 Live from Las Vegas, it's theCUBE. Covering Informatica World 2019. Brought to you by Informatica. Welcome back everyone to theCUBE's live coverage of Informatica World 2019. I'm your host, Rebecca Knight, along with my co-host, John Furrier. We're joined by Ali Gotzi. He is the CEO of Databricks. Thank you so much for coming on, for returning to theCUBE. You're a CUBE veteran. Yes, thank you for having me. So I want to pick up on something that you set up on the main stage, and that is that every enterprise on the planet wants to add AI capabilities. But the hardest part of AI is not AI, it's the data. Can you riff on that a little bit for our viewers, elaborate? Yeah, actually, the interesting part is that if you look at the companies that succeeded with AI, the actual AI algorithms they're using are actually algorithms from the 70s. They were actually developed in the 70s. That's 50 years ago. So then how come they're succeeding now? And actually the same algorithms weren't working in the 70s. So people gave up on them. Like these things called neural nets, right? Now they're in vogue and they're super successful. The reason is you have to apply orders of magnitude more data. If you feed those algorithms that we thought were broken, orders of magnitude more data, you actually get great results. But that's actually hard. Dealing with petabyte scale data and cleaning it, making sure that it's actually the right data for the task at hand, is not easy. So that's the part that people are struggling with. Oh, I saw you up on stage. I'm like, oh, Ollie's here. Jada, that's awesome. Sight that you stopped by the cube, been a while. I want to get a quick update because you guys have been on a tear doing some great work at Cal. We've just talked before we came on camera. But what are you doing here? What's the, is there any announcements or news with Informatica? What's the story? Yeah, it's a, we're doing partnership around Delta Lake, which is, you know, our next generation engine that we've built. So we're super excited about that. It integrates with all of the Informatica platform. So their ingestion tools, their transformation tools and the catalog that they also have. So we think together, this can actually really help enterprises make that transition into the AI era. So, you know, we've been following us our 10th year. So remember when we were in the cloud era office with Michael Olson and Amar Abadala when we first started, now a Duke move and started. And then the cloud came along. Right when you guys started your company the cloud growth took off. You guys were instrumental into changing the equation in dealing with data, data lakes, whatever they're calling it back then, to now data holistically as a systems architecture. On-premise is a huge challenge, cloud native, no real challenge, people love that. Data feeds AI, a lot of risk taking, a lot of reward. We're seeing the SaaS business explode, Zoom communications, the list goes on and on. Enterprise is trying to be SaaS, it's hard. So you can't just take data from an enterprise and make it SaaS-ified. You really got to think differently. What are you guys doing? How have you guys evolved and vetted into that challenge? Because this is where your core value proposition initially started to change. Take us through that data brick story and how you're solving that problem today. Yeah, it's a great question. Really what happened is that people started collecting a lot of our data about a decade ago. And the promise was you can do great things with this. There are all these aspirational use cases around machine learning, real time. It's going to be amazing, right? So people started collecting it, they started storing one petabyte, two petabytes and they kept sort of going back to their boss and say, hey, this project is real successful and I don't have five petabytes in it. At some point the business said, okay, that's great, but what can you do with it? What business problems are you actually addressing? What are you solving? And so in the last couple of years, there's been a push towards let's prove the value of these data lakes. And actually, many of these projects are falling short, many are failing. And the reason is people have just been dumping this data into the data lakes without thinking about the structure, the quality, how it's going to be used. The use cases have been an afterthought. So the number one thing in the top of mind for everyone right now is, how do we make these data lakes that we have successful so we can prove some business value to our management? So towards this, this is the main problem that we're focusing on. Towards this, we built something called Delta Lake. It's something you situate on top of your data lake and what it does is it increases the quality, the reliability, the performance and the scale of your data lake. It's just like a filter. The cream rises to the top. Exactly. No, that's a sludge. Exactly. The data swamps, stay below the clean water, if you will. Exactly, actually, you nailed it. So basically, we look at the data as it comes in, filter, as you said, and then look at if there's any quality issues, we put it back in the data lake. It's fine, it can stay there. We'll figure out how to get value out of it later. But if it makes it into the Delta Lake, it will have high quality, right? So that's great. And since we're anyway looking at all the data as it's coming in, we might as well also store a lot of indices and a lot of things that let us performance optimize it later on, so that later, when people are actually trying to use that data, they get really high performance, they get really good quality, and we also added asset transactions to it so that now you're also getting all those transactional use cases working on your existing data lake. I saw at my daughter's graduation at Cal Berkeley this weekend and yesterday, people around with Databricks backpacks. Very popular and academic. You guys got the young generation kind of coming in. What's the update of the company? How many employees? What's the traction? Just a quick business update. Yeah, we're about 800 employees now. About 100 people in Europe, I would say. And maybe 40, 50 people in Asia pack. We're expanding the EMEA and Asia business. Growth mode. Yeah, growth mode, yeah. So it's expanding as fast as possible. I mean, I actually, as a CEO, I try to always slow the hiring down to make sure that we keep the quality bars. So that's actually top of mind for me. But yeah, we're- You did a Delta Lake on that one. Yeah. Exactly. We're super excited about working with these universities. We get a lot of graduate students from the top universities. And Cal had the first ever class in data, college of data analytics. What was it? Data analytics for the first inaugural class graduated. Show us how early it is. Yeah, yeah, yeah. And actually use Databricks at the community edition for a class of over 1,000 students at Cal. Use the platform. So they're going to be trained in data science as they come out. So I want to ask about that. Because as you said, you're trying to slow down the hiring to make sure that you are maintaining a high bar for your new hires. But yet I'm sure there's a huge demand because you are in growth mode. So what are you doing? You said you're working with universities to make sure that the next generation is trained up and is capable of performing at Databricks. So tell us more about those efforts. Yeah, I mean, so obviously university recruiting is big for us. Cal, I think Databricks has the longest line of all the companies that come there on the career fair day. So we work very closely with these universities. I think next generation as they come out, this generation that's coming out today actually is data science trained. So it's a big difference. There is a huge skills gap out there. Every big enterprise you talk to tells you my biggest problem is actually I don't have skilled people. Can you help me hire people? Say, hey, we're not in the recruiting business. But the good news is if you look at the universities they're all training thousands of thousands of data scientists every year now can tell you just that Cal, because it happened to be on the faculty there, is almost every applicant now to grad school wants to do something AI related. Which has actually led to, if you look at all the programs in universities today, people who used to do networking, professors who used to do networking say we do intelligent networks. People who do databases say we do intelligent databases. People who do systems research say, hey, we do intelligent systems, right? So what that means is in a couple of years you'll have lots of students coming out and these companies that are now struggling hiring them, they'll be able to hire this talent and be able to actually succeed better with these AI projects. As they say in Berkeley, that's like a good revolution once in a while. AI is kind of changing everyone over. I got to ask you for the young kids out there and parents who have kids either in elementary school or high school. Everyone's trying to figure out there's no yet clear playbook. We started to see first generation training, but is there a skill set because there's a ranging surface area. You've got hardcore coding to ethics and everything in between from visualization, multiple dimensions of opportunities. What skills do you see that people could hone or tweak that may not be on a curriculum that they could get or pieces of different curriculums in school that would be a good foundation for folks learning and wanting to jump in to data and data value whether it's coding to ethics. Yeah, I mean just looking at my own background and seeing how what I got to learn in school, the thing that was lacking compared to what's needed today is statistics. Understanding of statistics, statistical knowledge, that's I think is going to be pervasive. So I think 10, 15 years from now, no matter which field you're in, actually whatever job you have, you have to have some basic level of statistical understanding because the systems you're working with will be, they were spitting out statistics and numbers and you need to understand what is false positives? What is this? What is the sample? What does these things mean? So that's one thing that's definitely missing and actually it's coming. That's one. The second is computing will continue being important. So in the intersection of those two is I think the- In all fields we were talking about earlier biology, everything's intersecting, biochemistry to whatever, right? Yeah. I got to ask you about, I'm old school, I'm 53 years old, I remember when I broke into the business coding, I used to walk into departments, they were called DP, data processing. So we're getting into the data processing world. Now, you've got statistics, you've got pipeline, these are data concepts. So I got to ask you as companies that are in the enterprise, maybe slower to move to the cutting edge like you guys are, they got to figure out where to store the data. So can you share your opinion on or view on how customers are thinking and how they maybe should be architecting data on premise and the cloud, certainly cloud's great if you're getting cloud native for pure SaaS that's born in the cloud like a startup, but if you're a large enterprise and you want to be SaaS like, to have all that benefit, take the risk with the reward of being agile, you got to have data, if you don't get the data into machine learning or AI, you're not going to have good AI, so you need to get that data feeding in fast and if it's constrained with regulation, compliance, you're screwed. So what's your view on this? Where should it be stored? What's your opinion? Yeah, I mean, we've had the same opinion for five, six years, right? Which is the data belongs in the cloud. Don't try to do this yourself. Don't try to do this on-prem. Don't store it in Hadoop. It's not built for this. Store it in the cloud. In the cloud, first of all, you get a lot of security benefits that the cloud vendors are already working on. So that's one good thing about it. Second, you get it, you know, it's reliable. You get the, you know, 10, 11, nine is of availability, so that's great. You get that. Start collecting data there. Another reason you want to do it in the cloud is that a lot of the data sets that you need to actually get good quality results are available in the cloud. So oftentimes what happens with AI is you build the predictive model, but actually it's terrible. It didn't work well. So you go back and then the main trick, the first tricks you use to make it, to increase the quality is actually augmenting that data with other data sets. You might purchase those data sets from other vendors. You don't want to be shipping hard drives around or, you know, getting that into your data center. Those will be available in the cloud so you can augment the data. So we're big fans of storing your data in data lakes in the cloud. We obviously believe that you need to make that data high quality and reliable. With that we believe the Delta Lake platform, open source project that we created is a great vehicle for that, but I think movement to the cloud is the number one thing. And hybrid works with that if you need to have something on premise. In my opinion, the two worlds are so different that it's hard. You hear a lot of vendors that say, you know, we're the hybrid solution that works on both and so on. But the two models are so different fundamentally that's hard to actually make them work well. I have not yet seen a customer yet or enterprise. You see a lot of offerings when people say hybrid is the way. Of course, a lot of on-prem vendors are now saying, hey, we're the hybrid solution. I haven't actually seen that be successful to be frank. Maybe someone will crack that nut, but... I think it's an operational question to see if we can make it work. Ali, congratulations on all your success. Great to see you. Yeah, it's been great having you on the show. Thank you so much for having me. You are watching theCUBE in Formatica 2019. I'm Rebecca Knight for John Furrier. Stay tuned.