 Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. Welcome back to theCUBE, day two at Spark Summit. It's very exciting. I can't wait to talk to this gentleman. We have the CEO from Databricks, Ali Gohti joining us. Ali, welcome to the show. Thank you so much. Well, we sat here and watched the keynote this morning with Made of Breath and you delivered some big announcements. Before we get into some of that, I want to ask you, it's been about a year and a half since you transitioned from VP products and engineering into a CEO role. What's the most fun part of that and maybe what's the toughest part? Oh, I see. That's a good question and that's a tough question too. Most fun part is, you know, you touch many more facets of the business. So in engineering, it's all the tech and you're dealing only with engineers, mostly. And customers are one hop away. There's a product management layer between you and the customers. So you're very inwards focused. As a CEO, you're dealing with marketing, finance, sales, these different functions. And then externally, with media, with stakeholders, a lot of customer calls. So you're just, there's many, many more facets of the business that you're seeing. And it also gives you a purview and it also gives you a perspective that you couldn't have before. You see how the pieces fit together. So you actually can have a better perspective and see a little bit further out than you could before. Before, I was more in a more myopic situation where I was seeing sort of just the things relating to engineering. So that's the best part. Obviously, working close with customers, you introduced a few customers this morning up on stage. But after the keynote, did you hear any reactions from people? What are they saying? Yes, the keynote was recently. So on my way here, I've had multiple people, sort of a couple of people that high-five just before I got up on stage here, on the serverless offering. People are really excited about that. Less devops, less configuration, let them focus on the innovation. They want that. So that's something that celebrated yesterday. You recap that real quickly for our audience here, what the serverless offering is. Absolutely. So it's very simple. We want lots and lots of data scientists to be able to do machine learning without having to worry about the infrastructure underneath it. So we have something called serverless pools. And serverless pools, we can just have lots of data scientists use it. Under the hood, this pool of resources shrinks and expands automatically. It adds storage if needed. And you don't have to worry about the configuration of it. And it also makes sure that it's isolating the different data scientists. So one data scientist happened to run something that takes much more resources. It won't affect the other data scientists that are sharing that. So the short story of it is you cut costs significantly. You can now have 30, 100 people share the same resources. And it enables them to move faster because they don't have to worry about all the DevOps that they otherwise have to do. Yeah. George, is that a really big deal? The industry? Well, we know whenever there's infrastructure that gets between a developer, data science and their outcomes, that's friction. I'd be curious to say, let's put that into a bigger perspective, which is if you go back several years, what were the class of apps that Spark was being used for? And in conjunction with what other technologies, then bring us forward to today and then maybe look out three years. Yeah, that's a great question. So from the very beginning, data is key for any of these predictive analytics that we are doing. So that was always a key thing. But back then we saw more Hadoop data lakes. There were more data lakes, data reservoirs, data marks that people were building out. We saw also a lot of traditional data warehousing. These days we see more and more things moving into cloud. So the Hadoop data lake, we see oftentimes in enterprises being transformed into a cloud blob storage. That's cheaper, it's geo-replicated, it's on many continents. So that's something that we've seen happen. And we work across any of these, frankly. We, from the very beginning, Spark, one of its strengths is it integrates really well wherever your data is. And there's a huge community of developers around it. Over a thousand people now that have contributed to it. Many of these people are in other organizations. They're employed by other companies and their job is to make sure that Databricks or Spark works really, really well with, say, Cassandra, or with S3. So that's a shift that we're seeing. In terms of applications people are building, it's moving more into production. So four years ago, much more of it was interactive exploratory. Now we're seeing production in these cases. So the fraud analytics use case that I mentioned, that's running continuously. And the requirements there are different. You can't go down for, say, 10 minutes on a Saturday morning at 4 a.m. when you're doing credit card fraud, because that's a lot of fraud and that affects the business of, say, Capital One. So that's much more crucial for them. So what would be the surrounding infrastructure and applications to make that whole solution work? I mean, would you plug into a traditional system of record at the sales order entry kind of process point and what other, are you working off sort of semi real-time or near real-time data? And did you train the models on the data lake? How did the pieces fit together? Yeah, so unfortunately the answer depends on the particular architecture that the customer has. Every enterprise is slightly different. But it's not uncommon that the data is coming in. They're using, say, Spark structured streaming in Databricks to get it into S3. So that's one piece of the puzzle. Then when it's end up there, from then on, it funnels out to many different use cases. It could be a data warehousing use case where they're just using interactive SQL on it. So that's the traditional interactive use case. But it could be a real-time use case where it's actually taking those data that it's processed and it's detecting anomalies and putting triggers in other systems. And then those systems downstream will react to those triggers for anomalies. But it could also be that it's periodically training models and storing the models somewhere. It could be, oftentimes it might be in a Cassandra or in a Redis or something of that sort. It'll store the model there. And then some web application can then take it from there, do point queries to it and say, okay, I have a particular user that came in here, George now, quickly look up what is his feature vector, figure out what are the product recommendations we should show to this person and then it takes it from there. So in those cases, Cassandra or Redis, they're playing the serving layer, but the generating the prediction model is coming from you and they're just doing the inferencing, the prediction itself. So if you look out several years without asking you the roadmap, what you can feel free to answer? What sort of, how do you see that scope of apps expanding or the share of an existing app like that? Yeah, so I think two interesting trends that I believe in, I'll be foolish enough to make predictions. One is that I think data warehousing, as we know it today, will continue to exist. However, it will be transformed and all the data warehousing solutions that you have today will add predictive capabilities or they will disappear. So let me motivate that. If you have a data warehouse with customer data in it in a fact table, you have all your transactions there, you have all your products there. Today, you can plug in BI tools and on top of that, you can see what's my business health today and yesterday, but you can't ask it, tell me about tomorrow. Why not? The data is there, why can I not ask it, this customer data, you tell me which of these customer are going to churn or which one of them should I reach out to because I can possibly upsell these. Why wouldn't I want to do that? I think everyone will want to do that and every data warehousing solution in 10 years will have these capabilities. Now with Spark SQL, you can do that and the announcement yesterday showed you also how you can bake models, machine learning models and export them so a SQL analyst can just access them directly with no machine learning experience. It's just a simple function call, right? And it just works. So that's one prediction I'll make. The second prediction I'll make is that we're going to see lots of revolutions in different industries beyond the traditional get people to click on ads and understand social behavior. We're going to go beyond that. So for those use cases, it'll be closer to the things I mentioned like Shell and what you need to do there is involve these domain experts. The domain experts will come in, the doctors or the machine specialists. So you have to involve them in the loop and they'll be able to transform maybe much less exotic applications. It's not this super high tech Silicon Valley stuff but it's nevertheless extremely important to every enterprise, every vertical on the planet. That's I think the exciting part of where predictions will go in the next decade or two. If I were to try and pick out the most man bites dog kind of observation in there was it supposed to be the unexpected thing? I would say where you said all data warehouses are going to become predictive services because what we've been hearing it's sort of the other side of that coin which is all the operational databases will get all the predictive capabilities but you said something very very different and I guess my question is are you seeing the advanced analytics go to the data warehouse because the repository of data is going to be bigger there and so you can either build better models or because it's not burdened with transaction SLAs that you can serve up predictions quicker? Well I'm saying something simple is that data warehousing has been about basic statistics, right? It's been a sequel, the language that is used is to get descriptive statistics tables with averages and medians, that's statistics. Why wouldn't you want to have advanced statistics which now does predictions on it? It just so happens that sequel is not the right interface for that so it's going to be very natural that people who are already asking statistical questions for the last 30 years from their customer data these massive throws of data that they have stored why wouldn't they want to also say okay now give me more advanced statistics? I'm not an expert on advanced statistics but you the system tell me what I should watch out for which of these customers should I talk to? Which of the products are in trouble? Which of the products are not or which of parts of my business are not doing well now? Predict the future for me, so. When you're doing that though you're now doing it on data that has a fair amount of latency built into it because that's how it got into the data warehouse. Whereas if it's in the operational database it's really low latency, typically low latency stuff. Where and why do you see that distinction? It's great. I do think also that we'll see more and more real time engines take over. So if you do things in real time you can do it for a fraction of the cost so we'll also see those capabilities come in. So you don't have to, so your question is why would you want to once a week batch everything into a central data warehouse? I agree with that. It'll be streaming in live and then you can on that to do predictions. You can do basic analytics. I think basically the lines will blur between all these technologies that we're seeing. And in some sense Spark actually was the precursor to all of that. So Spark already was unifying machine learning, SQL, ETL, real time and that's, you're going to see that everywhere up here. Okay, so you mentioned Shell as an example one of your customers. You also had HP, Capital One, and you developed this unified analytics platform that's solving some of their common problems. Now that you're in the mood to make predictions what do you think are going to be the most compelling use cases or industries where you're going to see data bricks going in the future? That's a hard one. Right now I think healthcare, there's a lot of data sets, there's a lot of gene sequencing data and they want to be able to use machine learning. In fact I think those industries are being transformed slowly from using classical statistics into machine learning. We've actually helped some of these companies do that. We've set up workshops and they've gotten people trained and now they're hiring machine learning experts that are coming in. So that's one I think in the healthcare industry whether it's for drug testing, clinical trials, even diagnosis, that's a big one. I do think industrial IT, so these are big companies with lots and lots of equipment. They have tons of sensor data, massive data sets and there's a lot of predictions that they can do on that that's coming. So that's a second one I would say. Financial industry, they've always been about predictions so it makes a lot of sense that they continue doing that. Those are the biggest ones for data bricks but I think now also as slowly other verticals are moving into the cloud, we'll see more of other use cases as well. But those are the biggest ones I see right now. It's hard to say where it will be 10 years from now, 15. Things are going so fast that it's hard to even predict six months. You believe IOT is going to be a big business driver? Yeah, absolutely, right. I want to circle back to where you said that we've got different types of databases but we're going to unify the capabilities without saying it's not like one wins, one loses. Yes, I didn't want to do that prediction. But what, so describe maybe the characteristics of what a database that complements Spark really well might look like. Yeah, that's hard for me to say. The capabilities of Spark I think are here to stay. So the ability to be able to ETL variety of data that doesn't have structure, so structured query language SQL is not fit for it, that is really important and it's going to become more important since data is the new oil as they said, well then it's going to be very important to be able to work with all kinds of data and getting that into the systems. And there's more things every day being created, devices, IOT, whatever it is, that are spewing out this data in different forms and shapes, so being able to work with that variety that's going to be an important property. So they'll have to do that. So that's the ETL portion or the ELT portion. The real time portion, not having to do this in a batch manner once a week because now time is competitive advantage. So if I'm one week behind you, that means I'm going to lose out. So doing that in real time or near human time or human real time, that's going to be really important. So that's going to come as well. I think people will demand that and that's going to be a competitive advantage. Wherever you can add that secret sauce, it's going to add value to the customers. And then finally the predictive stuff, adding the predictive stuff. But I think people will want to continue to do also all the old stuff they've been doing. I don't think that's going to go away. Those bring value to customers. They want to do all those traditional use cases as well. So what about now where customers expect to have some, not clear how much, on-prem of an application platform like Spark, some in the cloud that now that you've totally sort of reordered the TCO equation, but then also at the edge for IoT type use cases, do you have to slim down Spark to work at the edge? If you have serverless working in the cloud, does that mean you have to change the management paradigm on-prem? What does that mix look like? How does someone, how does a Fortune 200 company get their arms around that? Yeah, I mean, this is a surprising thing. Most surprising thing for me in the last year is how many of those Fortune 200s that I was talking to three years ago and they were saying, no way, we're not going into the cloud, you don't understand the regulations that we are facing or the amount of data that we have or we can do it better or the security requirements that we have, no one can match that. To now, those very same companies are saying, absolutely we're going, it's not about if, it's about when. So now I would be hard pressed to find any enterprise that says, no, we're not going to go ever. And some companies we've even seen go from the cloud to on-prem and then now back. Because the prices are getting more competitive in the cloud, right? Because now there's three, at least, major players that are competing and they're well-funded companies, right? In some sense, you have ad money and office money and retail money being thrown at this problem. Prices are getting competitive. Very soon, most IT folks will realize there's no way we can do this faster or better or more reliable or secure ourselves. We've got just a minute to go here before the break, so we're going to kind of wrap it up here and we've got over 3,000 people here at Spark Summit. So this is the Spark community. I want you to talk to them for a moment. What problems do you want them to work on the most and what are we going to be talking about a year from now at this table? Okay, that's the second one, it's harder, yeah. So I think the Spark community is doing a phenomenal job. I'm not going to tell them what to do. They should continue doing what they are doing already, which is integrating Spark in the ecosystem, adding more and more integrations with the greatest technologies that are happening out there. So continue the innovation and we're super happy to have them here. We'll continue it as well. We'll continue to host this event and look forward to also having a Spark Summit in Europe and also in the East Coast soon. Great, so I'm not going to ask you to make any more predictions. All right, excellent. Ali, this was great stuff today. Thank you so much for taking some time and giving us more insight after the keynote this morning. Good luck with the rest of the show. Thank you. Thanks so much. Thank you George. And thank you for watching. That's Ali Ghoti, CEO from Databricks. We are Spark Summit 2017 here on theCUBE. Thanks for watching. Stay with us.