 From New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East. Brought to you by Spark Summit. Now your hosts, Dave Vellante and George Gilbert. This is theCUBE, we're here live at Spark Summit East in Midtown at the Hilton Hotel. This is our second big data event this year. It's interesting to note it's in New York. A lot of the doers in big data are in New York. theCUBE goes out to the events, we extract the signal from the noise. Ali Goetze is here, he's the CEO of Databricks. Ali, welcome to theCUBE. Thank you very much. I love when an executive stands up and I'm a big fan of Simon Sinek. People don't buy what you do, they buy why you do it. And you started off saying the reason why we started the company was to simplify big data. I love that. It's a simple but defining statement. Talk about that a little bit. Well, I mean it was pretty simple for us. At that time, we were all active in the Hadoop project and what we saw is that even really simple operations were extremely complicated. So if you're given say a thousand numbers and I ask you to just compute a statistical average of those thousand numbers, you put it in Excel, you click and then you get the answer. Exactly the same simple trivial problem, if you're asked to do that, say on a petabyte of data, it's rocket science all of a sudden. You need PhDs, you need to spend a lot of money, you need to set up projects, you need to spend many, many months and then at the end of it, you'll get that answer, okay, the answer is 42. So our vision was it won't be this way. 10, 20, 30 years from now, this is going to go away. It's going to become as easy. So we want to be part of that journey to simplify and what it really ultimately means is that companies can extract insights more easily, it lowers their costs and they can do great things with it. And you guys talk a lot about the training efforts that you made this year, I'm struck by the story of the early days of the automobile, people were concerned that there wouldn't be enough chauffeurs to drive around the only people who could afford the vehicle, of course that didn't scale, so they had to make it sort of self-learning, not self-driving, self-driving, we'll get here. But so in that spirit, so much data that we're absorbing as humans, we can't absorb anymore. So we need data scientists, we need trained developers. You can only train so many, eventually software has to take over talk about what you're doing from a training standpoint, that was one of your big learnings of this past year. Yeah, enabling people to get access directly to the data, I mean access to data sets is one, the other one is great training material, having that available so that they don't have to go through all those, they don't need to go through those hoops of setting it up themselves, getting machines, configuring it, stitching it together, it's a full-time job to just do that and only then are you set up, ready to start exploring to write your own programs and figure out extracting insights from your data. So you want to lower that bar and make that happen much more seamlessly. You talked about early days of big data and sort of working with a petabyte versus what a thousand rows in Excel becomes rocket science. That was a developer perspective, but we had a similar problem on the admin side where Hadoop is a ecosystem, not a product. You're trying to, even a vendor, Doug Cutting told us, it's kind of scary, every time they get a new project, it's like, well, how do we manage it and we can't deprecate the old stuff? You took a very different approach. Yeah, I mean, this is one of the beauties of the cloud. In the cloud, there are a lot of different companies and they have services that they're hosting for you. They have already configured it for you. It's up and running, it's elastic, so you can compose those services or products that you need for your use cases seamlessly or much more easily than if you had to do it on-prem. On-premises, you have to get the software, it has to fit well, you have to make sure that you open up the security and make that work out. And it's just a very different game than something where it's already a service up there. So oftentimes, Databricks will get a customer that says, we'd also like to do this, this other, do you have any answer to this piece of the equation or the problem that we're trying to solve? And we might not have an answer to that, but there's a cloud provider that already has that service and will just integrate immediately with it seamlessly. And that company will be responsible for hosting that service and configuring it and managing it and we're responsible for ours. But you've made it, as a service provider, you've made it easy to deliver and operate your services. But someone like a clad era with director can go up and make some assumptions about how things should be configured. But you've got another step further, which is rather than having 30-some projects each on their own release cycle, you have one engine. Tell us a little bit how that makes it easier from an admin's point of view. Well, first of all, we'll make sure that it's always integrated in one version. Another advantage of the cloud is that we only have two versions of our software at any given time running in the whole world. That is not true if you're on premises, right? There are many older versions of your software running in different data centers. For us, there is only two versions. Yeah, there's one. The N and N minus one are the only two versions. Exactly, exactly, and we move in lock steps and we make sure that the next version that we put out is integrated and is configured and it's working out of the box. We have extensive automated testing that tries that out. So you don't have to really, it's a completely different play. We also release every week. So a new release goes out every week and then it's pushed out. So in a two-week period, customers will see that version that you're working on. So that's also a completely different model. Whereas actually with Open Source Spark, which is also used in non-cloud settings, those releases happen on three months or six months, longer horizons. That's a different play. Your projects tend to be much more cyclical. You do development, then you freeze the code and there's a QA period, documentation period and it sort of goes like this. With the cloud service, it's not like that. It's continuously developing features and deploying them to customers. And from the customer standpoint, that integrated package is so attractive. You mentioned today, your platform that you're not launched last summer comes with Open Source Spark, SQL, machine learning, other components, as well as the core platform. So essentially you can deploy that at software-like marginal economics. The repeatability is there, so that simplifies the customer's world dramatically. And obviously, from your standpoint, it's a better business model. So talk about that a little bit. What customer impact you've seen there. Yeah, so first of all, there is this line of what goes in Open Source and what doesn't for Databricks. Databricks, I think we've developed about a million lines of code of software that is part of the cloud offering that we have. That's not part of the open source. And then there's Open Source Spark, which I think is roughly 250,000 lines of code these days. So that's one of the things that we were very excited about when we started Databricks to figure out how do you draw that line between Open Source, Open Source companies, which pieces should be proprietary, which pieces should be open source. A lot of companies struggle with that. And how do you monetize that? And we had an answer that we were very happy about. So for us, the line is, the things that you run on our service on top of Spark on the execution engine, that we don't want to lock you in on that. So if you can run it there, you should be able to pull that out and run on any other Spark service that's anywhere else. And we have certifications for Spark that Databricks provides that actually ensures that it runs well on those other distros or other places. So that's really important. But things like integrations, security, or things that speed it up or work really well out of the box in certain environments, that's really important to enterprises, and that doesn't necessarily need to be in the Open Source. Okay, so that's how you will define your competitive advantage going forward, and that's your business model. Yeah, and if you look, and we've been true to this, so SQL, for instance, is part of Spark now. We actually developed it at Databricks after we founded the company. We developed SQL. It was not part of Spark originally, and we committed it back to Open Source. And it was an excellent decision looking back now because we want to make sure that anyone who can run on the Spark Open Source platform should be able to run it also in other environments than the Databricks hosted one. So how, let's talk about, back up a little bit, or hit the escape key on how customers are sort of applying your technology. I was saying to George in the early days, we were here at this hotel in the early days of a dupe world, and it was pig and hide and scoop and flume and all kinds of geeky stuff, and a lot of that going on here as well. Let's take it to the business impact. So you want to, goal is to simplify big data, AKA simplify Hadoop. People attacked the storage container problem with Hadoop, because number one, number two is the whole data warehouse, business intelligence business never lived up to its promise, so now big data analytics puts forth this big promise. Which so far, Hadoop has not lived up to, except for the first piece, reducing my investment in the denominator. What does Spark bring to solving those business problems? Well, it provides, so companies don't care about the technology per se. They have a problem they want to solve, so they're looking for a solution to their problems. So they have a use case, they want to, you know, they have a project around that, they have an initiative or a platform they want to deploy. For them, that problem that they're attacking has many different pieces. A small portion of that might be one of these technologies that you mentioned, one of those three letter acronyms or one of those, and so they're not really focused on that. So one of the visions of Spark and what we provide has all long been, how can we give them something that works end to end? So it's turn key. Rather than having to stitch together different pieces, how do we do that? And that was actually one of the visions of the Spark project from the very early days. We used to use the term unifying, so it unifies different use cases. So if I take a step back, look at Spark, when we developed Spark, we were really two separate problems we wanted to attack. One was that companies had already now stored their data in either data warehouses or maybe in data lakes or enterprise data hubs or in Hadoops, and more and more these days also in the cloud. But the data was siloed and they couldn't make decisions on it, so that was one problem. And as part of that, the data that they had had, you know, was sometimes unstructured, sometimes it was semi-structured. So it was coming in in different forms, especially now with IoT devices exploding, you know, you're getting collecting a lot of data from different environments and sensors. So how do you work with all those data sets? So it was not just the scale, it was also the variety that you saw, and then there was this integration problem. So many companies would have the siloed data sets and they couldn't work with it. So we wanted to unify those. How can we give you access to all of them in a really efficient manner? That was one, so you can make decisions on your data that you have. The second was unifying the use cases that companies had. And what we saw is that, yes, they want to have a SQL or a SQL engine that they can run BI on top of. Yeah, absolutely, they want to do that. But they also want to go further. They're also advanced analytics use cases that they want to do, which the traditional software does not give them. They want to be able to do forecasting. They want to be able to do anomaly detection. They want to be able to cluster or segment the data that they have based on features they've seen. I mean, this is the dawn of AI, right? Strong AI, you mentioned self-driving cars. There is Siri, there is Watson. So every company also wants to hit these use cases. And then finally, the last one is real time. They want to also do all of what I said in real time on recent data, not just on a month data from a month ago or a week ago. So we unify these. Yeah, so it's unifying these is where you provide value to the customer. So not having the siloed data and being able to unify the use cases that they have so they can end to end build pipelines that solve the problems or give them the turnkey solutions that they need. Okay, so the silo busting adds value. It also, there's a simplification component there unifying the use cases in real time. I wonder if you could comment on a broader theme that we've been discussing, which is for the past decade plus, information has become widely available to consumers. So we, the pricing power has shifted from the brands to the consumers. And so it seems like a lot of the big data initiatives are ways in which brands can begin to develop proprietary data that they can act on to regain some power in the marketplace relative to their competition as a business opportunity. Do you see that? I mean, is that a fair assertion on our part in terms of observing that that pricing power sort of has moderated that asymmetry where the brands had the advantage and now it's the brands trying to get that advantage back. Is that actually happening? Well, I think the way I would see it is that data has become democratized. So companies, you have to keep the companies honest. The value that they provide has to match the price that they charge. And that's what's happening with so many offerings and with so much comparison points, companies can now much more easily compare and see and make sure that they're actually paying for the value that they're getting. I think that's good for everyone. It is, but now if you look at Amazon and Google and Netflix, they seem to be creating a new advantage for themselves, which is they know more about me than I know about me. They know when I'm ready to buy, they know when I run out of paper towels, et cetera. But that's providing value for you, right? That's valuable to me. Oh, absolutely. But the value comes from creating some proprietary knowledge about me that I don't even have. Because they can silo bust in a way that you can't. Yes, and so there's value creation. It's a little scary, maybe sometimes, but that seems to us to be the big opportunity for businesses who really do a good job in this space. Yeah, I mean, they're figuring out what you need, right? Before you even know that. Yeah. Absolutely. I mean, we have to also be careful that, you know, but as long as it's machines that are looking over the data sets, there's not someone reading your data, sitting there, and they're doing it to help you. I think that's value. Of course, it can go wrong. So you have to be careful. That's another big theme. Six, seven years ago that these events we were talking about, you know, humans, can you take humans out of the equation? And the theme was always, no, no, humans are that last mile, like the old cable. Is that changing? Yeah, I mean, it's unclear how much, you know, I mean, we're definitely automating more and more and more. It's unclear how much humans will be needed. The line keeps moving, there wasn't anything. I think that's great. I mean, I think the more you can automate, the better, yeah, no, the better it will be. I mean, it doesn't mean that it's going to be bad for those people that don't, you know, cannot do those jobs. It just means, you know. Well, I'm not sure, actually. No, no, I'm saying it is what it is. Yeah, well, more importantly, the resources on the planet is the same. So it just means we have more time to do, you know, other things, you know, maybe we don't need to have someone that does that really mundane task. They can focus instead on something. Well, and that's kind of what you guys are doing. Right? When you simplify big data, you're freeing up people to do more, not to fire people, to freeing people. Now, people may lose jobs, who knows, but the other jobs will be created, hopefully. Yes, absolutely, yeah. You said something interesting about beyond busting the silos of the different data sets and data structures, but the use cases and cells, SQL, machine learning and its variants in real time. Now, those can encompass larger and larger sort of surface areas. Where are you being taken by your customers in what order to push back those boundaries that are there now? I see. I mean, what we're seeing really strong, especially this year more recently, is a push for real time. It seems, it was different a year or two ago. A year or two ago, there was a lot of advanced analytics use cases that companies had, in addition to traditional use cases. Now, this year, we're seeing that, you know, they're all three are now being combined. So you set up processes where you break those silos and for that, you need something that can work across these different storage systems and formats. And once you've done that, you have your data in a format that can now, you can leverage it and start doing more analytics on it and maybe more advanced analytics. And then now, companies also want to do that in real time. So that's a strong use case that's coming in. So basically bringing transaction and analytics systems together to affect an outcome in near real time anyway, so that you can gain market share or maybe save a life? Yes, yes, exactly. By the way, you said near real time, that's a good point. You know, when we ask customers, what is your actual requirements? What is really the requirements? Oftentimes, you hear this thing that I call human real time. It fast enough so I can react to it. If it's a millisecond or a hundred milliseconds, I don't really care. But I don't want to work on half an hour or two hour old data. Yeah, we define real time as before you lose the customer. Yes. I mean, however, whatever timeframe that is. Right, right, yeah. And it's shrinking, I guess, isn't it? So anyway, I didn't ask you about your thoughts on Spark Summit and the growth. You showed some, earlier, you guys showed some, Matej, you showed some growth charts, which was quite interesting. Yes. You got to be thrilled. Yeah, very excited, yeah. Talk about this event, it's global nature. Yeah. You know, what's coming next? Yeah, I mean, so a lot of, there's a lot of developers that come to this conference. We'd like to keep it that way. In the future, we might separate it out more clearly. With tracks or with a different conference? It'll be the same conference, but there'll be, we'll delineate it more. Otherwise, you face this challenge of having two split audiences that you want to cater to at the same time. We want to definitely keep the developer focused. So a lot of the developers come here, I think 70, 80% when we survey them are, have some kind of background. They're sort of developers. We want to still keep them engaged. Many of them, when we surveyed them, why are you here? One of the main reasons to say is we want to see where the future will be in big data. We're here to see sort of, so we want to keep that, have exciting talks for them. I mean, we're indebted to this community. We're extremely thankful. It's not going to be easy. So I mean, I remember Hadoop World 2009, 2010, it was that same community, but the demand was overwhelming from the business community. But your commitment is to try to preserve that flavor. And then what, service the business community with a different sort of agenda? We could have tracks and there are actually events that for Spark Summit West, there might be some different way we organize this so that we can have clear delineation and those that are interested in the business prospect, they can go there and watch those talks and the keynotes from industry leaders. And then on the developer side, there'll be exciting developer talks like the ones you saw from Matei today. Well, we saw the early days, George really pushed us to get into the whole Spark community. So we appreciate Ali, you guys having us here and giving us this great space and thank you for coming on theCUBE. Thank you. It's our pleasure. Thank you. All right, keep right there, everybody. We'll be back right after this. This is theCUBE. We're live from Spark Summit East. Right back.