 live from Union Square in the heart of San Francisco. It's theCUBE covering Spark Summit 2016, brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. Well good morning and welcome back to Spark Summit 2016 here on theCUBE. I'm John Walls along with George Gilbert. Great to have you here for our second day of coverage here from the Hilton Hotels. We continue our look at what's going on in the world of Spark. It has certainly been a fantastic couple of days we've had here already looking forward to joining you throughout this day as well. Along with George, we are very, very pleased to welcome Doug Cutting, who is the chief architect at Cloudera, the father of Hadoop. I don't have to introduce you, Doug, but thank you for joining us here for the open. We appreciate that. Oh, thanks for having me. Pleasure to be here. You've had this great seat, you know, for the past 10, 12 years, you know, obviously with Hadoop and that evolution MapReduce leading now to Apache Spark. You're talking about that a bit this morning. Characterize it from really the best view of all, I think, that you've had and how you would characterize this evolution toward Apache Spark. Yeah, no, as I said in the keynote this morning, you know, from the very early days, it wasn't one project. It's an ecosystem that developed within a year or two. We had things like Pig and Hive and HBase building on top of the core Hadoop. And Spark is just, you know, another step in that, a big one, you know, Spark is incredibly useful. It's on its way to replacing MapReduce. And that's what excites me the most is seeing this process, this evolution that this platform is reinventing itself on a regular basis and improving at a rate that we haven't seen before in enterprise software. It's a wonderful time. Well, and this might be getting way too far ahead, but I'm just kind of curious, we talked a lot about this acceleration and what's happened with Spark and how this easier, simpler, faster mantras has really become, you know, the code here. But do you see, is it possible that there's a barrier coming, there's going to be a new iteration, a new product, a new system or service that's going to then render Spark to the back page or even, yeah, because this energy that's created through open source really kind of lends itself toward that. We'll see, I mean, obviously I don't know the answer because the community sort of votes, users decide where we go as an ecosystem based on what's useful. My guess is there's some tensions, you know, people, they learn some new system, they adopt it, they embrace it and they can't give it up lightly. They invest a lot in it and so you want to pick things that are going to last. So in order to really become a new thing that's widely used, there's a certain threshold that has to be passed. I think there will be new things. Will they replace major existing components? Hard to say, I mean, so with HGFS, there's sort of three components, HGFS, Yarn, MapReduce and Spark is really on its way to replacing MapReduce, as I said, but HGFS and Yarn look to still have a very strong line. Still very strong places. Yeah, and so we'll see, you know, will something ever replace Spark completely? Nothing ever dies in software, right? There's always legacy. But could some new generation of execution engine come along that's superior in every way? It's possible. It wouldn't be my most likely outcome in the short term. I think we've got enough invested in Spark as an execution engine, but there's things beyond that. We're seeing Kafka providing tremendous utility. It's not replacing Spark by any means. It's a compliment, another compliment. So I think we see more things like that, but change is the new normal. And so eventually, a lot of these things will go away. And it's a good thing, generally speaking, because we're getting people more effective systems. All right. So let me key off something you said about the innovation in the ecosystem. And I mean, it is unprecedented. But there's a trade-off, you know, on the one extreme sort of, let's say, Oracle for data management controls and carefully integrates everything. And so it's slow to evolve, but it's integrated so it's a little bit easier on the developer, the administrator. At the Cloudera Analyst Day, I talked briefly to the VP of Engineering. I forgot his name. And he said, you know, when we take a new project, a new Apache project to be part of our curated distro, we budget 50% of the engineering cycles for that project for interop. Is he talking about interop with your management tools like Cloudera Manager and Navigator? Or is it something where it's one off, well, not really one off, but engine by engine? Probably more the latter. A lot of it is making sure of the security. You have a coherent security story across the stack that things stay encrypted using the same keys, ideally. And the same key services. So you need coherence at that level. The logging, the auditability. And that ends up appearing in our management suite. But it's actually stuff that is shared by the people who are using other management suites. So I think he means investment in the open source projects, not investment in proprietary by any means. Okay, okay, that says a lot. Because what he's saying is you need a sort of coherent conceptual model for security, for example, to wrap the other tools around, whether yours or others. I mean, the early adopters of these tools are always happy to use them as islands with no security. But as they move down the line and you start having somebody who sees something providing some incremental value, they just want it to slot in. And so it comes to vendors like Cloudera to do a lot of the work of making them slot in simply. So Cloudera was way out in front in terms of championing Spark, as a MapReduce replacement. What are some of the compute engines that used to sit on MapReduce that you're wrapping now around Spark once you've done and once you still have yet to do? I mean, most of the big ones have moved, or in the process, you're able to now at least try running them on top of Spark instead of MapReduce. So Pig has now been ported to live on top of Spark, as has Hive. And those were the two, you know, classic engines of the ecosystem from the old days. And getting them moved has been huge. And then we see, you know, a lot of integration work, making Spark talk well to HBase, making it work well with solar, and then other applications. So I also mentioned this morning that GATK, the Genomic Analysis Toolkit from, what do they call it, the Broad Institute in Boston, getting that ported, so that's a standard of that industry. You know, it's a vertical piece of software, but having it use Spark as its underlying engine is a huge step, moving existing workloads. And we're seeing lots of folks talking about that today on the stage. Would it be fair to say that that convergence around the new execution engine simplifies life for not just developers, but maybe admins as well? Oh, sure, sure. The more we can consolidate on one engine. I mean, I think generally speaking, it's nice if the ecosystem has one primary general purpose engine that can handle all kinds of different workloads, and then a handful of more specific engines that are highly optimized for tests. So you've got solar, you know, Spark is never probably going to be interactive search engine. It's never going to be a key value store like HBase. There's certain things that it's not going to do, but for everything else, it's nice to have this general purpose API that you can plug under them and support lots of engines. And Spark is the best one out there today. And that was actually my next question, which is, can we see, maybe obviously not around Spark, but can we see some convergence around storage? Not complete, but more so that, you know, we don't, the cloud guys talked to us about, actually IBM of all companies, you know, about a data management fabric where they're orchestrating different, you know, storage engines. Can we see something that moves in that direction? You know, it might be nice. It's hard to say. HGFS is going strong for the on-premises storage engine. And HBase obviously builds on that. Now we see Kudu, Apache Kudu coming along, which doesn't build on HGFS. So as much as possible, we need to make it play well. It's got a lot of benefits for a lot of applications, and there's good reasons why it's not built on HGFS. And then also we have, you know, cloud storage, the block storage, and things like S3 and similar from Microsoft and Google, are really important to people. And they need to be first-class citizens in the big data world, which they aren't really today. So, you know, in some ways it'd be nice to have a consolidated storage story. I don't think we're headed there right now. It's not the application world a little bit. I mean, we've heard a lot of, I mean, intelligence is being thrown around every which way, like intelligent augmentation, artificial intelligence, business intelligence, intelligent cloud, heard about that today. Ultimately, what does this all mean to developing an application? I mean, the contextual and the relevance and all that that you're able to create is really rich experiences now that maybe I couldn't before. I mean, intelligence is clearly a buzzword, you know, and like smart, you know. With that said, you know, the buzzword is useful and that we are doing things better than we were before. And we're using data to accomplish it. And we're seeing fundamental advances in so many industries. It's not just, you know, the web companies delivering ads better by any means. It's, you know, it's financial institutions running titerships, fighting fraud, managing, understanding their risk exposure, it's, you know, pharmaceuticals, better understanding their medicines, better understanding, you know, healthcare, better hospitals, retailers, managing inventory factories, production, farmers, their crops. It's all over. Trains, planes, and automobiles. So when does this hit the marketplace, everyone? When do you think, as socially, that we're going to start appreciating and realizing these gains? As ultimate end users, everybody here is going to benefit from that, not just professionally, but obviously personally and their personal interactions. It's a slower process than you might expect. It's not like consumers adopting a new app on their phone, which can happen, you know, within a year. You can have this huge groundswell. Enterprises shifting their technology stack just necessarily happen slower. You've got big institutions, they're big ships to turn. Now and then you get startups that can really sort of reinvent things. You know, we see that with companies like Uber and Tesla, which are really using big data from the outset and are doing phenomenally well. But your classic major companies are all adopting this stuff, but they've got to integrate it and turn these ships. So they all have pilots. They all have some early products in production, but it's a long road, still a small percentage. So I think it's going to be over the next decade. We're going to really see this become the standard. It's now the standard for new work, but it's not replacing all this legacy work that has to be. And that just takes a long time to turn over. And in some cases it's thankless work and it's smarter a lot of times, frankly, to wait until you really are forced to redo that part rather than be proactive. Just a question on the sort of the spectrum of who owns making the stuff work and how simple it can be for developers. So let's say we have a spectrum. There's the Hadoop that runs in your data center. There's the Hadoop that's been designed to assume it runs in the cloud. So it knows about separation of storage, compute. It understands about ephemeral infrastructure, stuff like that. And then there's the cloud native services, like Kinesis, Firehose, Lambda, DynamoDB, or Redshift. Help us walk through the trade-offs between those. And I assume at this end, the sort of cloud native spectrum, you have simplicity, but you're trading choices. Help us sink through that. I mean, the classic choice is that of lock-in, a vendor lock-in. In the RDBMS industry, people build applications at the center of their organizations that were owned by a company that could charge arbitrarily and change the rates and maybe not advance things the way you'd want and so on. Now we've moved to a different basis with open source where people can much more easily choose among vendors and how to, as you mentioned, how to deploy it, whether it's in the cloud or on-premises. And yet, people seem to almost be willing to sacrifice that flexibility. They, I mean, people complain a lot about lock-in. And I think they don't realize a lot of times that the cloud can lock you in completely. And if you're not careful, but I think you can be careful. I think if you use all of the APIs that you build to are open source components, then you can very easily build something which is portable. So if I'm understanding you right, the APIs will give you data freedom. It's more than the APIs. You've got to have the implementation that is, I mean, because different implementations may not implement the APIs equivalently. You know, I mean. Well, like a map bar. Yeah, just because something compiles doesn't mean it runs against somebody else's implementation. But that's one of the great things about open source is that we don't standardize on APIs as much as implementations. And we all agree on this as the implementation of this functionality, and that gets you a much higher degree of compatibility. So let me ask one last question, which is, so your distro, there's a lot of work. I mean, even from the very name, Cloud Era, you know, assumes we're going to be there. But the management tools that make it, you know, consumable, they have to be very different from what's on-prem in your data center to what's in the Cloud. Is that a fair statement? Not completely. There's definitely some distinctions. But largely we think of the, we've got a protocol called Cloud Era Director, which helps you run in the Cloud. And largely it helps you run our management suite in the Cloud. And then the standard management suite, the same one you'd use on-premises to manage your instances in the Cloud. There's obviously features that we're going to add to the management around ephemeral clusters and, you know, adding and removing nodes on the fly that are fairly Cloud-specific. And so there's some of that. But for the most part, the Cloud can be a layer that helps you bring up instances and bring them in. Doug, before we let you take off, just some overall thoughts about what you're seeing here this week, compared to the first gathering some three years ago. It's a little different scene right now, isn't it? It's phenomenal to see the excitement around Spark. It's really come of age. No longer just something that people are using in science projects. It now has those features that people really need, the sort of security and manageability. And we're seeing the fruit of that in all of these applications and the interest that you can see all around us here. We've got, what, 1,000 people or more here? 3500? 3500? All right, there we go. That's phenomenal. Yeah, it sure is. Well, thank you for the time. It's been a pleasure. Pleasure to chat with you once again. Always welcome here on theCUBE. We appreciate the time and wish you all the best down the road here. Down the road in your many hybrids, as a matter of fact. A man on the cutting edge, you might say. All right, we'll be back with more. George and I will as we continue our coverage here on theCUBE.