 More information, click here. From Munich, Germany, it's theCUBE. Covering DataWorks Summit Europe 2017. Brought to you by Hortonworks. Okay, welcome back everyone. We're here in Munich, Germany for DataWorks Summit 2017 formally Hadoop Summit, powered by Hortonworks. It's their event, but now called DataWorks because data is at the center of the value proposition. Hadoop, plus all data and storage, I'm John Furrier. My co-host Dave Vellante, our next guest is Scott Nau. It's a CTO of Hortonworks joining us again from the keynote stage, good to see you. Again, great to- Thanks for having me back. Great to be here. Welcome. Having you back. Get down and dirty and get technical. Super excited about the conversations that are happening in the industry right now for a variety of reasons. One is, you can't get more excited about what's happening in the data business. Machine learning AI has really brought up the hype around to mainstream America. People can visualize AI and see them, the self-driving car and understand now how software is powering all this. But still it's data-driven and Hadoop is extending it to data. You're seeing that natural extension and Clouder has filed their S1 to go public. So it brings back the conversations of this open source community that's been doing all this work in the big data industry, originally riding in on the horse of Hadoop. So I want to, and you guys have an update to your Hadoop data platform, but we'll get to the second. I want to ask you, a lot of stories around Hadoop, I say that Hadoop was the first horse that everyone rode in on on the big data industry. When I say big data, I mean like DevOps, Cloud, the whole open source ethos. But it's evolving, but it's not being replaced. So I want you to clarify your position on this because we were just talking about some of the false premises, a lot of stories being written about the demise of Hadoop, long live Hadoop. Yeah, well, how long do we have? You know, I think you hit it first, we're here at DataWorks Summit 2017 and we rebranded it, it's previously was Hadoop Summit, right? And we rebranded it really to recognize that there's this bigger thing going on and it's not just Hadoop, Hadoop is a big contributor, a big driver, very important part of the ecosystem, but it's more than that. And it's really about being able to manage and deliver analytic content on all data across that data's lifecycle, from when it gets created at the edge, to it's moving through networks, to it's landed and stored in a cluster, to analytics run and decisions go back out. It's that entire lifecycle and you mentioned some of the mega trends that I talked about this morning in the opening keynote, right? With AI and streaming and IoT, all of these things kind of converging are creating a much larger problem set and frankly opportunity for us as an industry to go solve. And so that's the context that we're really looking at. And there's real demand there, this is not like, I mean, certainly there's a hype factor on AI, but IoT is real, you have data, now not just a back office concept, you have a front facing business centric. I mean, there's real customer demand here. There's real customer demand and it really creates the ability to dramatically change a business. A simple example that I used on stage this morning is think about the electric utility business, right? And so I live in Southern California, 25 years ago, by the way, I studied to be an electrical engineer 20 years ago, 30 years ago, right? That business, while not entirely simple, was about building a big power plant and distributing electrons out to all the consumers of electrons. One direction and the optimization of that grid and network and that business was very hard and it was billions of dollars at stake. Fast forward to today, right? Now you've still got those generating plants online, but you've also got folks like me generating their own power and putting it back into the grid. So now you've got bi-directional electrons, the optimization is totally different. And then how do you figure out how most effectively to create capacity and distribute that capacity because creating capacity is not consumed is 100% spoiled. So it's a huge data problem, but it's a huge data problem, meaning IoT, right? Devices, smart meters, devices out at the edge, creating data, doing it in real time, a cloud blew over my generating capacity on my roof went down, so I've got to pull from the grid. Combining all of that data to make real-time decisions is we're talking hundreds of billions of dollars and it's being done today. In an industry, you know, it's not a high-tech Silicon Valley kind of industry, electric utilities are taking advantage of this technology today. So we were talking off-camera about, you know, some commentary about Hadoop has failed and obviously you take exception to that. And you also made the point, it's not just about Hadoop, but in a way, I mean it is because Hadoop is the catalyst of all this open. Why has Hadoop not failed in your view? Well, because we have customers and you know, the great thing about conferences like this is we're actually able to get a lot of folks to come in and talk about what they're doing with the technology and how they're driving business benefit and share that business benefit to their colleagues. So we see that business benefit coming along. You know, in any hype cycle, you know, people can get down a path maybe they had false expectations, right? Early on, you know, six years ago, 10 years ago, people were talking about, hey, this open source Hadoop's going to come along and replace CDW, complete fallacy, right? What I talked about in that opportunity, being able to store all kinds of disparate data, being able to manage and maneuver analytics in real time, that's the value proposition and it's very different than some of the legacy tech. So if you view it as, hey, this thing's going to replace that thing, okay, maybe not. But the point is, it's very successful for what it's being designed to do. Well, just to clarify what you just said there, that was, you guys never took that position or was Cloudera did with their Impala, was their initial, you could, Dave, you don't agree with that? Publicly, they would say, oh, it's not a replacement, but you're right. I mean, the actions were maybe designed to set in the marketplace that that might have been one of the outcomes. But they pivoted quickly when they realized that was failed strategy, but that became a premise that people locked in on. If that becomes your yardstick for measuring. Then so, but wouldn't you agree that Hadoop, in many respects, was designed to solve some of the problems that EDW never could? Exactly, so, again, when you think about the variety of data, when you think about the analytic content, doing time series analyses, it's very hard to do in a relational model. So it's a new tool in the workbench to go solve analytic problems. And so when you look at it from that perspective, and I use the utility example, the manufacturing example, financial, consumer finance, telco, all of these companies are using this technology, leveraging this technology, the solve problems they couldn't solve before, and frankly, to build new businesses that they couldn't build before because they didn't have access to that real time streaming data. And so money did shift from pouring money into the EDW with limited returns, because you were at the flat part of the S-curve to, hey, let's put it over here, in this so-called big data thing. And that's why the market, I think, was conditioned to sort of come to that simple conclusion. But the spending did shift, did it not? Yeah, I mean, if you subscribe to that herd mentality, and the net increase, the net new expenditure, and the new technology is always going to outpace the growth of the existing kind of plateau technology, that's just mad. The growth, yes, but not the size, not the absolute dollars. And so you have a lot of companies right now struggling in the traditional legacy space, and you got this rocket ship going in big data. And again, I think if you think about kind of the converging forces that are out there, in addition to IoT and streaming, the ability, frankly, Hadoop is an enabler of AI. When you think about the success of AI and machine learning, it's about having massive, massive, massive amounts of data. Right? And I think back, you know, 25 years ago, my first data mart was 30 gigabytes, and we thought that was all the data in the world, right? Now it fits on your phone. So when you think about just having the utter capacity and the ability to actually process that capacity of data, these are technology breakthroughs that have been driven in the core open source and Hadoop community, when combined with the ability then to execute in cloud and ephemeral kinds of workloads, you combine all of that stuff together now, instead of going to capital committee for $20 million for a bunch of hardware to do an exabyte kind of study where you may not get an answer that means anything, you can now spin that up into cloud and for a couple of thousand dollars get the answer, take that answer and go build a new system of insight that's going to drive your business. And this is a whole new area of opportunity driven by the convergence of all that tech. I mean, it's absurd to say Hadoop and big data has failed. I mean, it's crazy. Okay, but despite the growth, I call it profitless prosperity, can the industry fund itself? I mean, you've got to make big bets, yarn, Tes, different clouds. How does the industry turn into one that is profitable and growing? Well, I mean, obviously it creates new business models and new ways of monetizing software and deploying software. You know, one of the key things that is core to our belief system is that really leveraging and working with and nurturing the community is going to be a key success factor for our business, right? Nurturing that innovation and collaboration across the community to keep up with the rate of pace of change is one of the aspects of being relevant as a business. And then obviously creating a great service experience for our customers so that they know that they can depend on enterprise class support, enterprise class security and governance and operational management in the cloud and on-prem and creating that value proposition along with the advanced and accelerated delivery of innovation is where I think, you know, we kind of intersect uniquely in the industry. And one of the things that I think the people point out and I had this conversation all the time with people who try to squint through the, you know, the Wall Street implications of the value proposition of the industry. And this is something that I want to give you thoughts on because open source at this era that we're living in today is creating so much value outside of just important works in your company. Dave would made a comment on the intro package we were doing is that the practitioners are getting a lot of value, people out in the field. So these are the white spaces of value and they're actually transformative. Can you give some examples where things are getting done that are of real value as use cases that are highlighted that you guys can highlight? Because I think that's the unwritten story that no one's talking about is that there's the rising tide floating all boat happening. Yeah, there is. What are some of those use cases, the white spaces? Yeah, some of those use cases, again, it really involves kind of integrating legacy traditional transactional information, right? Very valuable information about a company, its operations, its customers, its products and all those kinds of things. But being able to combine that with the ability to do real time sensor management and ultimately have a technology stack that enables kind of the connection of all of those sources of data for an analytic. And that's an important differentiation, you know, for the first 25 years of my career, right? It was all about let's pull all this data into a place and then let's do something with it and then we can push analytics back. Not an entirely bad model, but a model that breaks in the world of IoT connected devices. There just, frankly, isn't enough money to spend on bandwidth to make that happen. And as fast as the speed of light is, it creates latency so those decisions aren't going to be able to be made in time. So we're seeing, even in traditional, I mentioned the utility business, think about manufacturing oil and gas, right? Sensors everywhere. Being able to take advantage, not of collecting all the sensor data and all of that, but being able to actually create analytics based on sensor data and push those analytics out to the sensors to make real time decisions that can affect hundreds of millions of dollars of production or equipment, are the use cases that we're seeing be deployed today and that's complete white space that was unavailable before. Right? And customer demand too. I mean, Dave and I were also debating about this not being a new trend. This is just big data happening. The customers are demanding production workloads. So you're seeing a lot more forcing functions driven by the customer and you guys have some news I want to get to and get your thoughts on HTTP, the forward worst data platform, 2.6. What's the key news there? Obviously real time, you've been talking about real time. Yeah, it's about real time flexibility and choice, right? You know, motherhood and apple pie. And the major highlights of that upgrade. So the upgrades really inside of Hive, we now have operational analytic query capabilities where we can do tactical response time, second, sub-second kind of response time. You know, Hadoop and Hive wasn't previously known for that kind of a tactical response. We've been able to now add inside of that technology the ability to do that workload. And we have customers who, building these white space applications who have hundreds or thousands of users or applications that depend on consistency of very quick analytic response time. We now deliver that inside the platform. What's really cool about it in addition to the fact that it works is that we did it inside of Hive. So it didn't create yet another project or yet another thing that a customer has to integrate to or rewrite their application. So any Hive-based application can now take advantage of this performance enhancement and that's part of our thinking of it as a platform. The second thing inside of that that we've done that really accretes to those kinds of workloads is we've really enhanced the ability to do incremental data acquisition, right? Whether it be streaming, whether it be batch upserts, right, on the SQL person, doing upserts. Being able to do that data maintenance in an asset-compliant fashion, completely automatically and behind the scenes so that those applications, again, can just kind of run without any heavy lifting behind them. It's just data in motion kind of thing going on, right? It's anywhere from data in motion, even to batch, to mini-batch, and anywhere kind of in between. But when you're doing those incremental data loads, you know, it's easy to get the same file twice by mistake. You don't want to double count. You want to have sanctity of the transactions. We now handle that inside of Hive with asset compliance. So, layperson question for the CTO, if I may. You mentioned Hadoop was not known for a sort of real-time response. You just mentioned asset. It was never in the early days known for a sort of asset compliance. Others would say, you know, Hadoop, the original big data platform is not designed for the matrix math of AI, for example. Are these misconceptions, and like Tim Berners-Lee, when we met, Tim Berners-Lee, you know, web 2.0, this is what the web was designed for. Would you say the same thing about Hadoop and big data? Yeah, I mean, ultimately, from my perspective, and kind of netting it out, Hadoop was designed for the easy acquisition of data, the easy onboarding of data. And then, once you've onboarded that data, it also was known for enabling new kinds of analytics that could be plugged in. Certainly starting out with MapReduce and HDFS was kind of the core. But the whole idea is I have now the flexible way to easily acquire data in its native form without having to apply a schema, without having to have any format in the store. I can get it exactly as it was and store it, and then I can apply whatever schema, whatever rules, whatever analytics on top of that that I want. So the center of gravity from my mind has really moved up to Yarn, which enables a multi-tenancy approach to having pluggable multiple different kinds of file formats and pluggable different kinds of analytics and data access methods, whether it be SQL, whether it be machine learning, whether it be HBase for lookup and indexing, and anywhere kind of in between. It's that Swiss Army knife, as it were, for handling all of this new stuff that is changing every second we sit here, data has changed. And just a quick follow-up, if I can, just clarification. So you said new types of analytics that can be plugged in by design because of its openness, is that right? By design because of its openness and the flexibility that the platform was built for. In addition to all on the performance, we've also got a new update to Spark and usability, consumability, and collaboration for data scientists using the latest versions of Spark inside the platform. We've got a whole lot of other features and functions that our customers have asked for. And then on the flexibility and choice, it's available public cloud infrastructure as a service, public cloud platform as a service, on-prem x86 and net new on-prem with Power8. It's got a final question for you. Just as the industry evolves, what are some of the key areas that open source can pivot to that really takes advantage of the machine learning, the AI trend that's going on? Because you start to see that really increase the narrative around the importance of data. And a lot of people are scratching their heads going, okay, I need to do the back office things, I need to set up my IT, I need to have all this great stuff, all these open source projects, all the Hadoop data platform. But then I got to get down and dirty, I might do multiple clouds on the hybrid cloud going on. I might want to leverage some of the new cool containers and Kubernetes and microservices and dollars DevOps. Where's that transition happening? As a CTO, what do you see that and how do you talk to customers about that, this transition, this evolution of how the data business is even getting more and more mainstream? Yeah, I think the big thing that people had to get over is we've reversed polarity from, again, 30 years of, I want a stack vendor to have an integrated stack of everything I plug and play. It's integrated end to end, it might not be 100% what I want, but the cost leverage that I get out of the stack versus what I'm going to go do, that's perfect. In this world, it's the opposite, it's about enabling the ecosystem. And that's where having, and by the way, it's a combination of open source and proprietary software that some of our partners have proprietary software, that's okay, but it's really about enabling the ecosystem. And I think the biggest service that we, as an open source community, can do is to continue to kind of keep that standard kernel for the platform and make it very usable and very easy for mini apps and software providers and other folks to plug into our testing. It's a thousand flower bloom kind of concept. And that's what you're talking about, the white spaces as use cases are evolving very rapidly. And then the bigger apps are kind of getting settling into the workload with real time. You know, think about the next generation of IT professional, the next generation of business professional grew up with iPhones, Android phones. They grew up in a mini app world, where I mean, I download an app, I'm going to try it, it's a widget, boom, and it's going to help me get something done, but it's not a big stack that I'm going to spend 30 years to implement. Yeah. And then I want to take those widgets and connect them together to do things that I haven't been able to do before. And that's how this ecosystem is really. Yeah, very DevOps culture, very agile. That's their mindset. Well Scott, congratulations on your 2.6 upgrade and great stuff, asset compliance, really big deal. Again, these compliance things, little things are important in the enterprise, right? Absolutely. All right, thanks for coming on theCUBE. The data works in Germany and Munich. I'm John Furrier, Dave Vellante. Thanks for watching more coverage live here in Germany after this short break.