 Live from San Jose in the heart of Silicon Valley. It's theCUBE, covering DataWorks Summit 2018. Brought to you by Hortonworks. Welcome back to theCUBE's live coverage of DataWorks here in San Jose, California. I'm your host, Rebecca Knight, along with my co-host, Jim McCobiellis. We're joined by Aaron Murthy. Arun Murthy, sorry, he is the co-founder and C Chief Product Officer of Hortonworks. Thank you so much for returning to theCUBE. It's great to have you on. It's been a fun time getting back, yeah. So you were on the main stage this morning in the keynote and you were describing the journey, the data journey that so many customers are on right now and you were talking about the cloud, saying that the cloud is part of the strategy but it really needs to fit into the overall business strategy. Can you describe a little bit about how you approached that? Absolutely, and the way we look at this is we help customers leverage data to actually deliver better capabilities, better services. That's the better experiences to their customers. And that's the business way, right? Now, with that, obviously, we look at cloud as a really key part of it, of the overall strategy in terms of how you want to manage data on-prem and on-the-cloud. You know, we kind of joked that we're sort of living a world of real-time data, we're just living it. And data is everywhere. You might have trucks on the road, you might have drones, you might have sensors and you'll have it all over the world, right? At that point, you know, we've kind of got to a point where enterprises understand that they'll manage some of the infrastructure but in a lot of cases, it'll make a lot more sense to actually lease some of it and that's the cloud, right? It's the same way, you know, if you want, if you're delivering packages, you don't go buy planes and layover rooms, you go to FedEx and actually, you know, buy, you know, let them handle that for you. That's kind of what the cloud is. So that is why we really fundamentally believe that we have to help customers leverage infrastructure whenever it makes sense, pragmatically, both from an architecture standpoint and from a financial standpoint. And that's kind of why we kind of talked about how, you know, your cloud strategy is part of your data strategy, which is actually fundamentally part of your business strategy. So how are you helping customers to leverage this? What is on their minds and what are, what's your response? Yeah, it's really interesting. Like I said, you know, cloud is, you know, cloud and infrastructure management is certainly something that's at the foremost, at the top of the mind for a VCIO today, right? And what we've consistently heard is they need a way to manage all of this data and all of this infrastructure in a hybrid, multi-cloud fashion, right? Because you know, in some geos, you might not have your favorite cloud vendor, you know, go to parts of Asia as a great example. You might have to use one of the, you know, the Chinese clouds, you know, you go to parts of Europe with the data, especially with things like GDPR, the data residency laws and so on, you have to be very, very cognizant of where your data gets stored and where the infrastructure is present. And that is why we fundamentally believe with it's really important to have, you know, give an enterprise a fabric with which they can manage all of this, right? And hide the details of all of the underlying infrastructure from them as much as possible. And that's data plane services, exactly. The hardwoods data plane services which we launched in October of last year, actually I was on Qube talking about it back then too. It's a, we see a lot of interest and a lot of excitement around it because now they understand that this, again, this doesn't mean we actually drive it down to the least common denominator. It is about helping enterprises leverage the key differentiators that each of the cloud vendors will provide. So for example, Google, which we announced a partnership with, they're really strong on AI and ML, right? So if you're running TensorFlow and you want to deal with things like Kubernetes, you want, GKE is a great place to do it. And for example, you can now go to Google Cloud and get TPUs which work great for TensorFlow. Similarly, a lot of customers run on Amazon for a bunch of the operational stuff, Redshift is an example. So this is a, the world we live in is that we want to help the CIO leverage the best piece of the cloud but then give them a consistent way to manage and count that data, right? We were joking on stage that IT is just about learn how to deal with Kerberos and Hadoop, right? And now we're telling them, oh, go figure out IAM on Google, which also is IAM on, it's called IAM on Amazon, but they're completely different. The only thing that's consistent is the name, right? So I think with the, we have a unique opportunity with, especially the open source technologies like Atlas, Stranger, Nox and so on, to be able to draw a consistent fabric over this in security and governance and help the enterprise leverage the best parts of the cloud to put a best fit architecture together, right? But which also happens to be a best of brain architecture. So the fabric is, everything you're describing, all of the Apache open source projects in which Horton works as a primary committer and contributor, are able to manage schemas and policies and metadata and so forth across this distributed heterogeneous fabric of public and private cloud segments within a distributed environment. Exactly. It's increasingly being containerized in terms of the applications for deployment to edge nodes. Containerization is a big theme in HTTP 3.0, which you announced at this show. So Arun, if you can give us a quick sense for how that containerization capability plays into more of an edge focus for what your customers are doing? Exactly, great point. And again, the fabric is obviously the core parts of the fabric of the open source projects, but you've also done a lot of net new innovation with data planning, which by the way is also open source. It's a new product and a new platform that you can actually leverage. You layer it over the open source ones you're familiar with. And again, like you said, containerization is, what is actually driving the fundamentals of this, the details matter, the scale at which we operate. We're talking about thousands of nodes, petabytes of data, the details really matter because a 5% improvement at that scale leads to millions of dollars in optimization for CapEx and OpEx. So that's why all of that, the details are being kind of fueled and driven by the community, which is kind of what we deliver with HTTP3. And some of the key ones, like you said, are containerization because now we can actually get complete agility in terms of how you deploy the applications. This is, you get isolation not only at the resource management level with containers, but you also get it at the software level, right? Which means if two data scientists want to use a different version of Python or Scala or Spark or whatever it is, they get that consistently and holistically that now they can actually go from the test-dev cycle into production in a completely consistent manner, right? So that's why containers are so big because now we can actually leverage it across the stack and the things like Minify showing up, right? We can actually- To find Minify before you go further, what is Minify for our listeners? Yeah, so we've always had NiFi, right? Which we, you know, real-time data flow management. NiFi was still sort of within your data center. What Minify does is actually now a really small layer, a small, thin, you know, library, if you will, that you can throw on a phone, a doorbell, a sensor, and that gives you full of all the capabilities of NiFi, but at the edge, right? And it's actually not just data flow, but what is really cool about NiFi is actually, it's actually command and control, right? So you can actually do bidirectional command and control. So you can actually change in real-time the flows you want, the processing you do, and so on. So what we're trying to do with Minify is actually not just collect data from the edge, but also push the processing as much as possible to the edge. Because we really do believe a lot more processing is going to happen at the edge, right? Especially with the A6 and so on coming out, there'll be custom hardware that you can throw and it can actually leverage that hardware at the edge to actually do this processing. And we believe, you know, we want to do that even at the cost of data not actually landing up at rest. Because at the end of the day, we're in the insights business, not in the data storage business. Well, I want to get back to that. You were talking about innovation and how so much of it is driven by the open source community. And you're a veteran of the big data open source community. How do we maintain that? How does that continue to be the fuel? Yeah, and a lot of it starts with just being consistent, right? From day one, James was around back then in 2011 we started. We've always said, we're going to be open source, right? And because we fundamentally believe that the community is going to out innovate any one vendor regardless of how much money they have in the bank, right? So we do believe that's really the best way to innovate. Mostly because there's a sense of shared ownership of that product. It's not just one vendor throwing some code out there trying to show it down the customer's throat. And we've seen this over and over again, right? You know, three years ago, we talked about a lot of the data plane stuff comes from Atlas and Ranger and so on. None of these existed. Now these actually came from the fruits of the collaboration with the community, with actually some very large enterprises being part of it, right? So it's a great example of how we continue to drive it because we fundamentally believe that that's the best way to innovate and continue to believe so. Right, and the community, like the Apache community as a whole has so many different projects that for example in streaming, there's Kafka and there's others that address a core set of common requirements but in different ways supporting different approaches for example to doing streaming with stateless transactions and so forth or stateless semantics and so forth. It seems to me that Hortonworks is shifting towards being more of a streaming oriented vendor away from data at rest though. I should say, HDP 3.0's got great scalability and storage efficiency capabilities baked in. I wonder if you could just break it down a little bit what the innovations or enhancements are in HDP 3.0. For those of your core customers, which is most of them, who are managing massive multi terabyte, multi petabyte, distributed federated big data lakes. I mean, what's in HDP 3.0 for them? Oh, for lots. I mean, again like I said, we obviously spend a lot of time on the streaming side because that's where we see. We live in a real-time world but again, we don't do it at the cost of our core business which is continuous VHDP. And as you can see, the community turned to its drive. We talked about containerization, massive step up for the HDP community. We've also added support for GPUs. Again, if you think about true ad scale machine learning. Graphical processing units, AI, deep learning. Yeah, it's huge deep learning, TensorFlow and so on. We really, really need a custom sort of GPU if you will. So that's coming. That's in HDP 3.0. We've added a whole bunch of scalability improvements with HDFS, right? We've added federation because now you can go from, you can go over a billion files, billion objects in HDFS. We also added capabilities for- But you indicated yesterday when we were talking that very few of your customers need that capacity yet, but you think they will. So. Oh, for sure. Again, part of this is as we enable more source of data in real time, that's going to, it's the fuel which drives. And that was always the strategy behind the HDF product, right? It was about can we leverage the synergies between the real-time world, feed that into what you do today in your classic enterprise with data at rest. And that's what is driving the necessity for scale, right? We've done that. We spend a lot of work getting, again, loading the total cost of ownership, the TCO, so we added a ratio coding. What is that exactly? Yeah, so a ratio coding is, it's a classic sort of storage concept, which allows you to actually, instead of, you know, HDFS, it's always been three replicas, right? So for redundancy, fault tolerance and recovery. Now, it sounds okay having three replicas because it's a cheap disc, right? But when you start to think about our customers are running 70, 80, 100 petabytes of data, those three replicas add up. Because you've now gone from 80 petabytes of effective data, we're actually to a quarter of an exabyte in terms of raw storage, right? So now what we can do with the ratio coding is actually, instead of storing the three blocks, we actually store parity. We store the encoding of it, which means we can actually go down from three to like two, one and a half, whatever you want to do, right? So, if you can get from three blocks to one and a half, especially for your cold data, right? The ones you're not accessing every day, right? It results in a massive savings, right? In terms of your infrastructure cost, right? And that's kind of what we're in the business of doing, helping customers do better with the data they have without having to spend more. Whether it's on-prem or on the cloud, that sort of, we want to help customers be comfortable in getting more data under management, you know, along with security and governance and the lower tort TCO. The other sort of big piece I'm really excited about at HTTP 3 is all the work that's happened in the Hive community for what we call the real-time database, right? As you guys know, you follow the whole SQL wars and the Duke space. Hive has changed a lot in the last several years. It's very different from what it was five years ago. The only thing that's same from five years ago is the name, right? So again, you know, the community's on a phenomenal job, you know, kind of really taking, you know, sort of like, we used to call it like a SQL engine on HDFS, right? From there to drive it to with 3.0, it's now like with Hive 3, which is part of HTTP 3.0, it's a full-fledged database. It's got full asset support. In fact, the asset support is so good, it's that, you know, writing asset tables is at least as fast as writing non-asset tables now, right? And you can do that not only on- It's transactional database. Exactly. Now, not only can you do it on prem, you can do it on S3, right? So you can actually drive the transactions through Hive on S3. We've done a lot of work to actually, you know, you were there yesterday when we were talking about some of the performance work we've done with LAP and so on to actually give consistent performance both on prem and the cloud. And this is a lot of effort, simply because the performance characteristics you get from the storage layer with HDFS versus S3 are significantly different, right? So we have been able to bridge those with, you know, things like LAP. We've done a lot of work in sort of enhance the security model around it, right? Governance and security. So now you get, you know, things like, you know, column level masking, roll level filtering, all the standard stuff that you'd expect and more from an end-priced warehouse. Like, you know, we talk to a lot of our customers, they're doing, you know, literally tens of thousands of views because they can't actually, they don't have the capabilities that, you know, exist in Hive now. And I'm sitting here kind of being amazed that, you know, for an open source set of tools to have the best security and governance at this point, it's pretty amazing coming from where we started off. And it's absolutely essential for GDPR compliance and compliance with HIPAA and every other thing, every other mandate and sensitivity that requires you to protect personally, identify all the information. So very important. So really in many ways, Hortonworks has one of the premier big data catalogs for all manner of compliance or requirements that your customers are facing. Yeah, and James, you know what about it in the context of, you know, Datastore Studio, which we introduced. Yes. You know, things like consent management, right? Having... A consent portal. Consent portal. Indicates the degree to which they require controls over their management of their PII possibly to be forgotten and so forth. Yeah, it's right to be forgotten. It's consent even for analytics. Like within the context of GDPR, you have to allow the customer to opt out of analytics, them being part of an analytic itself, right? So things like those are now something we enable through the enhanced security models we've done in Ranger, right? So now, instead of actually the really cool part about what we've done now with GDPR is that you can get all of these capabilities on existing data, right, with Hive and existing applications by just adding a security policy, not rewriting your SQL query, right? It's a massive, massive deal, which I cannot tell you how much customers are excited about, because they now understand they were sort of freaking out that they have to go to, you know, 30, 40, 50,000 enterprise apps and change them to take advantage of the, you know, to actually provide consent and right to be forgotten. The fact that we can do that now by changing a security policy with Ranger is huge for them. Yeah. Arun, thank you so much for coming on theCUBE. It's always a fun talking to you. Likewise. Thank you so much. I learned of something every time I listened to you. Indeed, indeed. I'm Rebecca Knight for James Kobielus. We will have more from theCUBE's live coverage of DataWorks just after this.