 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to Snowy Boston, everybody. This is theCUBE, the leader in live tech coverage. Arun Murthy is here, he's the founder and vice president of engineering at Hortonworks, Father of Yarn, can I call you that Godfather of Yarn, is that fair? Anyway, he's so modest, welcome back to theCUBE, it's great to see you. It's a pleasure to have you. Coming off the big keynote, you ended the session this morning, so that was great. Glad you made it in, to Boston. And a lot of talk about security and governance and you know, we've been talking about that for years and it feels like it's really starting to come into the mainstream, Arun. Well, I think it's just a reflection of, you know, it's just a reflection of what customers are doing with the tech now. You know, three, four years ago, a lot of it was pilots, a lot of it was, you know, people playing with the tech, or increasingly it's about, you know, people actually deploying stuff in production, you know, having data, Hadoop be the system of record, running workloads both on-prem and on the cloud, cloud is certainly becoming more and more real at, you know, mainstream enterprises. So, a lot of it means as you, you know, take any of the examples today, and the interesting app will have some sort of, you know, real-time data feed is probably coming out from a cell phone or a sensor somewhere, which means that data is actually not, in most cases it's actually not coming on-prem. It's actually getting collected in a local cloud somewhere. It's just more cost effective. Why would you put up 25 data centers if you don't have to, right? So then you got to connect that data up with, you know, transactional data you have or customer data you have or data you might have purchased and then join them up, you know, run some interesting analytics, do, you know, geo-based stuff, real-time threat detection, cyber security. A lot of it means that you need a common way to secure data, govern it, and that's where we see the action. It's, I think it's a really good sign for the market and for the Hadoop community that, you know, people are pushing Hadoop on these dimensions of the broader Hadoop ecosystem. They're getting pushed in these dimensions because it means that, you know, people are actually using it for real production workloads. Well, in the early days of Hadoop, we really didn't, I mean, we really didn't talk that much about cloud, you know? And now it's like, cloud, it's everywhere. And of course, you know, the whole hybrid cloud thing comes into play. What are you seeing there? What are the things that you can do in a hybrid, you know, or on-prem that you can't do in a public cloud? And what's the dynamic look like? Well, it's definitely not an either or, right? So what you're seeing is, you know, increasingly interesting apps need data which are, you know, born in the cloud and they'll stay in the cloud. But they also need transactional data which, you know, stays on-prem. You might have an EDW, for example, right? There's not a lot of, you know, people want to solve business problems and not just move, you know, data from one place to another, right? Or tech from one place to another. So it's not interesting to move your EDW to the cloud. It's similarly not interesting to bring your IoT data or sensor data back into on-prem, right? Just make sense. So naturally what happens is, you know, at Hardworks, we talk of a, you know, concept of modern app or a modern data app which means a modern data app has to spare, you know, has to sort of, you know, encompass both on-prem data and cloud data. Yeah, you talked about that in your keynote. I remember years ago, Furrier said that the data is the new development kit. And now you're seeing the apps are just so data-rich. And they have to span, you know, physical locations. And then, but then this whole thing of IoT comes up. We've been having a conversation on theCUBE, last, you know, several cubes of, okay, how much stays out? How much stays in? There's a lot of debates about that. There's reasons not to bring it in. But you talked today about some of the important stuff will come back and... Yeah, so, you know, the way we see this is there's always going to be, you know, there's a lot of data which will be born in the cloud instead of, you know, the IoT data. But then what will happen increasingly is, you know, key summaries of the data will move back and forth. So key summaries of your, you know, EDW will move into the cloud. Sometimes key summaries of your IoT data, you know, some, you know, you want to do some sort of historical training and analytics that'll come back on-prem. So I think there's a bi-directional data movement, but it just won't be all the data, right? It'll be key interesting summaries of the data, but not all of it. And a lot of times people say, well, it doesn't matter where it lives. You know, cloud should be an operating model, not a place where you put data applications. And while that's true, when we would agree with that, from a customer standpoint, it matters in terms of performance and latency issues and cost and regulation. And security and guards, absolutely. You got to think those things through. Exactly. So that's what we're focused on, to make sure that you have a common security and governance model regardless of where data is. So you can think of it as, you know, infrastructure you own and infrastructure you lease. Right, right. Now, the details matter, of course. You know, you'll use, when you go to, when you go to the cloud, you'll use, you know, S3, for example, or ADLS from, you know, Microsoft, right? But you got to make sure that there's a common sort of security governance front and top of it, you know, in front of it. So as an example, one of the things that, you know, in the open source community, you know, Rangers are really, you know, sort of the key project right now from a security authorization and authentication standpoint. We've done a lot of work with our friends at Microsoft to make sure you can actually now manage data in Wasbee, which is their object store, equal in their S3, natively with Ranger. So you can set a policy that says only Dave can access these files, you know, George can access these columns. That sort of stuff is natively done on the Microsoft platform, thanks to the relationship we have with them. So it's actually really interesting for the open source communities. So you've talked about sort of the commodity storage at the bottom layer, and even if they're different sort of interfaces and implementations, it's still commodity storage. And now what's really helpful to customers is that they have a common security model. Exactly. Authorization, authentication. Authentication, lineage, prominence. Oh, okay. You want to make sure all of these are common services across. But you mentioned also the different data patterns, like the stuff that might be streaming in on the cloud. What, assuming you're not putting it into just a file system or an object store, you know, and you want to sort of merge it with historical data. So what are some of the data stores other than the file system, in other words, newfangled databases to manage this sort of interaction? So I think what you're seeing is, we certainly have the raw data. The raw data is going to land up in whatever the cloud native storage, right? It's going to be SDN on Amazon, Wasby, ADLS, Google Storage, right? But then increasingly you want, so that is, so now the patterns change. So you have raw data, you have some sort of an ETL process, right? What's interesting in the cloud is that even the process data, or just, you know, you take the unstructured raw data and structured, that structured data also needs to live on the cloud platform, right? The reason that's important is because, A, you know, it's cheaper to use, you know, the native platform rather than, you know, set up your own database on top of it, right? The other one is you also want to take advantage of all the native services that the cloud storage vendor provides. So for example, they'll give you a replication. So automatically data in Wasby, you know, if you can set up a policy and say easily say, the structured data, stable that I have, which has a summary of all the IOT activity in the last 24 hours, you can make it, you can, using their, using the cloud providers technologies, you can actually make it show up easily in Europe. Like you don't have to do any work, right? So increasingly what we at Hardenworks focus a lot on is to make sure that we, all of the compute engines, whether it's Spark or Hive or, you know, MapReducer, Pico, whatever that doesn't matter, they're all natively working on the cloud provider storage platform. Okay. Right, so that's a really key consideration for us. And a follow-up to that, you know, there's a bit of a misconception that Spark replaces Hadoop, but it actually can be a processing, a compute engine that can complement or replace some of the compute engines in Hadoop. Help us frame how you talk about it with your customers. For us, it's really simple. In the past, the only option you had on Hadoop to do any computation was MapReduce. Now that was, you know, I started working in MapReduce 11 years ago. So as you can imagine, you know, it's a pretty good run for any technology, right? Spark is definitely the interesting sort of engine for sort of the, anything from, you know, sort of machine learning to ETL on top of, for data on top of Hadoop, right? But again, what we focus a lot on is to make sure that, you know, every time we bring in, so right now, when I saw, you know, when we started on HTTP, the first version of HTTP had about nine open source projects, literally just nine. Today, the last one we shipped was two, five. HTTP, two, five had about 27, I think. Like, it's a huge, you know, sort of Cambrian explosion, right? But the problem with that is not just that we have 27 projects. The problem is that you're going to make sure each of the 27 work with all the 27 others, 26 others, right? It's a QA nightmare. Exactly. So that integration is really key. So same thing with Spark. We want to make sure you have security and governance leads, like you saw in the demo today. You can now run Spark SQL, but also make sure you get role-level filtering, column masking, all of the enterprise capabilities that you need. You know, I was at a financial services vendor three, you know, four weeks ago in Chicago. Today, to do equivalent of what I showed today on demo, they need literally, they have a classic EDW, and they have to maintain anywhere between 1,500 to 2,500 views of the same database. It's a nightmare, as you can imagine, right? Now the fact that you can do this on the raw data using whether Hive or Spark or Pager or MapReduce doesn't really matter. It's really key, and that's the thing we push to make sure things like governance security work across all the stacks, all the open source stacks. So that makes life better. It's a simplification use case, if you will. What are some of the other use cases that you're seeing things like Spark enable? Machine learning is a really big one, right? Increasingly, you know, every product is going to have some, you know, people call it, you know, machine learning in AI and deep learning, there's a lot of, you know, techniques out there, but the key part is you want to build a predictive model, you know, in the past we call it predictive analytics, right? But you want to build a model and score what's happening in the real world against the model, but equally importantly, make sure the model gets updated as more data comes in on, and actually, as the model scores, it has to get smarter over time. So that's something we see all over. So for example, you know, even within our own product, it's not just us enabling this for the customer, but, you know, for example, at Hartmunk's, we have a product called SmartSense, which allows you to optimize how people use Hadoop, right? Where are the, what are the opportunities for you to exploit efficiencies within your own Hadoop system, whether it's Spark or Hype, right? So we now put machine learning into SmartSense and allow you, and show you that customers who are running queries like you are running, Mr. Customer X, other customers like you are tuning Hadoop this way, they're running these sort of configs, they're using these sort of features in Hadoop, right? That allows us to actually make, you know, the product itself better all the way down the pipe. So you're improving the scoring algorithm, or you're sort of replacing it with something better? What we're doing there is just helping them optimize the Hadoop deploys, right? You know, configuration and tuning and kernel settings and network settings, we do that automatically with SmartSense. But the customer, you talked about scoring, they're tuning that, or improving that, and increasing the probability of its accuracy, or is it? It's both, so the thing is what they do is you initially come with a hypothesis, you have some amount of data, right? I'm a big believer that over time, more data, you're better off spending more, getting more data into the system than to tune that algorithm infinitely, right? Interesting, okay. Right? So, for example, you go talk to any of the big guys who are on Facebook, they'll tell you the same. What they'll say is it's much better to get, to spend your time getting 10x data through the system and improving the model rather than spending 10x of time and improving the model itself on day one. Yeah, but no, that's a key choice because you've got to spend money on doing either and you're saying go for the data. Go for the data. At least now. Yeah, go for the data. What happens is the good part of that is it's not just the model, it's the, what you've got to really get through is the entire end to end flow, right? All the way from data aggregation to ingestion to collection to scoring, all that aspect, right? So you're better off sort of walking through the paces, like building the entire end to end product rather than spending time in a silo trying to make a lot of change. We've talked to a lot of machine learning tool vendors, application vendors, and it seems like we got to the point with big data where we put it in a repository, then we started doing better at curating it and understanding it and then starting to do a little bit exploration with business intelligence. But with machine learning, we don't have something that does this end to end from acquiring the data, building the model to operationalizing it. Where are we on that? Who should we look to for that? It's definitely very early. I mean, if you look at even the EDW space, right? For example, what is EDW? EDW is ingestion, ETL, and then sort of fast query layer, OLAP, BI, on and on and on, right? So that's the full EDW flow. I don't think as a market, I mean this is very, it's really early in the space, right? I don't think as an overall industry, we have that end to end sort of industrialized design concepts, right? It's going to take time, but a lot of people are ahead, you know, the Google is the world ahead, and over time a lot of people will catch up. We got to go. I wish we had more time. I had so many other questions for you, but I know time is tight on our schedule, so thanks so much everyone for coming on. I appreciate it. I appreciate it. All right, keep it right there, we'll be back with our next guest, theCube, we're live from Spark Summit East in Boston. Right back.