 Live from San Jose, it's theCUBE, presenting Big Data Silicon Valley, brought to you by SiliconANGLE Media and its ecosystem partners. Welcome back to theCUBE's continuing coverage of Big Data SV. This is our 10th Big Data event, our fifth here in San Jose. We are down the street from the Strata Data Conference. We invite you to come down and join us. Come on down, we are at Forager Tasting Room in Iteries, super cool place. We've got a cocktail event tonight and a endless briefing tomorrow morning. We are excited to welcome back to theCUBE Scott, now the CTO of Hortonworks. Hey Scott, welcome back. Thanks for having me, and I really love what you've done with the place. I think there's as much energy here as I've seen in the entire show. So thanks for having me over. We have done a pretty good thing to this place that we were renting for the day. So thanks for stopping by and talking with George and I. So February, Hortonworks announced some news about Hortonworks data flow. What was in that announcement? What does that do to help customers simplify data in motion? What industries is it going to be most impactful for? I'm thinking GDPR is a couple months away. What's new there? Well yeah, and there are a couple of topics in there, right? So obviously we're very committed to, which I think is one of our unique value propositions is we're committed to really creating an easy to use data management platform, as it were, for the entire life cycle of data from when data are created at the edge and as data are streaming from one place to another place, end up at rest, analytics get run, analytics get pushed back out to the edge. So that entire life cycle is really the footprint that we're looking at. And when you dig a level into that, obviously, the data in motion piece is hugely important. And so I think one of the things that we've looked at is we don't want to be just a streaming engine or just a tool for creating pipes and data flows and so on. We really want to create that entire experience around what needs to happen for data that's moving, whether it be acquisition at the edge in a protected way with provenance and encryption, whether it be applying streaming analytics as the data are flowing and everywhere kind of in between and so that's what HDF represents. And what we released in our latest release, which to your point was just a few weeks ago, is a way for our customers to go build their data in motion applications using a very simple drag and drop GUI interface so they don't have to understand all of the different animals in the zoo and the different technologies that are placed. Like I want to do this, okay, here's a GUI tool. You can have all of the different operators that are represented by the different underlying technologies that we provide as Hortonworks Dataflow and you can stream them together and then you can make those applications and test those applications and what are the biggest enhancement that we did is we made it very easy then for once those things are built in a laptop environment or in a dev environment to be published out to production or to be published out to other developers who might want to enhance them and so on. So the idea is to make it consumable inside of an enterprise and when you think about data in motion and IoT and all of those use cases, it's not going to be one department, one organization or one person that's doing it. It's going to be a team of people that are distributed just like the data and the sensors. And so being able to have that sharing capability is what we've enhanced in the experience. So you were just saying before we went live that you're here having speed dates with customers. What are some of the things? It's a little bit more sincere than that, but yeah. Aren't, isn't speed dating sincere? Am I missing it? It's 2018, I'm not sure. What are some of the things that you're hearing from customers and how is that helping to drive what's coming out? So the two things that I'm hearing right, number one certainly is that they really appreciate our approach to the entire lifecycle of data because customers are really experiencing huge data volume increases and data just from everywhere. And it's no longer just from the ERP system inside the firewall, it's from third party, it's from sensors, it's from mobile devices. And so they really do appreciate kind of the territory that we cover with the tools and technologies that we bring to market. And so that's been very rewarding. Clearly, customers who are now well into this path are starting to think about, in this new world, data governance. And data governance, I just took all of the energy out of the governance, it sounds like hard. What I mean by data governance really is customers need to understand with all of this diverse, connected data everywhere in the cloud, on-prem, in sensors, third party, partners is, frankly, they need a trail of breadcrumbs to say what is it, where did it come from, who had access to it, and then what did they do with it? And if you start to piece that together, that's what they really need to understand the data estate that belongs to them, so they can turn that into a refined product. And so when you then segue in one of your earlier questions at GDPR, GDPR is certainly a triggering point where if it's like, okay, the penalties are huge, oh my God, it's a whole new set of regulations that I have to comply with. And when you think about that trail of breadcrumbs that I just described, that actually becomes a roadmap for compliance under regulations like GDPR, where if a European customer calls up and says, forget my data, the only way that you can guarantee that you forgot that person's data is to actually understand where it all is. And that requires proper governance tools and techniques. And so when I say governance, it's really that aspect, it's not like the governor and the government and all that. That's an aspect, but the real important part is, how do I keep all of that connectivity so that I can understand the landscape of data that I've got access to? And I'm hearing a lot of energy around that. And when you think about an IoT kind of world, distributed processing, multiple hybrid cloud footprints, data is just everywhere. And so the perimeter is no longer fixed, it's kind of variable. And being able to keep track of that is a very important thing for our customers. So continuing on that theme, Scott, data lakes seem to be like the first major new repository we added after we had data warehouses and data marts. And it looked like the governance solutions were sort of around that perimeter of the data lake. Tell us, you were alluding to sort of how many more sort of repositories, whether at rest or in motion, there are for data. Is it, do we have to solve the governance problem sort of end-to-end before we can build meaningful applications? So I would argue personally that governance is one of the most strategic things for us as an industry collectively to go solve in a universal way. And what I mean by that is throughout my career, which is probably longer than I'd like to admit, in an EDW-centric world where things were somewhat easier, in terms of the perimeter and where the data came from, data sources were much more controlled, typically ERP systems, owned wholly by a company. Even in that era, true data governance, metadata management, and that provenance was never really solved adequately. There were 300 different solutions, none of which really won. They were all different, non-compatible, and the problem was easier. In this new world with connected data, the problem is infinitely more difficult to go solve. And so that same kind of approach of 300 different proprietary solutions, I don't think is going to work. So tell us, how does that approach have to change and who can make that change? So one of the things obviously that we're driving is we're leveraging our position in the open community to try to use the community to create that common infrastructure, common set of APIs for metadata management, and of course we call that Apache Atlas. And we work with a lot of partners, some of whom are customers, some of whom are other vendors, even some of whom could be considered competitors, to try to drive an Apache open source kind of project to become that standard layer that's common into which vendors can bring their applications. So now if I have a common API for tracking metadata and that trail of breadcrumbs that's commonly understood, I can bring in an application that helps customers go develop the taxonomy of the rules that they want to implement and that helps visualize and all of the other functionality, which is also extremely important, and that's where I think specialization comes into play, but having that common infrastructure I think is a really important thing because that's going to enable data, data lakes, IoT to be trusted, and if it's not trusted, it's not going to be successful. Okay, there's a chicken and an egg there, it sounds like. Am I the chicken or the egg? Well, you're the CTO. Okay. The thing I was thinking of was the scope, the broader the scope of trust that you're trying to achieve at first, the more difficult the problem. Do you tackle, do you see customers wanting to pick off one high value application? Not necessarily that's about managing what's in Atlas, you know, in the metadata, so much as they want to do an IoT app and they'll implement, you know, some amount of governance to solve that app. In other words, which comes first? Do they have to do the end-to-end, you know, metadata management and governance or do they pick in a problem off first? In this case, I think it's chicken or egg, but not... I mean, you could start from either point. I see customers who are implementing applications in the IoT space and they're saying, hey, this requires a new way to think of governance, so I'm going to go and build that out, but I'm going to think about it being pluggable into the next app. I also see a lot of customers, especially in highly regulated industries and especially in highly regulated jurisdictions, who are stepping back and saying, forget the applications, this is a data opportunity. And so I want to go solve my data fabric and I want to have some consistency across that data fabric into which I can publish data for specific applications and guarantee that holistically, I'm compliant and that I'm sitting inside of our corporate mission and all of those things. Okay. So one of the things you mentioned, and we talk about this a lot, is the proliferation of data. It's so many, so many different sources and companies have an opportunity. You mentioned the phrase data opportunity. There is massive opportunity there, but you said, you know, from even a GDR perspective alone, I can't remove the data if I don't know where it is for the breadcrumbs. How is it possible? As a marketer, you know, we use terms like, get a 360 degree view of your customer. How, is that actually really something that customers can achieve, leveraging a data like, can they actually really get, say a retailer, a 360 to complete view of their customer? All right, 358. It's pretty good. And we're getting there. Yeah, I mean, obviously the idea is to get a much broader view and 360 is a marketing term, I'm not a marketing person, but it certainly creates a much broader view of highly personalized information that help you interact with your customer better. And yes, we're seeing, you know, customers do that today and have great success with it and actually change and build new business models based on that capability for sure. I think that it, and the folks who've done that have realized that in this new world, the way that that works is you have to have a lot of people have access to a lot of data. And that's scary because that's not the way it used to be, right? It used to be you go to the DBA and you ask for access and then your boss has to sign off and say, and it's what you ask for. In this world, you need to have access to all of it. So when you think about this new governance capability where as part of the governance integrated with security, personalized information can be encrypted, it can be, you know, kind of blurred out, but you still have access to the data to look at the relationships to be found in the data to build out those sophisticated models. And so that's where kind of, not only is it a new opportunity for governance just because the sources, the variety, it's a different landscape, but it's ultimately very much required because if you're the CSO, you're not going to give access to the marketing team, all this customer data, unless you understand that, right? But it has to be, I'm just giving it to you and I know that it's automatically protected versus I'm going to let you ask for it to be successful. I guess following up on that, it sounds like what we were talking about, you know, chicken or egg, are you seeing an accelerating shift from where data is sort of collected centrally from applications to our, what we hear on Amazon is the amount coming off the edge is accelerating. It is and I think that that is a big drive to frankly, faster cloud adoption. You know, the analytics space particularly has been a laggard in cloud adoption for many reasons and we've talked about it previously. But one of the, excuse me, one of the biggest reasons obviously is that data has gravity, data movement is expensive. And so now when you think about the where data is being created, where it lives being further out on the edge and may live its entire life cycle in the cloud, you're seeing a reversal of gravity more towards cloud and that again creates more opportunities in terms of driving a more varied perimeter and just keeping track of where all the assets are. Finally, I think, you know, it also leads to this notion of managing the entire life cycle of data. One of the implications of that is if data is not going to be centralized, it's going to live in different places. Applications have to be portable to move to where the data exists. And so when I think about kind of that landscape of creating ubiquitous data management within Hortonworks portfolio, that's one of the big values that we can create for our customers. Not only can we be an on-ramp to their hybrid architecture, but as we become that on-ramp, we can also guarantee the portability of the applications that they've built out to those cloud footprints and ultimately even out to the edge. So a quick question then to clarify on that or drill down, would that mean like you could see scenarios where Hortonworks is managing, say, the distribution of models that do the inferencing on the edge and you're collecting, bringing back the relevant data, however, that's defined to do the retraining of any models or recreation of new models? Absolutely. That's one of the key things about the NIFI project in general and Hortonworks Dataflow specifically is the ability to selectively move data and the selectivity can be based on analytic models as well. So the easiest case to think about is self-driving cars. We don't understand how that works, right? A self-driving car has cameras and it's looking at things going on. It's making decisions locally based on models that have been delivered. And they have to be done locally because of like latency, right? But selectively, hey, here's something that I saw as an image I didn't recognize. I need to send that up so that it can be added to my lexicon of what images are and what action should be taken. So of course, that's all very futuristic but we understand how that works but that has application in things that are very relevant today. Think about jet engines that have diagnostics running. Do I need to send that terabyte of data an hour over an expensive thing? No, but I have a model that runs locally says, wow, this thing looks uninteresting. Let me send a gigabyte now for immediate action, right? So that decision-making capability is extremely important. Well, Scott, thanks so much for taking some time to come chat with us once again on theCUBE. We appreciate your insights. Appreciate it, time flies. Doesn't it, when you're having fun? All right, we want to thank you for watching theCUBE. I'm Lisa Martin with George Gilbert. We are live at Forager Tasting Room in downtown San Jose at our own event, Big Data SV. We'd love for you to come on down and join us tonight, today, tonight and tomorrow. Stick around, we'll be right back with our next guest after a short break. Since the dawn of the cloud, theCUBE.