 from Berlin, Germany. It's theCUBE, covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. Well hello, welcome to theCUBE. We're here on day two of DataWorks Summit 2018 in Berlin, Germany. I'm James Kobielus. I'm lead analyst for Big Data Analytics in the Wikibon team of SiliconANGLE Media. And what we have here today, we have Alan Gates, who's one of the founders of Hortonworks, and Hortonworks, of course, is the host of DataWorks Summit, and he's going to be, well, hello, Alan. Welcome to theCUBE. Hello, thank you. Yeah, so Alan, so you and I go way back. Essentially, what we'd like you to do, first of all, is just explain a little bit of the genesis of Hortonworks, where it came from your role as a founder from the beginning, how that's evolved over time, but really how the company has evolved specifically with the folks on the community, the Hadoop community, the open source community. You have a deepening open source stack that you build upon with Atlas and Ranger and so forth. Give us a sense for all of that, Alan. Sure, so we, as I think is well known, we started as the team at Yahoo that really was driving a lot of the development of Hadoop, or one of the major players in the Hadoop community. Worked on that for, I was in that team for four years. I think the team itself was going for about five, and it became clear that there was an opportunity to build a business around this. Some others had already started to do so. We wanted to participate in that. We worked with Yahoo to spin out Hortonworks, and actually they were a great partner in that. Helped us get that spun out. And the leadership team of the Hadoop team at Yahoo became the founders of Hortonworks and brought along a number of the other engineering, a bunch of the other engineers to help get started. And really at the beginning we were, it was Hadoop, Pig, Hive, a few of the very HBase, the kind of the beginning projects. So pretty small toolkit. I mean, we were, our early customers were very engineering heavy people, or companies who knew how to take those tools and build something directly on those tools. Well you start off with the, yeah, the Hadoop community is a whole start off with a focus on the data engineers of the world. And I think it's shifted and confirmed for me over time that you focus increasingly with your solutions on the data scientists who are doing the development of the application and the data stewards from what I can see at this show. I think it's really just part of the adoption curve, right? It was when you're early on that curve, you have people who are very into the technology understand how it works and want to dive in there. So those tend to be, as you said, the data engineering types in this space. As that curve grows out, you get, it becomes wider and wider. There's still plenty of data engineers that are our customers, that are working with us, but as you said, the data analysts, the BI people, data scientists, data stewards, all those people are now starting to adopt it as well. And they need different tools than the data engineers do. They don't want to sit down and write Java code or some of the data scientists might want to work in Python in a notebook like Zeppelin or Jupyter, but some many want to use SQL or even a Tableau or something on top of SQL to do the presentation. Of course, data stewards want tools more like Atlas to help manage all their stuff. So that does drive us to, one, pull more things into the toolkit. So you see the addition of projects like Apache Atlas and Ranger for security and all that. Another area of growth, I would say, is also the kind of data that we're focused on. So early on, we were focused on data at rest. We're going to store all this stuff in HDFS and as the kind of data scene has evolved, there's a lot more focused now on a couple of things. One is data, what we call data in motion for our HDF product where you've got it in a stream manager like Kafka or something like that. So there's processing that kind of data, but now we also see a lot of data in various places. It's not just, okay, I have a Hadoop cluster on-premise at my company, I might have some here, some on-premise somewhere else, and I might have it in several clouds as well. Yeah, your focus has shifted like the industry in general towards streaming data in multi-clouds where it's more of a stateful interactions and so forth. I think you made investments in Apache NiFi, so gives a sense for your NiFi versus Kafka and so forth inside of your product strategy or your... Sure, so NiFi is really focused on that data at the edge. So you're bringing data in from sensors, connected cars, airplane engines, all those sorts of things that are out there generating data and you need to figure out what parts of the data to move upstream, what parts not to, what processing can I do here so that I don't have to move upstream. When I have an error event or a warning event, can I turn up the amount of data I'm sending in, right? If, say, this airplane engine is suddenly heating up maybe a little more than it's supposed to, but maybe I should ship more of the logs upstream when the plane lands and connects than I would if otherwise. That's the kind of thing that Apache NiFi focuses on. I'm not saying it runs in all those places, but my point is it's that kind of edge processing. Kafka is still going to be running in a data center somewhere. It's still a pretty heavyweight technology in terms of memory and disk space and all that, so it's not going to be running on some sensor somewhere. But it is that data in motion, right? I've got millions of events streaming through a set of Kafka topics, watching all that sensor data that's coming in from NiFi and reacting to it, maybe putting some of it in the data warehouse for later analysis, all those sorts of things. So that's kind of the differentiation there between Kafka and NiFi. Right, right, right. So going forward, do you see more of your customers working internet of things projects? Is that, we don't often, at least in the industry of popular mind, associate Horton works with edge computing and so forth. Is that? I think that we will have more and more customers in that space. I mean, our goal is to help our customers with their data wherever it is. When it's on the edge, when it's in the data center, when it's moving in between, when it's in the cloud, all those places, that's where we want to help our customers store and process their data, right? So I wouldn't want to say that we're going to focus on just the edge or the internet of things, but that certainly has to be part of our strategy because it has to be part of what our customers are doing. When I think about the Hortonworks community, now we have to broaden our understanding because you have a tight partnership with IBM, which obviously is well established, huge in global. Give us a sense for, as you guys have teamed more closely with IBM, how your community has changed or broadened or shifted in its focus, or has it? I don't know that it's shifted the focus. I mean, IBM was already part of the Hadoop community. They were already contributing. Obviously they've contributed very heavily on projects like Spark and some of those. They've, they continue some of that contribution. So I wouldn't say that it's shifted it, it's just we are working more closely together as we both contribute to those communities, working more closely together to present solutions to our mutual customer base. But I wouldn't say it's really shifted the focus for us. Right, right. Now at this show, we're in Europe right now, but it doesn't matter that we're in Europe. GDPR is coming down fast and furious. Now Data Steward Studio, we had the demonstration today. It was announced yesterday. And it looks like a really good tool for the main requirements for compliance, which is discover and inventory your data, which is really set up a consent portal, is what I'd like to refer to. So the data subject can then go and make a request to have my data forgotten and so forth. Give us a sense going forward for how, or if Hortonworks, IBM and others in your community are going to work towards greater standardization in the functional capabilities of the tools and platforms for enabling GDP occupies, because it seems to me that you're going to need, the industry's going to need to have some reference architecture for these kind of capabilities so that going forward, in your ecosystem, partners can build add-on tools in some comment. Like the framework that was laid out today looks like a good basis. Is there anything that you're doing in terms of pushing towards more open source standardization in that area? Yes, there is. So actually one of my responsibilities is the technical management of our relationship with ODPI, which Mandy Chesl referenced yesterday in her keynote, and that is where we're working with IBM, with ING, with other companies to build exactly those standards, because we do want to build it around Apache Atlas. We feel like that's a good tool for the basis of that, but we know one that some people are going to want to bring their own tools to it. They're not necessarily going to want to use that one platform, so we want to do it in an open way that they can still plug in their metadata repositories and communicate with others. And we want to build the standards on top of that of how do you properly implement these features that GDPR requires, like right to be forgotten, like what are the protocols around PII data? How do you prevent a breach? How do you respond to a breach? I'll be under the umbrella of ODPI, that initiative of the partnership, or will it be a separate group or? Well, so certainly Apache Atlas is part of Apache and remains, so what ODPI is really focused on is that next layer up of how do we engage, not the programmers, because programmers can engage really well at the Apache level, but the next level up we want to engage the data professionals, the people whose job it is, the compliance officers, the people who don't sit and write code, and frankly, if you connect them to the engineers, there's just going to be an impedance mismatch in that conversation. You got policy wonks, you got tech wonks. Yeah, they understand each other at the wonk level. That's a good way to put it, and so that's where ODPI is really coming, is that group of compliance people that speak a completely different language, but we still need to get them all talking to each other, as you said, so that there's specifications around how do we do this, and what is compliance? Well, Alan, thank you very much, we're at the end of our time for this segment, this has been great, it's been great to catch up with you, and Hortonworks has been evolving very rapidly, and it seems to me that going forward, I think you're well positioned now for the new GDPR age to take your overall solution portfolio, your partnerships, and your capabilities to the next level, and really in terms of in an open source framework. In many ways though, you're not entirely 100%, like nobody has purely open source, you are still very much focused on open frameworks for building fairly scalable, very scalable solutions for enterprise deployment. Well, this has been Jim Kobielos with Alan Gates of Hortonworks here at theCUBE, on theCUBE at DataWorks Summit 2018 in Berlin. We'll be back fairly quickly with another guest, and thank you very much for watching our segment.