 Live from Dublin, Ireland, it's theCUBE, covering Hadoop Summit Europe 2016, brought to you by Hortonworks. Now your hosts, John Furrier and Dave Vellante. Hey, welcome back everyone. We are here live in Dublin, Ireland for Hadoop Summit 2016 in Europe. It's theCUBE's Silicon Angles flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, my co-host Dave Vellante. Our next guest is Alan Gates, co-founder of Hortonworks, ODPI Steering Committee, giving us the update on what's going on in the open data world. Welcome to theCUBE. Thank you very much. First congratulations, Hortonworks doing well. You guys went public, company's growing, market's changing, a lot of partnering going on, a lot of community action. Yep, a lot of changes, a lot of growth. It's been very good, it's been a great year for us. One of the things we were talking about yesterday pretty hardcore was community. Community right now is more important than ever in this ecosystem because of all the changes. It's not just Hadoop anymore. That's one element of the overall Hadoop ecosystem which is now just a label for the big data ecosystem of technologies. A lot of new technologies coming in, new projects starting, a lot of big vendors with installed base and big money participating. Cloud on top of that, prize more power. It's like Star Trek, Scotty, more horsepower. So we got that going on right now, it's a good thing. How does that affect what's going on in the community and ODPI, what's the update? So ODPI, really what we saw is, well let me kind of back up and say where do we think our value is and what do we see the need here, right? As you said, there's a lot of power in Hadoop. There's a lot of different things you can do with it. It's a platform, but how are people going to use that? And as Hortonworks, we talk to our users, some of the other partners in ODPI as they talk to them. One of the things they realized is this thing can be used in so many different ways that it's actually hard to build on top of it. If you're an application provider and you want to build an application to sell to people, you get it to work in one environment, you take it to a different distribution, doesn't work at all. And it becomes a test nightmare for them and a support nightmare, right? So one of the things that we wanted to address in ODPI is how do we bring specifications and best practices and testing to this ecosystem so that these users can, or sorry, these application writers can write it once, test it somewhere and have confidence it's going to work everywhere. Because what do we want, what does Hortonworks want? What do the other distribution providers want? We want Hadoop to accelerate. We want it to be adopted and we want it to be successful. The more we can make it easy for application providers for that next layer up to write on top of it, the better it is for all of us. This is the focus we've been hearing about automation, the assembly stuff, the effort to reduce complexities for not only just implementing interoperability's huge, dealing with other open APIs, kind of a unification message. You see, is that something that's happening? I think that's true. I think it's, yeah, how do we make this so that it's just bonehead simple and it'll never quite be bonehead simple, but compared to where it is now, it can be bonehead simple to make this stuff go, right? So we were talking at our Open, it's kind of what the cloud guys, the public cloud guys are trying to do, make it bonehead simple, but not as robust as what's coming, you know, but they get there and things are moving fast. We were making a comparison to the early UNIX days where there was tons of fragmentation. I'm sure you hear this a lot, but where does that metaphor sort of line up and where does it break down in your opinion? I don't know, I haven't thought about it. Sorry, UNIX, okay, so let me care about it. No, I think it's okay. I'm basically saying, you know, HP, IBM, blah, blah, blah. I remember those days, I am old enough to remember that. That's why we figure okay. No, I just had to think about it for a second. It's okay, so now, and then Linux comes in, we were sort of saying, well, that's kind of what ODPI is. Except that Linux sort of did it differently, right? Linux came in and just crushed those, right? It's not that it standardized, it had just made them irrelevant and I don't think that's what ODPI is going to do at all, I think what it's going to do is say, here's the way to set these things up so that everybody can use them. Because if you look at our specification, it isn't about have these features, do these things, it's about when you install it, don't screw with the directory structure. When you set it up, make sure these environment variables are set so that as an application, I can find where you put things. Don't change public APIs, which you would think would just, you know, people should know, but surprisingly they don't. So it's a set of standards of how you would lay this stuff out and make it usable. So we're not trying to supersede anybody else or add new features, we're just trying to make it usable. That's where I would say the analogy breaks down. I think to a first approximation, it is a good analogy. Yeah, so that's what you're getting. Let me jump in. That's really helpful. That was my major in undergraduate operating systems, so I remember that vividly. But it's a metaphor to try to frame work around it. So I agree with it, it's not a pure metaphor. It's just like Red Hat for Lint, Hadoop is not a good metaphor for Hortonworks. It can never be the same ever. But it was an operating system contained within hardware, mini computer or whatnot, and fragmented around. So the fragmentation is challenging, but the unification, the thing about Linux at the time that we're comparing to was the solidarity around Linux. The community said, hey, let's not, let's think about the bigger picture. The bigger picture is this fragmentation and these market forces. So if we stick together, we can go provide an alternative and grow and they crushed everyone and made everyone irrelevant, which is totally true. So now we live in an era of APIs. We live in an era of distributed operating system, if you will. So in a way, it's kind of like just a metaphor, but if you take that API base, universal API, so you can get away with standards possibly, but the community coming together. I agree. The community has to come together around this for it to be successful. And I think in that sense, you are right. This does have to be, we start this out as when we go to application providers and end users and pitch this, everybody says, I love that. Can you give me that yesterday, please? Right? And so as we start to prove out our value and as application providers start to say, yeah, we want this, that's how we believe we can grow that community. But I agree. You're right. So I wonder if I could just go one more follow up. So in the Linux world, there was and still is multiple distros, one obviously major one. So there was some initial fragmentation. Are there some similarities there as well? I think there will continue to be multiple distributions. I don't think, at least, well, okay, let me put it this way. I don't think ODPI is going to create a single distribution. That's not our goal. That's, I don't think that we would have any way to have that happen. And I would not be surprised to continue to see multiple distributions. Right, which again, in Linux today, you have multiple distributions. Which is always a room for two or three. I think it's appropriate. So let's get that, you mentioned, everyone wants that yesterday. I love that line, because that's pretty much go faster, pedal faster, catch up, fall behind. So catch up and lead is really the industry's imperative. Everyone kind of wants that, right? Come on, go faster. So what is that pitch that everyone says, I want that yesterday? And what's the updates? Where's the progress bar? What's been done? Can you just take us through that real quick? So here's what we've done. We've split our specifications into a couple tracks. So we have a runtime. First the pitch. First the pitch. Okay, sorry, the pitch, yes. What's the pitch of, what are we going to give? The pitch is kind of three-fold. Like we're hitting three audiences, right? To the distribution providers, the pitch is, we can give you a set of tests that can prove that you meet what we're doing, right? That you get the little gold seal you're approved. To the application providers, which is really who we're pitching to hard, you can write your application. You can test it against our reference building, test it against one of the distributions that are certified and it should run anywhere. Test once, deploy anywhere, just like Java or one of those, right? Which to them is music to their ears. To the end users, it's similar because they tend to be writing their own software. They may be doing their own internal things. I think that's a market where we can maybe do better in the future. We kind of focused on the application providers first. It's just step one, yeah. Yeah, you got to start somewhere, right? But I think that same pitch is going to work for them because everybody customizes this when they bring it in house, right? Everybody writes their own software and they want the same thing. They want it, because honestly, you look at a lot of the big players, they have multiple different distributions. It's not like they just buy from us or just buy from Plotero, whatever. They've got one of each. But two and three are really a developer play, whether they're an ISV or somebody inside the Bank of America. They want reliability. They want to be able to write their own apps, build it in and then push a button, like a DevOps agile way, make it work, infrastructure's code, all that good stuff. Exactly, so okay, so where's the update on that? So we split our efforts into runtime and operations. So runtime is Hadoop proper, HDFS, Yarn, MapReduce, just core old style Hadoop. And with that we have released last month a specification of okay, here's how you should lay this out if you're a ODPI compliant distribution. Here's the environment variables you should have set. Here's best practices you should be following. Included in there are some best practices for ISVs, for app writers, so that they can know that can form. We also released a set of tests so that the distributions can run these tests against their stuff and assert yes, I am compliant. And we've released a reference build so that people can use that to test with. That's out there, ready to download and use. The distribution providers that are part of ODPI are now running their tests against it and we hope to see them release compliant versions relatively quickly. And the outcome there is you'll see some gold stars and some certification, fully approved Hadoop, blah, blah. Yep, seal of approval. Got it, seal of approval, perfect. And then on the operations side, which is focused on Embarie and how you actually install this software to manage that, that's in a draft state. So we have a team working on writing up a spec for that with the plan that by this summer we'll also have specs and tests and all that for that. And that's where some of the friction came in in the marketplace was the requisite for Embarie, right? Wasn't that the... And you know, we addressed that, that was one of the friction points. It wasn't the only one, but it's one. Sure, birding cats. Sure, but we did address that by just recognizing that you know what, not all members are going to be compliant with the operation spec. Some are only going to be compliant with the run time spec and that's cool. Okay, so that's not a deal breaker then? No, it's not. We have members already in there that have just flat out said we don't care about Embarie, we don't use it. We're not going to comply with that. How much market forces are you guys looking at? I mean, how aware are you guys? I'm sure you obviously, I'm saying not aware, but like there's a lot of stuff going on from that's vectoring in from the market. Cost of ownership, complexity reduction, vendors own specific moves. For instance, I was just talking earlier about you got Oracle running big data sequel now with Wharton works in a huge deal, no one's talking about, but you can actually run 12C with Wharton works without exit data. I mean, that's like, that is a huge deal because that's, you know, Oracle used to have a deal with Cloudera and now you see the multi-vendor kind of playbook again. We've seen that movie in the 80s and 90s. So, you know, this is kind of what's happening now. So that's a pretty big deal, small in comparison to the ecosystem, but it's Oracle, it's got to install base. Sure, it's important. I mean, these are the moves that are going on. How does that impact your role and your team? And... As part of ODPI. So I think what it, it drives all more that the better job we can do of making these distributions low friction for people like Oracle, the easier we make their life, right? Because the reality for them is, and for all these kinds of players is they need to work with all the major distributions. They all have customers that are using all the distributions and we recognize that and want to make that... And that's the whole effort of the whole ODPI. And the thing that, you know, we saw, I saw it yesterday, I didn't have time to talk about it in the intro, Dave, was there's a huge undercurrent of data management push. Now the data management market is like banging on this ecosystem, like we need more hardened, you know, relevant real-time data management stuff now. Right. Like yesterday. Right. So, you know, I love that line. And because there's an installed base of data warehousing, there's an installed base of data management industries. So this is a big deal. How does all that weave into the mix? So that we haven't really addressed yet. I think that for us is kind of an unaddressed market and that is a good question of we need to figure out how we can make that metadata available to them as well. And all those, the things that kind of go around that. And the reason I would say we haven't addressed that yet is we started out kind of with a simpler, easier vision just so we could get up and going, which was just to do core, just in barry. So we really haven't gotten to a point yet where we're touching those things. Now, we want to bring other projects in, right? And kind of our approach is let's bring in what we think has the biggest bang for the buck. Obvious, an obvious candidate there is Hive, which I'm a little biased to because that's what I look after in most of my day job at Hortonworks. So, but, you know, just up on the 80-20 rule, that's going to be an obvious thing to look at. As we start to bring that in, your question of these metadata managers, security systems, all those things become even more relevant because that's going to interact a lot more. And the security workers are interesting now. You guys are setting the table for that with the security stuff now. Right, but so I think that's a question we haven't answered yet, but we'll start to answer over the next six to 12 months. So the concept basically is to have the energy source sticker, it's okay, it's compliant. But you're flexible in terms of, we talked about EMBAR, are you okay? I'm not going to follow that. So what is it, you're 75% compliant, 80% compliant? No, that's why we split it up, so you can say I'm compliant on the runtime. So I'm runtime compliant, but I'm not compliant on the opposite. Exactly. Okay, period. That's where we- It's a binary, right? Yes, that's where we would go. Yeah, because otherwise 75% compliant. That's like when a software engineer tells you he's 75% done with his company. That just means nothing, right? It means you're in trouble. Okay, good. So yeah, that's very much the goal. And as we bring in things like Hive, we'll probably put that in the runtime pile, right? Because that will make sense. Okay, great. So from the summarize, and how's it going? I mean, what's the reaction from both sort of the developer community and the developers within the customers? So all the customers we've talked about have been very excited about it. One thing we feel like we need is to start to get even stronger feedback from the application developer community, because so far, I mean, this was started by Hortonworks Pivotal, some of those kind of IBMs. IBMs in there, right? Or big early, we're all the distribution providers. We know what we want to do with the dupe, right? So a big area of focus for us is to get, and we do have some application providers, SAS and some data torrents and guys like that that are giving us good feedback. But I think the next thing for us is to really start to grow that part of the community, because that's where we really need the feedback. We need those people telling us, you're making our life hard because of X, fix that, right? And that to us is kind of- And they're under pressure to operationalize their business. And that's where the growth is kind of like hitting that glass ceiling right now. Exactly, and so I think it's gone, I think we've done an okay job of getting those people in. I think we need to grow that even more. That's where I would say our next area of focus. All right, so what's next? Give us the roadmap on your next milestone, what are you guys going to knock down next? So our next milestone is getting the operation stuff out this summer, then we want to have a six month cadence on the specs, because as everybody knows, this world moves at kind of an insane pace. And we do want to bring a little bit of order to that. We're not going to slow down the communities. Obviously we have no desire to do that, but we want to bring some order for these end users. So we want to be really measured about how often we do those updates. And our goal would be then to have another update in the fall and keep up with how the technology is changing. Is Microsoft involved? They have not been involved today. They're not coming in. What's your take generally? Cause you've been, you know, as a co-founder of Hortonworks, you've been in the ecosystem from day one. What's the view of the cloud? Cause the cloud and on-prem hybrid clouds, the cloud generally an operating model, is seems to be really a forcing function on accelerating a lot of the big data apps or companies, which are apps at the same time. Very data-driven, native data across the board. Do you agree with your view on that? Can you share some color on your view of what cloud will do for the ecosystem? I think, so this is all guesswork, right? Yeah, let's be here in the future. Killing the future is a dangerous business, but I'll take my crack at it. No, I think cloud is a forcing function. More and more we're seeing people say, you know, we're just going, either we're going to the cloud first, or we've been told by our CIO that we have to be in the cloud by two years, five years, whatever. So it's definitely a forcing function. I also think it's a real opportunity for providers like Hortonworks because we can start to really simplify things for you, right? Instead of having to say, oh, go buy a bunch of machines, go install all this stuff, go work through all your processes to get all that installed, it can be, hey, we provide you a set of images for a particular application. We make sure that works well with the other images. We give you kind of a Lego format like Arun was talking about today with the containers, right? I think that's a real opportunity for us to enable our customers. So to us, it's really exciting. I think it is going to be a good accelerator for the ecosystem. So in the panel this morning, did you hear the panel this morning? I did not. Okay, but so you were riding bikes. So, but I'll summarize and get your take. So the end customers in the panel were essentially saying, public data, public cloud, proprietary data, we're going to do that on-prem. And that's kind of how they broke it down. And I was struck by that saying, is that sustainable? Because from a cost standpoint and just an operations standpoint, it didn't strike me as the long-term strategy. What's your take on that? Well, here's what I've seen. Anytime, the word move is a four-letter word in data. If you have to move your data back and forth, you're in a bad place, I think. And so if you're going to set it up as, yeah, some works on-prem, some's in the cloud, and we have to move data back and forth. I haven't seen that one. But I think that's what they were saying in fairness. Because they were saying basically, public data that's in the public cloud, that's where we're going to run that. But then our data's on-prem. And they did say, well, we're going to only move the jewels back. You know, it becomes that. There was some moving. And I'm like, oh, move it, right? Yeah, everybody says I'm not going to move it. And then all of a sudden I need to do a join. So. Right. But you're working on that speed of light problem right here. Yeah, we got that down, yeah, no problem. Changing the laws of physics, always a good business model. So that's where I see that kind of bounce up against me. More and more, we do see people saying they're okay with the security measures in the cloud. I think that is, I think that is something that we're going to continue to see push on. I think there are people that are going to conclude that they just can't put their data in the cloud and they're going to end up pulling it back or maybe building their own cloud. Or rolling in something like a snowball and stuffing it with data and then shipping it on a truck, is that crazy? It's not crazy one time. It's crazy as a standard, I think, right? I don't think you can do that every weekend. That's not going to give you the latency you want, right? Most of my users. Yeah, it's going to be a one time seating. Right, most of my users complain at me because they can only load their data every five minutes. If it's, I can only load it every weekend when the truck stops, that's just not going to work, right? Okay, so what's the bottom line then? This is a hybrid world in perpetuity? Or are you saying that it's going to be leaning toward public? I think we're seeing more and more shifting toward the cloud, but I don't think we're going to see proprietary go away, right? I mean, people are still running mainframe, so I don't think it's going to completely go away. I do think the kind of center of gravity will shift a bit. Alan, I want to get your take. Final question is, for me, is obviously great success in what this 10 year run has been with Hadoop as an early idea. Now, I'm grocery commercializing in real time. A lot of new stuff happening, so good progress, heartbeat solid across the board. Yeah, it's the pragmatization. It's got to work on cleaning that up. Are you, what do you hope the ecosystem is in the next one to three years? What do you hope it will get to? And at what point do you want to see hit very fast? I want to see it get to a point where people are less focused on the platform where people are more focused on, you know, I want to do analytics, I want to do data science, I want to do ingesting of all my data. And I know this is the place to do all those things. And I know that whatever tools I like on there, they all work well together. That's my kind of dream vision, I guess, is that people become less and less focused on the tech, more and more focused on what tools they want to use and that those tools interoperate. Because I think the real promise of Hadoop isn't that we're bigger or faster or anything, it's really that we can bring that data all together and do the different things. Because you look back in the 80s, we had databases, we had statistical modeling packages, we had all these things, but you had to move the data in between them all, right? And so I really think this promise of, hey, I can put it in the data lake or whatever, whatever bingo word you want to use this. We say ocean. Okay, the data lake. All right, I've seen lake, ocean, barn. We hate lake, I hate the word data lake, it's my personal thing. Sorry, I won't say it again. Okay, that's okay. On the cube, I won't say it. All right, that's okay, there you go. But you put it all there and you can get to it and your data scientists can run Spark or R or whatever on it, you can run your cubing and reporting solutions off it through Hive, all that's just there. Make it an enabling platform so that new people can discover stuff and new use cases that will emerge of good data. Yeah, and whatever is the next big tool, because there'll be a next big tool that somebody thinks of, make that work on there too. That's the dream, that's the opportunity and certainly within reach. Thanks for sharing your insight. Co-founder of Hortonworks here inside the cube extracting the signal from the noise, giving a little vision, also predicting of the future, connecting the dots, giving us an update on ODPI and among other things. Thanks, Alan, for taking the time. This is the cube, we'll be right back with more after this short break.