 Back inside theCUBE, we're here with SiliconANGLE.tv. I'm John Furrier, the founder of SiliconANGLE.com. I'm joined by my co-host this week, Jeff Kelly from wikibon.org, the lead analyst in Big Data. This is theCUBE. Our flagship to help us go out to the event, talk to the smartest people we can find. Even the founders of the companies that are making it happen. And we're here with Owen O'Malley, who's the co-founder of Hortonworks. Ex-Yahoo, Big Data Guy, Apache Foundation for many years, one of the main members, contributors on Apache, right? Exactly. Computer science degree, master's degree, PhD, PhD. Okay, good. Yes, I got a PhD. You're a smart guy. Okay, so our job is to extract a lot of signal out of your head and share that with the audience. So, first of all, I want to ask you, first of all, congratulations on co-founding and you guys moving out. Thank you. It's very exciting. It's been amazing. We've grown really fast over the last year and now we've announced general availability of our product and so it's really been exciting to watch. You guys worked hard, you came out. It's hard to get a startup up and running. Granted, a lot of people came over from Yahoo as far as that deal, but it's hard to get actual business up and running, the mechanics of operations, HR issues, just get that, just booting up the structural. Well, we've grown a factor of three since we started from spinning out of Yahoo. In one year. In only a year. And you got your GA shipping. So let's talk about some of the science in the computer science. I'd like to geek out a little bit. It's a complicated market right now. You guys announced that your intentions are working with VMware on the infrastructure side among other, Microsoft among others, some relation with IBM right now at least. And the same side of the business side, the analytics is a lot of science involved. That's the app side. That's where the bread is going to be buttered. So you got two theaters exploding. How do you guys look at that from a tech perspective, science perspective? You got to run like the wind to get the tech up on the infrastructure side to enable. Exactly. So we've spent a lot of time over the last six years. I've been working on Hadoop for the last six and a half years now, and we spent a lot of time getting the base platform up so that we could run things at scale. We could run things on huge numbers of machines, huge amounts of data. We can manage petabytes of data at a time. And now we really need to bring it to that next level where we bring the computation to the data analytics specialists, the people who are specialized not in doing systems programming, but instead in analyzing the data and extracting value out of that data. So let's talk about a couple of different things. We'll go back to the infrastructure in a second and talk about the whole VMware thing and some of the tech challenges involved and with virtualization and cloud in general, challenging. But on the business side, analytics is the rage. So to those people who are the business analysts or non-PhD or non-master degree or even CS guys, they want the data out and exporting is a huge issue. So how do we, how do you guys deal with that? I mean, we have an HBase table and we got to get that data out. Well, again, the biggest win is if you can move the computation into the cluster, right? Hadoop really specializes if you can pull the data into, or pull the computation into where the data is actually already located. So yes, exporting, it's a big deal. It's very easy to take down systems that aren't built to the same scale as Hadoop. All right, we've had people accidentally reference filers in their MapReduce jobs and take down the filer just because it can't scale to the same level and the same with HTTP servers or any service that's not built at the same scale. So transferring data out is in fact a huge problem as well as transferring data in. But what we see is that when organizations start using Hadoop as a service, they end up pulling more and more of the data in to the Hadoop cluster and so they have less and less need to push the data out. They can run the computations on the cluster and that works the best because that's where all the data is and so it's much more efficient than trying to pull it out piece by piece. So you run analytics, you run queries that pull the data down to a manageable size and then you pull that out of the cluster. So let's dig into that a bit more in terms of what kind of innovation you're seeing in that regard, in terms of some of the vendors here perhaps that are working on that level, the analytics level, kind of embedding that inside of Hadoop rather than as you say pulling the data out. Because of course, we agree completely, one of the big tenants of big data is move data as little as possible and bring as much as the processing and compute to the data where it is. Absolutely. So what are we seeing in terms of innovation then? So there's a lot of innovation and that's in fact where the ecosystem is just really blowing up. So you see a lot of people in terms of the visualizations, right? With the Pinto and Datamere, those guys have been working on pushing the computation into the cluster for a long time and then giving nice interfaces. So part of the thing with Microsoft for instance is doing exactly that, right? They want to make an Excel front end to your results coming out of Hadoop. So you put your query in to your workstation and the user interface you're used to and it gets sent out to the Hadoop cluster to do the computation and then the results come back in the way you want to see it. Oh and let me ask you a question. Let's take a step back into your personal life for a second. It's just way from the big data industry. You went to UCLA in the 80s. You and I are the same age. I graduate the same time with my CS degree. But back then it was just starting to start to see network-based programming. C was right around the corner. Sun hit the table with Sun tools and workstations, all great stuff, right? I want you to talk about what happened between those years and now. From a science perspective where, you know, ontologies as a random example or AI and this new, that was all voodoo and kind of high-end academic stuff. What, in your mind, was once academic that's now mainstream? With the new data paradigm that's available out there. The big thing is really just how important those analytics have become to the mainstream, right? We saw a keynote today from the CTO of Sears saying how important the analytics is to Sears, right? Which was traditionally a bricks and mortar company. And yet they're driving huge amounts of value out of the data that they've been able to process. And we're seeing that across the board, right? In a wide range of industries, we've got customers that have huge inflows of data and the more they're able to analyze that data and instead of throwing it away like they did previously, take advantage of it and monetize it. They make the company more money and that's huge. Of course, from a discipline standpoint, computer science, you know, Brown, all these guys out there, you see how he's well-known with their CS department, obviously going back and venting into the internet. What, there's always two types of CS programs, right? There's the academic high-end, that's pie in the sky, you're going to be a professor. That's Berkeley, basically. And then reality, more practitioner. Pilot design, database design, seeing those tracks became kind of of the 80s, those days. What's happening now? What changes in the computer science curriculum that are really representative that were once far-fetched? Well, there's a wide range of them. One of the guys that I went to grad school with, for example, founded a course at UC Santa Cruz in how to do game design, right? So how to do video games is now a valid piece of study in a university setting. That's just unheard of, right? Back in the day, everyone would have signed up for that, right? Like I signed me up. But we're seeing a lot of the stuff that was just in the AI labs and the very academic researchers, right? Machine learning for most of the 90s was all very academic, you know, use prologue, use these technologies that weren't in the mainstream at all. And now, instead, they're becoming critical to all these Fortune 500 companies, right? All the Fortune 500 companies want to figure out how to use the data they have, how to do machine learning to take advantage of their information and figure out how to serve their customers better. What about streaming engines? I read a tweet yesterday about streaming engines are real hot right now all the rage. You essentially, you know, activity streams with data. You have all kinds of new, I guess, you know, Jeffrey Moore was saying machine log, log data is like the new currency. But we're living in a streaming market where you get streaming data going on, essentially distributed computing. It's a network, right? Well, you have both scales, right? You have log processing that you do longer term, but then you need to have a feedback loop where that goes to the customer and changes how your site interacts with the customer. So we've- Aka real time. Aka real time, exactly. So Hadoop has always had that feedback cycle and it all depends on how much processing you want to do. So there are short term feedback loops, but then there are longer ones where it can take a week or more to process the logs. But based on those logs, you generate models that you can then apply in real time. We're talking with Jeff Jonas at IBM. I don't know if you know Jeff. He's one of the scientists in there. I haven't met him, but- Great guys. He's been an entrepreneur. He started a company in his car, as he said. But we're talking about some of this projects he's doing. He wants to get to a certain millisecond performance time. What's been great about Hadoop has been fantastic on the batch, you know, good job. But it's near real time, right? We're talking, you know, run a hive job, come back. It's 15 minutes, have lunch. But now with some of the stuff we're seeing with HBase and these real times, it's near real time. What level of performance real time will we get to soon on Hadoop? Well, already you already have millisecond response times off of HBase. Facebook, for example, does all of their messenger traffic. So anytime I send a message to someone on Facebook, that's really a row in a HBase table. And that's being, you served out of HBase. So those guys have pushed HBase really hard. So they're millisecond performance on that? Yeah, they are. So we're there? Yeah, well, Facebook's there. Well, Facebook's there. Now, granted, they've got a huge HBase team that's working on HBase. So it'll take a while for that. That debunks the myth of this is 15 minutes, get a cup of coffee, have lunch, and come back. Absolutely. Maybe for a MapReduce job. So for MapReduce, absolutely, it's going to take 15 minutes to run. But for HBase, you can get there in real time, basically. So I want to talk a little bit about the product. But let's talk about, first of all, why a product. Why did Hortonworks decide that, you know, we need to, our own distribution. There was some talk when you guys emerged that maybe you wouldn't go that route. And how does that fit into what you're doing in terms of your overall business model? The, what we were really interested in was making Apache Hadoop very easy to consume, right? We wanted to make Apache Hadoop and the projects around it, because it's not just Apache Hadoop, right? There's a whole set of projects that mostly are Apache that you need all working together to make the whole product usable. And so there weren't any releases that were strictly Apache releases. And so we needed to go there because no one else was doing it. We've been responsible for every stable version of Apache Hadoop that's come out of Apache. And so we're just continuing that trend of we're making it easier for other corporations other than Yahoo to consume. And so that's basically why, is because we wanted to focus on open source. We didn't want to fragment the market further by making a distro that was different from Apache. We just wanted to take the Apache code, make it all work together between all the projects and make that available to users and make it very easy to do. One comment that Jeffrey Moore made was about crossing the chasm was around domain expertise, we're in that stage where use cases, domain expertise, we were at the HBase conference and we called HBase the tailored suit. You know, you put your use case, you tailor it up but don't try to put it on someone else, it may not fit, right? But it'll work great, high performance. To get to the bigger market, you got to kind of make it more general. So the question I'll have for you is that, and knowing your background on CES, data is about semantics, right? The Semantic Web Tim Berners-Lee's project. So look, get your personal perspective on the semantic web because that's vision that's very search and website specific, but with the social web and the quote social exhaust, as we would say, at Todd from Dr. Lucky Spin, you know, Todd from Continuity, you know, he's commenting this all the time, you have new data types, you have machine data, you got people data, you got application data, now funneling in all this data from the edge. What's your vision around how that's going to change a semantic web and what kinds of thinking needs to get our head around that? The important part of analyzing all that data is getting it into a consistent set of formats, right? So you need to get it into the systems that let you process it efficiently on a large number of computers because traditionally the semantic web processors ran on small numbers of nodes and so pushing the scale out will let us do really exciting analysis and push the boundaries much further. So question around, we've been talking this week around data sets, obviously Avro's a hot project that Doug Cuttings on, Google has protocol buffers. What's your take on protocol buffers versus, say, Avro? They're both very exciting. We've, in the Hadoop project, we've started using protocol buffers for our BC. But one's an Apache project, one's not, right? So it is. Yeah. And so. You can say it, come on. Come on, you can be aggressive. It's 10 times better. No, for different contexts, the different projects are good for different things. Google runs protocol buffers directly, they manage it. On the other hand, it's very stable, it's very well documented. You know exactly what you're going to get, but if you need to make a change, it's very unlikely to go in because Google depends on it and they control it absolutely. Avro's an Apache project, so it's open to everyone. It's not a single person dictating what goes in. It's a community effort. But the effort behind it is looking at more datasets, right? So as a programmer. As a programmer. More elegant, I guess, from a coding, is it? For RPC, protocol buffers tend to work better. In Hadoop, we've gone to protocol buffers for our RPC to get that version compatibility so you can get backward and forward compatibility. But for storage, Avro is a much tighter format and so you move more of the metadata to the header of the file and you can have smaller records. So it's much more efficient and the community is much more diverse. It's not just one company. It's a wide range of companies working. Oh, and Mali co-founded Hortonworks, so we're getting the hook here. Final comment, like you just shared with the folks out there, what's it been like to be an entrepreneur leaving Yahoo, leaving the mothership and doing something on your own with your cohorts and what is Hortonworks all about these days? It's really exciting. Well, first of all, we're growing very fast. As I said, we've tripled in size over the last year, which hiring in the Hadoop space is very challenging. As I'm sure you've talked to, most of the people are saying we're hiring and it's true, right? Almost every company you talk to out here on the floor is saying we're hiring. And so it's just very exciting. I'm really jazzed that our release is out there, that we're being able to ship to the public starting tomorrow. And so that's been a wild ride. It's been great. Product hitting the market. Owen, thanks for coming into theCUBE. We appreciate it and congratulations and we'll see you next time. Thank you. We'll be right back with our next guest after this short break.