 We're back inside theCUBE, this is day two, this is SiliconANGLE's coverage of big data. We've been covering big data since the original Hadoop world, going back three years when my relationship with CloudEra and Amur was formalized with me, getting some space over there, and I had a great pleasure to get to know Eli Collins, who's our next guest, I'm joined by Jeff Kelly, my co-host, and Eli Collins is the, I don't know what title you are, like employee number seven, 10? I work on the platform team. I work on the platform team. I'm one of the leads that builds CDH. You're on my first video I did with CloudEra in the back of the office with a little bulb from Ikea. It might even still be on the website. They're talking about when the recruiting video, you guys have now grown massively. You've been there from the beginning. Congratulations. Thanks. So we've been talking about Hortonworks and CloudEra in context to Apache, and clearing that up. This week we've also been talking about analytics, because that's a big rage in the marketplace in terms of the business value. But the marketplace is really exploding, both on the infrastructure side, as well as the business side with analytics, the use cases. So the first question is, I want to talk to you about is, all the rage about high availability, because of Hortonworks data platform is out there, you guys announced CDH4, you were part of that release. Talk about why high availability is so important, and when you guys started working on that problem in your shipping code, CDH4. Yeah, we've been working on high availability for over a year now. So in terms of why it's important for customers, some businesses just have strict guidelines in terms of the availability requirements for their software. So if you want to go into production with a given workload, you have to have a high availability story. So there's just kind of the basic enterprise feature set that customers expect. The other thing with high availability is more and more people are using Hadoop less as in batch mode and more in real time application serving environments. And in those environments, you really need high availability because downtime means your app is down, means your user facing site may be down. So high availability is really critical as Hadoop expands into these use cases that are really more low latency than what it's been used for in the past. And so we've been primarily working on HGFS high availability since all the pieces of the Hadoop stack depend on HGFS, it's kind of the linchpin of the system and being able to make sure your data is highly available so that the rest of the stack can always get to it has been really the focus of our effort. So we've been working on upstream Hadoop at Apache to make Hadoop highly available and that's what's shipped in CDH4. What are the big component areas that are most in demand right now? Is it the workflow engine? Is it the map reduce? Is it HBase? What are you seeing within the framework? You know, we're seeing really high attach rates. Obviously everyone is using Hadoop itself, HGFS, the core, and MapReduce. We're seeing really high attach rates of all the analytics frameworks, obviously, PIG and Hive. HBase has really taken off. You were at HBase kind of and saw how much activity and uptick was there. That's going to be huge. So HBase has been, I think one of the strongest areas of growth in the platform. We see, I would say, I think 70, 80% of our users are all using Q which is an end user interface project on Hadoop that makes it more accessible. Scoop and Flume are the two key projects for data ingest in Hadoop. So getting your data from a relational database or an EDW into Hadoop and Flume for getting structured, for getting log data into Hadoop. Both of those have just continued to grow quite a bit too. Hadoop has always co-located with existing systems and the more we can connect to existing systems, the bigger adoption you get with Hadoop and so just snowballs. Why is HBase? I mean HBase, we were well-documented at HBase Commons. Excited to be there. Thanks to you guys supporting us and I just want to say for the folks out there watching, Cloudera's been a real big underwriter supporter of our mission to be independent coverage and I want to thank them for that. They've always been a big supporter. It's looking at an angle. But HBase, I mean just insane adoption, excitement. Why? Why is HBase so integrated with HTFS and MapReduce? Yeah, part of it is that it's a new piece of infrastructure that there's not a lot of existing technology that solves that problem. So if you want a high performance scale-out database, so far people that hack together a bunch of MySQL databases hire some really bright people to figure out how to stitch them together, come up with something that really only works in their environment that you can't productize and stamp out and so the spread of that type of technology was limited purely by everyone having to go and kind of build a big Uber database by stitching together a lot of other databases. HBase kind of as a product solves that problem. So rather than trying to stitch together your own solution, you can just start with a scale-out database. So that's one piece. The other is that it's highly integrated with the Hadoop stack. So if you already have a lot of your data in Hadoop and you want to do low latency use cases with it, HBase is the obvious solution because it's integrated. It already runs on HTFS. It's integrated with MapReduce and the rest of the stack. So those are the two, I think those are the two driving threads. So what are some of the examples of those low latency applications that people can now do in combination between HBase and now having the confidence of high availability? What use cases specifically are you seeing? Does that open up now that are possible? Really a ton. HBase is a database and it's a key value store so it's way more low level. You can even really think a bit more as a storage engine than a database. So for example, we have a SQL interface. So really any, so for example, we see lots of people writing applications that are being hit by smartphones or web backends and they're using it as a primary data store for those applications. So you get big people like Facebook Messages where they've architected all of it on top of HBase. But even smaller scale applications where you're doing fraud detection or storing binary objects like images of checks that you want to do, do retrieval on. People who want to have pieces of smaller content that HFS is not well designed for have been using HBase as an alternative. So if I have a bunch of 3 megabyte objects that I want to store and get back, HBase is a great system for that. So you can really think of it as a storage engine so it opens up a lot of, anything can basically run on a storage engine. Do you think this really opens up the floodgates when it comes to building big data applications? You know, your CEO, Mike Olson, has talked about how we need to see more investment in the application layer. Why do you think we haven't seen as much activity perhaps as we had hoped by this point in the evolution of Hadoop? And what's it going to take? Is it what we're just talking about HBase and high availability? What are some of the keys to really kicking off a real kind of flood of big data applications? Because really that's where the value ends up coming from in the end result. That's a great question. So there's, I think it's a question of use cases. So right now if you look at, say, people doing advanced analytics or data pipelines, they're really treating Hadoop like a data warehouse or ETL tool and whatnot. And so people typically don't write, you know, there's way more applications written on Oracle than there is care data, for example. So for that use case, it's not a use case where you see thousands and thousands of people writing apps against it. I do believe that is the future of Hadoop though and HBase is a great example of that where if you stand up in HBase there really will be thousands of applications written against HBase. But if you're doing pig and hive analytics, you know, you're stopping with the analysts. There's no, you know, you're not going to see programs generating hive queries. You see analysts writing or reporting tools generating hive queries. And so that's just not a use case where you have, you know, a huge application developer. But that's coming. There's definitely, I think HBase is going to be one of the key components that unlocks that. And that'll be the, you know, people writing the next generation of applications that don't want to use traditional data sources that want to have access to your big data. Those will be the people that really kind of make that world blossom. But so far, I don't think that's where the Hadoop technology has been focused kind of pre-HBase. Couple questions. One is I want you to share the folks out there because you've been in this industry now. You guys have built, helped build this now. I'm calling an industry like the PC Revolution created an industry. I think you guys have created a whole other industry not just an extension of the cloud or data center. Share the folks the vibe. You've been through a bunch of these. This show is a technical show. Talk about what's it like here. What's the top conversations that you're seeing happen and the conversation you like here? Yeah. So the vibe is really, it's one of a platform shift. So I think one of the exciting things about this industry is that every decade or every 15 years there's one of these big platform shifts and everyone can kind of feel it coming. Like you start to feel the ground shake and three years ago so many people felt the ground shake and now a year later you can actually graph like Hadoop Summit and Hadoop World Attendance as kind of the upper scale of the ecosystem. And so now I think the richer scale of the ground is really shaking. Like you can really see that it's no longer you know 5% of the people in this ecosystem believe that this is going to take over the world of data management. It's pretty much anyone you talk to here. And as more and more people come and get kind of inculcated in this world then you see more people kind of adopt that mindset. So I think the big thing I see is just that platform shift is happening and it's getting bigger and you can see it as you get more you know now we're past 2,000 people next year I'm sure we'll be past 3,000 people and people are starting to take this stuff for granted so they're having the more interesting conversations. Of course you should write a whole application ecosystem and your question's a great example of that. Of course you write a whole application ecosystem against this stuff. Now oh is this a viable technology? Will the rest of the world use it? The questions themselves also change. People are taking it for granted. I think Jen more talking about crossing the charge and trying to pedal as fast as they can to do that which is great. My question also a question is about disruption. This is a great environment of disruption. Open source I think we're going to see in my at least in our history of my life I've seen open stores really become disruptive but I think still a whole another level of disruptions coming around open source you're seeing it here but I want you to talk about like Amazon a big enabler of cloud they own the developer community in terms of spinning up web servers and what not they have elastic map reduce and they announced the deal with map are right. We're trying to figure that out so help me understand what that means because how do one of the big things that we all know is trying to make it easy to configure right and Amazon is not that easy to work with right. I mean it's easy to work I'm just saying but most people know it's a lot of work involved yeah cost of ownership and all that stuff so what does that deal mean and what does elastic map reduce right. So I mean fundamentally Amazon is infrastructure service and so there there for the AWS business it's about selling infrastructure and the more of the ecosystem that works on your infrastructure the better right and so I you know I would expect to see them you know we've been running CDH on EC2 for literally over three years now and because as a platform provider you want to work on all the infrastructure that's out there that people are using and as an infrastructure provider you want all of the ecosystem to run to run on your platform and so that's so I think I think Amazon's going to continue to do you know they have their own database they also just announced that HBase is running on on EC2 map art so I think there you know you're going to try to see Amazon get really the the full suite of everybody you know obviously Oracle Windows they're going to try to get all all possible things running on AWS so I see those kind of new the new news here as just just further reinforcement of that I was excited by the HBase one because it really you know they have DB and they've they've in simple DB and what not so that was a real validator for HBase for us when you see like you know because there's 40,000 NoSQL data stores you know we really threw a lot of our weight behind HBase because we think that's the one that's going to stick so it was really interesting to see Amazon kind of validate that as well and I think they'll continue to do that I think you know I think we'll see other I think we'll see more platform technologies just in terms of you know the cloud is a viable way to deploy and take advantage of big data I mean we you know we talk a lot about the Fortune 50 looking into big data you know running a Cloudera what about some of the smaller the SMB market and you know sometimes the amount of data you might have or want to work with necessarily related to how big your company is in terms of headcount or revenue so how do smaller companies kind of leverage and is the cloud really kind of the answer to that question it helps yeah I think there's two interesting themes of the cloud one is that there's the cloud all this running this infrastructure as a service and cloud providers enabling their infrastructure cost and scale that lets people of pretty much any size consume it the other is that we're taking a bunch of cloud technology and enabling people who have their own computers or their own data centers in parallel where you know if you look at how someone who's say running H-Base themselves they're not running it that differently than say someone who's running in the cloud and these technologies were fundamentally designed for the cloud and so people are your on-prem is getting more and more cloud like and then similarly the cloud infrastructure is getting more and more accessible and both of those I think enable you know SMBs anyone at large to kind of okay you like columns of cloud era early employee cloud era congratulations on all your success you guys got CDH4 going CDH5 is coming around the corner you are talking to me that's your new your new release is coming yeah we're working on it exactly yeah one year release we do we do major releases every year and then we do quarterly updates so we're still CDH3 which is CDH4 we'll do quarterly updates to that now we're going to start working on CDH5 which will be the next major next major we love using sports analogies we call this the NASCAR Rays now Horton works got their car on the track with their data platform looks good they've got H catalog some interesting features all kind of going in the right direction we talked to Doug cutting about Avro just great development congratulations we'll be right back with our next guest after this short break