 Okay, we're back live at Hortonworks Hadoop Summit 2012. This is theCUBE, siliconangle.com's event coverage. We go out to the events, talk to the smallest people you can find, and track the signal from the noise. And this show is about Hadoop. This is about the center of big data. It's a technical, focused engineering geek developer meets business use cases. Not so much a developer conference. I'm John Furrier with siliconangle.com. I'm joined with Jeff Kelly, my co-host for today. And Todd Lipcon from Cloudera is on theCUBE again. Alumni, Benjamin, we've done number of times, five, six times we've had Cloudera folks on and just want to say to the folks out there, Cloudera, really the first company to commercialize and be funded by Venture Capital. Think almost four years ago now, about four years ago. Yeah, a little less than four years. Coming up on four years, really the first pioneers in Hadoop, the leader, clearly the market leader in today. Hortonworks, a number two coming fast up to Cloudera. Little bit different approaches, but Todd, you guys are working well with those guys. The big brouhaha, even a year ago, when Hortonworks kind of spun out of Yahoo, there's a lot of conversations, silly conversations around Cold War, who's contributing the most. But for the most part, it's been a really positive piece, it's kind of a peace treaty because you guys all know each other, right? And so the community's growing, real robust community in Apache. Everyone kind of sees the big picture. There's plenty of beach head for everybody. They all have coconut trees on them, a lot of fruit on the tree. So I want you to share with the audience your view of this conference, because we've been covering Strata, we're covering Hadoop World, that you guys used to run now, O'Reilly's going to pick that up and take that off your shoulders, which is good for your company, because you're not in the vent business, but you've seen the community grow from a kernel to now exploding, where these shows are selling out. Yeah, I think my first summit was, this is my fourth summit this time, so four years ago, we were over in Sunnyvale. I think it was 300 people, that's 320 people, and now today we're 2,100, 2,200 people. Yeah, an amazing amount of tracks, a lot of interest, new entrants into the market, into the ecosystem, and the community, right? The coding community. So what's your vibe? Tell the folks out there, what's happening at this show, why is this show different than the other ones? So I think as the event has gotten bigger, the amount of business stuff that we see in the show has definitely expanded. The first Hadoop summit I went to was basically Hadoop developers talking about their recent projects, they'd added the Hadoop, like new features in Hadoop O.18 or something like that. Now there's a lot more companies talking about what they're actually doing with Hadoop, new higher level frameworks people are building on Hadoop and new stuff where they integrate with Hadoop. That's sort of the major shift in focus I've noticed. Certainly there's still some talks about new features from the core committers, but of the six tracks, it's only one of them is really like that and the other five are morally data science vendors talking about their products. Some segmentation going on, right? Yeah, certainly, yeah. So let's talk about, because you're an alpha geek as I say, but also a tech athlete, you've been on theCUBE a bunch of times, you've been a lot of work at H-Base, well documented, we've talked about that, we'll get that in a second. But HDFS is your new sweet spot, we talked at H-Base conference about this, that's on YouTube. A lot of discussion around the advancements, security, so I'd like to ask you specifically what new features and developments are going on in HDFS specifically that are helping you guys get faster than near real time and all the market demand. Yes, I actually gave a talk about this at H-BaseCon, so if you were out there, look on the H-BaseCon website, there's slides and video from that. The talk was, what has HDFS done for H-Base lately, basically? So H-Base is basically our real time play, it's real time key value access, random access, very low latency. And we've added a lot to HDFS in the last year to four H-Base in particular. So let's say the two key areas that focus on one is performance, just because you need this low latency fast access, we really improve the random read performance of HDFS in particular and just really reduce the CPU overhead for a lot of operations. So that's gotten between like a 2x and 4x measured performance improvement of H-Base over the last couple of years running on HDFS. And then a high availability is the other really big one, it used to be that HDFS had a single point of failure, and this name node master node. If that node crashed, then the whole thing would kind of be down until you restarted it on a different node, perhaps. You didn't lose any data, but if you're running a real time website or something off this system, that's obviously a downtime that you didn't want to take. How about robustness of HDFS? What areas are going on there? Well, we've actually always been very robust in the sense of, we never lost data. There's very, very few documented cases of losing data on HDFS. All the ones I'm aware of have been either long since fixed bugs from like 2006 or like completely abled in the administrator error where they accidentally delete everything on their node and they delete the backup or they never cook a backup, right? So can you talk about security in HDFS? And also we just heard from David from Cloud, who Xdahu, who knows data. He's putting the analytic engines in HDFS versus pulling that out. That's a big concern. So the security policy is just a quick one. But then talk about the analytics side because that's where everyone wants. The HBase is showing great use cases of storing as a store but getting it out of HBase is what everyone wants to do. Yeah, so on the security front, so HDFS since about a year and a half ago has had Kerberos-based security. It integrates with Active Directory and all the enterprise-friendly stuff. Gladire Manager helps you set that up really easily and integrate with the enterprise systems. So that gives you basically access control to the data on HDFS. Security in HBase, since it's another layer on top, is newer. That came in CDH4, which we released just last week in GA. So that allows you to, within a table, specify different columns in that table and allow different users or groups access to those columns. You know, distinct read or write access. So that's sort of the security thing. It's a newer feature now available. In CDH4? In CDH4, it's in the upstream HBase 0.92 release. All of the CDH components, we build off the Apache open source. CDH itself is 100% Apache 2 licensed as well. There's no secret sauce in our HBase. It's the community's work and we contribute it to it quite a lot. Yeah. Analytics engine. On the analytics side, I wouldn't say there's any new major feature on the open source side in HBase for analytics yet. There's a new feature called co-processors, which allows you to essentially build extra code into the HBase servers themselves. You can think of it a little bit like stored procedures in a traditional database. So I don't think your average user is going to go out and write a co-process. That's not an open source. That's modifiable by... So this co-processor framework is open source and then you have to write, it's basically like plug-ins you can write. And once you write one of these plug-ins, it sort of hooks into the core of HBase and can do wildly varying things. It's just Java code, so you can do anything. So I don't think users are going to go out and write their co-processors tomorrow, but we're seeing companies like Webidata, for example, and Continuity, who are building their software on top of HBase. And they want to be able to really hook into the core and do useful analytic things or machine learning stuff inside the database itself. And I think they'll find this new co-processors feature pretty useful for that. We've seen interest from them. Also, I anticipate users will be using it through another vendor or through another library, which is using these lower-level components. It's kind of a distraction, not at that complexity of a way. And it'll be kind of under the covers. So you mentioned the high availability. I wanted to go back to that for a second because we hear a lot about single-point of failure and it's one of the key... We used to hear a lot of it. It's one of the key issues. Right, so can we put that to bed? Is that... Explain a little bit about HA and how that works and how you approach that issue. And again, is it a subject we won't be talking about anymore? I hope we're done with it. I'm tired of talking about it. This is the last time I'll ask you a whole... Great, great. Yeah, so we built a basically active, passive standby system. So there are two name nodes. One of them is the active name node. It takes her all the requests going through the namespace. The reason why we have a single master like that is it really simplifies the system both for operations and for the complexity of the code base rather than trying to distribute it. So it's been very, very robust. There's a few bugs that cause any kind of issues. And then we have the standby master, which is essentially keeping completely up to date with the whole namespace talking to the active master. And if the active master crashes, the standby can promote itself to become active within a matter of seconds. So there's no data unavailability. Any clients who are accessing the cluster will seamlessly fail over to the new active master. And then of course you can repair your node, you can replace the RAM, whatever went wrong on it, replace the power supply, bring it back up and fail back at your leisure, whatever you need to. So yeah, so talk about just overall, I mean, that's a key component of being enterprise-grade or enterprise-ready. Where is Hadoop on that spectrum? Is it safe for the enterprises, as they say? I mean, we have about 50% of the Fortune 50 are using CDH in production or not necessarily production, but in some sort of a serious project. So it definitely is enterprise-happy in essence. Like Fortune 50, how much more enterprise-happy. Right, right, absolutely. So you mentioned security. So I mean, what are some of the other major issues we have to tackle from a community perspective? I think the ease of use is still an issue. Originally, Hadoop was very much for the Java programmers out there who could sit down and write a bunch of Java code or write a MapReduce job and understand for one, how do I take my algorithm and make it MapReduce? Which is like not a trivial thing to do for most people. And then we saw a couple of years after MapReduce we saw Hive and Pig come out. And they made it easier for your average data scientist, sort of Python hacker kind of programmer to put together SQL queries or Pig scripts, much easier to consume, easier to write languages. And that really expanded the boundaries a lot. And I think the next wave that we've seen in the last year and continuing this year is more vendors building and then on top of these layers. So you have like a nice point and click interface. So like Talend or Tableau or Pentaho, all of these different business intelligence vendors and data visualization vendors are now integrating as another layer on top. So that expands even out to the business users. You never wrote a line of code in their life but they can get all the data that's stored in Hadoop. That's the stuff about the application and the analytics side for a minute. Because that again, when you go outside, you go outside the little bubble of the kernel of the alpha geeks and the, like yourself who are doing all the hardcore stuff in the community. Everyone else on the business side wants the data out. They want analytics. They want real time, near real time. What can you share with the folks out there that you've learned in observations or anecdotes or examples that you can share with them is ways to do that. And let's take HBase in particular. So HBase has become, as I said on Twitter last night, the holy trinity of big data. HBase, HTVS, and MapReduce. Those three things working together. And we talked about this at HBaseCon. This is a really nice configuration. Kind of working together. What advice do you give the people who say, hey, I love this. I'm going to deploy with it. Great batch collection. And we'll roll with the community on innovation. We're good for now. But what I need right now is analytics. What do they do? Is there a roadmap? Is there a playbook? I think one of the mistakes I've seen some companies make is they go tool up instead of use case down. So they say, oh, this HBase thing is really cool. Or I've heard Storm is really great. And I want to use that. And they don't think about what problem are they actually attempting to solve. I think people are much better success with Hadoop if they start with, here's a problem I have. Here's some data I'm collecting. I'd like to do more with it. And then talk to some experts in the community. Talk to Cloudera, do some reading. Take some training courses and understand, OK, well, my problem, even though I think it's real time, actually a five minute batch is good enough to be real time. It actually will be 10 times cheaper for me. And make those sort of distinctions with the right engineering trade-offs once you understand the problem. It takes me through that. They're going to call Cloudera. Say we have a problem, call Cloudera. He's going to call you guys up, hey, call me up. And you guys don't answer? I mean, so we're not a consulting shop. We don't do full beginning to end project consulting, like some people do. On the other hand, for a new customer, if we think there's a great Hadoop use case, we do have some people who can go inside who are Hadoop experts and they've solved use cases like this before. I think our usual technique as we go in and we talk to the people proposing the use cases and we say, what are your 10 problems you think might be applicable? And we look at all of them and say, well, let's start here. This is one that we've seen be successful before. I'll start with that, get you on the right path, find one really killer use case, and then expand to the 10 or 20 use cases you might have later. We were calling H. We've seen enough of them that we can pick out the good ones. We were calling H.Base the tailored suit. Tailored up for the use case, fits great, you try to do something else where it doesn't fit very well. That being said, what would you say the pros and cons of H.Base right now? I mean, obviously it's booming, it's floating, so there's a lot of validation, but from your perspective, pros and cons right now with H.Base. So I think the pro is also it's con. H.Base is, it's not really a database, it's very, very close to the actual underlying discs and metal of the machine. So the con of that is you have to understand, like when I put data, how is that happening? How does the write ahead log work? How does that actually get recovered if there's a crash? How does it get laid out on disk in the indexing scheme? So that sounds awful, you don't really want to understand how databases work. That's a big time plumbing. Right, but then the huge pro is that when you actually understand all of this stuff, you can write incredibly efficient applications. It's like a symbol of language. Very, very, yeah, I wouldn't go quite to the level of assembly language, but. It's like machine code. It's like machine code or like writing C. And most people don't want to write against C, they want to write against a database or something. Maybe they're using C, but they're using higher level APIs. So that sounds like a big knocking against a space, but it also means that people can write applications on it. These people like WibiData, for example, or Continuity, or shops that have a bigger engineering team, and they say, I need to build a search indexing pipeline. And I've got five engineers working on this, and we can take the month it takes to understand HBase in great detail, and then they get extreme power out of it. And they get extreme, that's the exact point. It's huge investment and skill, but the upside's massive. The upside is huge. So talk about the innovations that are going to end, because obviously that's an opportunity for, when you have massive performance increases with HBase and given how early it is, there's opportunity to abstract away the complexities. So you mentioned co-processors. What else is coming around the corner to make that easy so that I could do it? So I think it's, again, we're waiting to see the next layer of library. So where MapReduce was four years ago, maybe HBase is where MapReduce was three years ago or something. There's some sort of... Early stage. Development, yeah, it's an early stage. We haven't seen a ton of libraries built on top of it yet. There's just maybe three or four I can think of. I mentioned WibiData as one of them. They basically built a higher level API where you can describe your business entities and the kind of schema on top of HBase. It maps it to these lower level constructs and gives you a REST API to put and get data and then a simpler job. And then you use JSON as well. Yeah, it's all JSON. It's all very friendly feeling to your average developer. You probably give up some of the power. Is WibiData open sourcing those libraries? No, it's not open sourcing, unfortunately. So proprietary. Yeah. But I think we'll probably see equivalent things in open source, maybe not equivalent, but the same kind of idea. They'll come out eventually. Don't forget VDP finders out there, too. Another startup, it's got the libraries. That's our startup, by the way. Right. Yeah, there's a couple open source things on GitHub. Like Sam Polara, whom you might know, used to be at Cladera Labs as well. Built a thing called HAvroBase, which stores Avro objects on top of HBase. So they're sort of like fledgling libraries that people throw around. Did Sam open source that? Yeah, that's fun. He's at Twitter now, right? Yeah, yeah. Okay, cool. All right, so next up for you, what's on your agenda for this next Geosy? HTFS is good, but what's on your roadmap? Share with the folks out there what you're working on and any kind of coordinates to get a hold of you to collaborate on code. Yeah, so right now my major project is working on the next update of CDH4. So we finished the HA feature, we did the automatic failover, as I mentioned earlier. Currently we have a dependency on an enterprise filer for that, which most of the enterprise customers we have are completely okay with. I think 95% of enterprise software does HA with a dependence on enterprise hardware like that. Some of our customers, especially startups, who are running on AWS or other cloud services, they don't have filers, they don't want to buy a filer. So the project I'm working on now is essentially a distributed system that takes the place of that filer and can redundantly store the state on multiple machines using a Paxos protocol. So as we discussed that HBaseCon SiliconANGLE spun out a new company called VDP Finder, which we're doing some stuff on HBase and you met our data scientist. It's challenging right now to work on top of HBase, so what we've found is it's great, but we need some more horsepower. So what is HBase going to come around the corner for performance? What do you see HBase going at the next level? If it was where MapReduce was a couple of years ago, what do you see happening with HBase? So in performance, we actually have a new release that was recently released, O.94, which had some pretty big performance improvements over O.92, which is the previous release. So we'll be pulling that into CDH very soon, I think. And then I think it'll never be the same as a relational database where some people think about performance as in I can do a complex query and it'll use indexes and stuff to find the right rows. HBase doesn't have indexes currently, maybe we'll get them eventually, but it's not really our sweet spot. We're more about the raw horsepower, so you do have to code a little closer to the metal to get that. Okay, so what about MongoDB and some of these other databases, where are they kind of shaking out into the ecosystem map in terms of functionality? Are they finding a home? I think basically the contrast between MongoDB and HBase, if you want to make one distinguishing difference, is that MongoDB optimizes for features and is a pretty crappy database, like it's internals are dismal, whereas HBase is a pretty good database but has very few features. So depending on what your scale is, if you have very little scale and you want something you can get going super fast, Mongo is probably way better than HBase. But if you want to run on 100 nodes, there's not a single Mongo cluster in the world that big, right? And there are upwards of 50 clusters I'm aware of that are over 100 nodes on HBase. And I just want to give you a hat tip to Cloudera because like I mentioned, we run VDP Finder, our big data project we've been running for our media businesses now going to be spun out of a separate company. We use Cloudera Manager and literally saved us six months of development time. So thanks a lot for that. And otherwise the free version. I'll pass the thanks to the other team. Yeah, I want the upgrade. Tell Mike Olsen I want the upgrade. But I want to ask you about that being said, that's a real advantage for you guys. And for like the weeby days of the world to have these libraries to accelerate the development process. So what has been the reaction of CDH4 with your customer base? You guys are talking to federal clients, financial, all the top verticals. What's been the feedback of CDH4? So we definitely had people trying out the early betas. Performance wise it's been great. We've had good feedback that it's significantly faster than the CDH3 versions they were using. We've been competitive with other distributions out there that claim to be way faster but we're actually neck and neck now, which is good. In terms of the feature set, HAA has been widely anticipated so people are very happy about that. And the name node stuff. Yeah, the HAA name node. So those are the two big and MapReduce2 was the big one? Yeah, MapReduce2 is still early at it. I consider that kind of alpha beta quality at this point. So we made the decision a few months back that we would ship MapReduce2 and give people the option of running that. But we were also continuing to ship the prior version of MapReduce, the stable version, MapReduce1. Because many people wanted to upgrade to take advantage of the new name node and the performance improvements but they didn't want a chance moving entirely to the new MapReduce framework. So I think that is the way forward. MapReduce2 is the future. But it's the future and not the now and the yesterday. So some of our customers are more conservative there. Taaz, the LeBron James of HTFS. He's a tech athlete. Thanks for coming on theCUBE again. I don't know if you're Durand or LeBron or certainly not Kevin Garnett. He's going to be leaving the Celtics. He'll be mugs-y-bugs on that kind of story. Final question. Share with the audience how they get involved. I want to, for the developers out there who want to code, who want to get involved, who want to be your contributor in the community. How they get involved. Some most developers are kind of shy. Share with them how to get involved and what to do. I think the best way to get involved is to find an itch to scratch. So essentially if you're using Kadoop already and you've seen an error message is that you don't think it's clear or you think the documentation is vague in some area. Just pick something really simple, stupid first that won't have any detailed design problems, right? Like, yeah. The first thing you do shouldn't be trying to add a major new feature. Just learn how the process works, get the mechanics down. And once you've done that then you can go from there really quickly to fixing big bugs or adding new features, et cetera. It's all in the open source. We do all of our development, co-develop with the community. All the design discussions are out in the open, et cetera. Todd Lipcon here inside theCUBE, sharing his knowledge. He's great. Head on theCUBE. Brown alum. Yep. So put a plug in for Brown. Computer science program. Great to have you on theCUBE. We'll be right back with Arun Murthy, the co-founder of OrtonWorks. Next, another tech athlete. We'll be right back after this short break. Thanks.