 management and high availability facilities that are key to managing large data sets in general and the performance, well it speaks for itself. We get the data out, we get the answers back faster, much faster. Only MapR extends Hadoop beyond Batch to support real-time applications and streaming data. So I see MapR as the perfect Hadoop distribution for the enterprise. The MapR enterprise-grade Hadoop distribution is now available as an Amazon cloud service. Customers can run MapR exclusively in the cloud or augment existing on-premise deployments with cloud clusters for additional capacity or disaster recovery needs. When it comes to big data, MapR is the answer. I'm John Furrier, the founder of SiliconANGLE.com. I'm joined by my co-host for this segment is Jeff Kelly, the lead analyst at Wikibon.org on big data, the best big data analyst on the planet. Obviously Dave Vellante can't be here, Jeff, so you're going to step in for the spot. I'm playing Dave today. I'm super excited to be here because one, I love this ecosystem of Hadoop and it's just a lot of the friendly faces that we've seen over the years and our next guest, Doug Cutting, is one of those friendly faces from my time when I was sitting in the Hadoop office, a Cloud Air office, the Cloud Air office where Doug would come in a couple of times a week. Doug, welcome back to theCUBE. John, good to be here. You've been on many times. The founder, inventor, co-founder, inventor of Hadoop as you're being known as a celebrity. I knew you when you were just a Cloud, you know. I knew you when. You're very humble. I know you take with grace and you're very considerate. A lot of folks who want your autograph as the ecosystem grows up, but you've been a big part of this whole citizenship model of open source. We've talked many times on interviews with this. I wanted to get your perspective on the future of Hadoop. You've been involved from the beginning. You're in the community. You're at Cloudera. We just had to see you at Bortonworks on. Very friendly collaboration, which is great to see that that didn't go cold war. It went collaborative. What's going on? I mean, what's your view right now of Hadoop as it is and where is it going? I mean, we're seeing tremendous growth. We're seeing industry after industry start to realize that this is a way that they can improve their businesses. They have data that's passing through their hands that they can benefit from if they could get a handle on it, if they could save it and analyze it effectively and that Hadoop can help them with that, can provide them the tools. So it's pretty exciting to see that and the predictions, the projections are huge. I think it's our job as members of the community, the open source software community as well as vendors to fulfill that promise. Can you talk about some of the dynamics going on right now? Obviously, the environment has changed. All the usual suspects are around the table. We talked to Todd at the H-Base conference, Todd Lipconi, he's a contributor. A lot of people still there. Good folks. No bad politics going on that's worthy of reporting. But what's going on in the dynamics as the ecosystem is growing? More people want to be involved. What is some of the dynamics in the Apache community right now? I think there's so much to be done, but the only way to really do it effectively is to collaborate. You could think about trying to compete, trying to get a larger piece of the pie. But I think everybody's really rightly focused on growing the pie and not trying to steal chunks from your neighbor, not looking back but looking forward. There's plenty of beach head to camp out on. I mean, there's a big range of beach head. Yep. And that's just a more productive thing. What areas do you see? Listen to what our customers want and try to make sure that we're making them happy and not look to competitors. What areas do you see right now as the future of Hadoop is evolving? We just talked with Rob about some of the challenges. Infrastructure is hard to do. A lot of cloud. You've got solid stage driving the changing equation on the economics on storage, latency, batch to real time, near real times going very rapidly, high availability, ton of infrastructure stuff, virtualization, you name it is laundry list of things to do, as you mentioned. And then you got the business benefits of analytics. Okay, tremendous business value on the app side. Where's the action right now? And just tell the folks out there who are jumping into the ecosystem, where to pick up a weapon or shovel and get start digging? I mean, it's, I think it's really across the board. I mean, the basic platform, we're adding adding new, you know, really needed features, the high availability stuff makes the real time nature of H base, you know, as an as an online store, useful if you if you can rely on it being up 24 seven. And now you can with it with the current releases. So from from those real, you know, fundamental core layers, there's still a lot of fit and finish work at the out at the outside, making it really easy to incorporate new data sets to visualize results. To deploy and monitor these clusters. All these things need a lot of work. I mean, it's a young technology still. And it's it's getting more mature. It's a lot more mature than it was a couple years ago, the first time we talked. But it's, it's still got a ways to go. So you know, including, you know, starting to see verticals. So So actually, you're laid back guy, we talked in the past, we know you're out in the wine country these days a lot, and you get a great, great life out there. But you work hard. You're a hard worker. I've been living up north for the entire duration of this project. So it's not like it's not like I'm retired. No, I know, but it's a good life. It's a great place. By the way, if you could do I get a it's great. I just I live in my hometown where I grew up. And I enjoy working from home works for me. So and so you actually work for a leader cloud era, which, you know, was at the time the first commercial company. I've got a huge lead and I talked to Mike Olson, I was talking with Ting Lee at Excel, right? Cloudera has just done such great work. And they're so ahead of the competition in terms of talking to customers there. I mean, the employees are now over 200. It's like you 250, it's growing like crazy. And business is good. But they're out there talking to customers. How much time do you spend talking with customers out there? You know, Cloudera is actively engaging with a lot of the federal financial and all the big verticals. How much time do you spend with customers? And what are you hearing? Um, you know, that's a decent component of my time spent out in the field talking to folks. And what you know, what I hear is that they're they're loving this stuff mostly. They want to learn more. Some of them are at fairly advanced stages. More of them are just getting started getting their toes wet. They're doing experiments and they like what they see. And the sort of problems we hear the kind of areas where we're addressing, which is integration with existing systems, addressing security concerns, addressing reliability concerns, and really, really, you know, nailing all those. It's what we're focusing on. Do you think Hadoop will be a primary data warehouse in many organizations in the future? In the future? Or yes, I mean, I think I think it's becoming a primary repository for bulk data of all sorts, and a home for analysis of that data. And over time becoming an online database where that that data can be searched and accessed from. I mean, I think we're seeing better and better tools in that sort of online access. And I think we'll see that as an ongoing direction right now. You're doing hive queries that, you know, it can take minutes or longer to run things. And obviously, we'd like to address that as an ecosystem. We had a great time at the HBase conference that the CloudAir put on HBaseCon was great. All the alpha geeks, they're talking about HBase and Facebook gave a great presentation about in production HBase. And last night on the plane coming back from New York, I tweeted, HDFS MapReduce and HBase is the Holy Trinity of big data. Okay, in a little religious twist on the Catholic that I am bad Catholic, I should say. Do you agree that those three are really a nice combination? Because HBase is evolving very rapidly as a database of choice in the unstructured site within that Holy Trinity, as I call it. And for the folks out there who trying to grok between HBase, Mongo, all the different approaches on the unstructured database side. Why is HBase such a nice balance between HDFS and MapReduce? I mean, I think the key advantage of HBase, there's a number of distinctions between it and other, you know, quote unquote, no SQL data stores. But I think the key advantage of HBase is just this degree of integration with the rest of the Hattie ecosystem. That it co resides in HDFS allows for lots of opportunities for better interaction with MapReduce, better interaction with HDFS, you know, same security model. You know, a lot of these things just work out more easily and it's more seamless. And I think that's what makes it a success in this ecosystem. I mean, there's nothing wrong with the other solutions out there. They're just less well integrated. So they're going to have a harder time living alongside HDFS and MapReduce. So I think that's something I look for when we're sort of trying to figure out what are the next major components to join the ecosystem is how well can they integrate with what's there already. Because, you know, that you want to make things seamless. You want to make moving from one tool to another as easy as possible. You don't want to have to be importing and exporting your data. You want to be able to access it natively from one tool to another. So that's the direction I think we really ought to be pushing the ecosystem. You know, so John, you know, the previous question was about, do you see Hadoop kind of evolving into kind of the data warehouse of the future where it'll live as the main repository for data inside an organization. So what is your take on integrating Hadoop within existing environments? Are you, is it a situation where, you know, we see a lot of connectors being built and every database vendor kind of has a connector now to Hadoop. So is that not a long term viable, do you think strategy for data management strategy? And do you see Hadoop kind of subsuming kind of the relational data world and kind of incorporating structured data a little bit more into the Hadoop platform? Yes. I mean, I think connectors are definitely a short term strategy. They're a great way to go. I think there's a degree to which they're a long term strategy. There's certain applications which are going to be around for a long time, which aren't going to live in Hadoop. But I think there's also probably a number of things which aren't great fits in the current infrastructure they're in, where Hadoop would provide either a more economical or more scalable solution and we'll provide that. It may or may not today and we'll see migration of applications wholesale over. But that takes time. I mean, you've got a lot of investment in using a particular piece of infrastructure, moving over to another one. But companies do revisit things. Things do evolve. So I think over time we'll start to see more things get replaced with Hadoop-based solutions. What are some examples of those applications? You mentioned some that are not maybe a great fit and some that are. What are some of the applications or use cases you see gradually kind of making their way to the Hadoop ecosystem? Today I think we're really focused on there's enough new applications, things which just really don't work in existing tools and focusing on the connectors and moving these new workloads onto this new technology. And I think we've got our hands full doing that. Longer-term we read about Google moving all of its advertising data into a big data-style database that's able to actually handle all their transactions across a distributed global database. And I think that's an exciting direction. The Hadoop ecosystem is away from that, away is from that. And that's not our immediate concern. There's enough applications today that are new where they just don't work and with existing tools to keep us busy for a while. So I do see that as a long-term direction for lots of things. It's hard to identify particular ones. But short-term we're not focused on that as a community. So I got to ask you, we've talked about Avro before, Avro before. What's the future of that project? Talk about the folks, introduce them why it was created. And then I'd like to ask you how it differs from protocol buffers. Sure. There's been some conversation between the both and come once more cumbersome. And what's your work path? So Avro is a serialization format. Sounds pretty sexy, doesn't it? That's, we're tech athletes. You know, run the 50-yard dash. So what it is, it's a format for data for interchange. So you've got different applications using and different systems. You've got pig, you've got hive, you've got HBase, you've got all these different components. And you want to share your data across them. And data can be fairly complex. Classic relational database data is columns, you know, rows and columns with types on those. We tend to see more complicated structures than that with nested data structures, more complicated data structures. So you need to have a way to interchange it. Protocol buffers is a solution from Google for something like this. And Avro is a different solution that it's got a lot in common. It's a patchy project. It's in a patchy project. Patchy project, just to be clear, yeah. And so protocol buffers has a couple of deficiencies that Avro tries to remedy. For one thing, there isn't a standard file format to contain protocol buffers that includes a description of the data in it. So Avro has a self-contained file format where the data is completely described. So you can write data from a Java program, read it from a C program, an entirely different application, and make total sense of it, of what was there. So that's one layer. The other thing that's different is that the way that Avro is implemented permits you to generate new data sets on the fly. So let's say you've got a scripting language and you're composing some sort of query. So it's transport speed. One of the things you're trying to... It's dynamic. The schemas can be generated dynamically. And you can write data sets in new formats easily. It's really written this way to support interactive construction. Of data sets and reading data sets. The idea is sort of browsing data sets and saying, oh, I want to run a query over this one and be able to immediately do that. Not even within seconds. Within milliseconds, ideally, you should be able to start interacting with a data set because it should be self-describing. Whereas the protocol buffer approach is you need to generate code. You first, you need to find the description of the data set. Then you need to generate code to read it in whatever language. Then you need to compile that code, link that code in, build your program, start it. There's a lot of steps. It's not really this on the fly, just look at the data and start using it. And then moreover, if you want to combine a couple of data sets and generate a new one, the ability to do that on the fly. I mean, we're geeking out here with... Sorry. With the... No, this is good. I don't mind. I mean, but for the audience, let's talk about what this means. Why is Avro important? I wanted to bring it up because, one, we're having a conversation in our communities within Wikibon, Silk and Angle, around some of the hardcore development. Share the folks why is Avro important? I mean, I think that what we want to do is power the sort of spreadsheets of the future. You want to think of, you know, spreadsheets provide this power to people who aren't programmers to do some kind of data analysis, to do some numerical computation. It's a power tool, but it's a power tool that just about anybody can use. And now we've got this big data application and we want to have people be able to dynamically flick things around and analyze them. And so we need a data format that's really designed to support that. Hadoop comes out of this sort of batch computing methodology and it's becoming more and more dynamic. So that the traditional formats... It is a big focus on real time. Right. And so the traditional formats we've used in Hadoop are really work well when you're doing batch things, but not so well when you're trying to do interactive things. And so it's really trying to focus on that, of having a very dynamic, but also giving you this interoperability. Hadoop is originally a very Java focused system. I think long term we need to embrace other programming languages. So we need to have a language independent format. So it's trying to attack all those. So tell us, tell the folks out there, John, now that you're a big time celebrity and getting bigger every day, and you're tall too as well. What you're working on, what you're working on right now, I mean primarily in terms of your focus, and what you're excited about right now. You know, I've got three things that I tend to spend my time on. I'm the chairman of the Apache Software Foundation. So Cloudera donates my time, roughly a third of my time to volunteering at Apache and trying to keep things running smoothly there as best I can. I do a lot of... sort of work as a spokesman for Cloudera and for Apache, so I spend time out on the road talking with folks. You know, and if you spend a day on the road, there's days on either side preparing and recovering from that. So that's a big time sink to do that work. And then I'm still working on code. Still, you know, still hacking. Which hacking code are you really focused on now? So I focus on Avro, trying to keep that project going and responsive and do a lot of reviewing and incorporating contributions from others as that community grows, as well as developing things there. Recently been working on a thing trying to build a good column format for Avro so that you can query Avro format data much more quickly. Should make orders of magnitude difference for a lot of cases in the query speed having a column format. Beyond that, we'll see where it goes. I try to keep my head pretty low in the development community. I'm... There's different ways that open source communities can be run. Linux is very much run in the benevolent dictatorship model. And that's not something that I want to do or be. So, you know, I... Versus the Apache Collaborative... Yeah, yeah. Social good sense. Yeah, so I don't want to be a leader of the Hadoop project. And so I mostly drifted away from day-to-day development there. And long-term, I'm hoping to be able to do that with Avro that it will develop enough community around it that it will become a standalone independent thing supported by a deep community. So, yeah, I don't want... I don't like to, you know, toot my trumpet too loudly. And because, you know, I want that... I want Hadoop to be something that's independent for me very much. So, I wonder, you know, Wikibon just put out a report around kind of the enterprise readiness of Hadoop. So, if you could, what are maybe the one or two key areas of improvement that you've seen over the last, I don't know, six months to a year around making the system, you know, ready to uptime security, ease of use. What are the kind of the key barriers? What are the key components you're hearing from customers that, hey, we need this to... We need Hadoop to tick these boxes before we're comfortable deploying kind of mission-critical applications and workflows on it. What are some of those key issues and what have, you know, Cloudera, but also the community at large, what have they been doing to kind of address those? Well, I mean, Cloudera's been working in lots of areas, contributing to lots of projects, building commercial products to help folks run Hadoop in production and make that really seamless and smooth and easy. The community at large, I think probably the largest single thing is the high availability in HGFS. HGFS is the most central component of the Hadoop ecosystem. In a lot of ways, the degree to which something interoperates with HGFS is the degree to which it's a member of the ecosystem, I'd argue. So having that be, have the automatic failover for, you know, having it, there being no single point of failure is a big advance. It really changes the whole nature of the ecosystem because it is the centerpiece and that's a pretty fundamental feature that enables all sorts of, you know, online applications as opposed to more batch style. So I'd say that's the biggest single advance that we've seen recently. You know, beyond that, I think it's just the breadth of tools that is growing out there to really allow integration with more and more applications, work with more and more kinds of workloads, you know, machine learning, more SQL type queries, you name it, and really, so that when someone comes along and says, how do I do this, there is a writing answer. Oh, somebody's done that before, here are the tools they use, and getting that know-how out there. Exit question is this next year. We'll probably see you at Hadoop World in New York, but between now and that event, what's your key goal and how do you see the Hadoop ecosystem and your preferred future? What is Hadoop ecosystem going to look like? I mean, I just see, I see it's really trying to fulfill this promise that's out there. People have these great expectations, and so we need to meet them. We need to meet the customers, the users, find out what their problems are, how this isn't working for them, and make that happen. You know, we've got the Hadoop 2.0 CDH4, it's out in the field, Clodera released that last week, and I think over the next six months, we'll see widespread adoption of that in production, and that's very exciting. You see H-Base exploding? H-Base is going to continue to explode. I think the 2.0 stuff really helps H-Base a lot. There's a whole lot of performance work that went into HFS and MapReduce that we'll see the benefit of. Yeah, H-Base is just incredible. It's taken off, and we'll see more of that. It's fun to watch, too. Yeah. Well, thanks for all your help on theCUBE. You've been a great citizen. You've been great to come on. We love having you on. We knew you went back in the day, and also Clodera has been a great supporter of my mission at SiliconANGLE, and Mike Olson and Amr have enabled that, and you guys have been very good on that. So I want to thank you for that. Doug Cutting on theCUBE. We'll be right back with more news after this short break.