 Prepare for the extraction point. We've been briefed on all the important stories and events in the world of emerging information. Now it's time to extract the data and turn it into action. Live from the SiliconANGLE Studios in the heart of Silicon Valley, this is Extraction Point with John Furrier. Welcome to Extraction Point. I'm John Furrier, the founder of SiliconANGLE.com and SiliconANGLE.tv. My guest today on the extraction point is Doug Cutting, the founder of Hadoop. We're here at the SiliconANGLE Studios in Cloudera in the heart of Hadoop, which is the office that commercializes Hadoop. Doug Cutting is going to talk about Hadoop, the founding of Hadoop, and where it's going. Just a general update on the community. Doug, welcome to the Extraction Point with John Furrier. Thanks very much, John. It's great to be here. We sit next to each other at the same table when you're in town. You work for Cloudera as an employee, and you're also the founder of Hadoop, one of the hottest trends in tech right now. You also live in the vineyards in Napa Valley. I live up north. I try not to get too specific about it. I don't consider myself the founder. I'm a founder of the project. I was there when we were starting it out. It's really a collaborative effort. Lots of people involved. I was one guy there. I think I'm identified with the project because I was there from the start. I had the privilege of naming it after my kid's stuffed elephant, which gives me a certain amount of... You're a figurehead for the whole project, and there are other founders. Let's go through and talk about the early days of Hadoop. I'm not sure you were a co-founder of Hadoop, but one of the principles behind it. There were a lot of contributors. Talk about the early days of Hadoop for the folks out there and what Hadoop has become. Take us through the evolution of Hadoop, and Hadoop being the software that is powering this big data trend and or unstructured data trends. Take us through the origins of Hadoop. What was going on at the time? I got started on this. I was working on a project called Nuch. We started Nuch in 2002, 2003, something like that, trying to build an open source web search engine. What Google and Bing and folks like that have, but all open source. It was an ambitious effort because those are major works of software that take a lot of work to maintain. We were like, what the heck? Why not do one in open source? We knew from the start that in order to build something that big, at the time, I think people were saying there was 1 billion web pages. Now we say 20 or 100 billion or more, maybe a few hundred billion. The numbers keep going up. Still, a lot of data. More than you could store on a single hard drive. More than you can really store on a single server, a process especially on a single server. We knew it had to be a distributed solution from the outset. I also knew from my work on Lucine doing full-text search that processing data at bulk data at that kind of rate wasn't necessarily the forte of relational databases. In fact, in the early days of Nuch, we tried using relational kinds of technologies, using B-trees and whatnot to keep track of the web pages that we were grabbing. They just couldn't keep up. So the current tech was not there for you? It wasn't there. Basically, relational databases are great. Structure data, a lot of tables, a lot of overhead. We just couldn't get the performance we needed. So if you're crawling the web, for example, every page has a large number of outgoing links. For each of those outgoing links, you have to look up and find out whether you've seen that page before or whether you need to crawl it or when you last saw it. You need to do a database lookup. At the rate you can pull pages down from the net, you do the math, and you look at the number of database accesses you need to do, and it gets to be really huge. So we needed some alternate way of handling this big pipe of data that we were looking at. What year was Hadoop coming together at this point? Was it 2005, six? So we worked on it for a couple of years, and we had a way of getting everything distributed, and we could run it on maybe four machines, and we sort of worked out these problems, and it would in theory all scale to hundreds or thousands of machines, but in practice it was really hard because there was a lot of manual steps. It wasn't all fully automated. The algorithms and the data structures were designed to be distributed, but having a real distributed system is another step. And about that time, so this would be 2,000, I'm going to say four-ish, maybe 2005, probably somewhere in there. Google published a couple of papers. They published first one about a distributed file system, the Google file system, GFS, and then about a year later they published this MapReduce paper. And you put the two of those together, and they perfectly solved the problems they were having. They used the same algorithms, the same way of distributing that we were already doing in Natch, but they had a framework which automated all the hard parts and just made it so that you could easily scale without adding more people manually monitoring things and moving things around, which is the way we were doing it with Natch. And so to me it was like, that's what we need. This is obviously it. So me and another fellow, Mike Cafferella, primarily the two of us at this point, set about implementing those, and it took us a year or two, and we had something that we could run. Was the Google code that they contributed code at that time? No, it was a simple paper. They published two papers. And they didn't say a lot more about it than that, but the papers were pretty clear about what was going on. So we started doing this, and we got it to a point where we'd run on 20 to 40 machines. I was working with the Internet Archive, and they had some clusters of 20 to 40 machines that I would run things on. Mike was working at University of Washington to do some research for his education. Where's he at now? He's at... Oh boy. Trying to remember the name of it. Is he still at the university? No, he's at a different university in... So he's an academic student. He's an assistant professor. I'm embarrassed that I'm blanking on the name. That's okay. I just wanted to know if he's in the company. Is he at Google? So did Google just kind of throw the papers out there? Did they get involved? We got to this point, and it worked kind of, but you started to realize that every point where you do something that's distributed, every time you have one machine touch another, there's an opportunity for things to fail, and you need to successfully handle each of those failures. And it turns out in a system... MapReduce is not horribly complicated, but it's complicated enough, and the file system is designed to be very simple, but it's still complicated enough because there's lots of little ways things can fail. And making sure that you handle all of those failures correctly is a lot of work. And I began to realize that it was a bigger job than two guys working part-time would ever be able to... It would take us 10 years to get to the point where you could really run it on hundreds of thousands of machines. So Yahoo then approached me and said, we like this. It looks like a great starting point. We have this same problem. We need a platform to do this kind of computing on. And you look like you've got a good start to one here. Would you like to come work with us? And I said, great. As long as they knew it was open source. It was open source at that point? It was open source. It was out of patchy at that point. And it was still part of Nuch. So in January of 2006, I started working at Yahoo. We split this part of Nuch out into a new project, which we called Hadoop. A number of Yahoo engineers got involved. Most of them worked for a guy named Eric Baldeschweiler. Eric 14, as he's known. Owen O'Malley. Arun Murthy. Folks like that who were still very involved in the project. And they're still at Yahoo. And those three are still at Yahoo. There's a lot of other people who... We got to get you guys on here on theCUBE. Come into the show here in Palo Alto. We'll get you on. So they had a good, impressive team of engineers that they set onto this. And over the course of the next year, two years, it really did mature. We, you know, together got all these different ways that things could fail and made sure that it did something reasonable. And before you know it, Yahoo was actually using it to process the whole web. You know, to process tens of billions of pages. How fast did it move from coming on board to processing the web at that point? I think it was under two years before it was there. And we also, at the same time, had it... I think Owen worked on this as well as Arun, getting it to compete in this international competition for sorting data. Sounds exciting, doesn't it? Yeah, exactly. The next Olympic sport. It's better than quicksort, right? That's right. All you co-size students out there. It won. It won. It was the fastest system to be able to sort a terabyte that year. And so then it was really on the map by 98 as being a technology that could solve problems. By 2008. Sorry. 2008. So say 98, yeah. It indexed the whole web at that point. By 2008. It was really there, scaling, running on thousands of machines and being able to process the entire web. There's still some other things that were missing. And I think Cloudera got funding. Homer bolted out and became an EIR at Excel. I think Homer Bacher, was he still at Facebook at the time? And then you guys got funded. Cloudera was founded, I believe, in 2008 as well, around the same time. Got it. So right when he bolted out. I believe that's the case. And so then we started to see... Where did Google come into all of this? Because Google puts the papers out there. That's the catalyst for the innovation. You guys were working on Nuch. Google puts out the GFS and then the map produced papers basically like, wow, this is like an inspiration. Things started clicking for you guys. And then Yahoo picks up the ball, innovates from there, really gets it stable and growing as a core product. And Cloudera's for it. Does Google step into the equation at all during that time? Google. Google used it for... They encouraged its use at universities through various efforts. They helped universities teaching courses. They helped universities... They set up clusters for universities to be able to use to teach courses. And so they certainly helped promote it. Were they contributing code at the time? They didn't really... They had one intern who contributed some code, but not much. I think they were concerned... I think there were some legal concerns about having their engineers who could see everything that they had contributing to this project which was close to areas where they held some patents and so on. I don't know the details. But they definitely supported the project. They like having it out there. What I've heard from people at Google, they really appreciate is that now they can hire people who are already familiar with this family of technologies and this way of thinking. And they don't have to train everybody from scratch. It's really amazing intoxicating new computer science. Think about it. So we hand it out to the universities, get them playing with unstructured data and big clusters with Hadoop. It's good for business for them. So what I've heard from Google is... I don't know. Again, I don't speak for Google. So I'm surmising some things here. I can speak for Google. Google wants to get this out in front of everyone because it's good for their business. Good in computer science. Students train. Bill Gates talks about it at Microsoft. We need more computer science guys out there coding away. I don't know how much they hold this kind of technology as something that is a critical advantage that they have their own implementation. But they also have some practical reasons that people have told me why it's difficult for them to open source things, just the way their software is organized. And what are those reasons? There's a lot of interdependencies in the way they structure things. And they get a lot of benefit from that in their engineering to have everything... You mean kind of like building an OS? Kind of like... Yeah. It's just that it's hard for them to extract GFS and MapReduce. A specific problem in this case. And give those away without giving away Gmail and web search. Which they don't want to give those away. Because those are things which they consider proprietary technologies. And it's just technically hard. It would be a big investment and not worth it to them. Whether even if it were easy, whether they would, I don't know, that question doesn't come up because it's not something they can do. It makes sense. They should protect the crown jewels. They got outside of search, Gmail, and Android are the two hottest products for them. But they've been very encouraging about this work. I just wrote a comment to a blog post about Google how a lot of ex-Googles now are getting in and funding and doing startups. So they're taking over Silicon Valley at the startup level where both funding and execution. Google has a very intrapreneurial mindset within the company. A lot of people are doing entrepreneurial things. Which is a double-edged sword for Google. It creates more chaos. But provides more energy and innovation. Cool. So Hadoop, great movement. A lot of people involved helping along the way. You had a wingman and Mike. Yahoo came in. Google initiated with the papers. Gets it going. Yahoo picks it up. Boom. You're at Yahoo. Cloudera gets formed. Now you have Cloudera commercializing the aspects of Hadoop. You have Apache Software Foundation that Hadoop projects with all that contribution and contributors coming in across the board. So great history. Right. What's going on today? So what's the current situation? We've got lots of people still contributing. Both at Yahoo, at Facebook, at lots of other companies are involved in the project today. eBay, Twitter, you name it. There's a suite of other technologies which you're building up around this kernel of the distributed file system and MapReduce. And on top of that, we're getting all sorts of query engines. It's just a huge family of technologies growing up. And I think what we're seeing is people are realizing this is a new way of processing data and lets them do things they couldn't do at all before. They can afford... Like what examples would you say? They can afford to save data that they couldn't afford to save before, that was just prohibitively expensive with conventional enterprise storage solutions. If you want to just reliably be able to save, for example, every transaction... Just in physical media and resource. People, right? Both people and... Yes. I mean, if you wanted to save every transaction that you saw and be able to have it online and ready to process, ready to analyze at all times and save all the details of those transactions for a lot of businesses, that was just impractical. And then to be able to do... Having a low latency, too, is a whole other question, right? Not only storing it cost-wise, the latency issues of getting it back and processing it. I mean, you can store it on tape, maybe, or something like that, but that's not very practical for... I mean, so we're living, so what you're saying basically is open source is great, good stuff happening with Hadoop on the open source size, but we'll drill more down into that. But on the market side, we're talking about a real-time environment. Things like Facebook, Twitter, mobility is creating all this new data. So in addition to the data that businesses could capture from a legacy standpoint, there's all new data. So the requirements are high. We've got banks and things like that. And banks have a lot of transactional data. They've got ATMs, they've got credit cards, they've got all kinds of financial information which they can track and correlate and analyze and learn about their customers, learn about markets, learn about credit risks. It's all sorts of different industries. You know, in healthcare you can... There's lots of data that comes in that used to either be discarded or not kept in an easily, easy-to-use online form. Yeah, I mean, a lot of companies... I talked with the CEO, he called me ClickFox, Marco Pacelli out of New York. What he's done is he's used unstructured data and kind of the data model to integrate all this data from calling data, call center data, web data and real-time process it and identify business value for the customers, his customers, to change processes. Simple things like this person called in on the phone because they thought they forgot their password but they really didn't forget their password but they're still calling, they're getting hung up on. So all these like customer satisfaction things are like amazingly at being identified by data that was once stored and locked out on fenced out tape. So this ClickFox somebody's doing extremely well and his customers are saying, we've never had this kind of business engine before. Like real dashboard of like, hey, this is what our customers are doing. Like, and he was telling me that they could never do that before. So that is a pretty solid example of what you're saying, which is these new engines can come in, identify these problems. The other thing that it facilitates is doing ad hoc analyses. So you can afford to save the data but you can also afford to save it all and run some computation over all of it. And it may not be hugely fast but you can do it and get it done and maybe takes half an hour or an hour or even overnight. And that's tremendously enabling for people doing research and you don't always know what you want to do with the data when you gather it. Or you might change your mind. You might say, well, this is what we're doing with the data but somebody has another idea tomorrow and can we do this? And if you can afford to save it all and you've got a sort of general purpose engine that can go and process it all, then you can figure out other things to do with it after the fact. And we see people taking advantage of that all the time. That's a huge advantage. I think the old school way of doing things was you'd spend a lot more time upfront designing. You sort of say, what is the question we're going to ask at the end of the day for our data? What are the set of questions that we want to be able to answer? And so we've got the data coming in. So let's transform the data before we load it in and filter it down to just those things which are required to answer these questions. And then we'll index it in just the way that we can quickly answer these questions later. And you build this very specific system for this one set of questions. So the constraints are amazing. You've got structural constraints, syntax. So then if you change the question you want to ask, then, oops, you've got to go back and you've got to really, if you even were able to save the original data, there's something you want to query that what you threw out, you know, you've screwed. So highly inefficient, which is why this ClickFox example is interesting to me because what it opens up is to exactly your point, these business intelligence or data warehouses were purpose-built for a certain set of questions. Right, right, exactly. And a lot of this value is on predictive type things that are ad hoc. Identifying trends quickly and with the real-time web, it becomes very interesting. We're here at Cloudera in the SiliconANGLE Studios here in Palo Alto, California with Doug Cutting, one of the founders of the Hadoop project. Great inspiration to many folks in the computer science field. Hadoop is one of the most popular growing emerging technologies that is fueling a new innovation and revolution in business value and changing the data warehouse business as we know it and changing, quite frankly, society and benefits to society. Doug, let's talk about computer science for a minute. You've been an inspiration to a lot of folks out there, young and old, around what you've done. I mean, you've taken some hacking, an ambitious goal of building an open-source search engine and transformed that with a partner and got some momentum in a classic open-source success story where it got momentum and you had stakeholders like Yahoo and Google contributing. Massive companies that had the same need grows into a big project. A lot of folks want to know younger folks in particular that I talked to was, how do you do that? You've been around the block. You've done a few things. Hadoop and you're working on more. What's your advice to the folks out there who want to know, how do I get involved? How can I pull off something like that? I mean, I think working on open-source is a huge advantage. I worked writing software at proprietary software companies for a good 15 years before I started doing open-source work and I don't think the quality of the software or even how innovative it was was any different in those contexts, but fewer people got to see it. In some cases I worked for companies that went bankrupt and the software disappeared into an intellectual property black hole. When you're doing things that everyone can see, then more people can use it and it lives longer, it gets more exposure and moreover, people like free. They can try it out. You sort of skunkworks kind of projects. People don't go to their manager. They just download something because they can, it's free. They don't have to go and sign some sales agreement or do something that requires a lot of approval and thought. They can just download something. They can try something. They can build a prototype and they can evaluate it and see does this actually solve a need and a lot of times what you find is it solves 90% of a need but there's a little thing that would make, if they added this or they fixed this one thing here then it would be 100% and so then they add that and if they're smart they give it back to the project so that it will stay there so it will be there for other people and for them in the future. The other thing I want to get to another advantage of open source for them is when they have problems with it and they try something and it dies in some horrible way they can easily go and see why it died because they've got all the code and they've got a question about it there's a bunch of people out there who will answer the question for free and with proprietary software I think it's a little harder to do that if you usually don't have the source. Someone could quit and leave the job, right? And they may not be around for a lot of reasons so I think people find it easier just to get started and then once they're going to modify it and get it to do what they want but to me it's been a huge improvement working on open source over proprietary software. What would you say to the folks out there from Lessons Learned? I mean you've had an interesting road you've had stumbled probably hit your head against some stumbling blocks there and challenges always happen when you have growth it's like a crying baby and you've got to fix things, right? So what would you share with folks out there for this kind of where you are now where you've come from just some Lessons Learned best practices. One of the things that you share I've come to appreciate I mean I also all the projects I work on are at the Apache Software Foundation I'm currently the chair of the Apache Software Foundation and I've come to appreciate through the years the value that Apache brings Apache's worked out a lot of systems and ways of operating open source projects that work well where you can get people who may have differences to work together and resolve those differences or decide to split up and it gives you a legal structure I think of it as a sort of a civil society it's like a government for software and I think it's better They've done a good job over the years they've created some good products pretty stable products good community very active they have good track record obviously web server from up all the way up on the web server days but it's because the reason that there are these high quality projects is because there's this bottom up it's not the top down there's been somebody cracking the whip and saying you will produce great software rather it's that there's a system that enables people to produce good software and encourages that in a bottom up manner it's a very bottom up What's going on that gets you excited these days I see you're very busy I see you're you have a day job at Cloud Air but you're actively working Apache on the new stuff What's exciting for you these days what are the projects you're getting your fingers in and playing with Well it's exciting to see you know new versions of Hadoop roll out and all the surrounding projects that surround it and all the new things that people are able to do we're working on getting out another major release of Hadoop 0.22 some day maybe we'll have 1.0 I spend the most of my time when I'm writing software these days working on a project called Avro trying to establish a standard data format that has a little permits a little more introspection of the data that's a language independent format so that the data can describe itself and people can process it from lots of different applications I think we're we're not there's not quite the right data format out there yet and I'm hoping Avro can prove to be that another related problem we have is the systems that these different components communicate with is currently the ones we currently use in Hadoop are fragile if you change things on one side then the other side may not be able to talk to it any longer and we need to fix that we need to get to the point where you can smoothly upgrade different components independently in these distributed systems and that's an ongoing project to figure out how to do that and then it's going to be a massive effort to work that through all these projects let me ask you a question it's more a philosophical question so you know answer it however like what has surprised you the most what's going on today is there anything in particularly you go wow that is an absolute you know surprise or in a good way could be like you can say from I didn't think it was going to happen or well I didn't think it would blow up and be that big the success has been you know I didn't anticipate that this was going to be more than a technology to support an open source web search engine that was what I was interested in that's what we built it for the fact that it's of general utility you know isn't a total surprise but you know I didn't think it would be this much general utility that it would be used in you know insurance companies and banks and you know in science and all kinds of things like that that wasn't something I anticipated there were a lot of developers out there and you know from time to time we talk about things like what's going on with things like OpenStack and this the other thing and Duke and you know all these other proprietary vendor technologies and whatnot and then you got people who are decision makers and these big banks and whatnot but at the developer community it's a little bit more fickle people want to know hey you know I want to like an open source people like those areas that are just kind of sucked up with momentum is there anything that you could point to say for the folks out there that are developers like here's some projects that you're seeing out there that are getting a lot of momentum that are worth getting behind because a lot of times developers just want some stability around the community is it H base is it you know flume is it you know all these other momentum points is there anything out there that you could point to or worth pointing to I always think that people should work on more about developers think about the application they want to build and then look at the technologies that are out there and look which ones are closest to satisfying it and part of that is do they have an active community if you've got a project which satisfies 80% of it but it's dead then that's a lot of effort to bring it back to life if there's nobody working on it you've got one that only does 70% but it's got a group of people who are working together well you know it's got a future so that does play into it the popularity and the activity of the community and all that sort of stuff but mostly I think you need to have you want to scratch your own itch I think that's the best changes come from people who have a real problem and want to solve it and then what open source tends to enforce is that you have to think be a little bit more general I see a lot more in my proprietary software days of people doing short term fixes for a specific problem because they need to get it done they need to get on a schedule and they don't care if it's general purpose because they're in a rush and nobody's going to criticize them potentially yeah whereas on an open source project you've got other people at other companies who don't necessarily share that problem and so they think about well how is this change could I take advantage of it and even if I can't how can I make it so that it doesn't impact me and so you get spent a little more time and a lot more QA get more eyes looking at it from different angles different perspectives different uses so it's even it's not just QA I mean it's a particular kind of QA where people are trying to I mean it's because you're trying to build something collaboratively that's shared you have more perspectives because there's a diverse set of users actually have to run okay we're here at Doug Cutting here is to run thank you for coming on the extraction point really appreciate it thanks for your insight I look forward to talking further about Apache and the projects and getting more folks in here talk about Hadoop appreciate your time it's fun to be here I look forward to doing it again okay that's a wrap Ricky