 This is in the house. Doug Cutting in the house. All right. Doug, how you doing? He's been through surgery a few times. The original Hadoop. I don't know if he's been through a lot of surgery. He was getting a share of abuse before I rescued him. We'll make room for it. The toilet paper canisters. The socket are moving up right there. Perfect. It looks close to the mic. We'll take it a little snooze. Welcome back to theCUBE. Doug, you're a CUBE alum. You were at the Cloud Era offices. You were on theCUBE. I don't think we were alive that day, but now you're alive. Welcome. You're inside the CUBE where we want to extract some knowledge out of your head and share with the folks out there. Hadoop movement. It's officially a movement. This is the CUBE? This is called the CUBE. Virtual CUBE. You can't see behind those cameras, but it's actually a virtual CUBE. I'm in the CUBE. And we're broadcasting out to the live to the world, open source. So just to share the story, there's some folks out there watching who are interested in Hadoop, new to Hadoop, and obviously there's Alpha Geeks out there as well, but describe what is Hadoop and how it all started. Well, it started a long time ago. I was working on a project called Nuch, which was trying to, it was sort of an ambitious open source project, trying to build sort of the open source equivalent of what Google and Yahoo and Bing have, which is a web search engine that crawls and indexes the entire web. And it was pretty clear you couldn't do that on one machine, that it's lots and lots of data. A little bottleneck there. The time we started, we figured it was in a billion pages, maybe a couple of billion. And then by the time we got very far along, it was tens to 100 billion pages of content that you needed to be able to collect. And you needed to refresh it pretty regularly, every week or at least every month or so, things changed. You need to go ahead and check to see if they've changed. So we figured out a way to do this on a bunch of computers, sort of got the algorithm straight. And we're able to run it on four machines and do 100 million pages. But it took one guy full-time there, copying files and shuffling things around between these four machines that were running in parallel. And around this time, Google published some papers about how they were doing this. This was just a couple of us working part-time. And they had this paper on the Google file system they published first, and we were like, oh, that would be handy. That would make it get rid of a lot of the copying. Let's do a date on that. And so we started looking into how we implement something like that in Nuts. Then a year or so later, they published paper about this MapReduce system, which was a way to actually process the data that was stored in this distributed system. And we're like, that's exactly what we need. It was the same family of algorithms that we were already using, but directly supported by the framework and all automated. If one disk was getting full, some machine's crashing, stuff like that, all the stuff that we had to do by hand was done automatically for you. You just say go, and then you go have some coffee or go to sleep and go back to sleep. And you're done. Nice gift. That's a nice design. So we said about implementing it. This was mostly a guy named Mike Cafferella and I. And we worked on it for, at least a year, a couple of years maybe. And we got to the point where it was running on 20 machines and it would run along. And it wasn't perfect, but it was a heck of a lot better than doing it all by hand. And it became clear that we weren't gonna get to running on thousands of machines without a lot of work. That it's a tricky kind of programming to do and build something that will do this reliably and run all day and not die and not need any babysitting. So around that time, Yahoo came and talked to me and Yahoo said, we need something like this. Amar was there at the time. It was Eric Baldishweiler, who's the manager of the group there, was the person I talked to. Also, the person I ended up working for was a guy named Rami Stada, who's now the CTO, I believe, of Yahoo. And they were interested in this kind of technology. And they had a team of people they wanted to apply to it. And I was like, well, that's what we need. So, that's a gift too. Yeah, and so I was like, okay, I'll come work with you because we need to do this. But they weren't interested in Yahoo and a lot of the parts that were search specific because they already had all that. And they didn't want to replace any of that. They just wanted the stuff that did the distributed file system and the mavericks part. So we split that up into a new project and called it Hadoop. And I'd had the name sitting in my pocket there waiting for the next open source project that I was gonna do when my son had coined it to name this guy. I said, ooh, that is the original Hadoop right here. Yeah, when my son named his elephant that, I thought, well, that would be a good project name. It's a short word. It's easy to pronounce and it doesn't mean anything. Now, how did he come up with the name? It just sort of popped into his head. This was just a little imaginary friend. As far as I know, yeah. Kind of looks like I had to. He's never been able to explain it. He's 10 now, but he was only about two at the time that he coined the name. He was two at the time. Yeah, I think around two. Precocious. Is he writing code at that age, too? Not then, he does it a little now. Went to a programming summer camp this last summer. Awesome. Enjoyed that. I don't know. I don't know. He's very happy about it. You're right, good deal. But, so I also like that there was an obvious mascot for the name. Gotta have a mascot. The Yellow Elephant. That's perfect. We need that. We need the logo. We need the mascot. So we split that technology, the distributive file system and the MapReduce out of nudge, started this new project, a new Apache project called Hadoop. And set going on that and brought on more developers. Ono Malley was one of the first people. Tom White, Ono Malley's at Yahoo. Tom White, who was independent at the time, now is at Cloudera. A lot of people joined. Arun Murphy, who's at Yahoo, is here today. More and more people were joining. Hello. Hello. So congratulations. I mean, I'm like the proud papa here, but really in reality this movement is going mainstream and it's exciting. I mean, how do you feel? It's a little unbelievable. There's nothing that I would have ever predicted. I mean, from my point of view, I was interested in building web search and the technologies you needed to support that. But I think there's kind of a moral there in that I started with this one application in mind and saw this is a general purpose tool to help me build that kind of application. And I think that's the way a lot of people get started using Hadoop is they have this one problem when they have a huge amount of data and it's a critical problem. They say, okay, we're going to invest, we're going to get this big thing. We've got this general purpose technology which will make our life a lot easier. And they get it in there and then they start finding all these other problems. Oh, you know, since we have all the data there, I wonder if we could load this other data set in. It's rolling organically within the big companies. Yes, yes, and they start solving all kinds of problems that they hadn't realized they were even really having. Yeah, I hope you take the survey next year. Did you take the survey last year? Did I take the survey? No, no, did the Duke World 09, did you guys do the survey or no? Did you see the survey you guys did? Oh, it's fantastic. Sorry. No, Mike mentioned it, but I'm interested in how fast it's going to grow because it's exploding, right? It's definitely exploding. The average, I think it was 115 terabytes was the average instance. Yeah. All right, I mean, five years from now, what's that going to look like? I mean, a lot of it is driven by hardware economies. Hardware has gotten so cheap. And in some ways it's about time that the software caught up where you could really use all that hardware. You use the hardware you can afford to buy to effectively process your data. You've got, if you can afford to buy 1,000 processors, wouldn't it be nice if you could run them all together on all your data? And before that was pretty hard to do to really use all that power. The CEO of Cloudera, Mike Olson was saying on stage, he mentions the word proprietary vendors. I was talking about the guys in the marketplace. We talked about the concept of your data will always live, because no matter what happens, here's the source code. I've heard you talking in our office, your office in SiliconANGLE's office, which is in Palo Alto, with our director, Michael Strong, right privately about just some experiences you've had where you've written some code and the company has gone unburied and your code dies. Yeah, no, no, I worked at Excite and the company I was probably talking about, which went bankrupt in around 2000. And I have no idea what happened to that software. There was a lot of smart people at Excite, a lot of new software written and poof, as far as I know, it's disappeared into something. You got paid for it, but the reality code is like a baby, right? You've done a lot of good work and now your work, your art, has come. It's like almost a heist of stuff, right? And the other thing is that because so many people get to use open source software, it gets to be a huge success. I mean, I think there's been lots of software written that's not open source, that's everybody's good as the open source projects that are out there, but not everybody gets to use them. And so they don't get this groundswell of developers, groundswell of users, and they don't have these explosions. So I mean, I think open sources is a real secret sauce to success here. Open source is also maturing too. I mean, we just talking to Mike and me now, we're all kind of the same age and we've seen, you know, gen one, and there's a lot of religion in open source, like, you know, and you're for profit, but now it's becoming acceptable, right? I'd say in just third generation, say it's third generation open source that's growing. And Apache's been around for a long time, it's been very successful. It's changing. So how's the community changing? I mean, there's pretty much a mindset of commercialization is okay, there's some proof points of use cases where it's worked out, and there's some where it hasn't. How is the community evolving to the new open source, what I saw balanced between the benefit of a rising tide for the right reasons? Because it's always the right reasons and the wrong reasons. It's always good and bad, but how is the community evolving? Because there's more opportunities. Apache's historically and continues to have a very liberal idea of open source, which is the Apache license lets people pretty much do what they want with the software. We're not trying to force people what they can and can't do with software. Not just free of charge, but freedom. Well, people use that word in different ways. So I'm going to. It's an implicit contract, right? In a way, there's like an implicit. Well, we don't require that anybody, the people give things back if they change it. We understand people might want to change things. I'm not gonna get that. And that's fine. The important thing is that you have a community that's collaborating. And it's in their interests, I think usually, to give changes they make back so that the community can help maintain it. But if they have something they want to do and they want to sell it, that's fine. One area that we run into. They have to support that. The community won't support a non. Yeah, should be right. The one area where we sometimes run into conflict is trademarks, where we don't want there to be confusion. You know, what is Hadoop? Is it, you know, and you want it to be one thing. You don't want it to be there a lot of confusion. And so the Apache Software Foundation says, you know, we own the name Hadoop and we get to control what you can call Hadoop and have rules around that. But anybody could take it and call it something else and do whatever they want with it. The code itself isn't, you know, we're not proprietary about it. So that's sort of the line. So you don't want to fork the code base. Well, you can fork it. But you can't. But you won't have to fork the mindset. It would be confusing to have to work called the same thing. You could even fork it within Apache as long as you, you know, one of the forks would have to change its name. Yeah, call it Apache. Then you might have some of it. It's a sales job too at that point. I mean, forking is a nice little checks and balances. If there's momentum, it's kind of a vote. Vote for your code. I mean, if the community, you know, has become stagnated somehow, then we want people to have the freedom to go and fork it and say, you know, we're going to do something new and call it Blee Blop. And it's better than Hadoop and, you know, and it's compatible and so on. So Doug, how do you spend your time these days? What's your focus? These days I'm mostly working on a project called Avro when I managed to, you know, sit down and write code, which is not as much as I'd like, which is trying to establish a standard data representation for, for example, data files that people can interchange, read and write from different programming languages that's fairly rich. I mean, today the real sort of de facto standard data format is, you know, CSV, common separated values or tab separated values and things like that, which are pretty poor. So trying to build a layer that works well within for the wide spectrum of Hadoop applications and the Hadoop family applications. Which ultimately leads to greater adoption. So I spend a fair amount of time on that. I spend some time working with the Apache Software Foundation. I'm on the board of directors. I'm actually the chair of the board of directors this year. And so that takes a little time. You know, get ready for the meetings and try to, you know, put the agenda together. And, you know, sort of, I'm given back in my own way. And then, you know, I spend time talking to people about Hadoop and about other things. So I split my time among those. But I still, you know, try to spend, you know, at least a third or, you know, ideally half of my time coding. Yeah, which I'm not ready to give up yet. Well, we want to recruit you for our open source project. A storage backend for all this HD video we're trying to archive. We really need a good architect to help us out here. We need a good architect. So thanks for volunteering. Talk to Mike. We really appreciate your time donating to SiliconANGLE. Talk to TV's backend. We'll talk to you about that later. We're cutting the founder of Hadoop and we're excited to have him here because now he's a great guy. Really believes in open source, gives back. Creator of Hadoop, recruited a team, commercialized it, now part of the Clouderis team and built this entire universe, this movement with his friends. And this has now evolved into a full-on movement. Congratulations on all your success and your collaboration. Thank you very much. And thanks for coming in session with you. Oh yeah, thanks for having me. Thanks, Doug. Is that okay if I take the guy with me here? Sure, sure. Thanks for bringing him. Doug, cutting. Thank you.