 This is Research Computing and Engineering, episode 4. And today we have a special treat. We have Jeff Squires as a co-host here. You may remember Jeff from the OpenMPI project. Jeff, thanks a lot for helping me out here. Hey, no problem. Glad to help. It's interesting stuff. Yeah, no problem. We actually have something interesting today. We actually have a free implementation of the Google file system and Google MapReduce available from Apache. It's called Hadoop. We have one of the main developers that with us today. And this is quite a bit different than we've ever looked at before, Jeff. Yes, it is. It's a little different than the traditional HPC. And as you know, I'm on the OpenMPI project. And so I kind of do what's, quote unquote, traditional HPC, where people do parallel programming and whatnot. But data mining and I think some of the things that Christoph is going to talk to us about today is also a very exciting field and very difficult technologically to do. So I'm interested very much to hear what he says and how they work and how the technology works and all this kind of stuff. OK, well, that's cool. Let's go ahead and go ahead and have Christoph get on the line. Christoph? Hey there. Going to talk to you, Brock, Jeff. No problem. Thanks for coming. No, not at all. First off, Hadoop, is it a file system? Is it an application? Well, actually, Brock, let me ask something. Let me ask something before this. Brock, how do you pronounce Hadoop? Is it Hadoop? Hey, Hadoop? Hadoop? You know, what's the right way? It depends on who you talk to. Hadoop is a very common pronunciation. Hadoop is what the project founder calls it. And that's what his son says. And Hadoop is actually named after his son's yellow stuffed elephant. So I would argue that the way that a Doug Cunningson says it, Hadoop, might take more credit than what any of us geeks call it. OK, well, I'll try to remember that throughout here. But don't shoot me if I get it wrong. That also explains the logo. That explains the elephant. It does. It does. OK, cool. So let's jump right in. Hadoop, what is it? Yeah, so it depends on what you look at this for. Hadoop itself is an Apache sub-project. I mean, it's an Apache top-level project. And there are a couple very key sub-projects that are relevant. There's the Hadoop Core. And I think that's what people generally mean when they refer to Hadoop by itself. And Hadoop Core is the combination of the Hadoop distributed file system and the Hadoop MapReduce processing engine. So between these two core components, you get a very large-scale distributed storage system and a very large-scale distributing processing framework. So one of the things that we jokingly refer to as Hadoop is it's a gaping maw of bits. Meaning you can dump huge volumes of any type of data into the file system and have it store that data reliably. It handles fault tolerance. You don't have to worry about any of this. And then MapReduce is a processing framework that allows you to write relatively simple code to process this data at a scale. And the MapReduce system takes care of scheduling each of your jobs, each component of your computation to run as local to the data as possible. So in a very common case, you're processing the data on the same machine that is physically hosting the data. And you're only using the network for the aggregation in the summaries. OK, now one thing I noticed there in your description. You said you can dump any type of data into this. Now is this modeled after a traditional file system? So when you say any type of data, you mean just files? Or are there different ways of storing this? How does this work? What do you mean by that? Yeah, so HDFS is a file system. And the difference between the Hadoop file system and a traditional file system like X2, X3, what any others is Hadoop is optimized for extremely large files. It's optimized for extremely large files that have the characteristic that they tend to be right once or append only. And with these types of files, it excels at streaming reads and writes. It's very performant in that regard. What it does a horrible job of is storing lots of files that you want to access randomly. So it's like a file system in the sense that you can store any type of data you want, structured, unstructured, it doesn't care. So it's a block file system. What's different about a traditional file system is it's highly optimized to work with very large files where the primary access mode is streaming reads and streaming writes. So when using this file system, do you actually have to format like your entire cluster HDFS or does it sit on top of a block? Yeah, so the way this works in practice is you deploy Hadoop to your cluster. You bring up the, there's two components basically to the file system. There's the name node, which manages all the metadata for your file system. You might call that your master. And then there are an arbitrary number of data nodes. And for a small cluster, this might be five or six for a large cluster. This might be 500 or 600 or 2,000. And what these data nodes do is they just store blocks. So the formatting, so to speak, when you first bring a cluster up, you do do something called formatting, which is initializing the metadata on the master to a sort of clean known state. What the master then does primarily is keeps a mapping of files to the blocks and their data servers. So Christoph, what you just described there sounds to me like what I would consider and I'm kind of a newbie. I'm not a file system guy. It sounds like a parallel file system where you're spreading the data across multiple servers. So you can have clients talking directly to multiple servers, not just one server, like a traditional NFS kind of model. How is Hadoop different than that? How is, I'm sorry, I think it's HDFS is different than that, right? Yeah, so the real sort of defining characteristics of HDFS is the type of file and the type of operation that it's optimized for. It's, it performs much better with, as I said, these streaming reams it writes and working with very large files. Gotcha, gotcha. And that is the fundamental differences. You know, you can't, there's nothing to stop you from storing a small file in HDFS, but the block size is 64 megs. Ah, that's big. It's big. But when you're working with terabytes of data, that's actually a very reasonable block size. Because the common case, you gotta think about what the common case is. So let's walk through a very common case usage of Hadoop. So say you have a web crawl going on and you've got sort of lots of workers crawling the web and dumping their results out. All right, so each of these might open a file and just pour all of the data that it sucks across the network into that file. And then when you finish this process, maybe you'll have a couple terabytes of data spread over a couple very large files. Then what you'll wanna do is you wanna index this data and process it for the web. So what that would look like is you create a MapReduce job to do your indexing. And the first step of that index might be to build a hit list or something like that where you take each word occurrence and map it to the documents that it occurs in. So you would create a MapReduce job that what will happen then is your data will get broken into chunks on the order of the size of your blocks. So maybe 64 to 128 megs depending on how you configure your cluster. And then each node will process the data blocks that it has using this same map, this code from the map phase of the MapReduce. And what allows MapReduce to work so well is there's no shared state between the compute nodes. So each compute node is able to look at its data and process it independently of any other data on the cluster. So when a task fails, for example, this is relatively trivial to you just rerun the task. You don't have to worry about there being a shared state. The only synchronization point in the process is in the reduce phase. So it happens in the map phase you emit as many as you want arbitrary key value pairs that in the reduce phase. Let me ask you, I'm sorry, let me interrupt and ask you about one thing. So I'm in the parallel computing field and one of the big things we are concerned about is the overlap of data that, for me to do some kind of local computation, I need my data and a little bit of overlap of what my neighbors have. Is this something that's also doable in the Hadoop kind of space? Or is that just not really an issue because you're doing different kinds of computations that are fully local? So there's nothing to stop you from accessing data on a neighbor, so to speak. You just need to change the way that you think about it. So what I was saying is the only sort of synchronization point that you have in a map reduce computation is how you move between a map and a reduce. And so what the map does is it emits arbitrary key value pairs. And then what happens in the reduce is all of the values for a given key are aggregated together so you can process them together. So if you want to process data that's sort of on your neighbor, so to speak, you make me to make sure that you key your data the same way that your neighbor keys the data. And if you both do that, you'll get to look at that data together in the reduce step. But only if you keep them together in the map. It's like to me that the map part is what we would normally call like an embarrassly parallel problem. Doesn't have anything to do with what the other CPU is doing, what the other computer is doing, what the other racks doing. You could almost almost say what the other cluster is doing and then just reduce the final key value pairs at the end. And that would actually involve communication. That's exactly correct. The map step is embarrassingly parallel. And the sort of the magic, so to speak, happens with how you define what your intermediate keys and values mean and how you aggregate those values in the reduce step. And there's so much flexibility in how you do that. I sort of grew up at Google and I was taught MapReduce fresh out of college and it completely changed my perspective on computation. I was 22 years old, three months out of school and writing code that ran on 4,000 machine and processed 10 terabytes of data. It was, it took some racking, you had to wrap it. But it teaches you MapReduce is like you can do anything with this from a sort of data processing and transformation perspective. One of the things I would joke about is when you have a really big hammer, everything's a nail. So there are lots of things that you can do with MapReduce which might not be the, MapReduce might not be the optimal way to solve the problem but the fact that it's the same framework that you solve all your other problems with and it makes it very easy to interact with all the data that was generated with lots of other different processes, it often makes it good enough. And the real resource saving is the developer's time which in many ways is more expensive and harder to come by than CPU cycles or hard disks. So the map part is, you know, the nice thing about being embarrassly parallel it pretty much scales with machines and the amount of data you have. Do you do any type of special magic to try to make the reduce phase scale well? I mean, is there a lot of more work being done on trying to improve reduction than the mapping part? Or is it mostly just trying to massage everything into MapReduce? So the reduce phase, again, it can be very parallel where the magic comes in with the reduce phase is more in the model and the way you think about the computation because what really matters in the reduce phase is how many unique values you sort of have for a given key. So let me sort of think about, let me give you two examples here where we can start to illustrate this here. If you're indexing the web, okay and you're trying to do a mapping of a word to all the documents that contain that word, you have lots and lots of keys, okay? The work that's done per key doesn't, isn't dependent on any other key. So that can always be spread out. And in this example, which are gonna have your values, they're just going to be sort of a series of documents. So there's a lot of values, but the computation that you're doing, you don't ever have to store them all in memory at once. So the things that really matter are, one, do you have a, is your key space, your intermediate key space, is it wide enough that you can spread that over a lot of machines? I mean, sometimes you might have a map reducer really only has one intermediate key. And then it's not, you can do the map in parallel, but the reduce really isn't parallel. So I wonder if you could clarify one thing for me here. So you're talking about keys and data values and things like this, but I'm trying to reconcile that in my head to the file system. So how do you map a key and a value to, the back end HDFS, is the key the file name or how, how does that work? No, keys and values are sort of abstract concepts that mean what you want them to mean. I guess, so you think of this as you have a file, and a file will be broken into chunks. And you can basically define input types for a file that tells MapReduce how to process it. A couple very common input types are text files where every new line is essentially a new record. Sequence files are a way for you to pack arbitrary blobs of data into the notion of a record. And when you process one of these files in the map phase, the key usually has something to do with, in the map phase the key itself isn't very important. It might be the line number, it might be the record number, it might be the offset into the file that you're processing. But what you're doing in that map phase is you're taking the value, whether that's a line of text, or whether that's a blob of data that you're going to deserialize and do a data structure that you're familiar with, and you process that. And in the process of processing that input record, you generate arbitrary key value pairs. And those mean whatever you want them to mean. Because what's gonna happen? What's gonna happen if the key names come from? Like an example you gave earlier was, you're doing a web crawl and you're dumping all these files onto the file system. Is it during that dump to the file system that you generate these keys that are meaningful to you and you say, all right, this chunk that I'm putting there is key foo? No, that's not where the keys are generated. That data is just data. As I said before, gaping mob bits. You just think of it as a place to pour a bunch of data. Then what's gonna happen is, let's do this example where me and you are processing the same file. So say we both have this web crawl data as input, and let's say that you wanna build an index and I want to compute page rank. So it's the same input data. When you process that data, you're going to, in your map phase, you're gonna generate intermediate key value pairs where the key is the word and the value is the document. And then you're gonna aggregate those in the reduce and you're gonna output your index. So for you, you defined a key as it's a word and the value is the document and that means something to you and to your program and to your data transformation. Now what I'm gonna do is I'm going to look at all of that same input, but what I'm gonna do, when I'm processing the input, I'm looking for links and what I'm gonna do is every time I find a link, I'm gonna look at where that link goes to and the destination of that link is going to be my key and the value is going to be what fraction of my page rank I'm going to transfer to that page that was linked to. And that depends on. Your keys are kind of dynamically generated during the map. It's not an attribute of the file. It is. Exactly. It's an attribute of the computation, the transformation. So one of the things that I say a lot is that MapReduce gives you the ability to dynamically index and aggregate your data. You don't have to pre-define any indices on your data because in reality, you define indices over your data to make it easier to ask certain types of questions. The reality is if you knew all the questions that you wanted to ask about your data upfront, you probably wouldn't need the data. So like in your case, when you're emitting these links and what page you're coming from, the reduce of that, you'd probably would be totaling up all those for a given link to a page. How many times does it appear? Because you want to know which page is more popular? Well, what we're doing is remember, we use the destination site as the key. So what's going to be happening in the reduces for each destination site, it's going to be aggregating the page rank that was transferred to that from all incoming links to that site. So basically you're doing a very, very big sort pipe unique minus C? Kind of, you can, I mean, sort pipe unique minus C. In the simplest case, you can do more complicated than that. That was a play way of saying, not really. You have to forgive our ignorance here. Well yeah, I appreciate you explaining this stuff because it's huge. But what I'm saying is sort pipe unique minus C, yes, you can absolutely implement that in MapReduce, it's a very common thing to do. It absolutely works with the model. See what this is, because what MapReduce is, is MapReduce was originally developed by Jeff Dean and Sanjay Gemawat at Google, two of their fellows now. And this was developed to help with indexing. Google does these giant web crawls and they end up with terabytes and terabytes of data that they need to have indexed, massaged, analyzed, prepped and shipped out to the data center on a pretty regular basis. So you can't have every engineer that works on indexing or any part of this process thinking about all the challenges of a large-scale distributed data processing system. It's just as a very poor job of aligning their expertise with their needed work output. So in this MapReduce system, you really, it provides a layer, an insulator between the systems programs that need to focus about just locality, network IO, synchronization, concurrency, fault tolerance. They think about all that underneath the covers and then above when you're using the MapReduce API to do your data processing, you really don't think about any of that. You just think about this is the data I'm starting with and this is the data I wanna end with and now how do I use the MapReduce model to achieve this transformation? Absolutely, good middleware design. Yeah. So MapReduce started off as this, building an index, web searching, but what are some of the other things that people are using it for besides search? What's some of the more interesting things you've seen? I ran across the machine learning project while researching before this. What other things have you seen? Yeah, so as I said, it started out primarily for indexing and whatnot. One of the things that sort of quickly we realized you could do beyond that is it's pretty good at doing sort of graph analysis. It generates a lot of intermediate data, but it's pretty good at doing graph analysis. Data mining and artificial intelligence, again, it works really well because it allows you to process so much more input, so many more training examples. The challenges come though in thinking about the way to transform these algorithms into a MapReduce model. But the MapReduce model itself tends to be flexible enough to allow for this. There's just a thought exercise in converting it, and sometimes what you'll see people doing is they'll be iterative MapReduce jobs. Maybe it's not one job anymore, maybe now it's two or three jobs, but are you referring to the Mahoot project? Yeah, it's another Apache project. Exactly, so that's, I believe, I don't know if it's a sub-project of Hadoop or if it's a side project or how that works exactly, but that's a really sort of powerful and compelling machine learning library that takes a lot of the common machine learning algorithms and converts them into this MapReduce framework for you. So what's the biggest scale that you've seen Hadoop deployed? You were talking about at Google, you were running at 4,000 machines. Is this what people do with Hadoop or bigger or smaller? Well, so one thing we need to make clear, so Google published papers on MapReduce and GFS. They did not publish any implementation. The implementation for Hadoop was primarily developed by Yahoo with major contributions from Facebook and other companies that contribute to the community. I am personally aware of Yahoo Hadoop clusters that exceed 8,000 cores. So depending on how those machines are, and 8,000 cores and two, three, four petabytes of data, Google uses these same models to the same sort of type. Google's version of MapReduce and GFS goes up significantly larger than that. And Google's very tired about their machines and how big their clusters are, but it's considered, it's a couple of years ahead of development. One interesting thing that came out recently is some folks over at Yahoo recently used Hadoop to take the TerraSort benchmark. And they did it in 200 and some odd seconds. Google then used their own system to do it a couple months later and they did it in about, I think 60 or 70 seconds. And then what they did is they did a petasort after that. And that came out in, I think it was four to six hours. Or something like that. I don't have the data right in front of me. That's nutty. Is that a standard benchmark available somewhere? Well, the TerraSort is a very standard benchmark that's available. And the Hadoop code, there's some instrumentation stuff in the Hadoop project that allows you to just run that so you can measure the performance of your own cluster. Okay. Petasort was Google's sort of, I think their cheeky answer to Yahoo winning the TerraSort. Our clusters are bigger than your clusters. There's a lot of that going on. And it's always a fun position for me coming out of Google and working so closely with the Hadoop project where many of the competitors are at Yahoo. We quite frequently, we'll joke about our corporate overlords and punt on certain political issues whether it's good fun though. Let the sales guys figure all that stuff out. Honestly, one of the things that I really like about this project is that it's an open source project and you do really see developers from lots of different companies sort of coming together and solving your problem together. And for that reason, Hadoop is quirky. There's lots of little add-ons that someone developed because it would have been cool and really good for their application and now it's available to everybody. And this is sort of one of the double-edged swords of open source projects. On one hand, it's great that everybody adds whatever they want to the project. On the other hand, sometimes that can make it really confusing for end users. So that's one of the things that we're sort of doing here at Cloudera now is we are going to be removing sort of some of the confusion around how the components fit together and what components you need to do what task. And you'll see us sort of producing releases and summaries of the software that are well-suited for specific tasks. So actually that leads right into a good question here. What exactly is your role in the project? I mean, Brock introduced you as one of the core developers, but what do you do? Yeah, that was a horrible way to introduce me. And so I wasn't sure we were recording yet. I didn't want to stop it. So my hands are a little bit personally tied from a development perspective because I spent a lot of time at Google working on our internal versions of MapReduce and GFS. It just creates problems if I go start contributing code to the Apache version. It's easier now that I'm not at Google, but no, sort of my role at Google was I built this academic cloud computing initiative where we took, depending on how you count, like how you measure a cluster, probably about the second or third largest Hadoop cluster and made this available to the National Science Foundation. And so I did a lot of sort of operational legwork and carving out sort of a data center in Google, working with the Apache Foundation to get Hadoop up and running, working with IBM on a lot of operational issues with the cluster, and then working with the National Science Foundation to explain what is Hadoop and why do you care about big data. I worked a lot with the University of Washington to get courseware developed, sort of teaching, honestly, both students and faculty about what MapReduce is, why it matters and how Hadoop can help you do this stuff, and then really sort of just bringing this together in a large research program where the NSF solicited proposals for people that wanted to use a large Hadoop cluster. And those are in now and those are gonna be announced very soon. And so this cluster that Google maintains and that IBM helps out with is gonna be used for a lot of really cool data intensive research. Just outside, a lot of it is one degree away from computer science. For the first time ever now, non-computer scientists have a viable way to work with terabytes of data. This used to be a problem that really required computer scientists on your research team. Hadoop gives social scientists, geologists, people that are really deep into biotech. It gives them a very easy, manageable framework to process terabytes or petabytes of data. You know, that's a fascinating data point that you give in what you just said and something you said earlier that sometimes MapReduce is not the best tool to do it, but it's a great tool and the fact that you can deploy it on such a scale makes it so attractive. I hear similar things in my own community. Sure, we'll get five or 10% performance hit, but if you can give me a tool that will either scale better or have better features or something else that interests me instead of the performance, I'll take that. And it's actually quite refreshing to hear that from an entirely different community. Yeah, no, I mean, Hadoop is extremely well suited for the workload it was designed for, which is streaming reads and writes and processing data in an IO-bound environment. It's very good at that. It is also capable of doing many other things with, you know, more or less optimality. But the fact of the matter is if you get everyone familiar with using Hadoop and sort of comfortable with this, you have a virtually infinitely scalable file system. And if you write a Hadoop job and it works on 10 gigs of data, it's probably gonna work on 10 terabytes of data without rewriting it. The only difference is you're gonna require, you know, a thousand times more machines or I don't know how ever the math worked out in that. But yeah, I mean, that's the only difference is you just need more hardware. That's the only, again, this goes back to sort of separating the application developers and the analysts that are designing these data processing workloads from the system engineers that maintain Hadoop and the operations team that keeps it running. These all become very distinct roles now in an organization without a lot of dependency on each other. So how does Hadoop handle fault tolerance? I notice you've mentioned that a lot. And of course, you mean time to failure multiplies across a number of machines and you're mentioning multi-thousand machine clusters. Something's gonna break. Absolutely. Again, sort of one of the motivations for, so Hadoop is designed to run on relatively cheap commodity hardware. One of the things that Google realized when they were developing their systems is even if you buy really, really expensive, reliable hardware, you have 10,000 of them, you know, a couple of them are gonna fail every day. And once you're gonna deal with a couple machines failing every day, it's not that much harder to deal with five or 10 times that many failing every day. So the way Hadoop handles fault tolerance is the every block of data is replicated to at least three distinct machines. And there's two reasons for the lot. It's really not a problem because the master sort of takes care to make sure that you've got one copy of data on a machine, maybe one copy in the same rack and one copy in a different rack. It distributes it so rack failures aren't even a problem. Can I interrupt one thing there? So I'm a network guy right here and what you just implied to me is that Hadoop is aware of the network topology. Is that something that it goes and figures out itself or something you have to tell it right now? Okay. Tell it what your racks look like. Tell it where your switches are and stuff like that. And it will then use that to, there's a couple things it does with that information. First hand, when it's placing blocks for replication, it takes care to do that in a fault tolerant way. And the other places is useful is when a map reduce is now scheduling a job, the first thing it's gonna try and do is schedule the computation on the same physical machine that contains the data. If that's not possible, it's gonna try and get it in the same rack. Only if it can't get it in the same rack will a map process pull data across your core switch. Wait a minute. So let me get something straight here. And the clusters we're used to working on, disk tends to be provided by a dedicated set of servers providing a file system. This sounds like workers, in our case, a node would is also storing data. You're just using the local drive. There's a very intimate marriage between the Hadoop file system and the MapReduce processing engine. Every node is both a data node and a processing node. So for every node that comes with a 250 gig hard drive effectively for free from a vendor is providing to the toy storage of HDFS. But realistically, the machines that we build out for customers, we're usually putting like four one terabyte disks in like an eight core machine. No rate? Common configuration. You know, you can have only 250 gigs in there, but remember, there's a replication factor of three. So at that point, you're only getting, you know, I don't know, a little less than, you know, 75 gigs or so per node. So it sounds also like, since you're basically doing your own software raid, there's no need to use hardware raid and you would actually get- There's absolutely no need for hardware raid. We just use cheap IDE, SATA disks. Do you get your full four terabytes that you talked about in that box? Well, sort of, except for a replication factor, so it's like 1.3 terabytes. But the thing is though, the replication factor though, it's not just about fault tolerance. So there's another angle on this that you need to consider. And that is, it makes your data more available. So because that data is replicated three times, that means that your indexing job can be running at the same time that my page rank computation is running and we're both accessing the data and we're both accessing it locally. And, you know, for certain types of data that might have a lot of analysts working with it all at once, for subsets of your directory tree, you can just increase their application factor to five or 10 or 12 or 20 or whatever makes sense. And you could do that for, again, this data is so sensitive. I'm, you know, four nines isn't good enough for me or this data is gonna have so many people accessing it. I wanna increase the replication factor. You have this type of control with the file system. And it's just since the file system and the processing engine are intimately married, you know, that's why it's so good at these large streaming reads and writes with, you know, these data transformations. It's just, it really doesn't make sense to separate MapReduce from the file system. It's like an active storage system. Now, that was redundancy for, and, you know, reliability for data. What about the actual computation part? It sounds like Hadoop is already running. And do you kind of like submit work to Hadoop? And does Hadoop handle fail? Yeah, you submit a job. So when you submit a job, the master then from the time the client submits the job, the master manages the execution of that job. So let's take a job. Let's say there are a thousand nodes on your cluster and you submit a job and it cuts it up into 1500 pieces. All right, so the first thing it's gonna do is it's gonna try and schedule a thousand of those processes to go finish their map phase. And maybe, you know, they'll finish and some are gonna finish faster than others. And as it starts to finish jobs, it'll take those remaining 500s and stick it on the nodes that finish first. And then the rest of them will start to finish. And then there'll be some stragglers that maybe the disk was bad in that machine or maybe your motherboard's flaky or maybe your network card's flaky. Who knows what's wrong with it? It's just taking a long time. Since it's a shared nothing architecture, what the MapReduce Master would do at that point is it would say, you know, that job's taking really long. I'm just gonna go try it on a different machine to see if it finishes faster. And so it will speculatively execute that job on another machine and just whoever finishes first, it says your results count, yours don't. So it's a very much a manager worker kind of thing. And it just farms out work. And if it happens to farm out multiple copies of the same work, whoever answers first wins. Exactly. And this is completely transparent to the developer who submitted the job. Cool, very cool. So let me ask you this being again, reflecting my own natural bias as a network guy here. So we're talking about replicating, you know, terabytes of data here. What kind of networking is typically deployed in Hadoop and other scenarios like this? So a very common configuration is 40 1U servers in Iraq, each with a one gigabyte uplink to that switch. Sometimes if you wanna put two switches in there, you can do that for redundancy. And then those switches generally are patched through to the core switch at somewhere between two to eight gigabytes of connectivity. Again, depending on your load, how much money you have to spend, et cetera. But that's a very common setup. It's honestly, it's nothing fancy. Interesting. So you actually don't even have, I mean, you can still do all this processing massive amounts of data streaming with just one gigabyte down to the... Yeah, because remember what we said before that the file system is inherently married to the processing engine. Every node both stores data and does computation. So rather than, I think when you talked about how you normally have your storage system separate, what you're doing is you're bringing the data to the computation. With Hadoop, we are pushing the computation to the data. And in reality, if you're processing 10 terabytes of data, what you care about in your reduced phase is probably only a small subset of that. If there's 10 terabytes of input data, maybe a terabyte or half a terabyte of that is interesting, so to speak, for the computation that you're doing, that one terabyte or half terabyte that goes into the reduced phase, that's the only thing that's gonna cross the network. The rest of those 10 terabytes you're able to basically analyze and search through and map over locally. Gotcha. And is it a general characteristic that the output, because you do need to get the output off the nodes, right? Is the output generally much smaller and therefore not- It doesn't matter. You write it back to HDS. Oh, okay. You write it back to the same file system. So as you're writing your output, it's getting distributed back over the file system. So when that's gonna be the input to your next job, it's already distributed over the file system, it's already spread out, you just submit your job, the master then will take care of pushing your job, your computation out to the data nodes that are storing that data so we can sort of start this whole process over again. I see. Now what if you've done a 10 stage job and the output of the 10th stage is now in HDFS and you wanna take it off the cluster, then assume that you do have to pay the- Yeah, do a little copy. Like there's an HDFS client which will read from all of the data nodes that store your data in parallel and will write that file out to a different file system. It's okay, it's just, it's not as efficient as working with the data inside that file system. But in a very common use case is maybe you have your database server that you dump into Hadoop, you have all your web servers aggregate their logs into Hadoop and you have any other interesting new data you have all dump stuff into Hadoop. And maybe you have your BI systems dump into Hadoop. Then what'll happen is you'll see a bunch of Hadoop jobs do a bunch of interesting extract, transform, load type computations over this. Where they transform the data in interesting ways and maybe augment it and join it with some other data and then do some further computation and then your results now of all of this data mashing you did is relatively small and actually fits back into a regular database. But all of that intermediate data, there might have been 10 terabytes of intermediate data which you could have never stored in a database. So quick question here. So Hadoop does scheduling resource management, everything that we tend to have separate dedicated pieces of software for on our HPC style clusters. Has anyone ever done work to make Hadoop live inside like say the Torq resource manager or SLRM or SGE or one of these others? So there has been some work to make Hadoop work with Torq and Condor like systems. It's called Hadoop on demand. But let me sort of explain to you how that works and what the problem is. So you absolutely could use a traditional system like Condor or Torq to provision and deploy Hadoop. So what would happen then is you would deploy these nodes, you create the file system and it would bring up the computation nodes but the thing is your disk state now isn't persistent. So the file system itself would need to be persistent. So the compromise that people have come up with is they deploy the HDFS file system on all the nodes and they'll use something like Torq to provision the MapReduce workers to do that data processing job. The problem with that is all those guarantees about locality now are gone because Torq's not gonna be aware of which file has which blocks on which server. So what it's gonna do is it's gonna provision 10 random nodes to do your MapReduce. Those nodes that it chooses may or may not be local to your data. Right, so you're losing a lot of the benefits of what you just described a few minutes ago. And the thing is that when those nodes are provisioned and allocated for that user, the MapReduce jobs tend to have sort of a long tail where maybe one or two of them will take a lot longer to finish. In this world, those eight nodes that aren't doing work while those two are finishing are wasted. If it's a common shared cluster now somebody else is running on those already. But this, Gotcha. But this lets, like for example, our computer science department here, if they wanted to try out MapReduce and Hadoop, they could actually come use our system which has Torq and Moab on it already. And we wouldn't have to do anything too crazy. They could get Hadoop going and at least test out their algorithm. They absolutely could get Hadoop going. They absolutely could run jobs. It would not be able to take advantage of many of the performance benefits that come from Hadoop's design and intimate awareness of the file system and the MapReduce engine. Okay. But at least it's possible. It absolutely will work. It absolutely will work. But if you then, for example, had someone go do a performance benchmark on that, it would be very embarrassing for Hadoop. Sounds like a good proving ground though. A cheap way to try it out before you buy it. Absolutely. A lot of classes, what they'll do is, we've got some tools that sort of make it really easy to deploy a Hadoop cluster on EC2. And Amazon's now getting very good about giving students and teaching faculty free credits on EC2. So with the push of a button, you can have a cluster deployed on EC2 and up and running in an accountable number of minutes. Oh, that's cool. I didn't know that actually, yeah. Yeah, no, it's very, very simple. The Amazon folks are really helpful to academics. We've been doing sort of cool stuff with the AMI for Hadoop. And we really wanna see, we really wanna remove all of the barriers to anyone deploying a Hadoop cluster, whether it's locally or on EC2 or on some other cloud. Or I love to make it easier for it to work on systems like Condor, except it just makes Hadoop look really bad when you deploy it in environments like that. Because it can't take advantage of any of the design decisions and the compromises, frankly, that were made to get that type of performance. Okay, let me change text here a little bit here. Since I'm an open source developer myself, it always fascinates me to hear about other open source projects. Because as you kind of alluded to earlier, every open source project has its own quirks and it's good things and bad things and whatnot. Tell us a little bit about the Hadoop project. Who's involved, who does what? How do you guys operate on a day-to-day basis, this kind of thing? Sure, so the Hadoop is an Apache project. So Hadoop, like any other Apache project, has a PMC, the Project Management Committee. And if we just start by sort of looking at the Project Management Committee, what you'll see is almost half of those people come from Yahoo. Yahoo is the single largest contributor to Hadoop and the largest deployments of Hadoop. It's used for many, many of their production systems. It builds the web indices at Hadoop. It builds the web maps. The research team uses it extensively for research, adds teams, use it to sort of analyze and optimize stuff. Many of the same things that Google developed MapReduce for and uses MapReduce for. So that's the largest user. I would say the next largest user is probably Facebook. They have similar data requirements to the Googles and the Yahooes. And one of my co-founders, Jeff Humberbacher, is the guy who actually built the data team at Facebook and built sort of the data warehousing and analytics tools that they use on top of that. There are contributors also from a sort of a few smaller companies. And I would say I would group sort of, behind the Facebook and Yahoo sort of Cloudera is probably one of the next largest contributors. We have a handful of committers as well. And then there are a bunch of smaller companies that have, I wouldn't say a bunch, or a few that have like, maybe one developer, one committer. But then there are lots and lots and lots more users of the software. And the usage we see ranging everywhere from academic for research and instructional usage, we see lots of web 2.0 companies using this to process large volumes of logs data or user-generated content or maybe they're doing a crawl. The thing that Hadoop does really well is it deals with semi-structured data. Like you know, or unstructured data. You no longer have to cram this into a relational database and define your indices and know how your data's gonna scale and grow and how you're gonna wanna interact with it in the future. You just dump your data into Hadoop and you write your sort of imperative processing jobs. You know, SQL's really good at finding data. It's a declarative language. If you can specify your query wholly, it will then compile down to a plan that will pull that out of a database faster than Hadoop ever would. But if you want the imperative ability to analyze your data in depth and that scale, this isn't really what SQL does. Okay, cool. So we're about out of time here. So where's the Hadoop website and where can people find information about it? Sure, so the Hadoop website is at Hadoop.apache.org and then Cloudera's website is just at cloudera.com. Okay, cool. Well, Chris, thanks a lot for giving your time to speak with us today. I'm sure a lot of people who are gonna listen to this is a very different kind of application than we're used to working with. Oh, it's fascinating though, I appreciate your time, sir. So do me one last favor and let me sort of get one plug in here explaining what Cloudera specifically does right now. And right now what we do is we provide commercial support. So if you are at a company where you want to run Hadoop, much in the way that Red Hat will support Linux, we will support you in running Hadoop, providing you with nice clean bits and a nice clean distribution and all the operational and technical support you need to do that. Okay, that's very handy to have. Good, all right. Thanks a lot for your time. All right, thanks a lot. It was great talking to you. No problem, thanks, Chris. That was, yeah, I'm stressing how different it is than what we're used to normally working with. So this was quite a different, but still very large scale, not something you do on your desktop. Yeah, well it certainly made me excited to actually just go try this stuff myself. I'm like, you know, I must have some large data sets around here somewhere that I could deploy on a couple hundred machines and give it a whirl. Sounds fun. Yeah, if you Google, I found a bunch of examples online and was able to get it going simply just on my regular old Mac laptop. Aha, you Google for a hoop, haha. Yeah, whoops, whoops. Don't use it as a verb. That's right. So what else you got coming up on RCE? What's next? Well, right now the next interview is going to be the Visit project out of Lawrence Livermore. That's the big distributed data visualization, runs on MPI, large systems. They've done a couple billion cell meshes, you know, hundreds of gigabyte style visualization. So it should be something very different to what we normally talk about. All right, well hey, MPI, I always like that. I don't know anything about visitor rules, but I'm all for parallelization. Well, on top of that, just people go to our website, www.rce-cast.com and fill out the nomination form for other things that people would like to hear about. We're always looking for more people to get in contact with. Also, if you have contact information for anybody you see on the list that I don't have contact information for, please send that on to us also. Okay, sounds good. Okay, Jeff, thanks a lot for helping me out this time. We'll definitely have to try it again. All right, thanks everybody. See ya. No problem, bye.