 All right. Just like to say we're a hacked existence team, we're here to present today on Hadoop, Apache's open source implementation. Apache's open source implementation of Google's MapReduce framework. Just a few quick greets. Shout out to DarkTangent for making Defcon possible for all of us. Nikita for project selection and getting us up here on stage so we can present all of you. And last but not least, Steve Argun. So let's get into it. We're going to give you a brief overview of clouds, you know, definitions of Hadoop clouds, things that they do, things of that nature. The MapReduce walkthrough, we're going to talk about what a mapper is, what a reducer is, what they're doing, how we're utilizing them. A little bit about Hadoop's backend infrastructure like the masternodes, job tracker, distributed cache, things of that nature. The streaming interface, standard input output, and how things you create in Hadoop framework are portable to other languages. HDFS, which is the file system that Hadoop uses, and HBase, which is the database they use. Then we're going to talk a little bit about the Netflix prize and the sample code that our team has generated for it. And that will give you a chance to see some code examples and things where we've utilized Hadoop framework to generate some data. And then last but not least, we have a few other special select projects just for you guys today at DEFCON to check out. So right now I'm going to hand it over to Joy Calca and Ryan Anguiano. All right, so a few things most people should already know about clouds. They're big piles of other people's hardware. There's some element of virtualization built into them. They're scalable. You can drop nodes in and out. It's not going to affect how the cloud runs. You've got a high-level API, so with Hadoop we really don't have to deal with a lot of moving data around or dealing with networking or anything like that. You just write a map and reducer. The framework takes care of all the rest of stuff for you. And Hadoop really utilizes core-screen data, processed in parallel. So how much data are we talking about? Well, Wayback Machine has about two petabytes a data total and they're adding about 20 petabytes a month. Google processes 20 petabytes a day. What would you do if someone came to you with 20 petabytes and asked you to get information out of it? That's just a massive amount of information. A whole bunch of CERN's large heat and collider. 15 petabytes a year when it's up and running. So large amounts of data that you need to be able to sit through and analyze. All right, so a lot of the work we did was on the Saguaro cluster at ASU. There was 45, 60 processor cores. We got a little stripped-off portion of that. We had 50 hardware nodes. Each node was two processors that were each quad cores. So we had eight cores per node. When we go into our mappers, you create one mapper per processor. So we had 50 hardware nodes, but we would create 100 mapper nodes. You'd run two mappers at the same time on each hardware node. So for the whole rest of the speech when we refer to nodes, it's processors. And then there's just a few other stats on the rest of the cluster. Now, Google's MapReduce. In 2004, Google released an abstract, basically outlining their MapReduce framework. And they had a huge problem processing and generating large data sets. So they invented MapReduce to be able to solve this. And many real-world tasks are expressible in this model. It's been automatically paralyzed for a large cluster of commodity machines. And here's basically the workflow of MapReduce. You get your input data. You run it through your mapper. The mapper outputs some intermediate key value pairs, which gets passed to a reducer, and the reducer spits out your output. We'll talk about this in a second. It's really easy to utilize with a large distributed system without any experience. You don't have to mess with the network or you don't have to do anything. You just got to write your mapper and your reducer. It's highly scalable. It's meant to scale down to a couple gigabytes of data up to a couple terabytes. So any data you want to throw in it, it'll be able to handle it. And it's so useful that Google runs about 1,000 MapReduce jobs a day. Now here's Hadoop. Hadoop is Apache Project's open-source implementation of MapReduce. It's Java-based. And right now, it's not that stable. It's 0.20. Yeah, the latest version is 0.20. Our cloud that we compute on is running 0.19. So right now, it's been demonstrated on a cluster with 2,000 nodes. But for production, they're aiming for a target of 10,000 node clusters. And if you want to find more information about that, you can check out the website right there. And so here's a mapper. It's basically a special function that applies function F to each element in the data. So here's the algebraic expression. Map of F or map function basically applied to every single piece of the data. And as you can see in the graphic right here, it basically just takes the function and applies it to every single piece of data. The input, you basically take your input in your mapper and you input values to, you map all your values to a key. And map, the map function is called one time for each input line. And it outputs one key value pair as a hash map to the reducer. All right, so after you get the key value hash map from your mapper, there's a whole intermediate phase that you don't really interface with. It's copied and sorted because map runs in parallel across a bunch of nodes. So we're running 100 mappers at a time. All that data is then copied to a single reduced node because reduced happens in serial. After it's copied, it's then sorted. Okay, so it's sorted by the key. Everything with a common key in the hash map, it takes all the values from it, dumps them into an array and passes into the reducer the key and then an array of all the values that had the common key coming out of map. All right, so the reducer then gets, you have your function, you have X which is an initial value and then the rest of the array list. It sets an accumulator, it computes F for the first initial value and then it applies F to every element in the list plus the accumulated value. And then your result, your final output is what you output from your reducer. So here's what it would look like algebraically. You apply F of X. X is your initial value and then you have A, B and C. Here's a picture of what it looks like. So you can see the block on the left, that's your initial value and then it does it recursively keeping the accumulated values. So your input to the reducer is the output key value hash map from the copy and sort after your mapper. F of X is performed on every X with a common key and then your output from your reducer is another hash map and that's what you get as your text output. All right, so the map reduce framework, given this model map is implicitly parallel. You can take your whole input data set, break it up into a bunch of pieces, run it all in parallel, bring it back together for reduce. The order of application of function does not matter in the mapper because you're doing it in parallel and breaking it up. You don't have to worry about doing things out of order. Reduce is executed in serial on a single node because you're computing across the entire data set whereas map is just computing on pieces of the entire data set and Hadoop takes care of all that huge list of stuff at the bottom so you don't have to worry about any of that stuff, which is really nice. All right, here's a picture of what the workflow would look like. You start with your data source, you break it up, you do a whole map, then you bring it all back together. This picture is kind of misleading because they have three reducers there. In reality, there's only one serial reducer. So when you run a job in Hadoop, you basically upload your job to a master node. This master node basically keeps track of all the nodes in the cluster, and it assigns tasks and data to each node. It also hosts an HTTP job tracker so you can basically follow your jobs, watch the progress, and go back and view previous jobs. It also queries each node and kills any task that does not respond and re-batches out that killed task to another, to the next available node. Here's an example of the job tracker. You can see right here. We've spent hours right here watching these progress bars. Just go by slowly. You can see there's 104 map tasks because we had 100 processors, so we basically wanted to get as close as possible to 100. It's basically what the job tracker looks like. Okay, so what we've talked about so far is taking an input data set, running it through the two phases, and spitting out an output data set. Then you have a facility called distributed cache where you can compute against in your mapper and reducer a second data set. Distributed cache is just pretty much a network file share. You start large read-only files. Your mappers and your reducers receive a pointer to where the files are stored in distributed cache. This is a really important one. You have to copy the file out of distributed cache into the RAM on each node, so you create a hash map in your configure method, which goes above your mapper or your reducer. I had a job where I didn't do that. It ran for like five and a half hours and then crashed the whole cloud. It's a bad idea. You want to try and avoid that. So now we're going to talk about streaming. Streaming is really cool. It's basically an interface that uses standard in and standard out to stream the input and output to each node. So it basically gives you the ability to port mappers and reducers to any language that you can read from standard in and write to standard out. Input is read from standard in. Here's an example in Python. Input variable is system.standardin, basically running through the read input function. And output is written to standard out. It has to be outputted as a hash map, which is basically a string in the form of key tab value. Here's another example. It's a string variable tab string variable passing in key and value. The streaming utility packages all your files into a single jar, which is sent to all the nodes. And any distributed cache files you want accessible to your streaming job are similar to the current working directory on each node. So you have direct access to any distributed cache file. Here's how you would run a streaming job. You'd basically call the Hadoop jar and you'd call up the streaming interface and you basically pass all these flags, input, output, mapper, reducer, any files you want to be packaged in the jar, any cache files you want to be simulating to the directory, and a whole bunch of job configuration flags. Here's an example. Training set is input data set. We have Netflix output for output directory. Mapper is you call it, you basically call from the current working directory on each node. So in order to call Pine Mapper, you have to include this file in the jar file so you're able to access it from right here. And you can access any executable. If there's grep on each node, you can access grep, you can access awk, you can basically access any executable from right here. And a cache file, you can see we access a file in the HDFS and sim link it to movietitles.tech. So that is right where it's available from the code. Reporting and streaming. Since you're using standard in and standard out to basically handle your data, you have to use standard error to communicate with the master node. Since master node kills any job that doesn't report back, you have to use this, you have to use a counter after every line you process, and you basically do this by writing the standard error report call and counter, and then your job name, the phase, so it'd be mapper in this case, and then comma one. And you could also use status messages to track errors in log files. Alright, so since Hadoop is an implementation of Google's MapReduce framework, everything in Hadoop has a Google equivalent. So Hadoop uses the HDFS, the Hadoop distributed file system, which is the equivalent to Google's file system. It's high fault tolerant, runs on low cost hardware. You have a high throughput streaming access to data, which is really nice for this framework. This one's really important. Data is split on 64 meg blocks and replicated in storage. That 64 megs is important because when you create a mapper node, if you have a 65 meg file, it's going to split it into a 64 meg file and a one meg file, and you're going to create two mappers. One's going to run at full capacity. One's going to run really fast and be really inefficient. So you really have to consider the input data that you're looking at, how many files to split it into, and what size you want to make each file. HBase, all right, this is the equivalent to Google's big table, and this is a non-relational database. It's a really tough concept to grasp for all of us. We grew up with relational databases. It's not built for real-time querying. You're not trying to run your website and pull users' login information out of HBase. You're moving away from per user actions where you're saying, take this user, link it to a bunch of tables, and pull me a field out of it towards per action data sets. So you're saying, show me all of the users who did this and give me that huge giant data set and then run that through this mapper use program. HBase is distributed, multi-dimensional. It's denormalized data, so you're not worrying about BCNF or anything like that. You're not trying to strip it down. You just store everything you need with links to anything else everywhere. That's why it's really hard for us to grasp it, because we grew up with relational databases. Okay, so the table schema defines your column families. Okay, so HBase is like a database of a bunch of other databases. You have column families. You have a row key and then a column family, where inside of your column family, at that cell, you have a bunch more columns. And then coming down off of that on the Z axis, you have versions, so you can have timestamps. So like for Google, when they go scrape a web page, dump into their search engine, they grab it today. And then when they scrape it again tomorrow, they'll drop it down on the Z axis to the next version. So they can go back through time on the same data set and see the changes over time. Everything in the table, except for the table name, is stored as a byte array, which works with HDFS for efficiency. This is Amazon's classic compute cloud. They basically have a cloud that, they have a web service that you can basically go and purchase a resizable compute cloud in their data center. Hadoop is packaged as a public EC2 image. So it's really easy to set up. You just go to the web page. You choose how many nodes you want on your cluster, and you click Hadoop, and you click load, and it basically dumps you into Hadoop, and you can just start running jobs. And if you want to learn more about that, you can go to their website. There's a link right there. And here's some pricing they have, just for example. So they charge 10 cents per hour for a basic small Linux node. And this is per node. So if you wanted 100, it'd be 10 cents times 100. And you also have to pay for storage accessing your data. All right. So the first project that we used to really learn about Hadoop was the Netflix prize. It was a competition that Netflix put out. If you could beat the algorithm, the movie recommendation algorithm that Netflix was using in 2006 by a 10% or better RMSC, they'd give you a million dollars cash. So they gave you a public two gig data set of movie user ratings. Here's some stats on the data set that we used. There's 17,770 movies in the data set. Each movie got its own text file. So you got that many files. The movie IDs were ranged from one to 17,770 sequentially. Customer IDs had gaps in them. There's about 480,000 customers that were giving ratings for movies. Ratings were an integer from one to five. And it also gave you the dates rated. So down here in the bottom left, the first line of every file was the movie ID followed by a colon. And then the whole rest of the file was just this. It's customer ID, rating, and then date rated. Okay. So the default input data set, if you just take that straight off the download and dump it into Hadoop, you're going to create 17,770 mapper nodes because it creates one node per file, which is horribly inefficient on 100 node cluster. So we needed to optimize the number of files to the number of mappers available. Since it's two gigs, we go back again to that 64 meg split. So we had that wasn't really an issue for us because 64 times 100 was still less than two or more than the two gigs. So we just made 104 files for use on 100 procs. So it didn't split any of the 104 because they were all less than the 64 megs. And this ensured that all the mappers were utilized, optimizing file input output. All right. So this is the reorg script that we used. Hadoop is all text based. Most everything you do with it is all just giant text files. So Ock is really useful and it's really easy to write. There's three lines of code that reorganized our entire 17,000 files and then it was real simple to just dump them all into 100 files. So the efficiency gained by reorganizing. This is for Netflix one which was one of the smaller programs that we wrote. It took 43 minutes and 27 seconds to run with the default data set. Just by reorganizing the input data set you can see a 400 percent increase in efficiency. And then you have Python running it still faster than the original data set and Ock. And we'll talk a minute about Python and Ock, how they incur a bunch of overhead because they use the streaming interface whereas the Netflix one Reorg was implemented in Java. So you don't have that overhead. So the first first program we wrote for Netflix was a program to produce statistical information about each movie in the data set. I took every movie from the reorganized data set and we produced the first date rated, last date rated, the total rating count and the average rating for each movie. So as you can see we took the reorganized training set and we outputted movie ID as the key rating and date rated from the mapper. It produces one key value pair for each individual movie rating. So for each customer rating that they rated a movie it produces one key value pair for under under a movie ID key. And now we're going to pull up some code right now. Here's what a mapper looks like in Java. All right so here's a mapper that I wrote in Java for Netflix one and all this code's available on our website. You can see I normally don't do this much commenting but it helps when you have no idea what's going on like we did when we started. And there really wasn't any sample code on Hadoop out on the internet when we started. We just kind of tried it and figured it out. So your input comes in the string variable. You get a line. Everything in Hadoop when you pass something to a mapper it breaks on a line break and so you get single lines out of the text file that get passed to your map method. So you want to write your map method to deal with single lines of your input data set. So you have to make sure that your whole input data set is schemed out so everything you need is line by line. Then you just make a tokenizer. We tokenized on comma because the Netflix data set had all the values separated by comma. And then you can just iterate through and grab all of the different tokens put them in individual variables. And so you can see we took the... Where's word? Okay, so you take the name which is the movie ID which is off the first line and we set that to our key for our hash map here. You use output.collect to pass your hash maps out of the mapper into the reducer. And then your value is this rating and date variable which took the next two tokens out of the data set. So the rating comma the date dumped them into one variable made that the value for each movie ID. And here's what that same program looks like written in Python. Python is a little bit different because you're using the streaming interface. So basically I wrote a method to read the input. You get your input basically as standard in. It's basically just the whole file. So I wrote a generator to go through each line of the file and just yield one line at a time. And right here you can see I just pass insist that standard in to my read input function so I get one line at a time. So four line in input basically split it by comma again and right here I'm just checking to make sure I have the right amount of values. And the key is going to be your movie ID and you get the rating and date from the line split. And same thing right here you print string tab string and you pass in the movie and the value. And you after every line you have to print a standard error report dot counter pine Netflix one mapper comma one. Now it gets real crazy when we do it in AUK. Okay so what we should have explained before those last two. The standard data set has the movie ID colon on the first line and then the whole rest of the line is user ID rating date rated. What our AUK the reorg script did was took that movie ID from the first line and added it as a first token to every line after that and got rid of that first line. So every line was now movie ID comma user comma rating comma date rated. And you really have to do that with streaming because streaming you can't access other lines in the file you're taking input from standard in. And here was the last two files written in AUK as a mapper. Because we had AUK locally executable on each of the mapper nodes so we have two lines. Let's tokenize on a comma print one tab three four. And here's some mapper comparison times. Java averaged eight seconds a node or best was eight seconds a node at average twelve. And Python and AUK they incur a streaming overhead because streaming has to basically add all these functions to translate a map reduce job into standard in and standard out. So you can see Python really doesn't do that great compared to Java. But AUK since it's a really fast executable best was nine seconds average fifteen. So you can see that two line file is pretty comparison pretty comparable to that large Java file that you saw. And another thing is really good trade-off between streaming is as you see as you saw how small those files were we can write a mapper in under a minute compared to Java where we have to sit down and take maybe an hour to write a mapper. So if you just have a job that you just need to get out the door and you don't care about the extra couple minutes of efficiency you can write a mapper and reducer in just a couple minutes and get your job in the in and out the door and that's really the big trade-off for using streaming. All right so then we wrote another algorithm called Netflix to calculate a lot of statistics based on the users in the data set instead of the movies which was Netflix one. The Netflix two mapper output was user ID. Yeah user ID was the key and then the value was movie ID rating date rated as the whole value and we added in colons to separate them so all we have to do is tokenize on colon to separate all those values. Actually that's the output from the reducer and the mapper if you'd output that you would just tokenize on colon. So you have one key value per unique user ID movie ID rating so you get this huge hash map of the output of all the statistics based on each user which was user ID and then the value would be the rating count the average rating the rating delay and the movie rating date list. So when you're outputting in the reducer everything that passed from the mapper which was mapped by user ID is your key your reducer gets a whole array of all the values from that specific user all the movies that they've rated and the dates that they rated it on and the ratings that they've given it and then you can in your reduce method do all kinds of averaging because of the recursive nature of it. So here's a Java reducer for Netflix two. Okay so you you declare these variables up here this is inside your reduce method you have a rating total a rating count string line is your input okay so while values dot has next this is iterating through that array that you get for all of the values from a common key tokenize it split it out by commas pull all the values out and assign them to variables you add the count so each time reduce is called it's going to increment the count so at the end you get the count of how many times reduce was called which is how many times the movie was rated how many movies were rated by a single user and then rating delay was just something we did to compute the date that the movie came out and the date that the user rated it so you could tell if users liked old movies versus new movies based on their total average of the rating delay. Yeah okay so and then you call collect again with your key and value from your output and that writes it out to a text file so what you get is a text file that's just a giant hash map that you can then dump into excel and we use like comma to separate all of the variables in our value so you get a giant excel spreadsheet that you can then dump into a graph or something like that. All right so next we're basically going to run through the same program again but in python python is a little different because it doesn't go through when you run anything through streaming it doesn't go through the copy and sort phase you basically have to do that all yourself so you need a couple tools when you're doing it in python you need group by and item getter and basically wrote the same thing for reading the input just call it read map or output so for line and mapper you have to group all the users together by a single by by each key so that's what this function does it basically gives you a user and an array of values grouped by the user key basically the same thing in java except you have to do it yourself so you do the same thing you set some initializers and you iterate through each value in the group and you can go and split it by comma and you basically add one for the ratings add total ratings and the total delay and then after we add all of them up we can do some averages and output the final output and again you have to output sys.standardare report counter reducer comma one so that output to standard error if you don't do that your jobs will get killed by the master node because you have to report back every so often or the master node will think your node is dropped off the network and it'll just re-batch out your job to the next available node here's a reducer comparison you've got java running in two minutes of 58 seconds python running in 8 minutes 45 so you can see there's a decent amount of overhead that you incur using the streaming interface all right so a couple of the other projects this is all the results of a class that I took at ASU on cloud computing one of the other students did an image processing algorithm which is really interesting because Hadoop is completely based around text files it's not built for images and because all your data gets split on line breaks so that's really not good for working with binary files so Jeff Connor and Douglas Fuller extended the file input format interface to deal with images here's some of the results of the project they did running this through Hadoop here's the first image they did and they did a candy edge detection algorithm and so you can see it pulls out you can't really see too well in here because of how bright it is but all these lines in here are all outlined in this image and then this was just a different edge detection algorithm what they were trying to do was a blob detection algorithm so that you could take massive like NASA data sets of the surface of the earth and find things that matched a picture of something else and Hadoop's a really good framework for doing that so we needed a really good project to present on at DEF CON for MapReduce so we went and got 405 users complete walls from Facebook and we ran them through a few different MapReduce programs all the users were part of a college network so you're really looking at a target group of college kids the date ranges were from November 1st 2004 and then we cut everything off at March 30th 2009 so you're looking at about four years of Facebook wall posts across the entire data set you have 227,000 unique posts in that data set about 76,000 of those posts were status updates that people had here's a really cool graph so this right here is November of 2004 that's March of 2009 and this is the number of posts per day the red is just total posts the green is the amount of status updates overlaid over that so you can see the giant ramp up here in August of 2006 where a ton of people just joined the network where Facebook introduced the ability to post statuses didn't really catch on and then they introduced the ability to comment on people's statuses all of a sudden people got crazy with that this giant line here where it breaks 600 this was really fun just trying to figure out what all these different numbers mean that we found was two days before that Facebook released the Facebook for iPhone app so it took two days that's kind of what we believe this line is took two days to hit network saturation for all the users in our data set to grab that app they used it for a day and then it dropped off really fast so if you were to take all the people on your friends list and you wanted to get the maximum amount of exposure for something let's say you made a video of your Defcon speech and you want the most amount of people in your Facebook friends list to see it where would you post it so here's the post per day giant pie chart so from our network all the people that we grabbed you can see that 49% of activity happens on Monday Tuesday and Wednesday so you'd probably want to post something right in here and ride this curve around rather than this curve so we try to figure out exactly where we would post that here's zero through 23 hours of the day and then you have your days of the week so this is an hourly breakdown by day of the week of number of posts so what you really want to ride is this giant red curve up here so if you posted about here you're going to ride that all the way through and grab the most amount of exposure if you were to post something here on like Thursday at 4 p.m. you're going to ride this curve down and miss out on all this exposure so a lot of the implications of this you're really looking at micro marketing knowing exactly who your target audience is how they interact with people at ASU they do a lot of human computer interaction studies what we're trying to do is human computer human interaction studies so how people interact with each other on a computer yeah you want to talk about it? sure so we decided to run word count fun little program so if you go and look up all the documentation on Hadoop when we were in the class learning about it word count is the only example that you find on the internet anywhere it's the only piece of code you'll find and it just all it does is it maps every word in an input data set to the value one and then reducer just adds up all the values so you literally just count the number of times each word was used so here's the output of word count from our data set so as you can see it's all basically is a happy happy birthday you can basically open any file and you would see happy birthday at least 10 times so we basically went through and I did word count we were going to try to do some regular expressions to try to find every single phone number in the data set but as a turnout we ran word count and since it's already sorted it's they were all right there all lined up we basically had like 40 or 50 numbers that you can't even find on their info pages so that's a real cool way to get a bunch of numbers and emails and stuff like that we found a bunch of crazy words yeah we found one post that had a bin slash bash in it yeah just a bunch of random things so yeah we just want to say thanks to ASU for letting us use our cluster and teach us about all this stuff all these people the class was taught by in a joint collaboration by three people we have the CTO of ASU Dr. Adrian Sanier you had Dr. Dan Stanzione he runs the whole cluster for the school of engineering then we had Dr. Raghu Santanam he's an associate professor in the business school the class was really interesting because it was a giant mixture of there was about 20 grad students and 10 undergrads and then half of them were business students and half of them were engineers so it was just a really interesting forum of a bunch of people with a bunch of different ideas trying to figure out how to deal with Hadoop so thanks for coming enjoy the rest of DEF CON check out our website