 So, this first talk for today covers MapReduce and just a little bit of Hadoop is actually an implementation of MapReduce. So, I told you there is this thing called MapReduce which is a cool technology invented very long ago for doing computation in parallel. By the way as I mentioned over here some of the slides in here taken from talks which are available on the web. I have just acknowledged a few of those here. So, what is MapReduce? It is basically a platform for reliable scalable parallel computing. When MapReduce was first introduced reliability was not one of the requirements. It was just a simple way of expressing how to do something in parallel. There have been many, many attempts at specifying how to do things in parallel for many years. Parallel computers were have been in use for you know at least 40 years till now. So, there were many different paradigms for expressing how to parallelize a certain task. One paradigm for example was to have a unix process which forked and then each process did some work and eventually they finished up they joined again meaning when they are done they tell the parent process that I am done and as an extension the processes could be started on a remote machine. So, one machine starts off all these processes on remote machines they do some work using local data meaning the central process has to push data to the local machines they work locally and then they gather the data and output it. In fact this is what actually happens today also, but one of the important things is a lot of the details of partitioning the data, repartitioning the data, starting up these processes all that is hidden by this map reduce infrastructure which lets you just specify what you want computed and everything else is done by this prebuilt infrastructure. The infrastructure not only simplifies the programming it also allows effective handling of failures when you have thousands of machines running in parallel something will fail. The infrastructure will mask this failure and let somebody else do the computation which the failed machines were doing and the computation succeeds even though underneath the hardware has failed the machine has failed, but the computation as a whole succeeds. So, that is the idea. So, it abstracts away issues of distributed and parallel environment including failure and so on. Now, the current implementations of map reduce actually work on what is called the distributed file system. What are distributed file systems? If you use a machine today it has a disk and whatever data is on the local disk is part of your file system. There are some extensions which let you share the file system meaning you can mount a file system from a file server and access it locally, but this is not very scalable. There is a single file server and every request has to for a file goes to that file server and comes back. That file server will get beaten up if you have thousands of machines asking for files from it it is not scalable. Instead what these distributed file systems do is they actually have thousands of machines and the file itself each file itself is distributed across multiple machines. So, the overall the files are distributed across thousands of machines. So, if I want to access a file it could be on any one of these thousand machines and this distributed file system is in charge of figuring out where the file is and it will fetch the file for me if I read from it. So, that is the basic idea. Again the idea of distributed file systems is very old. There were projects way back in the 70s and early 80s actually I guess not 70s. Early 80s there were distributed file systems projects which showed how to build such file systems. That era was actually really cool because a lot of the technology which we are using today both in terms of what is map reduce, how do you build distributed file systems, how do they operate in spite of failures. A lot of the technology was invented in the late 70s early mid 80s. So, today what we are doing is actually using many concepts from back there, but at a scale which was never done back there. So, there are some new issues also. So, today there are several file systems which are very widely used. Google file system is a distributed file system which is used within Google it is not available outside. Hadoop file system is a similar distributed file system which is actually available in public domain. So, you open source you can go get it. There are a few more there is a file system called a Cosmics file system KFS which is also like Google file system. So, let me tell you a little bit about these file systems before I get into the map reduce paradigm itself. So, that you can understand better how these things work. So, first of all these are highly scalable distributed systems with you know the number of nodes in there Google and Yahoo routinely run computations with 10,000 processors. There is a data center with many racks each rack holding let us say 40 to 100 machines and then there are tends to may be hundreds of these racks in the center. So, 10,000 machines running in parallel the files are distributed across all of them and the how many files can you store in this level 100 million files is nothing out of the ordinary for this scale. How much data 10 petabytes is nothing out of the ordinary how much is 10 petabytes petabyte is a 1000 terabytes. So, this is 1000 terabyte sorry 10,000 disk each of a terabyte. So, each of these 10,000 nodes has multiple disks aggregating to a few terabytes each and then you have 10,000 times that much. So, the scale of data is absolutely enormous and the next thing to note is that in an earlier era anything which was had to be up a lot meant that you bought a very reliable main frame system from IBM that was the original era. In the next era you bought a very reliable computer from Sun now or for that matter IBM not a main frames were still around, but regular unique servers from Sun and IBM they are very well engineered products. We have had some of those run nonstop for you know 10 maybe I do not know about 10, but I know personally I have seen several of them which never gave a single problem in about 7 or 8 years of service. Eventually they died not because their hardware was bad, but because of the dust level we had here ruining their fans and their cooling systems and so on. So, if they were in a less hostile environment than room in IIT Bombay they would have probably lasted another 10 years they are very good machines, but they are very expensive and if you need 10,000 such machines you are going to go broke. So, what Google did their engineering is really fabulous. So, their engineering teams figured let us not do this let us buy the cheapest machines there are out there and put 10,000 of them. Well, they are cheap machines they are cheap for a reason they are not as well engineered as these highly reliable systems they are going to fail you know that if you buy you know cheap Chinese thing in the market today you know it is going to fail you expect it to fail very very soon. In fact, there is a good chance that it will fail even before you use it. So, somebody I know said that in an earlier area it was you know you buy it you use it once and throw it. Now, in the current era of really cheap Chinese goods you buy it and then you it does not work and you throw it and then you buy another one. So, it is buy and throw it is not use and throw. So, well Google did not use such really really bad computers they bought reasonable ones, but they are going to fail and Google's brilliant idea their engineers brilliant idea was let them fail. We will build a software layer on top which deals with all these failures and handle it. And now this Google file system is an example it spread over tens of thousands of machines yes some of them are guaranteed to be down at any point of time. How can the file system work if a machine is down? What about the data on that machine? And the simple answer is Google file system keeps every piece of data on at least three machines. So, even if two of them are dead the third one is going to be alive and can process your request. So, replication is really core to this just as we saw with parallel databases with the distributed file system too files are replicated. Now, a distributed file system is a platform on which we can run these parallel operations because you can have all these thousands of nodes by the way the data is stored on 10,000 computers each of those 10,000 computers is not just storing data it is also running code it is not sitting idle just storing data. So, when a piece of code runs on it it can access data which is local, but that is not a constraint it can very well go access data from anywhere of in the file system which means any of these other 10,000 machines may have the data it needs it can fetch it. So, basically there is a single name space meaning think of a UNIX file system there is a name space you say slash home slash you know your username slash some directory and so forth. So, they have a single name space for all the files in this entire 10,000 node cluster. So, it is easy to refer to files and they all look like they are locally available, but they are actually brought in on demand as required. There are some issues of coherency what if two people write at the same time to a single file this can cause problems and they have very nice solution. Basically what they do is usually there is a master for a file or a piece of a file. So, every request to write to that file is sent to that master. So, it becomes the serialization point for writes to it. Moreover, if you have many people writing to the same file at the same location of the file they will clobber each other all the time. So, the basic way in which these files are written is by appending data which makes a lot of sense. In fact, if you look at most of these applications files are of one of two forms a file is like something which is created once and never modified like a document or a web page which is crawled it is read from the web stored in a file never ever updated. So, that is a write once there is another kind like log files and so on where you keep appending data. So, in fact these are the only two kinds of files pretty much in these systems write once or keep appending. So, those are the two operations which are supported. Now, a file itself may be extremely large. So, what file system like Google does is it breaks up the file into pieces. The pieces are big they are not small each piece is something like 128 megabytes. So, for really large files you need such big pieces. So, now this chunk of 128 megabytes would be the unit is replicated in several machines. Now, when a client wants to read a file it will talk to some servers to find out where all the pieces of that file reside the different blocks of that file and then it will go fetch it directly from those machines. So, it is very parallel except for one part which is to find out where does this file reside for that you have to go somewhere even that is parallelized in some clever ways. So, overall everything runs in parallel without going through any central location. Now, the Hadoop file system HDFS is like Google file system and for this I managed to get a figure from the web which shows how it works. So, you have all these data nodes. So, these are all individual machines and if somebody needs to read data the first step they do is go to master called a name node and say here is a file name tell me where it is located. So, what this fellow says is here are the block IDs and the machines where the blocks are located. So, then the client gets this amount of data is small even for a very large file there is a small amount of data at which point it goes and fetches all the blocks from corresponding nodes very parallel access. So, the name node it is the master basically keeps metadata what are the file names where are the files located and where are the replicas of a file located and. So, the client gets this metadata and then does the read from the data nodes what this shows is that the data is sitting in a there are many machines in a rack and there are multiple racks. Now, racks are often subject to coordinated failure sometimes maybe the switch dies. So, the replica which is on a rack is actually if you have a node here which has a certain block of data is replicated on a different rack. Now, this works for certain kinds of computations for your email what if the entire data center storing your email goes down and you log into Google mail Yahoo mail you cannot access it you are not going to be happy. In fact, what they do is they have a complete replica of the data center itself somewhere else. So, even if this data center dies you can still read your email. So, they give very high availability as a result. So, in fact this is very interesting high availability was once the preserve of a few critical applications like banking applications the rest of the people did not have high availability. Today, we take it for granted you know if gmail is down for 1 hour it is in the newspapers this is incredible. Here is a service which runs 24 by 7 24 hours a day 7 days a week you know every week of the year and in the whole year it probably fails for a few hours and even that is reported in the papers that itself is viewed as surprising the same with Yahoo mail or any of the other mails they have tremendously high availability. And what is interesting is all of these run on tremendously unreliable hardware underneath and all of them have software layer above which mass all these hardware failures and gives this very high availability. So, that was a detour into the distributed file system. So, now let us look at map reduce. So, initially whatever data which you wanted huge amount of data all the web crawl or whatever else it was stored in multiple files across this distributed file system. Now, let us take a very very simple task to illustrate what map reduce does. So, this particular task is you given a number of documents I want to find how many times each word occurs across the entire collection this is a toy it is not very useful normally, but because it is a toy we can show what it how to do it very easily. So, first of all how do you do it in parallel. So, it is a count problem. Now, if you remember we saw how to do it in SQL count on a relation what did we do we had the rows partition we computed the counts per shop ID locally and then gathered it and added it up the same idea works here. So, you divide the documents amongst the workers the documents are physically on the file system, but they are processed in parallel by all the workers each worker reads whatever documents are assigned to it finds all the words and outputs word count pairs that is in that document how many times does the word occur. In fact, if you see in the map reduce computation they can even do something really dumb although they would in reality this is what they would do the example is simplified even more to make it even easier to see how it is parallelized. So, the idea is instead of outputting a word count pair each time they see the word they can output a word and 1 the count of 1 is a bit silly is obvious optimization is each for each document you see how many times the word occurs, but without that optimization still see a word output word 1 very easy to write. Now, what we have is the output of the first phase is a set of word and count pairs which are per document or just word and 1 word 1 or how many ever times it occurs in that document. Now, we have to group by word and add up these counts obvious SQL operation and we saw how to do that we partition the data and we can do it in parallel across many machines that is what the second part of map reduce does it in this particular example the word count pairs are partitioned by word in this case the key the group by partitioning key is the word and the partitioned by word meaning all pairs for a particular word will land upon 1 machine and you do this partitioning across 10,000 machines. Now, each machine has got collection of data and it does a local group by and then sum up the count. So, this is actually an operation which you could pretty much write an SQL except for once the fact that you have a document how do you take the document parse it and get the word count that is not something which you can do in SQL easily. Although, if you add an ability to call user defined functions from SQL well you could do it actually with an extended SQL. In fact, some system like the Aster data systems which I mentioned earlier they have a map reduce which is built inside of SQL by using user defined functions. So, it is actually not very different map reduce and SQL parallel query processing in SQL are very closely related we will come back to this later. So, that was the intuition for how to do the word count map reduce is a paradigm meaning the infrastructure lets you do two operations one is map one is reduce, but these are not actual operations they are kind of meta operations meaning you the programmer have to specify what the map function does you have to give an implementation of the map function you have to give an implementation of the reduce function. The map function is essentially something which is like pre processing and outputting the groups along with some associated value group value the output of map is a group and a value. The reduce operation basically takes all the elements in a particular group and aggregates the reduce is basically aggregation. So, that is basically map reduce, but the difference from SQL is that you the programmer write the map and the reduce function and you can write this in C code or Java code if you do it in head up it is Java. I think Google's map reduce may be C based I am not sure. So, this came from functional programming languages like Lisp originally long ago. So, the input to this is a set of key value pairs and there are two functions there is a map function which takes a key value pair and outputs a list of some other key value pairs. So, in our word counting the initial key value is a document ID and document ID is the key the value is the actual document content. So, the map function takes the document ID and the document content and outputs some a list of k 1 v 1 what is k 1? k 1 is the word v 1 is the count of the word in the document. So, for one input document it outputs a number of k 1 v 1 pairs it is a list of those pairs. The reduce function is basically the aggregation function what it does is for a particular word it is given a list of all the values. So, think of this as you have done the grouping and for a group you have all the values the multi set of the values in this case it is called a list you have a list of values. So, the map function is basically the aggregation function and what it outputs is a value for that particular key it gives a value. So, in our word counting the group by is done on word and the reduce function adds up the counts which it has got for that word and gives one final count. And of course, this is done in parallel for different words on different machines. So, map and reduce can both be parallelized in between the map and the reduce the infrastructure has to do something very important. This list over here on each machine it has many different words over here all the counts for a given word have to be on one machine. So, in between these two steps the infrastructure has to repartition the data which is in these lists and that is a important part of the map reduce infrastructure. So, the infrastructure takes care of everything underneath you the programmer have to define these two functions plus a little bit more on you know how the initial data stored and how it who should work on what part of the data there is a little bit more you have to specify. Once you have done that the infrastructure runs what if there is a failure if you look at it these are actually if a particular machine was handling a set of documents it failed in the map phase what do you do well you already know what all map task that machine was supposed to do some other machine does that particular work once this fails it is whatever partial work it did you throw it out and start a fresh on another machine. So, you can deal with failures on map similarly for reduce we will come to this later. So, the paradigm works out very nicely with respect to handling failures. So, let me show the map and the reduce task again this time pictorially. So, what do we have here the initial you know values k 1 k 2 k n and the keys in the values. So, these were document ID and the document content the map produced a list of word and count pairs from each of these documents. So, there is a whole collection of these which are distributed across all the machines word word count pairs. Now, what the reduce step does the infrastructure takes this collection of intermediate key value pairs which is in our example word and word count local word per document and now it does the grouping. So, for a particular word the purple word it gets all the counts together and creates a list similarly for the whatever yellowish word here it gets a set of values and so forth. So, for each word it collects together all the partial counts and then calls a reduce function. So, this grouping is done by the system you do not control it well you control it in the sense you have specified what is the key and the grouping is by that key. The reduce function now collects all the values together and does the aggregation and finally, outputs this. So, that reduce function is provided by you the programmer. So, that is in a nutshell that is what map reduces, but the reduce function can do very complex work beyond what the SQL aggregation functions to. So, again like I said SQL people have not sat quiet. In fact, for a long time SQL databases have provided user defined aggregate functions which let you do basically aggregation in your own C code PostgreSQL supports this for example. So, it is actually fairly easy to take a reduce function and stick it into the database and thereby you can do map reduce in a database today at least on Aster others will catch up for sure they are not yet there. So, I have been saying map and reduce in an abstract sense for the word count problem here is exactly the specification of the map function. So, the map function takes an input key and an input value both are strings. So, many of the map reduce guys they just treat data as strings for simplicity and of course, data as strings is not a good idea for many applications. So, these will evolve to allow more complex data types, but assuming they are just strings what it does is for each word W in input value you just go through the document input value here is a document and you go through the contents of the document and every time you find a word you break it up into words as you go along you output emit intermediate W comma 1, 1 is a count. So, actually this step is not even bothering to count how many times the word occurs in a document the moment it finds a word it outputs W 1 it is creating more network load. So, this is not actually a good idea from a performance perspective, but it is correct. So, now there are all these collections of word and just one the value 1. So, now what it does is the reduce function is getting a word with all the associated counts. So, what it says is for each V in intermediate values it gets a key and an intermediate values what is this intermediate values this is a list of counts in our particular example. It is an iterator which means you can step through those counts, but again the data type is a string. So, what it has to do is it takes the count which is stored as a string it pass int meaning takes the string converts it to an int and adds that value to the result. So, now what it has done is this for loop has gone over all the counts for a particular word and then it emit as string result what does that mean the result is an integer it converts it to a string and emits it. So, now the system has a collection of word and final count for the word again represented as strings where does this output go where does the output come in from these are two important questions and all of these map reduce systems they take their input from the distributed file system and stick their results the final results back in the distributed file system. So, that is how they work they also have all these intermediate results that they typically do not stick in the file system they keep locally repartition and then make it available for the reduce function. So, here is roughly how this whole thing is architected you have a user program well the some master program and this is actually your map and reduce functions linked with the map reduce library of Hadoop or whatever and what it does is it starts of these worker processes on all the machines which are running and each of these is reading part of the input data which it gets from a distributed file system. So, somewhere in here there is a master process which has told this worker which part of the data it is responsible for. So, it will collect whatever part of the input data it is responsible for and it runs the map function here this worker runs the map function and its result is kept at local files temporarily then that result is repartition. Remember whenever we did a join we had to repartition or when we wanted to do a distributed aggregation we had to repartition this is a distributed aggregation. So, repartitioning is required based on the key which in our example is the word. So, the repartitioning is on word here and you have another set of workers this set of workers could be a smaller number if you expect this data to be small you could have fewer workers here that is why it is shown as 3 and 2 each of these collects its data and the reduce function defined by the programmer is called on each of those pieces of data which it has got computes the aggregate result and outputs it to a file. These files are also in the distributed file system what this means is that now any guy the master then collect all these files which are in the distributed file system it reads it just like the local files and outputs it is done. So, an important thing is that map reduce allows imperative code written in C or Java to do the functions of map and reduce that gives you power which SQL does not give you SQL is fantastic for certain tasks for certain other tasks it is it is a limited language let us accept it. So, that is one kind of thing which let them do computing page rank from a collection of web crawl documents build keyword indices on these documents it turns out even this is very effectively parallelized. So, in parallel you build the indices and then do some merging if you want to do any analysis on all those web click logs it is inherently parallel you have the log files distributed on the distributed file system start a map reduce job you can specify all that you need to do analyzing the raw log files in map reduce and today apparently engineers in Google and Yahoo run first of all a lot of their production jobs which need to be highly parallel everything is on map reduce these days this is what we have heard from recent talks from people from Google and Yahoo. Moreover it is not just regular jobs, but even you know if an engineer wants to do some analysis to see should I tweak the ranking function somehow they can do all this analysis in parallel on huge volumes of queries using map reduce. So, it used day in and day out in these places it is been a tremendous success and there was a paper from Google on map reduce which started off this whole thing and later others realized that this is a fantastic way of doing things and many people have implemented it. Hadoop is one such implementation which was built at Yahoo and open sourced. So, as I was saying there was a big debate in fact a war between database researchers and the map reduce people with database researchers saying you guys are idiots you are reinventing stuff and you are not even doing it that well and then those people said you can't do this you can't do that. So, there is a little bit of truth to both camps the first camp probably ignored all the parallel database work which they should have said yes parallel database is do the same thing here exactly is what we are doing they did not say that they claimed a little too much, but once you analyzed it more soberly you found that yes some stuff they had done was known, but actual implementations that runs on thousands of machines handling failure seamlessly and allowing procedural code which gives tremendous power is all new. So, that is so who all use map reduce Google has an implementation it is not available publicly, but it is used all the time in Google. Hadoop is an open source implementation in Java which uses Hadoop FS we saw it earlier as the storage you can actually download Hadoop and set it up and run it in your institute you can run it on just a few machines if you want you do not need 10,000 machines none of us has 10,000 machines to run it on, but actually we do have a lot of machines we have labs with hundreds of machines. So, if you think about it most of those machines are sitting idle most of the time. So, you can actually run distributed file systems and Hadoop across all those machines especially when nobody is using them. Microsoft has dry add Aster has map reduce embedded inside an SQL database. So, they claim you can get the best of both worlds you can do procedural stuff if you want you can also get the full power of SQL if you want. And then there is pointers to what you should be looking up you can take this later the original map reduce paper GFS paper and for the other ones Hadoop, HDFS and so on. There are no papers, but there is a lot of material in the web instead of giving you pointers I am just suggesting you do a Google or any other search engine search and then dig deeper into those. For a change I am done exactly on time for the break I will try to keep it up after this also. Thank you and we will take questions after the break.