 So, this session we are going to cover the topic of big data. What is big data? Many people define it in whatever way they want, but yes, there is a lot of buzz about these terms, but basically I do not care, ok. Is anything which is large enough? To me it is anything which is large enough that it needs massively parallel systems to do anything useful with it. To me that is a more useful definition. If you look at the tools which people use to deal with big data, I do not care what its volume, its velocity or any other of the V's that are associated with it. But if I can do it on my desktop, it is not big data. If I can do it on a shared memory machine with 64 cores and you know 1 terabyte of main memory, it is not big data, it is small data. But if it inherently needs a massive degree of parallelism to do querying, update, whatever so forth, then to me that is big data. This is a implementer's viewpoint. Now from a user's viewpoint there are many things and so there are these industry buzzwords centering around the 3Vs. We have heard of it, good if you have not heard about it, like I said it is irrelevant. Do read it up so that you know what are the buzzwords, but technically it is not a big deal. So big data is anything which is inherently so large that we need a lot of parallelism. The question is what are these things? What, where are these sources for such data? What kind of things do you need to do with such data? And where do databases play a role in dealing with such data? These are some questions which need to be answered. And we are going to answer it by starting with the MapReduce paradigm and then move on to big data storage systems. Historically, this is how these systems have evolved. Big data started not with database people, although you could argue that it did. The current things which are called big data were driven by people outside the database world initially, but database people have been looking at big data for a very long time. There have been parallel databases which were built in the mid 80s. There were major projects. A company called Teradata came out in the late 80s and is still around. It's still one of the largest parallel database companies. So those things are still around and playing a very important role. And these people had systems with 300, 400 nodes way back in the 1980s. So who had such kind of data in the 1980s? There were not many companies. There was Walmart, which you might have heard of. There was already a very big retailer back then. They had a huge amount of transaction data. There was the US phone company AT&T, which also had a very huge amount of data. And there were a few more such large companies. There were oil companies which cease to make exploration data. And there are many, many other such companies actually. But things which are really large, there's a relatively small number. And Teradata was a fairly expensive database. You had to pay a lot of money to buy it. So only the really large guys could. First of all, only the really large guys had access to data. And B, only the really large guys could afford to buy a Teradata machine. So what has changed since then? What has changed since then is that the web exploded. And now many, many companies have huge amounts of data. And the big web companies, many of them started small. Facebook was small. And in a few years, it exploded and became really huge. And they needed solutions for handling the data which were scalable. What do I mean by this? Things where you can add nodes into the system in a fairly easy way, without shutting down the system, keeping things running all the time. And yet you can grow and handle larger and larger volumes of data as time progresses. So there was a need for such system. Now parallel databases could already do that. Teradata systems could do it to a reasonable degree. Although they never looked at 100% expansion in a year, they were, companies like Walmart probably grew at 20%, 30%. So that's the kind of thing that they were targeted at. Downtime was acceptable. So in the new world, there are a lot more demands. And the other thing which happened was, in the old world, Teradata, I'm sorry, Walmart got all its data from cache registers. It was already coming into a relational database in relational format. In today's world, on the web, a lot of data is actually pretty structured, but a lot of data is not. What was one of the biggest sources of data on the web? What is it? It's the web pages which are out there on the web. They are documents. They are unstructured, textual data with hyperlinks and so forth, some formatting, but they're not relational data. Among the earliest motivations for the map-reduced system, which we'll look at in more detail, were taking the web crawl and running some computation, what is a web crawl? It's basically a system which goes and finds web pages. How does it find web pages? It starts from some initial set of web pages. Each web page has hyperlinks to other web pages. So then it finds those, puts it into its set, visits them, finds the links out from those, puts it in. Of course, sometimes the links loop back. Those are ignored because we've already processed it. And you keep growing until either if you exhausted your budget or there's no new pages to be found. And with a well-chosen initial set of pages, you would crawl a pretty good fraction of the web which is out there. And in particular, you'd manage to reach all the important sites. Generally, people have links to the important sites. You could find all of them. So that's a web crawl. That is huge. It was 1 billion pages more than 10 years ago. And that is just the part which Google decided is important enough to visit. My actual web is even larger. And now it must be much, much bigger. I don't know the current size. That's a lot of data out there. Now, what kind of computation do you want to do? One of the things Google did, which was very important for ranking, is something called page rank. That's a thing which Google used, which was a major factor in Google's initial success. Although, of late, it probably plays a smaller and smaller role for various reasons. But that required a lot of computation to be done on a very, very large graph, a graph with a billion nodes. And I think each node had maybe 20, 30 edges out. So 20, 30 billion edges, really huge graph. They needed to do computation on it. Now, you cannot run such computation on a single machine. You cannot run such computation on 10 machines. They realized at some point that there's a lot more. I mean, this is just one minor part of it. They had huge amounts of other data. When people search on Google, that is a data point. When they return some pages and people click on the results, that's another data point. All this data is out there. When they display an advertisement, when you go and search on Google, it display ads. And if you click on that, that's a data point. If you go to a website which chose to use Google as its advertising partner, Google displays the ads. They are the middlemen. Between the person who wants to display the ads and the website, Google collects money from the advertiser and pays some part of it to the website. So every time you visit such a website, Google is collecting data. Google has an enormous amount of such data, and they need to analyze it to make various business decisions. And it's not just Google. Yahoo is of comparable size, a little smaller, much smaller on search, but on many other aspects, compared to Microsoft. In India, we had a few very large things, Rediff. But they seem to have been sort of clobbered by the big daddies. However, there are other companies in India. There is a, what is that called? Indotmobi or something. There's a mobile ad company which was started in India, which hit the big time. They've been very successful. And now there's Flipkart. There's so many other companies which have a lot of data. So how do you process this data? What kind of computation do you want to do? The bottom line is you want to do parallel computing on this. This is where a lot of the big data technologies of the recent past started. In recent years, databases have come back in the picture, but in the early days, it was not really database specific. It was data specific, but not database specific. What is the difference? When I say database, I mean things like relational databases, relational data, as opposed to document. That's also data, but it's not relational. So let's see the origins of MapReduce. And Google actually pushed this paradigm. But the MapReduce approach to parallel computing is very, very old, probably 40 years old. Many people think Google invented it, but no. Google did not invent the paradigm, the way of doing things. But Google built a very robust, extremely scalable, meaning you can run on tens of thousands of nodes, all tolerant system to implement MapReduce. That was their technical achievement. And then they used it on a huge variety of projects. It became standard for anything you do on Google. Any analysis had to run on a MapReduce platform, because the scale is so big. So it proved an invaluable tool for Google. Every, well, pretty much every Google employee who's involved in doing something with data ends up working on the MapReduce system of Google. Now Yahoo saw this, and meanwhile there was somebody building a MapReduce system in open source, called Nuch at that time. So Yahoo decided they needed it. And they hired that guy and said, you continue working on open source, because that's what motivated you. But work with us, and we will be major users of this system, which became Hadoop. So Hadoop is open source, but it was pushed by Yahoo, which was a nice situation, because Yahoo got some very smart people to work on it. And the open source community got something which we can download for free and run. And it's a very robust system. So it was a win-win situation. So there have been several more like this, which have kept the open source community growing. So earlier today, you heard about FOSS. The expansion for FOSS was left out, I think. It's free and open source software. So there's an economic angle to it. Who on earth is sitting there writing all these programs for you to download free and use? We are freeloaders in some sense. We are getting things for free without paying for them. Somebody else has done all the hard work of building these. We are doing the tutorials on top, but the underlying systems were built by somebody else. Well, we should be part of it, too. So today's talk was talking about open source as consumers, but I want to point out we should be open sources, produce open source software. In fact, a lot of the software we build at IIT, we make available, if it's niche demand, we make it available on request to anybody who wants it. And others have been open source and put out there for anybody who wants to use it. So that's something which is important, that we contribute back. If we are consuming, we contribute back. And society has a whole benefits. So that's the principle of open source. But it had a particular impact here, because very few people could afford the Teradata product, for example. All the products which offered parallelism were focused on large companies costing millions of dollars to buy a license even. But suddenly, here is this free thing. It resulted in an explosion of interest in building big data systems which span a lot of nodes. But of course, nothing is free. Software may be free, but what about the hardware? Nobody is giving you free hardware, right? I see a lot of you with laptops. I'm sure each of them costs a order of 40 to 60,000. And then you get free software. For somebody who's collecting the money, maybe it's China, maybe it's Japan, or Korea, but somebody is collecting the money from all of us. It's no longer the software, maybe. Now the issue is supposing software is free, and it requires 1,000 nodes to run. But you're willing to pay for software which is not free and can run on 500 nodes. Maybe you saved money by paying for the software, because instead of 1,000 nodes, you're running on 500. 500 computers is a lot of money. So the point I want to make here is free can be nice. Free can be great for learning. But for a company, free could be great. Paid could also be good. If we turn everything into free, we're out of jobs. Our graduates are out of jobs. OK, so coming back, Hadoop is luckily free. And that's why we're able to actually use it and try things out. This is also driven by companies like Oracle to make their products available for free for testing, education, development, and so on, if not for commercial use. OK, so let's come back to the technical detail. It says that the paradigm is old, but 1,000 to 10,000 nodes, all tolerant things are what Google pioneered and Hadoop made open source. Now where is all this data residing? That's the bottom line. We have to start from where is the data? OK, maybe the web crawl, the data on the web. But to process it, you have to pull it in. How do you store it? Can you store it on one machine? Not going to happen. There's so much data, just putting it into one machine or taking it out will take forever. So the data itself has to be stored in parallel across a very large number of nodes. That's the starting point. So in fact, the first project at Google in this space was Google File System, GFS. MapReduce, I think followed or was in parallel with that. So what is a distributed file system? Again, this is a pretty old idea. It dates back, I think back in the 80s when I was doing my undergrad. I heard a talk by somebody who was building the CODA file system in CMU, called Satya Narayana, if I remember right. So it dates back to mid-80s, at least, distributed file systems. And then the idea was that you have a lot of machines in a department. Can we store files across those machines and make use of the space on all those disks on all the local machines? So that idea is, again, pretty old. But those operated with hundreds of machines. And Google pushed the thing. And Google File System could handle, let's say, 10,000 nodes, 100 million files, and say 10 petabytes. These are indicator numbers, of course, in various. 10 petabytes is how much? 10,000 terabytes. 10,000 disks, if you buy one terabyte disk. That's a lot of data. So how do you store all this in a system which gives you uniform access to any file in here? That's a first issue. So you want the data to be stored on thousands of machines, disks or sort of machines. But you want to be able to say, open file name slash home slash blah slash foo slash whatever. Give a path. And the system should retrieve the file for you. You don't care how many nodes there are in the system. You don't care if a node fails. You still want the data, and the machine will fail. If you have 1,000 machines, at any given time, some number of them are guaranteed to be dead. They will not all be up at any time. So you cannot say, one machine is down. I can't give you this data. It has to be fault tolerant. So files are replicated. So if a machine is down, its data is there in some other machine, which can serve it. And the system in the software should be able to detect that a machine is down. And if a machine is down, and somebody asks for data which is in that machine, the software should say, ah, this data is replicated at this other place. So what if this is down? I'll go fetch it from there. It also has to recover. Supposing this machine is dead for a long time, now there's one less copy of the data. If I had three copies of data, one is data. I have two copies. If one more machine fails, I have one copy. If that fails, I can't access the data. So I have to rebuild. I have to make a fresh copy of the data. I have to manage the system. All of that is done by the distributed file system. And GFS was in recent times, I could say there's been old work, but in recent times this is one of the big ones. And again, Hadoop file system was an open source clone of that. So Hadoop has done a great job of copying everything Google did and making it available in open source, which is very nice. So Hadoop file system is basically a distributed file system. I won't get into details, but as you can imagine, somebody has to control all of this. That control of the distributed file system can itself be distributed, but typically in Google file system and Hadoop file system, there are a few nodes which are called the master's. They keep track of what data is where, but they won't actually serve any data. So if you want data, you ask the master, I want to open this file. The master will say, this file is broken up into these pieces at these nodes. Go fetch it from there. So you go to that node and fetch. If the node is down, the master has already told you, copy is here and here. You try the first one. It's not responding. You go to the second one and get it. So all that code is provided by the GFS or Hadoop files, HDFS. But the central master keeps track of what is where. You create a new file, the master knows. So the master is centralized repository. So the data is replicated. The master is a single machine. What if it fails? Well, there's actually a replica of that machine. Exact copy. Everything the master does is replicated on another copy. So if the master machine fails, the other machine takes over. So that's how these systems have been built. OK. So now let's look at what to do with the data in this file. Take a very simple application, which has today become the standard example for MapReducing programming. It's, oh, sorry. That's not that example. This is a different example. I'll come to that example later. That's called the word count example. This is a different one. I have a log file. The log file says some date, some time, and then some file name. What could this file be? What do you think this is? It's a stylized example of something. What do you think this could be? File last access. Where is the log file coming from? This is a log file. I've already said that. What server? Yeah, I heard somebody saying web server. So this is a typical example of a log file from a web server. Actually, the web log file has far more detail than this. But this is the critical information in the web. There's more who accessed it, and so on, or other fields, which are omitted here. So it turns out that one of the biggest sources of big data out there are the web server logs. So the way all these sites handle requests is that they parallelize across many, many machines. If you send a request to Google, I send a request. They're not both going to the same machine, most probably. Google has hundreds of thousands of machines, and our requests are routed to one of those machines, which in turn, may talk to other machines, get the data I need, and send it back. Now, each machine is collecting these logs, this kind of log. Each second, each machine may be generating thousands of lines like this. And now, if you do this across 10,000 machines, that's a lot of log data. And what is the use of this log data? So plenty of use is for it. You know who queried for what. That will help you target advertising there. It'll help you optimize your website. It can help you decide what information to display to who. When you go to yahoo.com, a lot of stuff in there will be personalized, because it knows your history. It will generate a web page suited to you, not suited to somebody in the US. It's localized. It's personalized. How do you make these decisions primarily through information in log files? There's also a user ID in case of these things. If you're logged in, it knows who you are. So your user ID would be in the log file. So a lot of useful information. The log files were one of the biggest motivations for big data again, in addition to web crawl. So now, let's say that I have a simple goal. Find out how many times each of the files in the slide directory was accessed between these dates. This is a small snippet. This could be a very large file. And in fact, this file could have been got by concatenating log files from many sources. Or in fact, the data is stored as partition. You get log files from 1,000 machines, and put them in the file system as 1,000 different files. So I want to run this query across all 1,000 log files. How do I do that? And if I have to write the code for doing all this each time, it's very, very difficult. So what the MapReduce paradigm does is it provides a way for you to specify the logic of your computation. And the underlying system takes care of all the rest of the work. So what are the options? It stays here. Sequential program is, of course, too slow on such massive data. You can load into a parallel database, but the loading itself will take time. And then you can run an SQL query for this. It's easy enough. But the loading is going to take time. So direct operation on log files is a lot cheaper. So how do you do this? You can have a custom-built parallel program for this task. You can write a parallel program. It's possible. And underneath Hadoop and so on, are doing parallel programming. But somebody has done it once for you. You don't have to do it if you use the. It's actually very tedious. You have to deal with failure and so on. It's a lot of work, which you should never do, leave it to Hadoop. So what is the programming model for Hadoop? The basic idea is it comes from the very old MapReduce paradigm, which dates back to, I think, 1960s, Lisp. So it's older than most of us in the room, probably. So the approach has two functions, Map and Reduce. You specify two functions. You have a lot of input data. Think of it as in the Lisp days, the input data was a list. In the current avatar, it's just individual pieces of data which are in the log file. Each line of a log file is a data item. So now there are two functions which the user has to provide, a Map function, which is given a key and a value. So for the log file, what are the key and the value? It could be the key could be one file, the value could be the content of the file, or the key could be something and the value could be one line of the file. One record can be considered as a value. And the Map function outputs something. What is that output? It outputs a list of key value pairs. These keys and values can be different. It can be whatever. We'll see. But that's what the paradigm specifies. What exactly to output depends on how you write the Map function. You're going to write the Map function. You are taking input of certain forms. You're working on a certain set of files. So the input k and v to the Map function are defined by the application. You write the Map function and you decide what to output. Now the system takes what you have output and sorts it on the key values. So you had a huge number of records. The Map function is run on each of them. And it outputs some key. The system will sort all this in parallel across many machines. And now it gets together all the things with a common key, let's say k1. It gets all of those together. And the values come as a list concept. So what does this operation correspond to in the SQL world? You're doing group by on the key value. In SQL, group by comes with an aggregate sum, min, max, whatever. Here it simply collects all the values into a list. And the programmer again provides the reduce function which can iterate through all the elements of that list and compute something. The programmer may compute sum, the programmer may compute count, or anything else that they need to depend on the application. It may not be a database application at all. This is the bottom line of the paradigm. And then it outputs a value. And that value is output, finally. That's the final result is that value. And there can be many such values. Because there may be many keys. For each key, there is a value. That's the output of the MapReduce. So when you have data in files, it has to be broken into records. Again, the system like Hadoop provides built-in functionality for breaking up files into lines. And Hadoop will call the map function. And the key, by default here, would be the line number. So now with that background, let us see how this file access count would work. So look at the map function. This is not formal syntax. This is pseudocode. You see here, the map function is taking two strings, a key and record, like a key and value. We called it record here. And there is an array string attribute 3. So I'm breaking up that line into three parts. So I'm not showing how to do it. I just put a comment here saying, break up the record into tokens and store it in the array attribute 0, 1, 2. So if you see here, date, time, and file name. Those are the three fields. And then it pulls this into date, time, and file name. This is just to understand what is going on. And it checks if date is between these two dates. And file name starts with slidedir. Because that was the goal. I wanted to find how many times files in that slide dir were accessed between these two dates. So it's just running a simple selection. And then it emits file name comma 1. What is going on here? This is an intermediate thing, which is going to be passed on to the reduce function. Why does it, it has to emit a key and a value. What is the key here? File name. Why is the file name the key? That's a group by attribute. The key is the group by, in this case. So that it's emitting file name, because I want to count how many times the file name is accessed. So group by file name. And the value in this case is just 1, because we just have one line that occurs 1. And there can be other things, which will come up, come too shortly, where the value can be something other than 1. So this emit results in a very large number of pairs, file name comma 1. Then the system takes all this, sorts this in parallel. And all occurrences of a particular file name are brought together for the grouping. And that list of value, in this case it's just 1, is given to the reduce function. So here, sling key, list, record list. And the reduce function is very simple here. Count equal to 0 for each record in record list. Count equal to count plus 1. So it just steps through the list, adding 1 to the count. And it hits the end of the list. It outputs file name comma count. So it's a long program to do something very simple, to find the sum group by file name. Why take all this trouble to go through all this? So there are two answers to it. First of all, if you are doing something which is not a relational database operation, you can do that in SQL. The second part of the answer is, earlier on, SQL systems which let you run such queries in parallel were very, very expensive. And therefore, you had to go to Hadoop to do this. Even though it was something which is natural for SQL, you had to go to Hadoop and write all this. That situation has changed today. I'll come to it at the end. I'm going to go historically. Today, there are systems which will let you specify this in a variant of SQL, and you can run it on Hadoop. There's a system called Hive, which was built at Facebook. So a bunch of people who built it, half of them were Indians. Some of them have come back running a startup in Bangalore. So they built this system which can run SQL queries on top of a map-query system. So anyway, I'll come to that later. So here is a diagram which shows how the keys flow through the system. Supposing I had key one, map value one. These were the input records. These were in the files. And these files are typically in the distributed file system. Let's say that these are the input records. The map function takes each input record, and outputs reduced key, reduced value pairs, like this. Now this is sorted so that every occurrence of RK1 is one here, one here, one here, and so on. Sorry, not here. This is I. This is one here and one here. And then maybe more. So RK1 is associated with R value one, R value, Rv7, and so on. So all of these are given to the reduce. The reduce key and the reduce value is the list. Key, record list for reduce. That's the basic idea. OK, so that's pseudo code. We'll come to the Hadoop syntax data. But before that, let's take another example, which is less of a database example. Supposing I want to, this is a common example to illustrate MapReduce, I have a bunch of documents. I want to know how many times each word occurs in the document. Now if I had a record per word occurrence in a document, it's a simple SQL grouped by some query, or count. Group by word, count, star. That will easily tell me how many times each word occurred in the set of documents. But my input is not like that. My input is a set of documents. I cannot directly run an SQL query on this. I have to process it. I have to go over the document and do something with it. And that is where MapReduce is a very natural fit. So what is going on here? How do you do it in parallel? This is the implementation side of MapReduce. You want to divide the documents among a large number of worker machines. How do you do that? First of all, HDFS, the distributed file system, is already running on a very large number of machines. So you store the files in HDFS. That's step one. Step two, you are going to tell a bunch of workers to process each worker processes some set of documents. So if I have 10,000 documents and 1,000 machines, each machine will process 10 documents. Now what does that each worker do? It parses the document to find the words. This is actually a very simple tokenization. And the map function could go over the document and output the word and the count of the word in that document. Now what is the document is very big. It's huge. Log files can be very huge. Whereas the memory available is limited. How do you compute the count of each word in the file? If it is small, what you can do is have an array or a hash map or some other data structure which lets you keep track of the count for each word. As soon as you find a word, see if it is already in the hash map. If it is, add one to the count. If not, add a new entry with count one in the hash map. So you can build it with such data structures. It's fine if the document is small. The document is huge. This data structure may run out of memory. So you may want to do it differently. So in fact, the standard word count program does something very simple at this point. Instead of counting the actual number of times a word occurs, it will output, just as we saw before, word comma one, word comma one. Each time you find the word, it immediately outputs word comma one. No matter how big the document is, it's OK. It'll output a lot of such keys. But Hadoop or Google MapReduce can handle that. Doesn't require memory. Now in the next step, I need to group by word and bring all occurrences of a particular word together at one spot so that I can finish the aggregation there. The reduce function has to do this. How do I do this? I sort. I can sort locally and get a list of sorted words on each node. Now I have 1,000 nodes. Each have a list of many words. I need to partition these words again amongst a large number of reduced machines. So I may say that all words between AA and BAD are on machine 1. All words between just after BAD but before CAX are on machine 2 and so on. So the Hadoop system will decide how this partitioning is done, but what will happen is all occurrences of a word will land up at this machine. They may have started in any number of machines and the map function is run highly parallel. Each word is now on one machine. And then you do sorting to bring the words together and then apply the reduce function. So as an example, if I have this input line, the map function could emit 1, 1, A1, penny 1, 2, 1. Now A is found again. So it again outputs A1. It could have said A2, but I'm keeping it simple, outputting it immediately. So A1, penny 1, and so on. Now reduce will bring the A's together, the penny's together, and so on, and add up and get this output. So now again pseudo code for the word count. This is very simple. Map string input, string input value. This is, for each word, w and input value, this emit w, 1. No, it's a string 1 here for simplicity. It could be integer also. And reduce, what does it do? It has our key and the values here. For each value, v and intermediate value, result plus equal to parse int of v. Why parse int? Because it's a string, some overhead, but can be avoided. But for simplicity, we are working with strings. And then it outputs the result along with the key. It's not shown here. This is the pseudo code. Any questions? So you're asking, how will Hadoop interface with a relational database? That's actually a good question. And that's something which many people are working on. For example, there are projects which Microsoft is working on a project, which will let you run Hadoop programs straight from SQL server. There are other companies. Quite a few years, five years back or so. There's another company which was founded by one of my students, B-Tech here. Later he went to Stanford, a company called Aster Data. They were one of the first companies to integrate map-reduced with relational data. So you could mix the two paradigms. So, ultimately, or vertically, how you divide the database in small, small pieces? So that parallel databases is an old area. So you take this company, which I do, Aster Data. They did not start with map-reduced. They started by building a parallel database. Interestingly, they chose to use PostgreSQL as the underlying database. So they would run PostgreSQL on hundreds of nodes. And then they have a layer on top, which would partition the data across all these PostgreSQL. So you get a query, it would partition the query. And they did very well. They were bought out by Teradata recently. So my student is very rich. At some point, I have to tap him for money. But anyway. So the point is that it's a very important product, given the needs of the market today. So that came from the relational world. But they soon realized that there are non-relational cases. So they integrated map-reduced into that. So now how does map-reduced execute? There are some more details here. You have your user program, meaning you write code for map and reduce. What language do you write this in? We are doing it in Java. But in fact, the same underlying system can take code in many other languages. It's fairly language agnostic. So it copies the program to all these machines which are running the map tasks. And the reduced part of it, it copies over to all these things which are going to run the reduce. And then there is a master, which is controlling the whole operation. The master will initiate all these. It will start the map task on some number of machines. The input files themselves are their files. Some of the files can be very big. So what Hadoop does is it can actually break up the file into pieces. You give it a 1 terabyte file. It will break it up into smaller parts as required. If you give it many files, it will distribute the files across. So the files go to various map tasks. It can be mixed up in some order. The map task runs on whatever data it is given, which could be a part of a file, and writes the output of map to local disk. And then the reduced workers, they get the data from the local disk of various map workers. And then they do some sorting. So even here, partitioning is done. Sorting and partitioning is already done so that one part of the data goes to this, one goes to this, and so on. These guys will get all the keys that are coming in and merge it to get one single sorted result here. And then on that sorted result, they run the reduced programs. And the reduced program outputs something which is written to a file. And each reduced task creates a separate file, so you get a number of output files. Is this architecture clear? You'll be seeing this in the lab exercises tomorrow. You can give multiple input files. You get multiple output files. How do you say which files to use? It's actually simple. In Hadoop, you give a directory, and any file inside the directory will get used. If you have 10 files in there, map reduce will be run on those 10 files. The output, similarly, is in an output directory. Depending on how many reduced tasks there were, you may get multiple output files. Probably in your case, it will just run on one map and one reduced, because we're not really running it in parallel. But this exact same program can be run across a large number of machines. It'll work. OK, so what is Hadoop? It's this implementation. And it's written in Java. And it runs on top of the Hadoop file system, HDFS. In the lab exercise, we are not actually running HDFS. It simply runs off the local Unix file system. But there is another component, HDFS, which you can also run, download, and run. And then actually get the data partitioned across a large number of machines, and run the whole thing. It's used in production on thousands of machines. We don't have thousands of machines. So we are just going to run it on one machine. I already told you about replication and a central name node, which provides metadata, such as which block is in which file, and which block is in which machine rather. OK, so that general principle is done, now we come to programming details. Now, so far we avoided types. I just said string. But when you build a system, you have to worry about types. So in Hadoop, you have to provide four types. The types of the input key, input value, output key and output value. This is the first thing you need to provide. Then you have to define the mapped class, which implements the mapper interface. If you're familiar with Java, this will make sense. If you don't know what this is, it's OK. Basically, the mapped class has to provide a certain set of functions, the mapped being the key function. And in the next slide, we're going to assume that the map input key is of type long writable, it's a long integer. The map input value, which is all of or part of a document, is of type text. What does it mean, all of or part of? Because Hadoop, if it's a small file, will give the whole file. If it's a very large file, Hadoop will break the file into pieces and give each piece to a separate map class. And then the map output type for our thing is text. We're doing word count. So the key is a word, which is a group by attribute. Therefore, it's a type text. And the output value is an integer value. It uses int writable. It could have been string. In the other one, we said parse int. Here, we don't have to do it. Once we say it's int writable. So now here is the map function in Hadoop. Public static class map extends mapper. Long writable, text, text, int writable. Those are the types of those four things. Input key, input value, output key, output value. Then this is a local variable, new int writable 1. It's just an efficiency thing. So that's a variable called o and e, which continues, contains the value 1. This is all in the map class. And then there's a word. So actually, you should probably put that word into here. Sorry. So what is happening in the map function? The map function takes a key, a value. And then there is a context, which is other Hadoop settings. If you need to access it, it's available. For example, how do you know which file name this data came from? The map is running on some data. If I need to know the file, I can get it from the context. There's a lot of other functions with the context. So what is it doing? The first step is value is text. So it says value.toString. That's a line. Why is it a line? Because somewhere else I'm going to specify that Hadoop should break the file into lines and give it one line at a time to the mapper. That's coming up later. Now what does it do? Say string tokenizer equal to new string tokenizer on line. What is the tokenizer in Java? This is an interface which lets you break a string into pieces based on what? The default would be space and some other characters. But you can specify if you want to list all the specific ones, you can list it here. Now what does it do? While there are more tokens, you're setting word.set tokenizer next token. Getting the next token. And then writing context, this came from here, context.write word, 1. So the key group by key is a word. The value is 1, the variable which we have up here. ONA is this variable, containing the value 1. Now we come to the reduce function. Now the reduce function also extends reducer. Here, what are these four? This would be the reduce input key, the reduce input value, and then the final output key and the final output value, which is the overall program outputs. So if you see here, we are getting a word with account intratable. It's going to output a word with account, so that is the intratable there. Now coming down to the actual reduce function, it has three things as before, like map. The first is a key. The second is not a single value. It's a list of values. So the Hadoop interface specifies itreble. And this is a generic, by the way. This Angular brackets is Java generics. If you are familiar with Java generics, well and good. If you're not, I don't have time to get into it. But it's basically, you can define a class which can be parameterized by the types of various parts of that class. So the idea here is this reducer is a generic class, which is a super class of this. It defines a bunch of functions. And the reducer doesn't know what is the actual type of the map key, map value, reduce key, reduce value. You can bind it here. And the reduced class extends that. It's a subclass of the reducer class, which this class is defined by the Hadoop framework. This class we are defining. In this function, we are defining inside this class. Now what is this function doing? It's actually very simple. Interitable val colon values. Values is itreble. This is, by the way, Java 6 syntax, which when you have an itreble, you can write a for loop like this. Val, intritable val, that's the type of val, colon values is a itreble. So the for loop will iterate over each element of itreble. The itreble type. Itreble itself is a generic type, which allows you to build iterators. And we have given a value intritable. So the itreble is called to the iterator saying, give me an x value, give me an x value. It's going to return an int value, intritable value. So that's why we have this for loop here. And then sum is initially 0, sum plus equal to val dot get. It's getting the actual value here and adding it up. And finally, context dot write key comma new intritable of sum. Sum is an int. The output type is intritable. This is clear? So that's it for the map and reduce functions. But to get the program running, you have to do a bit more. There's a main function which has to set up various parameters. So first of all, the most important is you can have many classes which extend the mapper and the reducer functions. So in one program, you can have many, many different map and reduce functions. And the program can actually run one after another. It can run a map, reduce, map, reduce, and so on, multiple stages. So you have to tell it, for a particular, you create a job, one map reduce job which runs, you have to say, run this map function and this reduce function. That's one job. And the job itself has a number of parameters. For example, I said everything is in parallel. How much parallel? Should I run it on five nodes? Should I run it on 1,000 nodes? How does it know? You have control over that. You can specify this. Again, we are not going to specify it because we are just running it on a single node. We are running a very simple instance. But you can specify that. And then you have to specify the types of the jobs output key and values. Oh, sorry. First is, I'm jumping ahead. Let me finish this before coming there. The first is, you use the method setMapperClass and setReducer class for a job to set the respective classes. Then the types of these are set by setOutputKeyClass and setOutputValueClass, so the types are set here. Then the input format for the job. What do you mean by input format? I said that this is a text file which should be broken into lines. Where did I specify that? Here. It could be different. I may have my own program which will read the text file and break it into records, and based on something else. I may have there are other predefined ones which can break into records based on other conditions, not just line per record. So the default input format is text input format, where the key is a byte offset into the file, and the map value is the contents of one single line of the file. That's a default. So the offset is a long y. Integer can be 32 bits by default on most systems, 64 on some. Now files can be much, much bigger than 4 gigabytes. 32 bits gives only 4 gigabytes. So the offset has to be long. That's used inside there. And then the directory is where the input files are stored, and where the output files must be created. So this is set by this add input part and add output part functions, where to get the data, where to output research. And many more parameters. Like I said, number of nodes should run the map task in parallel. Number of nodes should run the reduce task in parallel. This need not be the same. In some situation, map is very time consuming. It does a lot of work, and reduce is cheap. So I may run map on 1,000 nodes, reduce on 15 nodes. Some others may be the other way. So here is the main function, which sets all of this. So first of all, it's a new configuration. And then it creates a new job. A single job will do a map and then reduce that. So job.setoutputkeyclass, job.setoutputvalueclass, job.setmapperclass, map.class, so not class map. I could have called it anything. I didn't have to call this map. I could have called it map, x, y, z. I could have called it anything I want. But here I have to give the name of that class. I set the mapper class. Similarly, I have set the reducer class. I called it reduce.class. And then input format is textinputformat.class. Output format is textoutputformat, which is again plain text lines. And then fileinputformat.addinputpath, job, new path, arg0. What is this doing? This program takes command line argument. So when we run it, I have to give a directory name first for the input files. That's the first argument. The second argument is the directory for the output files. So output path is for this job is path with args2. That is the second command line argument. And then I say job.waitForCompletionTrue, which starts the job and waits for it to complete before continuing. It can also start the job, say waitForCompletionFall, which means it will return immediately. And the job will run in the background. And then I can check when it's done. So it's a quick summary of my produce.