 So, now let us look at the Hadoop specifics for what we have just seen. So, as I said Google pioneered MapReduce for big data and they used it for a very large number of tasks. In fact, initially their index implementation, Google search index was hand coded, but once the MapReduce was developed they soon realized that it is best to do even indexing using MapReduce, because MapReduce paradigm deals with failures, it deals with slow machines. It can be configured to run on different numbers of machines by just changing some parameter somewhere, you do not have to do any recoding. So, it is actually a very convenient paradigm. So, they moved their indexing to MapReduce. They did a lot of analysis on MapReduce. So, in fact, pretty much all the stuff which analysts at Google do these days, Google has a lot of people doing analysis, huge, lot of their employees are doing different kinds of analysis. All the inputs to the analysis are enormous, because every single thing that Google does has anywhere from tens of millions to hundreds of millions or billions of users and billions of queries per day. It is gigantic. So, everything that they do runs on the MapReduce platform. Now, when Yahoo saw this, they decided to actually Hadoop was already being done by a couple of people. It started off as a small project built by two people. It was called Nuch in those days. Yahoo saw the potential and they actually hired one of the two people who was involved in Nuch. The other person decided to become an academic. So, he is now on the faculty at Michigan. But one of the two who joined Yahoo continued developing it and it became called Hadoop. Yahoo put a lot of resources into this. They put a lot of people to develop this. And thanks to Yahoo, Hadoop is available and Yahoo decided to open source it, which was also a very good idea because many people started working on this even outside Yahoo. So, Yahoo benefited. The world has benefited. So, the original Google implementation ran on top of the Google file system, GFS distributed file system. So, Yahoo built equivalent called Hadoop file system, HDFS. Now, Google file system is an actual underlying file system running on machine. Hadoop file system runs on a number of machines. The underlying storage in each of the machines is the local file system. You may be using whatever Linux or other file system. But the Hadoop file system runs a bunch of servers and now it lets you store files in the Hadoop file system and it lets you fetch files from the Hadoop file system. So, HDFS is a service which is provided by a code system, which Hadoop provides. Now, in your assignment today, you are not going to be using the HDFS. Actual production runs of Hadoop will use HDFS, but it takes effort to set up HDFS. So, to simplify your life, you are going to be running Hadoop over local files on a single machine. But the principles are the same. So, whatever you do, in principle can be easily re-targeted to running on files which are stored in a distributed file system, HDFS. So, that data is replicated and there is a central name node which provides metadata. So, if you want to open a file, you talk to the name node. The name node will tell you which blocks are contained in the file and where all it is replicated. Now, the name node is central. If it fails, you are in trouble. So, HDFS keeps a replica of the name node, one replica which is up to date. It has all the data that the live replica of the name node has. The backup replica has everything which is up to date. So, if this crashes, the other one can take over. Now, let us get to specifics of MapReduce in Hadoop. We looked at abstract MapReduce with types as strings and so forth. In Hadoop, you have to give types for all of these. So, in Hadoop, there is a mapper and a reducer interface which you have to implement. These are generics. What are generics in Java? If you are not familiar with it, a generic is a class or a interface which can take types as parameters. So, in the context of Hadoop, the mapper and reducer interfaces take two parameters each. The parameters are the types of the input key and the input value and the types of the output key and output value. Now, what do I mean by input key input value? The input key and the input value are the map input key and value. The map output something which is the output key and output value. So, its types are also needed. It turns out that the output of map is the input to reduce. So, the output key type and output value type of map must match the input key type and input value type of reduce. And reduce in turn has a output key and value whose types can also be specified. So, each actually takes four types. So, the mapper interface is a generic which takes four types. Now, I will show you the thing in the next slide, but in our example, the input key type is going to be long writable which is a long integer. Now, why long integer? Files can be very big more than gigabyte, more than 4 gigabyte. So, 32 bit integer is not enough. So, we are going to use long int. And in our word count example, the key is offset into a file. We are not actually going to use the key, but the Hadoop system provides the key. So, it like I said it can break large files into pieces. So, when it gives a and furthermore it is going to call the map on each record in the file. So, we are going to set up the thing such that each line is a record. So, Hadoop will identify the line by giving the offset of the start of the line in the file. So, that is the key. We are not actually going to use the key, but it has to be there. The value is all or part of a document. In our context, the value is a record which is one line of a file. So, its type is text. That is the string type for Hadoop. Similarly, the output is of type, output key is of type text because it is a word. And the output value is of type int writable, which is an integer value. Int writable is Hadoop's version of integer. So, with that in mind, let us look at the mapper, the map class which extends the mapper interface. It takes four types, long writable which is the offset in the file, text which in our case would be a single line, text which is the output key that is the word, int writable which is the count of that word. And this class has, this is actually an optimization. It has an int writable variable which is final static. It means it cannot be updated called one which is a new int writable of one. This is a optimization. So, it does not keep creating more and more objects of type int writable. It uses the same object many times. But it is not essentially. It could do it. It is an optimization. It could create new copies each time. Then it has private text word equal to new text. So, that is a word. Now, what it is doing is the map function takes a key and a text value and a context which provides other information. In particular that context tells us where to output the key value pairs which are going to be assigned to the radius function. It provides many other things, the context. It throws two types of exceptions. We are not going to worry about that in our pseudo code. So, the first thing this function does is line equal to value dot two string. So, value is text and it is converting it to a string because string is the type which we can deal with in normal Java code. Now, we are building a string tokenizer, new string tokenizer on this input line. So, that is going to break it up into words. The default tokenizer breaks up based on space and other white space characters, new lines and so forth. So, the tokenizer is basically an iterator. So, you can call next on it to get tokens. So, the while loop says while tokenizer dot has more tokens, word dot set tokenizer dot next token. What is this doing? The next token gives us the next word in that input line. So, the tokenizer breaks up the line into words and while it has more tokens, pull the next word and word dot set. What is word here? Word is this text field. So, the next word in the input line is saved in this word and then we are outputting word comma one. So, what is going on here? The one is simply this count one. It is an optimization to create it as an object and that word with the value one is being output here. So, that is a map function. Similarly, the reduced class extends the reducer function. So, that is a requirement. Now, by the way, I want to mention that there can be many map functions which need to be executed in the course of one large job. So, there can be many such classes. The name map is not important here. It can be any name. Inside this, there is a function map. So, that function has to be called map, but this thing can be called anything you want and we will see how to launch a job which does one map job of another later. Similarly, the class reduce here. This is one class. You can have more classes with different names here. Extends reducer. Again, the reducer takes four parameters, four types rather. This is the reduce key, reduce value, output key, output value. So, those are the four types. So, we have decided that this is the word and int is the count and the output is the word with the final count. So, text and int are to be. Now, the reduced function is called with the four link parameters in Hadoop. The first is the reduce key which is of we have said is type text. So, that is text here. Now, next is the reduce value is int writable. The reduce function is called with a number of these values. So, that type that it is called with has to be iterable parameterized by int writable. So, this is again a Hadoop specific. Iterable is basically an iterator which itself is parameterized by the type and that same type which is declared here has to be specified here. So, this is iterable int writables value. So, this is the second parameter to the reduce function. The third parameter again is the context which provides various things. Here, it is used to output the final key value, the final count for the word. Again, it throws a couple of exceptions which we are ignoring here. So, the reduced function does something very simple. Some is set to 0. Now, it is says for int writable value, this is the Java syntax for going over iterator. So, this is an iterator. So, if you say for something colon values, values is an iterator. So, here I specify a low variable val whose type is int writable. Now, note that this is a number of int writables. So, this val had better be of type int writable. So, that is part of the for loop. This is Java 6 syntax. If you are not familiar with it, this is how you can step over all the values in the iterator. Values is an iterator. So, now it says val dot get. Now, val is an int writable. Int writable dot get returns an integer value and I am just adding that to sum. Sum is 0. Initially, I keep adding. So, the idea here is that many map functions might have returned the same word with different counts. Now, all those counts are added up here to get the final count for a particular word. And finally, I write key, that is the word, a new int writable of sum. So, that is the sum, the total count for that word. So, that completes the map and reduce functions. The next set of things are how to set up the job parameters. But, before I get into that, I would like to take some questions in case people have not understood the map and reduce function. We have pandai peria value, please go ahead. In what are the languages we embed adobe in java only in other languages? Take the question as I understood it. The question I think was, what are languages can you do map reduce in? The example I have shown is using java, but the adobe platform actually has plugins to handle many other languages. So, you can write your map reduce code in one of many, many other languages. So, what happens is the adobe system is built using java. There are many languages which can be compiled down to java bytecode. So, any of those languages can be used with adobe, because the map and reduce functions can be written in those languages. But, compiled down to the same bytecode and the adobe infrastructure simply executes the bytecode. It does not care what language the bytecode was written in. So, as long as the type system matches the adobe type system. So, there are map reduce things available for many languages in the adobe framework itself. So, any other question please go ahead. Sir, we have another question. What is the equivalent command in Postgreal? Like ls we are using in linux. So, the question is in linux we use ls to see what files are there in a directory. What is the equivalent in Postgreal? This is not big data related, but a quick answer is if you are using the psql command line prompt there are slash commands this type slash help and you will see a list of commands. So, I think there is a slash dt or something which shows a list of tables and then there are variants of that to show different things. Or if you use a graphical interface that provides it for you if you use pg admin it provides that. But, let us focus on big data questions today if you do not mind. So, let me take some cat questions here. So, one of the questions which was asked is do we have to use for each or for in the map reduce code. So, maybe you know my pseudo code might have confused you. So, let me make this clear. Hadoop is described here and the java syntax which I have shown here works with Hadoop. If you go back to before Hadoop was introduced we use a lot of pseudo code. None of this will work in Hadoop this is pure pseudo code it is not meant to work on any system. So, the for each loop is not a real construct it is not java construct it is pseudo code. I use some other code here which look more like java, but this is also pseudo code this is not going to run on Hadoop. I cannot say if date between you know this is not valid java syntax this is not this is pseudo code. So, do not use any of these things in Hadoop use only the java constructs which you are familiar with or the constructs which are used here can be used. This for Hadoop on iterators even if you are not familiar with it I think from this example you should get a hang of what it is and you can use it. So, you most welcome to use that. Now, let me see if there any other things. So, people are asking about their guidelines for partitioning in big data again Hadoop has a lot of configuration parameters we are not going to touch on most of them, but we do have to touch on a few to get your program running and that is the next thing that I am going to talk about. So, let me come there are no other questions directly related to today's Hadoop code. So, let me go ahead and tell you the next part of Hadoop which is the job parameters and how do you set up a overall program in Hadoop. So, these are the next two slides. So far I only showed you the map and reduce function, you have to do a lot more to get your job running well a little bit more not a lot more. The first thing is that you have to tell Hadoop which classes contain the map and reduce functions. I sort of mentioned this earlier I told you that this class is called map and this class is called reduce, but your program can have many such classes with different names it does not matter what they are called. So, you have to tell you have to set up a Hadoop job and you can run multiple Hadoop jobs one after another each Hadoop job needs to be told that this is the map class and this is the reduce class. So, there are two methods set mapper class and set reducer class which you have to call and we will show that in the next slide to tell it which map function and which reduce function to use in this job. You also have to tell the job what are the types of the output key and value. So, that you have to do using set output key class and output value class this is for the final output the types of the intermediate outputs are also provided in the class you know those are generic classes. So, those provide the other types, but Hadoop job has to be told about the final output type. Now, the input format is another important thing you know I kind of puth saying that somehow the system gives you input key and input value from a set of file, but how does it decide what are these keys and what are the values. Now, these are functions which you can also give implementation of, but the simplest thing is to use one of the built in things the default format in Hadoop is the text input format. The text input format says that I have a number of files each line of each file is a record by itself. So, the new line character breaks a line and so each line is a separate record by itself. So, the value is the contents of one line and the key is the byte offset into the file. The file name is not part of the key, but the context parameter which is passed to the map function has the file name accessible from the context. So, if you need it that is also available. So, the set input format class there are other formats available which can break the thing into records based on some other terminator for record or you can even create your own which lets you parse the file and break it up into records. Then the next thing is where is the input data and where is the output data. So, the way Hadoop does this is you provide direct trees. So, you say add input path and add output path. So, input path is a directory which contains files. So, Hadoop will process all the files in that directory that directory may contain a large number of files that is it will process all of them and they are all part of the input. The output path tells it where to store the output. Now, if you saw earlier figure each reduce function outputs reduce key and value. Now, the reduce functions run on a number of machines M different machines here. So, each machine is going to create one or more files and the output directory is where all these files will be stored. So, they will have various names and you will see that in the lab today sorry I have a short here. So, the add output path that is why is which directory the file should output files have to be stored in. So, when you run a program you first have to create a directory and store all your input files in that directory and in the program you have to say add input path with that directory. There are many more parameters we are not going to cover all of them most have default values. So, here is our overall Hadoop program for word count. So, the main function is in another class class word count can call this whatever you want and that thing should have a public static void main string args. So, that has to be defined and it can throw some exception. This is setting up the job and executing it. What is it doing? Configuration, conf equal to new configuration. So, configuration is the stuff which sets up all the parameters. Now, we say job equal to new job conf this previous configuration and there is a name for it called word count. So, that is to identify the job. Now, for the job we are setting various parameters. In this case we have not really altered any configuration parameters, but if we wish to if it were really running parallely we could set the number of reduce machines, the number of map machines and so on actually tasks. A task is like a machine. So, number of map tasks, number of reduce tasks and so on can be set here. We have not set it. So, all the configuration parameters are default. In fact, the way we are running it is just a single machine. So, that is the default for the single machine version. So, now we have to tell the job that the output key classes text and output value classes intritable. So, text dot class intritable dot class. This is the way of telling it the classes of the input output key and value. Similarly, we have to tell it which are the map and reduce functions which classes contain those. So, job dot set mapper class, map dot class. So, note map is the function we and reduce are the functions we define here. Reduce and map. We could have called these whatever we want. This name map and this name reduce are what we use here. So, set mapper class, map dot class, set reduce a class, reduce dot class. Then we set the input format to text input format dot class. So, we could have also defined our own thing and use this instead here. Similarly, output format we are saying text output format dot class which says that it is a file with you know one line per record and the record will have the key and value. Then we say file input format dot add input path job comma new path arg zero. So, what is going on here? First of all this word count program is going to be invoked with two parameters. The first parameter is path is the input directory which contains the input files. So, now when I say new path with arg zero, arg zero is the command line argument with the directory name. So, path with the name which we have provided basically creates a path and we are saying add input path for that. And the next one is for the output. So, the second argument to the call is the output directory. Now, it in general you should delete this directory and it will create it automatically and so we are saying set output path with that directory name. So, when the job runs it will create that directory if it is not present and fill it with as many files as required. Now, we have set up the whole thing and then we say job dot wait for completion prove. So, wait for completion means start the job and then wait for it to complete because I have said true. If I said job dot wait for completion false this will return immediately and let me do other stuff while that job is executing. The job may take some time to execute it is running on big data potentially may take a long time to execute. So, meanwhile I can do other stuff, but in this case my program is just going to wait and when the job finishes the program will return. So, that completes the whole program which we have for today's lab. The first thing you will be doing in today's lab is running this program making sure it runs. The second thing you will be doing is modifying this program to do some very simple change in the functionality. So, the bulk of your lab will be actually in setting up the environment. The environment is a little finicky there is a bunch of stuff you have to do. So, please follow the instructions in the lab carefully today. If you miss out some steps in those instructions you will get into trouble you will get error messages. So, please follow the steps religiously in particular there are some environment variables and so on which you should not miss. So, all of this has to run on Linux. Now, when your workshop coordinators were here some of them said that they have problems they do not have Linux they have only Windows. And a few people tried to set up Windows set up Hadoop on Windows that is actually again not trivial. There is a Hadoop port which runs on Windows server, but I do not think that port runs on normal Windows 7 or Windows 8 non-server versions. So, if you have only Windows in your lab you have a problem you may not be able to do today's live at all. I do hope that there are not too many centers in this soup. I hope all centers have been able to install Linux and run Hadoop, but those centers which could not please report this to us and we can follow those of you who are in such centers you know please go back later get access to a Linux machine and do this lab on your own. And if you have trouble you can use Piazza to post your questions and there will be somebody answering your questions. We have Apex Institute Bhubaneshwar if you have some questions please go ahead. So, what is the disadvantages of replication and what are the technique used to replicate file? The question was replication what are the disadvantages? The disadvantage is obvious you have to have more disk space, but it is essential in such large systems because there is a high chance that some disk somewhere will fail. So, you absolutely need replication to build any large system like with thousands or tens of thousands of nodes. The main thing you can control is how much replication to use. Should you use 2-way, 3-way or more or maybe if you do not do 2-way replication maybe you can use some version of RAID maybe parity bits stored in other places rather than full replication. So, all of these are actually used I am not getting into those details, but many of these things are under the control of the configuration for say HTFS or other similar systems. There are other distributed file systems out there. Any follow up question? Okay, so next question in the big data that is the Hadoop technology generally used for the parallel, splitting the file and parallel processing. And what is the basic difference with the Hadoop technology and the slicing technology in the data mining? The slicing technology in data mining. Slicing technology in the data mining. I do not know what you mean by slicing. Slicing technology. All of the, I am not sure what is slicing in data mining, but there are many different projects and approaches to parallel processing, but at the core they are all similar. You have to partition the data in some way and then process it and then maybe combine the results with some repartitioning and so on. So, there is parallel databases or Hadoop, the basic principles are the same. We can take some other questions from some other center. Sree Ramakrishna, Coimbatore, please go ahead. Sir, one question. Does Hadoop replace database or other existing system? The question is does Hadoop replace databases or existing system? That is a good question. How does Hadoop compare with databases? Should it replace databases? I am going to talk about that in the last part of my big data presentation. So, I will come back to that question. So, I will defer the question now. I do have slides on this and I will talk about it. Selvam College, Namakar. Sir, good morning, sir. Sir, how big data can be used in cloud computing areas? So, several people have asked this question about big data and cloud computing. So, first of all, let me explain to those of you who are not familiar, what is cloud computing? In cloud computing, you do not have machines sitting in your location, but somebody has a large number of machines out there and is willing to rent them out to you. Either at the level of machines or they will rent use of some service, maybe use of a platform. For example, Google Docs, let us use those documents online. That is an example of cloud computing. But Amazon gives, there are others, but Amazon is one of the big players here. You can go to Amazon and rent machines by the month, year, even by the hour. You can rent a machine by an hour. Actually, it is not a full machine, but a virtual machine, which might be shared with a few other virtual machines. You can rent it by the hour. So, cloud computing encompasses all of these kinds of things. Now, what is the connection between big data and cloud computing? So, you could run a platform like Hadoop locally on machines which you have bought, which is probably okay for academic institutes because we tend to have labs with 40, 50, 60 machines. I mean, I see on the video that many of you are sitting in labs of that kind of size and you can very well run Hadoop on across 40, 50 machines in the lab and do pretty useful. 50 is not bad. You can do a lot of parallel processing with that. But if you want to run Hadoop with really large data with 1000 machines, setting up your own cluster of 1000 machines for occasional use is crazy. You are much better off going and renting those machines by the hour from Amazon. So, there are many people who use Amazon to run Hadoop. In fact, Amazon now offers a service where you can basically get an instance of Hadoop running. You do not have to rent a machine. You rent the Hadoop service and you have to, of course, give them the map reduce function, the job config and so forth. And they will end the data and they will run it for you and give the results. So, that is the connection between big data and cloud computing. Okay, thank you sir. Government engineering college Bikaner, you have a question please go ahead. Can we do map rating on video data also sir? The map reduce paradigm, you can do anything you want in the map and reduce functions. You can also have map only jobs. You just do a map and nothing else. So, there are people who use this infrastructure to do all kinds of things which are database unrelated. For example, there are people who want to do video transcoding. Let us say you have recorded video for this lecture at a certain frame rate. You also want to convert it to other frame rates for people who have a lower bandwidth connection. Now, in the case of a view, this is done, I think, on the fly. But there are people who allow people, you know, like YouTube equivalents which let you upload file and then you have to convert it. There is a lot of these jobs which need to be done. So, people have used map reduce where the map function simply does video transcoding for example. So, what is the benefit of doing it? Well, you get fault tolerance. You have the ability to store these files in a distributed file system, HDFS. So, yeah, I mean this infrastructure is used for many kinds of things which have nothing to do with databases because it provides a fault tolerant parallel processing setup. And so, it is used for that aspect, not for the data. Now, the reduced part of it is more data related and things which use map and reduce tend to be database related. So, in fact, somebody was asking about the connection between Hadoop and databases. So, Hadoop is a very low level infrastructure. Databases provide you a much higher level view. So, people have tried to provide a high level view that databases provide including SQL on top of underlying Hadoop implementation. People have also said that the Hadoop implementation is not very good for data based query processing. It is actually not very good to be honest. You can do things much more efficiently if you built something specifically for data processing, not as a general purpose platform. And so, now there are several platforms which are tailored to these kinds of things. I will have a few slides on that coming up. I will take a couple of questions from Jack. So, I will look at a few of these things. One of the questions says, can you explain Hadoop with practical implementation? We do have this word count program and today's lab you will be actually running Hadoop program. So, it is very much part of the syllabus. The next question is, can we use Hadoop for implementation of machine learning algorithms which are recursive in nature? That is a good question. The Hadoop job consists of a map task and then a reduced task. Now, what if you want to do something more complex which has to iterate or recursion and iteration can, you know, one can replace the other. So, if you have a machine learning task which needs multiple iterations, how do you do it on Hadoop? And the answer is in that main program which we saw, we set up a job and ran it. It is possible to set up a sequence of jobs and run them one after another iteratively. So, it is very much possible in Hadoop, but possible is not equal to efficient. So, people have noted that it is not very efficient when you do such iterative things. There are better ways of doing stuff. So, there are several platforms which have. So, first of all, Hadoop itself, there is a new version called YAN which is in beta right now, but it is already quite usable which has better support for iteration. And then, there are many other platforms which people have built for specific purposes. For example, for graph algorithms, I mentioned pregel. That is much more efficient for iterative processing of graph algorithm. Then, there are other things which people have built for iteration. The next question on chat is something about memory error while running the Hadoop test examples given on the Moodle website. Please help. I am not sure about this particular problem. I have not seen it myself. Maybe you can post it on Piazza and somebody can take a look at it, but we did not see this on anything going on here. One possibility is that your machine has very little memory, but that or you have many other tasks running on your machine which means it is out of virtual memory and now you are in trouble. So, maybe you should close other processes and just run Hadoop nothing else. Next question is how can we use Hadoop in information retrieval especially web information retrieval? So, your lab assignment is to modify the Hadoop WordCon program we have given to make a small change which essentially is like one step in building an index, an inverted index as it is called. Inverted indices are the core for web search system. So, you are going to do an assignment which is not really build an index, but take one step towards building an index. So, when you do your assignment that will become more clear. It is not a very hard assignment, but it is a first step in building an index. The other part which is for web crawling. Yes, Hadoop I believe is also being used for web crawling, but there were some limitations. It was used extensively for web crawling. It probably still is in many places, but Google which pioneered using their map reduce for even for processing websites and indexing have actually moved to an alternative architecture which I forget the name, but they have a new system which is tailored to web indexing. You can use Hadoop for this or general Google map reduce, but they have a system which has some performance improvements over the raw map reduce. So, that is what they are using internally I believe. The last question which I will take from status. After range partitioning what does the reduce function do? The idea is as follows. Range partitioning is used to ensure that all records with a particular key land up in one machine, but now that machine has to is getting such records from many other machines. So, if we go to the white board. So, this was a range partition which means that a particular machine here p naught gets everything with g less than 10. Now, there are many mappers which have generated values with g less than 10. Let us say all of that comes to this machine, but for a particular g value let us say 7 it has to collect all the records with g equal to 7. So, what happens is after the range partitioning each machine does a merge the things which it gets are already pre sorted. It does a merge to get a final sorted list of all key values and once it has that it can invoke the reduce function on specific key values. So, the final sorted list has everything with key value 7 together it can call the reduce function on all those values with key 7. So, that is how that step works. May be I will take another one or two questions live questions then I will move back to my slides. We have BSA college. Good morning sir. Sir can we use map reduce for business intelligence? Yes certainly. So, this is an active area lot of companies in the BI space are working on map reduce implementations. Now, the thing is BI was focused on systems which use SQL as the back end. The initial map reduce systems were not SQL systems they were different the Hadoop is not SQL. So, interfacing an existing BI system to such Hadoop framework was not easy, but in the last few years there is a new layer on top of Hadoop called Hive. It is actually not just a layer the implementation is currently on Hadoop, but now there are new implementations of Hive on other infrastructures. What does Hive provide? It provides SQL on top of such big data. So, Hive can run on top of a map reduce platform. It can take SQL queries and run them and give back results and run them using Hadoop underneath. So, you can run these on very large data sets. So, now that Hive is available people have started building business intelligence tools using Hive as the back end. They generate SQL let Hive execute the SQL. So, that is a very active area right now. So, any follow up? Good. So, let us move back to these slides and let me wrap up big data. Before we move to a short description of Hive and so forth, let me start with one other technical thing. Let me show the figure for this. So, here what happened is each map task collected it ran map on a number of values here and then it collects those values and then partitions them to all the reducers. Now, if you saw our word count program if you see map, map outputs a word with a count of 1. Supposing a particular machine has a number of documents, large number of documents. Now, the same word may occur many times in one document and many times more times across documents. Now, if you send this output across the network to a reducer, the network overhead is going to be very high. You do not want network is a bottleneck today. It turns out that typical machines today use Gigabit Ethernet. 10 Gigabit Ethernet is still considered very expensive whereas a machine typically has several hard disks and each hard disk can pump out 50, 60 megabytes of information. So, you can generate data here much faster than the network can handle. So, you want to reduce network traffic. So, the idea is that supposing we have we can do some kind of local reduce within the machine and do some sorting collect all occurrence of word together and add up their counts. So, that idea is implemented directly in Hadoop through what is called local pre-aggregation to minimize network traffic. So, in Hadoop there is a call in the job configuration to use combiner. So, if you turn on this option, it will use the reduced function as is locally once and the output of that reduced function is fed to the main reduced function. This works fine as long as the input and output types of the reduced function are the same. If the input and output types of the reduced function are different you run into trouble in which case you can create your own combiner function and it has its own input and output types and reduce can have its own input and output types. So, I am not showing details here, you can look it up. In fact, the combiner can work at two levels one is within a machine and if you look at the architecture of most data center there are a number of machines in racks and let me use the whiteboard to explain this. So, if you look at typical data center it has a number of racks. This is what a rack kind of looks like and it has a number of these racks. Each rack has a number of servers they are very thin they are like a inch or so high. So, now a rack has a number of these. So, very common number is something like 40 or so machines in a rack and in the rack there is a switch typically a 48 port switch. Now, there are many more such racks. So, then there is another switch here which is connected to each rack. Now, typically these are 10 gigabit per second and the network connection inside the rack are 1 gigabit per second. This is a very typical setup. This may change over time to faster thing, but this is what is common today. So, the idea is that if you look there are 40 machines here. If they all start throwing traffic across to other racks you are going to generate 40 gigabits gigabits per second whereas this network can only handle 10 gigabits per second. So, it is good to reduce the traffic coming out of a machine. It is also good to reduce the traffic going out of a rack. So, in fact you can set up Hadoop to run a combiner for all the data in a machine. Then you send all the data to one of the machines in the rack which will again affects not just one you can parallelize this also, but you can do a reduced step inside the rack to further reduce the data volume and then you send the data across to machines which are in other racks. So, at each step you do a combiner run pass which reduces the amount of data. So, that is supported in Hadoop. Now, what about implementations? MapReduce in this scale was pioneered by Google, but they do not open source very much. Hadoop is open source thanks to Yahoo and there are other implementations which combine databases with MapReduce. Somebody had asked this question how do databases and MapReduce combine? So, there are several different answers. One of the answers was you take an existing database system parallel database system and add support for the MapReduce paradigm in it. So, as the data which if you recall I had mentioned was one of the co-founders was a person from IIT Bombay called Mayank Baba. So, they pioneered MapReduce within parallel databases and there are others, but there is a completely different approach which is now gaining traction. So, to understand that let us compare MapReduce versus parallel database. So, there is a lot of discussion about this. So, MapReduce is widely used. Lots and lots of people are using it. Even in India there are many companies which are using Hadoop. Particularly Hadoop is used, I should say. Compute page rank, build keyword indices, do data analysis of web click logs, so on and so forth, many, many uses. But if you see many of these things, not all, maybe not keyword indices, but analysis of web click logs, it is actually easier to do it in SQL than to do it in Hadoop. And database people said, look, we have been doing it for a long time and we know better how to handle this kind of data for simple query processing. And the Hadoop people came back and said, look, you people may say you have built parallel databases, but they are very expensive. Nobody actually owns them and they run at hundreds, not thousands of machine scale. And we are much better at handling failures. We allow procedural code in MapReduce and data of any type. So, that is true. On the other hand, many of the uses of MapReduce turned out that they actually use structured data, data which can easily be converted to structured form and then you do aggregates. Now, for this, SQL is actually a much better language for expressing what you want than hand coding using Hadoop. It is actually crazy how much effort you have to put to create a Hadoop program, which you will see today to do something simple, which you could write as a two line SQL program. The word count you cannot do exactly in SQL, but some of the other things are much easier to write in SQL. So, once people realize this, they built other interfaces which provided declarative languages on top of MapReduce. So, the first project in this space built a language called PigLatin, which is not a SQL, but is more like relational algebra and you can have a number of operations on data. So, this came from Yahoo and this was open sourced by Yahoo. So, you can actually use this. You can download and use it. But soon after this, Facebook had a different project called Hive, which they decided to forget new languages. Let us use SQL itself as a language, maybe a subset of SQL, but it is SQL. And the other thing they did is that they said, look, Hadoop says files can be any old text format, but let us put a restriction that the files have to have a schema associated with them and a way to take the raw data and the files and output fields, some structured format which can then be consumed by the SQL system. Otherwise, for every file, you have to build your own parsing routine. That is not good. Let us have a standard format for files. Of course, if you have a web server which generates a log in its own format, then you have to add to the formats that Hive supports to understand that, but there are only a few such. So, Hive has a notion of schema for every piece of data it has. So, it is much closer to SQL in that sense and it supports the SQL language. So, this is now very, very widely used. Hive has really taken over from raw Hadoop. Many people who are using raw Hadoop for data analysis have moved to Hive. Of course, Hadoop continues to be very widely used for things which cannot be expressed in SQL. There are many such uses. But those which can be done in SQL are quickly migrating to Hive. There is also a system called Scope from Microsoft which does similar things. And finally, there are many extensions of the MapReduce paradigm itself to add joins, pipelining of data, blah, blah, blah. I will not get into the details to make it more efficient. So, that is the state of the world. Hadoop is being, is still used as the underlying infrastructure, but there is a layer on top called Hive which is widely used. Now, there is a last part which I want to wrap up big data with. So far, we have been focusing on the decision support. We have a lot of data. We want to do analysis using the data. Now, the other part is if you want to store update retrieve data, files are not the right format for it. Files are fine if you just have logs and so on which do not get updated, but there are many uses which are more OLTP like. So, if you want to store records which can be individually accessed, updated and so forth, files are not the answer. So, again here Google built the first system which address this particular issue called BigTable. This is a really cool system because they give you an abstraction of a table with more features actually than basic table. You can retrieve records. You can store records in this table. The records themselves can have further structure. They are not necessarily first normal form. And the key thing here is that these tables can be enormous. They can be many terabytes, even petabytes. And that view, a logical view of a table is provided, but the actual physical table is stored across potentially tens of thousands of machines. And the nice thing is that as the table grows, you can incrementally grow the system also by adding more processors to the system. And BigTable will smoothly migrate little by little, move data into the new processors. It will not, it does not have to bring the system to a halt ever. So, this was a really nice system. And once they published it, people heard about it, they wanted to build their own. So, Yahoo built a system called PNUTS. And more recently, Apache built a system called HBase. This is open source. PNUTS is not. So, HBase is now wide, open source and widely used now. Many people are using HBase. And all of these provide what is called a key value store. It is not a full-fledged database. You can give a key, like a record with a primary key. You can give the primary key and the content of the record and say store this. And you can give the primary key and say retrieve the record. What these things lack is the support force indices other than the index on the primary key. Again, people have, after they were, this is the build, they started adding support for secondary keys and transactions and so on. So, that is still an active area which is under development. So, if you use HBase, you are not going to get too many functions in terms of secondary indices or transactions. But people are developing such systems few years from now. Open source may be available. Google has already built these in-house. They have actually built massively parallel scalable database systems which have SQL and all the bells and whistles of database, most of the bells and whistles of databases. But it runs across enormous numbers of machines. So, that is getting popular within Google and eventually you will get publicly available open source clones. Last few questions. Bansal college Bhopal. Sir, my question is, what is the basic difference between RDBMS and Hadoop? The question is, what is the difference between RDBMS and Hadoop? So, Hadoop is a layer which provides just a few functionality. It just provides map and radio which you can run in parallel across many machines with fault tolerance. It is not a database at all. And the point I was making in the last few slides is, that you can build a database on top of Hadoop. In particular, if the database is focused on decision support, so you want parallel query processing, you can build it and Hive is exactly such a system. It is a database which supports SQL which is built on the Hadoop infrastructure. So, Hive is the underlying execution engine for, sorry, Hadoop is the execution engine for Hive. That is the basic relationship between Hadoop and a database system. Now, Hive itself provides an SQL layer. Now, people are actually building Hive implementations which are sitting not on Hadoop, but on other infrastructure which is more suited to database operations. So, that is the next wave in Hive. There are some commercial systems already available which claim much better performance than Hive on Hadoop. So, it is Hive on some other infrastructure which is much faster than Hive on Hadoop. This is from GRIET, Kukutpalli. Sir, may look, may not know, what is the difference between HBS and HDFS, sir? HBase and HDFS. Okay. It is a good question. Sir, let me just use the slide here. So, the question is, what is the difference between HBase and HDFS? HDFS is a file system. You can store files, just like in your local machine. You can create files, you know, you can write two files, you can read files. HDFS provides exactly this interface. The only difference is that the files are stored across a large number of machines and it has replication and fault tolerance and all that built in. So, if a machine dies, HDFS can continue to live and provide you access to those files. HBase provides similar functionality, but for records. And what it does is, it is called key value store. By key value, what we mean is that a record must have a key, like a primary key of the record and the value is the value of the record. And the only way to access a record is by specifying a key value and the system will retrieve the record or you can give a key and a value and it will store the record. So, it just supports two basic operations, get and put. There are some extensions, but these are the two basic operations. So, this is what HBase provides. It is not a full-fledged database on its own. So, now you could again take Hive and make it run on Hadoop. So, by the way, Hadoop is map reduced. Hadoop, so far I have been talking of running on top of HDFS. Hadoop also can run on top of HBase. So, you can store records in HBase and then run Hadoop program directly of the HBase storage. So, it does not have to be HDFS. Does that answer your question back to you? Thank you, sir. And one more question is there. What is the function of mappers? Okay, the question is what is the function of mappers? So, I already told you that you have to write a map function in the reduced function. A mapper is basically a task which runs these map functions on multiple key value pairs. So, that is a map task. So, mapper is a task that runs these map functions. Similarly, reducer is a thing which runs the reduced functions. We have Fisheela Adanshan Godavat. Please go ahead. Good morning, sir. Sir, my question is whether on our desktop computers and with Linux or Windows, can we have Hadoop on it and which is the, other than have, is there any database system which can support the underlying Hadoop file system, sir? Thank you. Good question. So, first of all, Hadoop was built on Linux systems and it can run on your desktop very easily. Resource constraint is very small, particularly if you just run it in single machine mode. That is the default mode we run it in. Hadoop can also be configured to run across many machines. Then there is some more work to set it up. So, we have not put that in your lab, but you can certainly follow the instructions which are available and set it up across all the machines in your lab. I can see your lab has lots of machines. So, you can run Hadoop in parallel across all those machines. Now, coming to the Windows part. So, Microsoft realized that Hadoop is very widely used. Initially, they built their own things called Dread and Scope, which is a database layer on top of Dread. Dread is their equivalent of Hadoop. Unfortunately, in the market, Dread didn't do that well. So, now Microsoft has also paid this company called, what is it called? I forget the name. It has paid this company to build Windows port of Hadoop. So, that port is currently available for Windows server, not yet for Windows desktop. So, as of now, I believe it's difficult to run Hadoop on a Windows desktop. Some people have ported, you know, Siegwin, which provides Linux functionality on Windows and is more complex, but you can't do it more straightforward manner right now. I'm sure that will change. What is the last part of your question? Is there any other database systems other than Hive which can support Hadoop underlying system? So, many people have been developing projects which do several different things. One is replace Hadoop by extensions of Hadoop or new things which can support Hadoop as well as other relational operations such as join and so on, which Hadoop does not support natively. You can code joins in Hadoop, but it's more cumbersome and slower. So, many people have been working on layers underneath Hadoop and changing the underlying implementation. And also, many people including some of the same people have been building other systems which will run SQL on top of Hadoop or other similar systems. So, there are quite a few projects out there, but the Hive project has really taken off and it's kind of becoming the accepted standard. So, you might as well go with that as of now. We have RBS college. Go in, Batu. Please go ahead, RBS. How to handle unstructured data using Hadoop with respect to map-reducing algorithm? Unstructured data. How to handle unstructured data using Hadoop? So, let me answer that question. The question is how to handle unstructured data using Hadoop. The word count program which we just saw was unstructured data. There are a number of files and we are breaking up the files into lines and counting the words in each line. You might alternatively not break it up into lines. You can treat the whole file as one record and do word count per file. But the bottom line is that example directly shows how we can use unstructured data and run Hadoop programs on it. And your lab exercise is to, like I said, take the first steps towards building a keyword index. So, that is also something which is functionality on unstructured data, which you will be using Hadoop for. So, our goal here was not to have you do SQL on Hadoop, which like I said, you might as well use Hive if that's what you're doing. But to get exposure doing other kinds of things on Hadoop, which you will still need. Even if you have Hive, if you want to do things like this, word count is a toy, but other operations on unstructured data, Hive is not going to solve that problem for you. You will have to go back to Hadoop. So, that is our goal in the Hadoop assignment. I'll take maybe a couple of questions from chat. This one's between MapReduce Database and Parallel Database. Like I said, we can build parallel databases on top of the MapReduce infrastructure, or you can build it on some other relational, parallel relational engine. That is, traditional parallel databases had an underlying engine, which could run all the relational operations in parallel, join, group by aggregate, outer join, duplicate elimination, sorting. All of these, there are parallel versions, and the parallel database chapter in the textbook talks about how to do all this. So, traditional parallel databases used this as the underlying infrastructure, and ran SQL queries on top of the parallel relational operations. The Hadoop infrastructure doesn't directly support these operations, but you can work around it and implement the operation. That's what Hive did initially. But more recently, people have built other implementations and put Hive on top of it. So, what you get is essentially back to a parallel database after removing Hadoop from underneath. People have asked about how to check status and health of cluster in Hadoop. So, I don't have the details, but there are tools which can let you monitor what a Hadoop cluster is doing. So, you can search on the web to find these tools. I think it comes with Hadoop. If you download Hadoop sources, it comes with that. Last one or two questions. One question is, what is fault tolerance in HDFS? So, fault tolerance means that in spite of something failing, you should be able to continue executing your system. In the context of HDFS, when one of the machines which contains data fails, it's okay. There is a replica somewhere else. HDFS will tell you where the replicas are located, so you can go read it from there. And the read, HDFS provides an API. So, the read function in that will do all that underneath. You don't have to worry about it. You just say read, and it will get you the data from whichever copy is live. So, that is fault tolerance in HDFS. There is also fault tolerance in Hadoop. It says if a map task fails, Hadoop will automatically re-execute that map task on some other machine. That machine may be dead. The task will be executed somewhere else. Even if one of the machines is slow, all the other machines have finished their map task, but one machine has not yet finished. It's live, but it's running very slowly. Hadoop will automatically execute that map task on some other machine in parallel, and whichever finishes first will be taken. So, if this machine was very slow for some reason, the other machine will finish and get the result. If this was a temporary glitch and this machine actually caught up, well, this one can be used. Similarly, if a reduced task fails, Hadoop will re-execute the reduced task and generate the output files. So, there's a lot of fault tolerance built into Hadoop, and similarly, Google Maps reduced. I think I'll stop on Hadoop and big data here.