 Good morning everyone. Welcome to day 9 of our workshop. Today we are going to cover several topics. The first topic is big data and after that we will move on to transactions and a little bit of concurrency control. And today afternoon the lab is on big data. In particular we will be using the Hadoop software to perform map reduce programming. Today's lab on Hadoop will be running on individual machines. You will just be running Hadoop locally on one machine. But exactly the same thing can also be configured to run on thousands or tens of thousands of machines which is how things are done in the real world when there are really big data needs. There are of course, smaller data and big data needs which may run on 20 40 machines but these things scale up to tens of thousands of machines. So, that brings us to what is all this noise about big data. So, first what is big data and is it really new what is different from what was there before? Maybe it is not all the different but there is a lot of hype these days because there is a lot of demand a need for processing very large amounts of data which are well beyond the scope of a single machine. So, what we are going to do is look at the map reduce paradigm and first cover the concepts and then talk about specific implementation called Hadoop which you will be using in the lab today. And then we will move on to briefly look at other aspects of big data including big data storage system. So, what is the map reduce paradigm and why is it needed. So, like I said first of all what is big data there is always been a need for processing large volumes of data which are beyond the scope of something which can be done on a single machine. Now, the need for this was felt by only a few big companies many in fact I would say 2 decades back or more 2 and a half decades back. There were a few very large companies which were gathering enough data that they needed to worry about how many machines to run this on you know one machine 2 machines are not going to cut it they had to run computations on literally hundreds of machines. So, there was a company formed back then called Teradata and even before that there were parallel database research projects from the university of Wisconsin and from the university of Tokyo both of which showed how the normal relational operations can be nicely parallelized across many machines. So, the underpinnings were laid back then in the mid 1980s. The project took off around probably 1983 or so quite a long time back almost 3 decades back. And in the late 80s this company called Teradata was launched and it initially had a few very big customers like Walmart who you have heard of which is probably the biggest retail firm in the world. Back then it was more USA centric, but even then they had literally hundreds of major shops across the US and terabytes of data which in that era was even more scary than it is today. Today a terabyte is something which you have on one disk on your machine in that era your typical disk was you know maybe 100 megabytes if you pushed it. So, then you are talking of 10,000 disks and they were talking of terabytes with tens of thousands of disks. So, in that era parallel databases started growing, but the customer base was relatively small and it had a steady growth which exploded in the late 1990s when the web exploded. So, what is the influence of the web? There were lots of people going online and doing things. Now, what were the things they were doing? It started off with just going and looking at web pages, but soon e-commerce grew. So, they were buying things on the web website like Amazon and these days you have Flipkart in India. They were also running search queries on the web in that era even before Google there were many very successful search companies. There is a company called Alta Vista another called Inktome. You may not hear about them anymore, but they were very successful search companies and there was Yahoo and there was Hotmail any number of websites they are collecting a lot of data and this data was stored in log files primarily. So, they were not using relational databases to store this kind of data. Relational databases were never designed to handle this kind of volume that they were experiencing. They were not really fast enough. It was a lot faster to have a large number of machines parallely writing stuff into local files log files whenever something happened and then you collect the log files and do interesting stuff with them. Analyze them to see what you should be doing. For example, an analysis of queries on search engine might help it decide whether it was doing the right thing where people getting the answers they wanted or were they searching repeatedly with slightly different keywords because they were not getting the answers. Where people searching for certain products maybe shopping website would want to make sure that things which people frequently search for are available. There is enough choice available that particular product may not be stocked, but the search might tell it that maybe we should be stocking it. So, there is a lot of use for taking these web logs and making business decisions based on them. This is more or less where the whole big data paradigm started, but there were many more uses. So, if you look at Google how was it going to build its web search index? So, that is a big index out there. It is an enormous index because it induces billions of web pages. Each web page has thousands of keywords. So, it has trillions of keywords in that index. Clearly that index cannot be sitting on one machine. It has to be partitioned across probably thousands of machines and it has to be very fast in order to respond to query, but also you have to build that index and building that index itself has to be done in parallel. Now, there was no database which could handle these kinds of things. So, the initial versions of these parallel systems were built by hand. So, what does I mean built by hand? Every time people wanted to do new thing they would have to code up a system which would, let us assume we are running Linux machines. They would SSH to a large number of machines, startup processes on those machines which would participate in that job and then when the job is done those processes would quit. So, this is hand coded parallel system and this is how the early systems were built. If you wanted to build parallel web crawl and web indexing system, this is how it would have been built in the early days. Parallelism was already there, but hand crafted parallelism. But after some time people realized that this was there was a lot of pain to doing these things. First of all, if you had to build this whole thing starting a process, shutting them down for every single little thing which you do, the overhead is far more than what you really want to pay for every small task. It is not a small task, it is a which is running across many machines, but from a human view point it is a simple task and you do not want to have such high overheads for doing a simple task. The second thing is that when you have a large number of machines, there are failures. I think machines die and if you are having a thousand machines, the probability that one of the machines will be dead to start with is already there and then as after you start a computation, there is a significant probability that one of the machines will die or it may not die, but one of the machine may be very slow because something is wrong. Somebody has got a process running on that machine which is uncontrolled and the machine is not able to service your requests. So, it can be very slow which is more or less as good as dead in this context. So, you have to deal with failures such as these and how do you deal with it? Conceptually, you can say if a machine is dead, run that whatever it was supposed to do somewhere else, but somebody has to do all these things. They have to maintain the infrastructure. Somebody has to paint the buildings, lay the roads, clear the sewers when they are clogged. All that is part of the infrastructure. We cannot take it for granted and the map reduced paradigm is basically something which provides a nice high level view hiding all these lower level infrastructural details from the user. So, map reduced is a paradigm for reliable, scalable, parallel computing. That is the very first bullet in this slide. It abstracts issues of distributed and parallel environments from the programmer. Now, you might think that map reduced was introduced with the growth of big data, but no it is not. By the way, I think I gave you the impression that big data equals to web logs, but it is not. There is a lot of other big data out there, not just the web logs. Web logs were the motivating factor in the recent rise of big data, but you could just as well argue that transactions which Wal-Mart executed long ago were the big data of that era and they are still pretty big even now. Phone calls which you know, every time you make a phone call, a log is created. So, back then the US phone company AT&T had one of the biggest logs of phone calls and that was big data too. Today, our Indian phone companies have probably far bigger big data of call logs than any US company. So, the tables have turned today. So, coming back, there is a lot of different kinds of big data, some of which are very old in terms of the motivating things, some of which are new. So, parallelism correspondingly is a very old idea. It is you know, at least 40, 50 years old now, the idea of parallelism and this particular paradigm called MapReduce was introduced I think in the 1960s, if I am not mistaken. And it came from Lisp, which was also language introduced in the early 1960s, late 50s when it started, development was started. So, back then this paradigm was introduced and Google used this paradigm to build a very nice system, MapReduce implementation and that fired the imagination of everybody else and Hadoop is an open source clone of the Google MapReduce. Now, when I say parallel system, you know, you have to figure out where to store data. Now, one way is to store data locally on each of thousands of machines, but what if a machine dies? What about the data which is on that machine? How do you get access to that data? The machine is dead, the data is gone. You cannot afford that. So, you need redundancy. With rate systems, how did we achieve redundancy? We stored a copy of the data on another disk. So, if one disk died, the data was still available, but if that disk is within a machine, if the whole machine dies, you are in trouble. What do you do? So, if you look at businesses which use rate systems extensively, they buy very expensive storage systems which have multiported rate controllers, very reliable hardware, they have internal parallelism if something fails, some other part of the machine takes over. So, they have a lot of features built in to provide very high availability, but these are expensive and you cannot scale these up to thousands or tens of thousands of machines. So, the way this was done in the context of the modern big data was to first build what are called distributed file systems. So, this is a file system which runs across potentially thousands or tens of thousands of machines. It could be as small as even 20 machines and the idea is that if you store data on this distributed file system, it is stored redundantly across many machines. What I mean by redundantly? If I have a particular disk block, a block of a file, it is not just stored on one machine, it is stored on at least two and typically three machines, three is kind of the default. If two machines fail, you might lose data. So, three is viewed as giving better reliability, good enough reliability. So, each file block is stored on let us say three machines. So, if two machines die, it is the third one will still be available. If three machines die, yes, then you do not have access to data. So, now you need to know if you open a file on this distributed file system, it provides the same file system interface as any other file system NFS for example. It provides an interface to open a file, read, write to the file and so on because the file system has to take care of all the details of replication. So, if I want to read a file, what happens in file system like Google file system, which again was copied in the Hadoop file system. Hadoop is the open source version of the Google file system. There are others too. What happens is the same block is stored in three places. There is a master which tells the client. Supposing I open a file, there is a master which tells me here are the blocks in the file that you ask for and here are the places where that block is stored. These are the machines which store these blocks. So, now I can go to any of these machines and read the blocks. If I go to one of the machines and it is dead, I cannot reach it. I can always go to another machine and read the block from there. So, the death of one or two machines will not prevent me from reading data. If I want to update data in an existing file, I have to go write all the copies of that block. If I want to add a block to a file, I have to tell the master. The master will say that here are the machines which will store copies of your new block and then I have to go store the block in all those files. So, all those actions are done by the client. The master merely tells the client where all the data should be stored. In other words, the master keeps the metadata, that is, what are the files, where are the blocks of the file stored and so forth. Well, the actual data is parallelized across all the machines that are participating in the file system. So, that is conceptually how a distributed file system works. Practically speaking, you can open a file from any of thousands of machines and regardless of where the data is stored, the file system will fetch the block and return it to you and will let you write to a block or append to a block. So, that is the underpinning of all parallel processing. Like I said, if you do not have it in a distributed file system, you have to manage the replication. In fact, if you go back before the Google file system era, which is about now, about maybe 14 years or so old, before that what did people do? Parallel databases existed for a very, very long time. So, these old parallel databases did not have a file system, but they instead stored data on their own and they did the replication themselves. So, you would have many machines and there is a parallel database. There is a system which managed all these machines and it would decide which machine stored copies of what parts of the data and it would do exactly the same thing, except it was not a file system. It was a database. So, you do not have to worry about files, you have to worry about relations and you have to worry about how to partition a relation. So, let me show a slide here, shows the following. You have a relation R, a relation S and what has been done is it has been partitioned. Now, we have seen a slide almost exactly like this in the context of hash join. The idea there was you partition the relation such that each piece fits in memory, at least S naught, S 1, S 2, so on, those fit in memory. Here the goal is different, here also the partitioning is done. In this case it was for a join, but in reality the partitioning is done up front. Here I am showing a whole relation S is, but it is actually stored partition. So, it is already broken up into pieces R naught, R 1, R 2, R 3 and stored across many machines. So, I have not shown the partitioning here, but imagine that R itself is already partitioned on something and now if I want to access R, I have to access all the partitions. Now, if I want to do a join in parallel, what happens here, this step here, if you can see the most pointer here is essentially called repartition. It is stored partition in some way and in order to do the join, I want to partition it on the join attributes. This is exactly what we did for hash join, but now the goal is parallelism and the partitioning is across different machines. So, each of these R naught, R 1, R 2, R 3 is going to be on a different machine. So, there are lot of machines in the system and the data is broken up across all these machines. In this case the partitioning is done on the join attribute and similarly S is also partition on the join attribute. After we have done this, in the case of hash join if you recall, what did we do? We said we can join R naught with S naught and we do not have to join R naught with any other partition. So, this machine can do the join, processor P naught can do the join of R naught S naught. Similarly, R 1, S 1 are sent to processor P 1 and it can do the join of R 1, S 1 and so forth. The key thing to note is that processor P naught, P 1, P 2, P 3 and so on can run in parallel. There is no need for them to talk to each other after the partitioning has been done. So, first you may do some repartitioning, then you do the join locally on each machine and then you might take the results, store it locally and then you might repartition the result to do something else. So, this is a small example of parallel join. So, we have shown you two things here. One is parallel storage. I did not show you the details, but relations would be stored partition in some way and on top of that I did a parallel join. I can also do other operations. For example, if I want to do aggregation, I can do the same thing. I would first break up the relation on what attributes. There is no S naught here, just R. So, this right half of the figure can be ignored. Just take the left half. I would take R and partition it on what attributes? I would partition it on the group by attributes. We saw this. If you wanted to do aggregation, you can sort on the group by attributes or you can partition on the group by attributes and do the aggregation locally on each machine. If I partition R on the group by attribute, what do I get? All the R tuples for a particular group will land up in R naught. All the R's for another group may land up in R 2, but the important thing to notice because I have partitioned on a group, everything in one group will land up in the same place. Now, that leaves the question of how do you do the partition? So, let me go to the white board and explain this briefly. How to do partition? One of the simple ways of doing it is called range partitioning. In range partitioning, I create a partitioning vector which looks like this. Let us say I am partitioning numbers and I have a range partitioning vector which says 10, 33, 47. It has to be sorted 66 and let us say 83. So, essentially this is going to say that anything less than 10 goes to partition, let us say P naught and P 1 has things which are between 10 and 30. So, P naught is less than 10, let us say and P 1 is 10 less than or equal to let us say I am partitioning on some group by column G, let us say. 10 less than or equal to be less than 33. Similarly, P 2 is 33 less than or equal to that group by column value less than 47 and so forth. Finally, I have P 3 is 47 to 66, P 4 will be 66 to 83 and P 5 will be greater than or equal to 83. So, those are the partitions I get with this range partitioning vector. So, the basic idea is that I will break up the tuples in this relation based on the G value. Supposing, I have a relation with the group by attribute G and some other attributes let us say X and Y. So, if G is 95, where will it go? It will go to the last partition P 5, it will go to that processor. So, these are the processors, 6 processors are there. If I have another one with let us say 50, where will it go? It will go to this one P 3, because it is in that range. So, I will take the tuples of the relation and partition them across this. Now, R might itself be stored partition in some way may be arbitrary just in the order in which tuples were received, it might be stored in some machine. So, a repartitioning involves having a number of machines, I have a number of machines which store the data and a number of machines to which the data needs to be sent in order to do group by or join or whatever. So, each machine might send data to all the other machines. Similarly, this might send data to all the other machines and so forth. This would send data similarly to all the machines. So, there is a lot of data movement anytime I need to repartition data, but this is required step for this kind of parallel processing. So, this idea of doing parallel processing for relational operation, I have just shown you join and group by, but you can do this for all the other operations. Sorting can also be done in parallel, hashing can be done in parallel. So, pretty much everything can be easily parallelized. I am not getting into the details, but it is not too hard. There is a detailed description in chapter 18 of the book on parallel processing. So, coming back, map reduce is based on somewhat similar paradigm and I will show you the details coming up, but before that let me motivate map reduce going as a paradigm which goes beyond just purely relational data. So, supposing we have a log file like this. This is a simplified version of a web servers log. So, what does it say? On this date at this time somebody access this file slide dir slash 11 dot ppt. Maybe this is a log for a book website and it has this information. Actually, it has a lot more information. It tracks which IP did the request come from what was the referring page and there are a bunch of other stuff which we have removed to simplify the example. So, now I have all these log files and I want to do some analysis. So, here is a very simplified example of an analysis. I want to know how many times each of the files in this directly. By the way, there are obviously many other files on the website. Not all of them are in slide disk. Some of them may be in other directories. So, my goal here is find how many times each of the files in slide dir directory was accessed in this range of dates from January 1st 2013 to January 31st 2013. So, I need the date field. So, how do I do this? Given that you know my suppose for the book site there is just one machine and the logs are pretty small. So, it is not really an issue for our book site, but supposing you were you know Yahoo or you know one of the newspaper company and so on. You have lots and lots of machines which are serving data and each of them has a copy of the log file. So, you may have thousands or tens of thousands of log files and each of those log files may have some relevant data. Now, if I run a sequential program collecting all those log files doing it one log file at a time, it might take forever if I have very big website. Now, one option is to load all these log files into a parallel database and then run this query. It is a very simple query. It is just a select query followed by a group by because I want to know how many access for each file. Select on the date group by on the file name sorry select on two things. Select on the date and on the prefix of the file name. I only want accesses to files which are in slide directly. So, there are two selects on date and this prefix followed by a group by on this file name and the aggregation is count. How many times each of them was accessed? So, count aggregates. So, it is a very simple SQL query. Easy to parallelize. Parallel databases handle this kind of query very well, but the issue here is that loading it into a database is expensive and parallel databases were expensive at that time too. So, people prefer to build their own parallel system to do this. You can build your own custom parallel program for just this one task, but now if you want to do a second task again you have to do all of that very expensive. The map reduced paradigm provides a clean way of doing this. So, let me explain what the paradigm provides and then how it is implemented. So, this was inspired from the map and reduced operations which were initially proposed for languages like LISP. So, when back when they were proposed parallelism was not really goal. The goal was to be able to declaratively specify certain operations which could be done on lists. So, you map that is you apply a function to each element of a list and reduce would combine it across many elements of a list. Now, soon people realized that that paradigm was very nice for parallel processing and indeed it was used back in that era for parallel processing. The Google people realized that this is a good paradigm for their work too. So, let us see what this paradigm does. In this paradigm the user has to provide two functions, map and reduce. The user defines these functions. The user also has to provide some other in configuration information, what is the input, where does the output go and so forth. So, we will see all that later, but assuming the data is already available the user provides two functions. Map takes a key and a value and it returns a list of key value values. Reduce takes a key value and a list of values and reduce returns a single value. So, if you look at reduce it is like a group by operation. What it has done is it has grouped all the values output by map. Map is like you know taking some input and creating a number of tuples. k 1, v 1 are the values in the tuples. k 1 is essentially the group by attribute, v 1 are the remaining attribute. What reduce is doing is once the grouping has been done all the values in a particular group are made available to the reduce function. So, the reduce function parameters are a key that is a group and the list v 1 is a list of all the values in that group and the job of the reduce function is essentially to do the aggregation and it returns the aggregate value. That is intuitively what the reduce function does and map can function gives this input to the group by from whatever input it has. It may take files and then generate this that is a typical use. So, for our example the one which we just saw how many times each file is accessed. We will assume that the system takes a file like this and breaks it up into records. So, each line is a record. The system does not know the format of the internal format of the record, but it has been told that each line is a record. So, the system will break up files into lines and calls the map function with the value of each line. So, the key is typically the file line number and the name is also available, the file name if required and the value is the contents of that line. So, that is those are the parameters to the map function and what does the map function do? It is going to return key value pairs. The key of eight outputs is not the line necessarily the line number. In our case we will see what it is. So, here is our map reduce example for the line for the motivating example we just saw. So, what is the map function doing? It is taking a key and a record. The key is the line number, the record is the contents of one line. I am going to break that line into three attributes. So, I have string attribute three. So, I am breaking up the record into tokens based on the space character. So, I have not shown how to do this, but in Java there are simple tokenization functions which can break up a line into multiple fields. So, essentially on this data the first field will be the date because there is space after date. The second field will be the time and there is a space after that. So, the third field will be the file name here. So, I have broken it up into date, time and file name from attribute 0, 1 and 2. Now, what the map function does? This is pseudo code. If the date is between January 1st and January 31st, 2003 and the file name starts with slide there. It calls a function called emit file name comma 1. That 1 is just something that is initial count which will get added up later. The file name is the group by value. I want to find the count per file name. So, it is emitting a number of these and these are collected together and we will see an example coming up. They are collected together. So, there are many log files and the map function is being called in parallel on many machines on different records. The output of the map function on each machine is collected together and then some sorting is done in parallel and the result of all that parallel sorting is that the records for a particular file name are all brought together and the reduce function is called for each file name. Now, this also runs in parallel and I will show you how in just a moment, but the for a given file name there are many different things emitted here. I mean it is actually many copies of the same thing. The same file name will be emitted with the value 1 many, many times. Every time it is accessed this will be emitted. So, when all of those are collected together you get this string key which is the file name and a record list which is basically a long list containing just the value 1 many times and what does that do? The reduce function it takes the key as the first thing with the key and stores it in the file name and then for each record in record list count equal to count plus 1. It is just counting how many entries are there in this list that is exactly the number of times the file was accessed and at the end it outputs file name comma count. So, it has computed the aggregate count and it is emitting the file name which is the group by attribute and the count value and this again is done in parallel on many different file names and the outputs are all collected together and returned to the user. So, here is a diagram which shows what is going on. There are the map inputs that is key value. In our case the key is the line number and the value is the line content. The map function takes this and outputs these. In our case the map function simply returned the file name and the count 1 just one thing, but in general you can each input map record can result in many outputs. We will see that coming up. Now, all of these are done essentially in parallel across many machines and each machine is going to generate a number of such key value pairs. Now, they have to be partitioned and all the key values let us say R k 1 that is reduce key 1 that particular value is generated here it is also generated here. So, both of those have to be collected here. In this case the reduced values were R v 1 and R v 7 and may be more. So, the reduce function is called with the list R v 1, R v 7 and may be other things. Similarly, R k 2 reduce key 2 was generated here and also here. So, if you look here reduce key 2 has the associated list R v 8 and R v whatever I have something. So, that is there. So, this step also requires a lot of repartition because these are done on many machines reduce key 1 is going to be on one of the machines. So, all the things with reduce key 1 have to land upon this machine. This reduce key 2 may also be on the same machine. So, all the things with reduce key 2 land upon this machine, but reduce key 7 may be on another machine. So, everything with reduce key 7 has to come to this machine. Now, how do we do that? If you see a white board we saw exactly how to do that. This is what we did we had a range partition. So, we had different values and they land up in different partitions. So, the map reduce paradigm does range partitioning on the reduce key. So, that one machine will be responsible for a given range of key. So, machine 0 has all the keys with value less than 10. This one has between 10 and 33 and so forth. Now, here these are numbers in our example they are not numbers they are file name, but the same principle holds. We have may be an alphabetical sorting and partitioning on file name alphabetical ranges. I think this is a good point to take question before we move into another map reduce example. So, let us take questions. Dronacharya. Hello sir, am I audible? Yes, please go ahead. Sir, my question is how do we handle security constraints related to big data? Security constraints on big data. So, that is a completely different topic. There are issues in privacy who is allowed to see what and there are that is a complex issue. So, let us not get into that at this point, but yes it is a serious issue. For example, this is a company called America online which used to be one of the biggest internet companies long ago. It is now a fraction of its old size, but that company decided to release some search logs thinking it will help researchers. And it turned out that the search logs had enough information to identify which person had issued which query. It turns out most people do a search on their name sometime or the other. I am sure all of us have done that. It also turns out that most people query their locality. They want to see what is nearby. So, if you look for queries that mention pin code or place names, they would have that query. So, although it was supposed to be anonymous, they found out they could actually track down which person had issued which queries in many cases and that led to a big mess. The person was actually a very nice person who thought he will help the research community by releasing search logs got fired from the company as a result. It was very unfortunate. It was a nice guy. I think it was an Indian incidentally, but that is life. So, there are security issues, but that is not our focus here. If you have any other question, please go ahead. Sir, next question is can you brief us about briefed big data analytics? Yeah, big data analytics. So, essentially what we are doing is describing the infrastructure for big data analytics. Big data analytics is basically running more complex queries on big data and what I am doing now is laying the groundwork to understand how the map reduce paradigm works. On top of this, you do many things. You can write queries which are equivalent of SQL queries which are used in OLAP system. OLAP is online analytical processing. Here it is not quite online, but similar queries can be executed on big data. That is one kind. Another part of analytics is data mining. So, there are data mining algorithms have been coded to run on the map reduce platform. So, you can do data mining on large volumes of data. So, these are all part of it. So, I am not going to be able to cover the upper layers, details of analysis and data mining, but my goal is to familiarize you with the lower layers so that you know what is going on and then you can go read up on your own about the upper layer, the actual analysis part. Does that answer your question? Okay. Thank you. Thank you, sir. You can take a question from some other speakers. We have Om Institute, Hariyana. Yeah. Good morning, sir. My question is what is the difference between integer value and string field in indexing? What is the preferred data type for indexing? Okay. The question is on indexing integers versus strings. We all know about type systems in programming languages and databases also have types and you can index integers, you can index strings. It does not matter. You know, the type is a function of the data that you are modeling. If you have a salary field, it had better be an integer or a numeric. If you have a name, it cannot be an integer. It had better be a string field and you can index either one. It does not matter. B plus trees do not depend on the type as long as the type provides a very simple function, which is a comparison function. So, more specifically, B plus trees depend on having a total ordering on values of whatever type it is indexing. So, integers are totally ordered. That is, given any two integers, I can see if an integer is less than another. Similarly, strings are totally ordered on alphabetical order. Given two strings, I can compare them alphabetically and say if one is less than the other, of course, they may be equal. There are certain other domains which do not permit this kind of total ordering. For example, if I have spatial data and I have latitude-longitude, there are ways to force this into an order. For example, I can say first sort on latitude, then on longitude. This is a very artificial ordering. I can still use it in some cases, but in general, it is not meaningful to order it like this. First latitude, then longitude. So, there is no physical notion of ordering of points in a two-dimensional space. They have lat-long coordinates, but you cannot order them totally in any meaningful way. So, you can order them artificially and then answer certain queries such as what item is at exactly this coordinate. You can answer that, but if you want to find what is near this coordinate, you cannot do it. And that is where R trees come in. If you do not have a single total ordering, that is okay. R trees will allow multiple attributes each with its own ordering. So, latitudes are ordered, longitudes are ordered, and R trees can be built on two dimensions like that. So, that is the difference between a B plus tree and an R tree. Thank you, sir. My next question is, how does the size of the index affect performance? The B plus tree's height is logarithmic in the size, and in fact, for all practical sizes, the height of a B plus tree would be maybe four or five at most. So, you know, it functions pretty well even on very large data. The only issue is that if you have extremely large data, you probably have extremely large number of queries too. So, you might have to partition the B plus tree for efficiency of querying. And in fact, there are parallel versions of B plus trees which are used in this big data system. It turns out that it is a very simple solution actually. The solution is, just like we range partition the data, we also range partition the index. What I mean is, if you think of a B plus tree, let me show a diagram. If you think of a B plus tree, it is a tree, but if you see the details, there is a node with a number of children, maybe hundreds of children. So, what you do is, for example, you can store each of these children on a different machine. And this root node, its partitioning is available, it is copied on every machine. Of course, if it is updated, you have to update all the copies. So, now, if I want to search for a key value on a particular machine, I have a copy of this root node locally. So, I know which child to go to. Now, these children are stored on different machines. So, I will actually send a request to that particular machine, which holds maybe I have to go to this child. So, I find out which machine has this child, and I send a request to that machine. That machine will have the subtree there, and it will do the rest of the querying for me. Now, if you partition at one level, maybe you can have hundreds of machines, but if you typically the partitioning may be done at the second level. So, this whole thing, the first two levels, you keep a copy at each machine. And so, you know given anything, where do I go to? And so, here at this level, there may be say, 10,000 nodes at this level for very large data. And this 10,000 nodes can be stored over 10,000 machines, or maybe they will be stored over 1000 machines, maybe 10 nodes per machine, roughly. So, I keep a mapping of which node is at which machine. So, I can actually do querying on this B plus 3 in parallel. I can also do inserts and updates in parallel. There is some extra work to be done, but these systems can handle that. The exact details of how they do it, there are actually differences. So, maybe if time permits, I will discuss this at the end of the big data session. We have Mon Zion. Please go ahead, we can hear you. Sir, we have heard that MapReduce suits only for homogeneous environments like cluster. How far it is efficient in heterogeneous environments? And if it is not much efficient, how does Google manage in its cloud implementation? So, to explain that question to others, MapReduce paradigm breaks up tasks into many small pieces. And thus some part of the job on a number of different machines. In fact, it is not true that MapReduce works only on homogeneous environments. What do I mean by that? So, I want all the machines to run the same kind of software. So, if I am running, how do all the machines should run Java? That is really all that I need. I do not need all the machines to be of the same speed. So, in fact, how it is done is that the map tasks can be broken up. So, it is possible for the Hadoop system to take into account the speeds of different machines and give a fewer number of tasks to some machines and more tasks to other machines. Conceptually, it is straightforward. Maybe early implementations were not good at it. But even then, if a machine was slow, so already when a machine may be identical hardware, but at runtime, you may find it is slow. So, it is not homogeneous. And the MapReduce paradigm can deal with this. And a machine which is not responding or is slow or is dead for that matter, its assigned tasks can be given to other machines which can complete that task. So, it can work in environments where machines are not homogeneous. Does that answer your question? Yes, sir. Thank you. Thank you. We will go to another center. We have Bharatiya Vidyapit, Navi Mumbai. Hello. Sir, can we use object-oriented database to store big data? Object-oriented databases to store big data. Not really. I do not know of any object-oriented massively parallel system. So, the key thing is that for parallelism, you need to do a lot of partitioning. The system has to understand the attributes on which partitioning is done. So, I am not aware of any object-oriented databases. So, object-oriented database that supports it. Now, in the MapReduce paradigm, you can do whatever you want in the map and reduce function. So, if your input are objects and you want to run function on those objects, you are welcome. The paradigm does not constrain you. All it needs is these keys on which it does the partitioning. So, as long as you provide the keys, you can have a parallel object-oriented system built on top of MapReduce. Any follow-up questions? Yes, sir. Sir, how we get the big data sample? How to get big data sample? Big data sample? Yeah. So, I am not sure what you mean by sample. I think what you mean is, how can we get a bit of big data to run something on? So, supposing you want to do, see, any data can potentially be big. But if your focus is on log files or something like that, how do you get access to log files? There are sources on the web which provide log files or you can create artificial things of your own. There is effort going on to create benchmarks for big data. If you take traditional relational data, there are several benchmarks which are widely used. So, there is something called, let me use the white board. There is something called the TPC series of benchmarks which were designed for relational data. Many people who are working on big data take these benchmarks and use them as the source for benchmarking big data. Because it is relational data, you can run MapReduce, you can do whatever you want on the same data. So, there is a series of TPC benchmarks of which the relevant ones are TPC-H. Then there is something called TPC-DS and some others. So, these are things which are currently used. There are some older ones also. So, these are things which are popularly used even for benchmarking big data system. But many people say that this is not representative of real world big data. So, there is an active community working on benchmarking of big data system. There is a workshop which runs I think twice a year now. Last year it was in Pune actually. It is an international workshop. So, there are benchmarks coming up and if you search for them, I am sure you will find several big data benchmarks available on the web. They are not accepted as standard yet. The workshops are trying to evolve towards standard for this. But if you are just looking for data to try out on your own, yes, there are plenty of data sets available. You just have to search on the web. We have MIT University, Haryana. Please go ahead. You can hear you. Sir, how can you see behind the scene work in the MapReduce program? What is actually happening? Hello, sir. Yeah. How can we see behind the scene work when you are executing the MapReduce program? How can you see what is happening behind the scene? How can you see what is happening behind the scene? That is not something you can see easily. There may be some consoles for Hadoop which give you some idea. But the good thing is you do not quite need to see what is going on behind the scenes. It is not exactly declarative, but the parallelism part of it is, you know, it is not controlled by you. The system does it. As long as you have written your original MapReduce program properly, if it works on a single system, unless you have done something weird, it should give you exactly the same result when it is run on a parallel system. When I say something weird, I mean something like global variables which are accessed by the map function or the reduce function. Those will cause trouble if you move it on to a really parallel platform. But as long as you stick to a function which do not use global variables, which only take the parameters which are passed to the map and reduce function and do stuff only with those, nothing else, they will work exactly as, you know, they will give you the same result in a parallel environment. Now, how exactly things are parallelized? I am going to explain it to you, but if your question is, can you observe it? You know, I do not know if I have not seen any platforms for doing this, but it is likely that there are some platforms which will let you do this. Please, Shankaracharya. Hello, good morning, sir. How can we reduce duplicate for analysis in big data? How can you reduce duplicate? If by reduce, you mean the reduce function. The reduce function will get duplicates as input, just like any aggregate function. There will be duplicates, that is not an issue. If your point is, you have duplicate inputs which you want to remove before you do analysis because the duplicates are meaningless, that is a different issue. So, if there are exact duplicates, the equivalent of select this thing, the reduce function can take multiple copies and replace it by a single copy. So, the first instance of the map reduce can remove the duplicates and then you can do your analysis afterwards. But if it is not exact duplicates, then life is a bit harder. For example, in data warehousing, there are often duplicate records, meaning things which belong to the same address, but the address has been written a little differently in each input. But they are actually duplicates. So, how do you do approximate matching of addresses to remove duplicate addresses? So, that is an issue which I think was discussed, something similar to the Aadhar duplicates on biometric, which was briefly discussed yesterday. So, now how can you do such duplicate elimination of approximate duplicates in parallel? I think there has been work on it. I am not familiar with the details. I know there has been work on it, but I am not familiar with the details. Did that answer your question or was it something else? Thank you, sir. So far, what I showed you is a simple map and reduce function and the schematic flow of keys across the map and reduce function. Now, I want to show you one more example of map reduce and then we will see how this is parallelized. So, now let us take a different problem which does not look like a database problem at all. In fact, it is not, which is given a large number of documents, count the number of occurrences of each word in this collection of documents. So, how would you do this in parallel? I think the intuition should be obvious. You want to divide the documents among workers. Workers meaning there are many computers doing this task. You divide the documents among these. Each worker will parse the document, break it up into tokens to find all the words in that document and the map output function would output word count pairs. That is this word occurs so many times in this document. Now, you partition the word count pairs across workers based on the word. So, this is the grouping so that the particular word occurs obviously in many documents. So, all of those come together at one task for doing the reduce and the reduce function adds up the word counts for each word. So, as an example here, given the, this particular function, one particular input 1 up any, 2 up any, hot cross 1. That is a document. What is the map task output? In this case, we have even simplified the work of the map task even more. The map task is outputting the following. 1 1 time, a 1 time, penny 1 time, 2 1 time, a 1 time. Now, you will notice that a has already come. The map function could have output a 2, meaning a has occurred 2 times, but that requires a little bit of extra work at the map function. We could have done it, but we do not actually need to because the counts will get added later. So, it is outputting a 1 twice. And then, penny once again, again penny has occurred twice, but it is outputted 2 separate times, hot 1, cross 1, and burn 1. So, that is the map function. Now, this similar thing would be done for each document. In this case, supposing this is the only document, then the reduce function would have got like, let us say 1 occurs only once. The reduce function would be called once with a key 1 and the list being just that 1 copy of 1 and it will output 1 with the count of 1. Now, for the word a, the list will have 2 items, which is 1 and 1. And the reduce function will just add up these numbers and get you 2. Similarly, the reduce function would be called with the word penny with the list having 1 and 1 and it will add it up to get 2 and so forth. So, this is the final output of the reduce function. This is the toy example, of course. Now, here is the pseudo code for the word count. So, the map takes 2 strings, input key and input value and then, image for each word w in the input value, it image w comma 1. Now, note that the 1 is in quotes because I am outputting a string. This is a very simplified function. The reduce function gets an output key and an iterator on intermediate values. Earlier, I was talking of lists. This is getting closer to the Hadoop paradigm, where instead of a list, I have an iterator. If you are not familiar with iterators, iterator is the construct on which you can do next or in the current versions of Java, you can write a for loop like this. For each v in intermediate values, intermediate values is the variable of type iterator. So, I can say for each v in this, in current Java, the syntax is a little different, but conceptually this is what it is. And now, v gets bound to each element of this list or set. And what is this reduce function doing? It is parsing int v because this was a string. It is converting it to an integer type and adding it up to result and finally, it outputs result. That is the pseudo code. I will show you the parallel processing and then I will come back to the Hadoop implementation. So, what is happening in the actual implementation is the input files are broken up into partitions. Why? Because some of the files may be very big. So, a big file may be broken into a number of smaller pieces. Now, these partitions are mapped to different map tasks. So, I have a number of map tasks running on n different machines. Now, where are these inputs stored? They are stored on a distributed file system and different partitions of different files are given to different map tasks. So, here I have shown partition 1 going to map 1, 2 going to map 2. Maybe these two were pieces of one big file. Partition 3 may be one file by itself that is sent here, 4 is sent there and so forth. Now, the use of program consisting of a map and reduce function is taken by the map reduce master and it sends a copy of the map function to each map task here. So, each map task has a copy of the map function. So, in Hadoop, it is a java jar file which is sent to all the map tasks and the master also tells each map task, which partitions are assigned to it. So, the map task will actually read the partitions from the distributed file system. So, the file system is just sitting idle, I mean it is not idle, it is not active meaning it can be asked to give a part of a file, it will give that part of a file. So, the master has told map task 1 that these are the files and pieces of the files in some cases that you have to process. So, the map task 1 will go fetch the files or pieces of files which it has to process, apply the map function and the output which are intermediate files, it will sort locally. So, all the outputs of map tasks which it ran, it will sort. This is part of the overall sorting and then it will partition these to the reduce task. How is this partitioning done? This is going to be a range partitioning. So, that range is available to each of the map tasks. So, it is going to range partition the keys to these reduced tasks. Now, I have shown n map tasks and m reduced tasks. So, these are parameters to the map reduced jobs which tell it how many map tasks should run, how many reduced tasks should run. That depends on how expensive is the map task and how expensive is the reduced task. If the map task is very expensive you would have more copies, maybe less copies of the reduced tasks. So, this range partitioning is used to send things locally. So, now each reduced function has pieces of each map output coming in. It is going to merge all a piece. How does it merge it? Well, we already saw how to do a multi-way merge yesterday. So, it is going to merge all this to create a merged list and then locally it is going to call the reduced functions on each key value and its output is going to be written to a file. So, if there are m reduced tasks there are m output files. So, this is what is going on behind the scenes.