 There is time for questions. The question is about fragmentation of data. So in the relational database literature, there is something called horizontal fragmentation versus vertical fragmentation. What does this mean? Supposing I have a big table like this, a lot of rows and then there are columns like this, maybe lots of columns also. What is horizontal fragmentation? In horizontal fragmentation, the table is broken up into pieces, let us say there are 5 machines. This part goes to machine 1, machine 2, machine 3, machine 4 and machine 5. So this is called horizontal fragmentation. Is this notation clear? This is what Hadoop runs on. It is based extensively on horizontal fragmentation because it breaks up data into files. The files are broken up into lines or records or whatever. So that is horizontal fragmentation. In the relational database literature, there is another concept called vertical fragmentation which can say these two columns are processed at machine 1 and these two at machine 2. So I am breaking up the relation by columns not by rows. In fact, you can have a combination. I can say that this part is in machine 1, this part is in machine 2, this part is in machine 3 and so on. You can combine horizontal and vertical fragmentation, both. But Hadoop focuses on horizontal, vertical is, in fact Hadoop has no idea of a schema. All it knows is records, it does not know what is inside a record. That part is up to you when you write the map and reduce function to deal with. So Hadoop is very low level. It does not know anything about schemas and that is also the reason it is a pain to program in Hadoop. We just saw a lot of code for doing something very simple. So it is very nice for many things but it is a lot of work to do simple things. That motivates why SQL is making its way back in this community. So that was a period when people said no SQL. SQL is old, it is, see we can write programs in Hadoop which can run at thousands, tens of thousands of nodes. Two database people are struggling with parallel databases running on 200, 300 nodes. So that is where it started. No SQL, forget SQL. But eventually people realize that sure, you can do it but it is a lot of work to write programs and after all that is very inefficient. So they said we can run programs that runs on 1000 machines. Today the database people are coming back and saying why are you using 1000 machines? We can write new programs better on 100 machines. We don't need 1000 machines. We write better programs, our code is better, it is faster. So that is what is happening in the world today. Of course it is not the original databases. There are other companies which are coming and doing these things and databases in SQL are actually making a comeback. There are a couple more small topics in Hadoop before I move on to some other topics here. The first is what is called local pre-aggregation. So I said that each machine outputs all these values, each mapper outputs all these values which are then sent to a reduced function. Supposing that I need to send all this data over a network, that takes a lot of time. If you see the way technology has gone, CPU speeds kept increasing for a long time. Now CPU speeds are not increasing that much or not at all. But the number of cores per CPU has kept going up. The number of CPUs per machine has gone up. So today it is routine to have 32 cores on a single machine with 32 cores. And internally the bandwidth is very nice. What about this? This bandwidth has been going up steadily. It used to be maybe 10 megabytes, 20 megabytes. These days you can get 300 megabytes even from this. It has gone up a lot. What about network? Network speeds reached 1 gigabit long ago. All your laptops have gigabit Ethernet cards. The servers running at Google probably also have 1 gigabit Ethernet cards. That has become a big bottleneck. So 10 gigabit Ethernet is available, but it's more expensive. I'm sure it is used. But it's still fairly expensive. I don't know why. Networking people have been lazy. Maybe. I don't know. But the bottom line is the network is a bottleneck today for many things. So what local pre-aggregation does, it says, wait, why am I sending all this? On my one machine I have processed, say, 100 documents. Can I group together and do a local group-by on those 100 documents and output the word count so that each word will go out only once with a count across these 100 documents? That makes sense. Why ship all this stuff across the network and make a bottleneck in the network? So what the combiner does is you can run it within a machine. You can even run it in another level. So the way these things are organized, how do you organize a system with a thousand machines? The way you'll do it is you have a rack, there is one rack, not very good at 3D drawings. Here's another rack. What does each rack contain? Each rack has a number of machines. This is one machine. So a typical server these days is called 1U format. This is a standard unit which is like about maybe 3 centimeters or so high, I don't know the exact number, but that's 3, 4 centimeters high. So one rack like this about, say, 6 feet high can take about 40 machines plus some switches on top and a few other things. So this is a typical rack. So within this rack there is a network switch and this network switch in turn is connected to another network switch. So each rack has its own network switch. So all the machines here are connected here and so let's say this switch has 40 connections to the 40 machines and then one or more connections to the main switch. So typically this is 1 gigabit, this is 10 gigabit ports. This is typical. So first of all you can do pre-aggregation within each machine to avoid overloading the 1 gigabit links. Then at the rack level you can send all these to one of the machines which can do another or even that can be parallelized. What it does is across the 40 machines in the rack it can do a sum again. So you don't send out the same word 40 times but only once with a combined count. So that is the rack level combiner. So within machine and at rack level. So the Hadoop framework lets you define a combiner function and you can say use the combiner. That is an option you can give it and then it will run it at each machine and if it knows about the machine layout on the racks you can also guide it to run it at each rack level. And Hadoop the reduce function which you have defined is itself used as a combiner by default if you turn on combiners or you can specify your own combiner. My suggestion is don't do this it gets into trouble if you don't know what you are doing. Initially in your programming don't bother about combiners afterwards if you have time you can play around with it. So what are the implementation as I said Google is not open source we can't use it. Hadoop is what everybody uses available there and there are other asset data and so on. And then there are some others which are used in other places but the Hadoop is today the D thing that is why we are using it. So now I have been mentioning this throughout the talk map reduce versus parallel databases. So there is a lot of buzz about map reduce then the database people said but we have been doing this for a long time the map reduce people said that hey you never ran at such scale fault tolerance where does fault tolerance come in? So back to this picture a lot of map tasks a lot of reduce tasks what if the machine running one of the map tasks fails what to do? The good thing is no it is not replicated ahead of time but the master can detect that this mapper is not generating data or it is dead totally and the same thing which it gave which it told this guy to do it will run it on another machine exact same map tasks will be run on another machine. What about the output of this guy? So this guy's output will not be consumed until it is fully generated. So if it fails in the middle before its output is consumed it can be thrown away and the reduced tasks here will wait for a newly created copy of the task to finish before the reduced tasks start. So the way the thing works is first all the map tasks are run if a map task does not finish does not generate data it rerun effectively and only when all the map tasks are done and have sent the data to the reducers is the map phase finished. Even when the reducers can collect the data but they do not run the reduce function yet. They wait until all the map tasks are finished they have got all the data that they need then they do the final sorting and run the reduced tasks. What if a reducer fails? No problem you can get the data again from the map tasks here and rerun the reduced tasks. So again all of this is done by the master. So even if there are failures it can recover. In fact even if a map task does not fail sometimes what happens is it runs slowly why? Because that machine may be doing some other task it becomes slow somebody is running some very intensive job on it it become very slow. Even in that case the master will rerun the map task so laggards like slow machines can also be it is not pure failures but even slowing down can be dealt with. So this is the kind of fault tolerance that it gives plus it uses the fault tolerance of the distributed file system itself. So failures are nicely handled and then this was the major thing procedural code in map and reduce functions for example page rank and so on can be done in a relational database but you first have to pre-process the data that is a lot of work and all that can be done procedurally. On the other hand map reduce is very cumbersome and slow in fact execution is slow for very simple tasks for which SQL is actually a much better fit. And the second part is that programming in map reduce is very painful so two things happen in parallel. The first was to provide a better interface to map reduce for tasks which are more naturally database tasks and there were three things systems one is a system called pig also from Yahoo this one they didn't public didn't open source it it's internal and no it's sorry pig is open source there's another one I'm mixing it up pig is open source developed at Yahoo so what pig does is it has algebra essentially and lets you specify these operations and run these on the data so you it's predefined a number of operations you can use those. You don't have to write map jobs from scratch so it's declarative. Then there's Hive which came from Facebook which actually lets you run it directly write SQL a dialect of SQL and in fact Hive also specifies a schema for each input source and so although the data may be in files Hive knows what is the format of the data and can read the data properly and then the third one is scope from Microsoft this is again not public they are available but you mean you can purchase it so these are three things which allow declarative query SQL or variants and then there have been many extensions of map reduce to allow joints by planning of data in fact they have come to the point where they are now like a parallel database implementation with map reduce extra on it so there are several such systems that are available so the combination of these now there's huge impact now there are a lot of companies which are building these some open source some not but there are a lot of products out there which some people call this new SQL which are really massively parallel systems which can run SQL queries directly so this is for so far what is map reduce for is for analyzing data what if you want to store data what if you want to do transaction processing at really large scale what can you do map reduce is not the answer to that map reduce is only for querying but databases are not only querying if you know database have two aspects there is the online transaction processing and then there is the decision support so far all we have been talking of in big data decision support about well in fact there are systems which do that and these are called big data storage systems again they started from ground up meaning the initial systems do very simple jobs they don't do a lot in fact all that they do is they are called key value stores looking at the history for it you need to store huge amounts of data you need scalability distributed file systems which are and under like I do have been around but the problem is that they assume each file is very big GFS HDFS will die if you throw one billion files at them they just cannot handle that kind of scale they can handle hundreds of thousands of very large files maybe millions of very large files but not billions but a lot of data which comes from data processing application very small each piece of data is small then you have billions of such data so take an example Facebook what kind of data does Facebook generate each time somebody posts a comment is data each time someone posts a photo it's data each time somebody clicks on like it's a piece of data each data is really small it identifies who clicked it whose page it was on time few other things and this number of such things that are generated are huge absolutely huge what about the friends thing each person has friends on Facebook how many friends does average person have I don't know maybe a few hundred that data is also pretty small it's not very big but there are billions of users on Facebook so if Facebook tried to store all its data you know per user file is created on GFS or HDFS it would die cannot have these systems cannot handle so many files so what happened again the history is again Google took the lead in this it built a system called big table which is built on top of Google file system so underneath it is creating a few a number of large file each file is like few hundred megabytes but within those files it stores much smaller records and the primary interfaces you can put a record with a key you can retrieve a record for a given key and then you can scan all records in a given key range just three operations there are a few minor ones but these are the three main ones put a record with a key get a record with the key range scan now what do you do with these three operations actually you can do a lot with just these three primitives so you lot of applications are built on exactly this kind of infrastructure it's very powerful actually it is not a full-fledged relational database why there's no notion of integrity constraints there's no notion of well there is a equivalent of relation but full-fledged integrity constraints are not there you if you want SQL you need one more layer they don't provide SQL they don't provide transactions they don't provide concurrency control they don't provide many things what they do provide a scale which the traditional databases don't at that level so these have been used extensively first big table from Yahoo and Sunil of Yahoo sorry big table from Google and Sunil of Yahoo came up with its own version called peanuts this one they did not open source it's internal but more recently the Hadoop project has also built something called H base which is a clone of big table so that's also available we are not doing anything with it in this course but if you have students who want to do something new with the latest tools H base is a nice thing to try out so we've had several student projects in IIT this year using H base so what is the key value store this is that you have two main operations put key with an associated value get key and then maybe range scans and the system may store potentially multiple versions of data with a time stamp very simple interface how is it implemented this is a schematic come which comes from the peanuts system from Yahoo so what you have are what are called tablets white tablets it's a nothing to do with medicines it's a piece of a table okay so a piece of a cigar is a cigarette a piece of a table is a tablet so you break up tables horizontally into a lot of pieces and store them across many nodes and now note that these boxes here are machines each machine contains many tablets it's not one tablet per machine it's potentially hundreds of tablets per machine why because it gives flexibility supposing this machine is overloaded there's a lot of demand on tablets in this machine what do you do you can move one tablet from here to here or two tablets or some small number of tablets and rebalance the load or you can you know purchase a new machine and put it in and then take one tablet each from each of these machines and stick it into the new machine that way you have grown the system by one machine without seriously affecting the work on these machines each is copying out a small amount of data each tablet is typically few hundred megabytes okay so that's how the data is partitioned and now what about queries for request comes in you have to know where which tablet contains data and where that tablet resides so there is a index kept here at the master copy of the index is here at the tablet controller and then there are machines for routers which have a copy of that information so given a key value this can look up that index and say this by the way the index is on ranges so any key value between one and twenty thousand is in this thing or hash of key value equal to something is on this machine so that's how that goes so this partition and tablet mapping this thing is actually fairly small it fits in memory although the data is huge the mapping is small so given a key value I know which tablet it is in and I know where that tablet resides quickly I can find that and then the query is sent to the appropriate tablet now what about replication well the master also keeps track of tablet replicas updates have to be sent to all the replicas I've not shown that in this figure but that's how this system works and big table is also essentially the same architect although some details differ and here how is each tablet managed in peanuts these tablets are stored as relations in a MySQL database or in a Berkeley DB store in big table they built their own infrastructure for storing tablets which is highly optimized for writes they have some interesting techniques there so that's the underlying system so again you can't get peanuts or big table but H base is a clone of big table it also uses a similar architecture so I would encourage you to get your students to do some projects with these so many many companies are taking to this so coming back why are companies bothered about it now who who cares right 30 years back it was only Walmart or 20 years back it was only Walmart and a few other companies AT&T then the web came and many companies Yahoo Amazon maybe Flipkart read if many companies have websites but if only these companies were interested in these technologies again it's a small market what is happening is every company today is interested in you know when people are searching for their products on the net they want to know about it I want to know who is searching and I want to advertise that is another side but the search companies will happily sell information about who searched for what and companies will buy that data it's a lot of data which is available to be bought and they want to use that data analyze it to make decisions so even companies which don't even operate a major website have a huge amount of data and they're analyzing it so there's a lot of opportunities in the job market in this area so it's good to have students look at it at least the more advanced students so time for a few questions if a size of the tablet coordinator is last then how we can manage the access time for a tablet coordinator which one this one tablet coordinator suppose size of the tablet coordinator the master copy or the router yeah master copy sir yeah so the the there is a so supposing you have a tablet which is 100 megabytes let's say you have a petabyte of data how many tablets are there right so 100 megabytes a gigabyte is 10 tablets a terabyte is 10,000 tablets a petabyte is 10 million tablets so now you have to store information for 10 million tablets but what is it that you need to store you need some hash function and boundaries and so on 10 million can easily be stored in memory that's nothing okay so it's very easy to fit this in the memory of one machine does that answer your question if size increases day by day yeah so even you okay I think your point is eventually this is going to become very big then what that's a good question so this is an engineering decision that people have taken today today with the memory sizes that are available scaring to a petabyte is easy and the biggest things that are out there today are a few petabytes three four petabytes that's nothing three petabytes is 30 million entries 30 million entries supposing we put 20 bytes for each entry or even 50 bytes for each entry 1.5 gigabytes nothing it's small okay so as an engineer you'd say that for now this is good enough maybe five years from now when it has gone to five times this maybe we need to rethink this in fact this has already happened not for big table but for GFS GFS has been re-engineered GFS also did something like this it stored there's a master which tracks which blocks of which files are on which machine that runs into trouble when you have more than a hundred million files so like I said that 50 million tablets 100 million files there is some comparison here 50 million tablets is fine 100 million files is fine but if you go to a billion files they die now you don't need to go to a billion tablets because tablets are under your control even if the records are small you are grouping many records into a tablet for GFS it's not in their control if an application comes and creates a billion files GFS will die so GFS has been re-architected they have a new system called Colossus I think which is different that this master information here is spread across multiple machines and it can handle much larger sizes okay so the question is what is this request to tablet mapping okay I think I skipped something here I didn't tell you what is a request a request is a get or put call which we just saw it's not a web request not at all it's put a key and a value pair or get key or maybe range scan so the request is specifying a key value or a range of key values and this mapping the router says that this particular key value is at this tablet so that is done in some range partitioning or hash partitioning of some kind and that needs to have essentially one range per tablet and so the that's why I said about say 50 bytes per tablet is enough okay good so now we'll have a coffee break here