 So, let me start off with a little bit of Bhashan, as you say, on what is happening in the database community in the last few years. So back in the 60s, databases were preserved of large enterprises, you know, if you are a multinational company, you would have a database. If you were a small shop, you wouldn't, certainly, because you needed a very expensive mainframe. Then there was a democratization of this whole thing, which meant there was a lot of focus on small databases, small databases which you could run on, you know, if you had a shop, you could keep your inventory in a database. And there were a number of PC products which built these small databases. That was going on. Meanwhile, for large companies, they increased the number of people using the databases. The company who was in the company became a user more or less over time. And the load that these large companies had to handle kept growing. The next trend in databases, so this load, by the way, was something that mainframes of that era could handle with Moore's law, CPU speeds kept increasing, memory sizes kept increasing. And as a result, a single machine could handle the needs of many enterprises. Of course, it need not be a single core. They ended up with multiple CPUs, multiple disks, but one installation with a few machines could handle the needs. Now, this state went along and at one point, people said, if we can reach a thousand transactions per minute of a particular benchmark, TPC, then pretty much everybody's needs are met. And thousand transactions a second, that was almost unimaginable. Who would need thousand transactions a second? Well, along came some really big retail chains. One of the drivers of some of this was a company called Walmart in the US, which opened hundreds and hundreds of shops. Each shop had dozens of people checking out items. So each second, there were items being checked out across all of this place. And what Walmart did was they pioneered the collection of every single piece of data about what was sold, what item was sold, what time. Even down to the customer, if they knew who was the customer, they would record which customer bought it. And they collected all this data. Other companies could have, they didn't. What Walmart did was they pioneered the collecting all of this data and putting it in a central data warehouse. Remember, this was in an era before the web really exploded. So they had already started doing all this. And they would get it together in a data warehouse, which collected this every night. They would upload all the previous days information about who bought what. And one of the key things that they did was they analyzed this data. What did they do with this analysis? They could see what products were selling, what products were not selling. They could keep their inventory really tight. Most shops would buy enough material to last them a few months. Walmart could tell their suppliers, look, we sold 40 boxes of this yesterday. Tomorrow, suppliers 40 more boxes to replace those 40. Or maybe they would say sell us 60 boxes because the trend is going up. So they could do all kinds of analysis on this data and manage their inventory very, very effectively. And for a retail operation, managing the inventory is a very important part. And even for manufacturers, in fact, they started giving manufacturers access to their data analysis systems. So manufacturers could see what products were selling where. So as a result, if you had a company, let's say Hindustan liver, they could decide what to produce in the next week, based on what all has been selling in Walmart in the last few days. So that helped the manufacturers. And this actually is set about a very virtuous cycle where there is a lot of efficiency brought into production because of this. And there was tremendous value. There was also tremendous scale. Walmart managed to create the largest databases in the world at that period. Their databases were of the order of terabytes of data. Back when this were hardly 10 gigabyte, this was considered very big. One gigabyte disk was normal. They had databases with a few terabytes of data, hundreds of machines, each with many disks. So they bought these systems from a company called Teradata, which was a pioneer of parallel database systems. So Teradata basically is a company which is doing very well today. They are still the leaders in the parallel databases for data analysis, and Walmart was one of their driving forces. Now over time, other companies realized that such analysis was very important, and today the number of companies with such large volumes of data has exploded. If you went back 20 years, you would say, okay, there's Walmart with so much data, who else has data? Maybe one or two other companies, why do we care? Today, if I ask you to name companies which have this kind of scale in terms of number of shops, number of things being bought in the shops. Right here in India, we have a fantastic example, big bazaar and the other shops in the future bazaar chain. They're probably as big as Walmart is in the US. And in addition, there's this other enormous thing which is telecom. Every second person in India has a mobile phone now, hundreds of millions of phones which are active, telecom companies which are processing calls. One of the major factors for telecom companies is that people switch. They go with one company now, and they find somebody else is giving a better rate, they go with that company. In fact, there's a new trend if you have been observing the ads of having dual SIM phones. The simple reason is people don't want to abandon their old phone number, but they want outgoing calls on this cheaper thing, so they can have two SIM cards. So it's important for companies to stay competitive, to do what it takes to keep their customers. And the major factor in this is to analyze customer data. Now, the Indian telecom companies are doing it these days, but in the US, a long distance phone call companies had these problems about 15 years, 20 years ago, the same problem. And they pioneered a lot of work in analyzing their data and figuring out how to keep customers. So the bottom line is that there are now many, many different industries where the amounts of data are very large and you need to analyze huge volumes of data. So parallel databases have become a very important factor. Now it's not just parallel databases of regular traditional data. Now there's this whole slew of web facing companies where the amount of data which they get makes Walmart looks like peanuts. They collect terabytes, not overall, the database is not terabytes. Their databases are many petabytes, thousands of times what Walmart was and is. That scale is something which was unimaginable some time ago for operational data. People had this kind of scale for image data. If you take all the satellites that are sending images, some institutions like NASA collect that kind of data and they store it. But they're not doing processing on this on the fly. So this is very different. The scale is enormous. So a major challenge is how to deal with data at this scale. So today I'm going to cover two aspects of this. The first aspect is assuming the data has been acquired and is now stored in some kind of storage system, files or whatever. How do you process this data to do analysis of various kinds in parallel? One part of the answer was to have parallel database systems. And they run SQL. You just run SQL, they will parallelize it and do what it takes. And in fact, there are many companies which provide this today. As an example, I mentioned Teradata. Let me put down a few names. Okay, so there are these companies, I mentioned Teradata. Then there is Green Plum and Aster, which was the company I had mentioned as being founded by an alumnus of here. And these things, as I said, used Postgres SQL as the underlying database. So what they do is they have hundreds of machines. And each machine is running a copy of Postgres SQL. And whatever data is there is partition among all the machines. And whenever you submit a query, the query itself is broken up, executed on the different machines. Then there is some amount of work to exchange data between the machines, as required to complete the query processing. Little more work in Postgres SQL. And lo and behold, your answer is there. In fact, what is interesting is, there was a project called affordable, I don't know if you can read this, affordable databases. Affordable parallel databases, I should say, which Professor Fartuck was carrying out a few years ago. And interestingly, in parallel with these companies, he also had architecture where there were a number of copies of Postgres SQL on separate machines. And then a layer on top which could parallelize queries amongst these. Of course, this particular project was never commercial, it was a research project. But the others do have commercial offerings which are quite well known and popular on the market today. So this is one whole class of companies which handle very large query processing. But these are all SQL. Now, about again, 15, 10 years at least back. Companies such as web companies such as Google, Google was a pioneer in this. What they realize is that they are getting a huge amount of data, which they need to process in some way. Now, most of this data was not traditional database data. Amongst the kinds of data they had to process were collecting web pages which they crawl from across the world and then building an index on it. That's one example. Another example is they have all these logs of what people are querying. And they want to analyze those logs to maybe to look at what answers Google has been giving and to check who clicked where and what answers. Is Google giving the good answers at the beginning or is it quality going down by giving answers at the beginning which nobody is clicking. So what they now have is huge amounts of click data, what people clicked on. So what people searched query logs and in the query result what did people click on? Click logs. So they have hundreds of millions of such queries and clicks every day. And they want to analyze those to see what is happening. So there's a lot of analysis on data which is not traditional relational database data which has to be parallelized. As you can imagine there is no way to analyze this click data on a single machine. There is no way you can build a keyword index on billions of documents on a single machine. All of this has to be parallelized, that much is obvious. What is not obvious is how much parallelism do you need to handle such jobs? And the answer it turned out was not tens of machines. Tens of machines was what Teradata, it was a piece of cake for Teradata. Hundreds of machines that was the upper end of Teradata's installation some years ago. Today they're probably much larger but the largest installation 15 years back was probably several hundred machines. Now companies like Google realize that several hundred machines were not going to cut it. What they needed was thousands, in some cases tens of thousands of machines to analyze the amount of data that they have. They really have to parallelize a task across thousands of machines. Now this was a complete new ball game. One of the important things which you realize if you have one machine it's up most of the time. If you have ten machines working in parallel well they are probably going to be up most of the time too. If you have a lab with hundred machines probably one or two of them are going to be dead. If you have a rack with a hundred machines probably one or two of them will die every few months. But when you go to thousands of machines at any point in time, surely some of those machines are dead. So you can never actually run a computation across all those thousands of machines. You cannot assume that all the machines are alive. First still while you're running your computations one of those machines may die under you. It does happen. Or even if it doesn't die it may have some problems with its disk. Which means it's still working but it's much slower than the others. So here was a computation which finished in two minutes on all the other machines. And one particular machine is chugging along after five minutes. It's still not finished. It's part of the job and it's holding up all the others. So the moral of this theory is when you get to such large scale you have to deal with failures. You have to deal with recovering automatically from the failures in the sense that if something was supposed to be done by a failed node, it should be done by somebody else. Or if the node is slow, somebody else should take it up to finish up in case this guy doesn't finish it in time. So there's a whole lot of fault tolerance which needs to be built in. So when you build applications at this scale, fault tolerance becomes very, very important. In fact, companies like Teradata knew about this and they did in fact solve the problem. Their SQL engines could continue working even if a machine died. How do you do a parallel computation if a machine dies? Well, first of all, if you have many parallel machines, the data is split amongst all these machines. This is called shared nothing parallelism. So what this means is you have many machines all interconnected. Each machine has one or more disks internally and it stores the data. So what do you do if a machine fails? You can't access its data, it's gone. Therefore, one key part is replication. So what this means is whatever data is here has to be, you have to keep a copy at one or more machines, one or more of the other machines. In fact, there are some very clever tricks here. Now one simple way is to do something like raid one where you just keep a copy of the same data on another machine. So you pair up machines and keep a copy of the data on every pair. So that if one goes down, the other has all the data. It turns out that simply doing replication in this fashion causes a problem when one of the machine fails. So think what if this machine fails? Its data is there on the pad machine, so the data is not lost. So now you can actually do whatever computation this machine was supposed to do. This is dead. So whatever computation it was supposed to do, this next guy is doing. However, the next guy was not sitting idle. You can't afford to keep half your computer idle. It was already doing some work. Now you have a situation where you have 100 people doing work. When one guy dies, his partner is now responsible for doing both their work. What's going to happen? It should be very obvious that that partner is going to be overloaded. He is doing the work of two people. He cannot keep up with the workload. So instead, one of the key things which is done is, you think of this machine's data as being split into 10 small parts. Now the first part is replicated here. The second part is replicated as the next machine. The third part at the next one and so on. So what you have done is you are doing replication. But you have partitioned the data at each machine into small pieces and the replicas for each piece of data on a different machine. So now if this guy dies, guess what? Each of these other 10 people have one 10th extra work. So whatever work they would have finished in, say, 10 minutes. Now they will take 11 minutes to finish, which is not so bad. It's not as bad as eight of them finishing in 10 minutes and that one guy whose partner died taking 20 minutes to do it. So it's a lot cleaner this way. So replication is a key property which any parallel databases uses in order to spread the load around when things fail. Okay, then the next question is, how do you even parallelize a query? If I give you an SQL query, how do you run it in parallel? It turns out the answer is, first of all, you have to partition the data itself. How do you partition the data? There are many ways of doing it. Let's take a very simple way. We just take one tenth of the tuples in each, if I take a particular relation r. I will take one tenth of the tuples of r, put them in machine one, one tenth in machine two, arbitrarily in any old way. I don't care how I put it. I just put those tuples. Now supposing I get a query which says, find how many tuples there are in this relation. How would you answer this query? The tuples are broken up amongst 10 machines. Each machine can locally count how many tuples there are. It gets the count and then sends it to the central query site, which then adds up the 10 counts to get the final count. So what you have done is, you have effectively run this whole counting process completely in parallel. So that is one of the key tricks, which parallel database is used. It is not always possible to get this fantastic degree of parallelism. But in many cases, you do. So each machine works independently on each part of the data. So this is called independent, meaning first of all, before any query processing is done right at the beginning. When you load the data, the data is partitioned amongst the machines in some way. There are good ways and bad ways. Let us not bother about how to do it. It has been done somehow. Now when you get a query, a simple count query, it was very easy to parallelize. Now let us go one step above. Supposing the query was not just a simple count, but it was a select something like let us say shop ID comma count star or it could be some for that matter, from sales. This was the query, group by shop ID. So what do we have here? Oh, what happened? I should look at the screen when I am writing. I am sorry, if you are wondering why I am doing this. There is a blank sheet of paper. I take out the sheet of paper. The whiteboard does not get erased. The paper is erased because I have a new one. The whiteboard still has old stuff. So today I am writing everything twice. I have to again write this twice. Let me erase the whiteboard. I cannot actually see the whiteboard when I am writing. I could, but I end up seeing the paper. So what I was saying was this is called independent. And the query I had in mind was select shop ID, count star from sales, group by. So the question is how will you parallelize this query? If I did not have a group by, parallelizing it was very easy. So what I did was each machine locally computed the count, sent it to a common machine and it just added up the counts. Now with the group by I guess what you need to do. It is actually not that hard. With a group by what you do is each machine is going to run exactly the same query on whatever records it has locally. So what its result is going to be is not a single count. It is actually a set of shop ID count value. So maybe each machine, the first machine might give something like shop ID 1 count 5. This is at machine 1, shop ID 2 count 10. At the same time the next machine may give shop ID 1 count 3, shop ID 2 count 1 and so forth. So now you have these relations which you have got from each of the machines. How big are these relations at each machine? They are probably a lot, lot smaller than the original data. How many shops do you have? Even a big chain like Wal-Mart has only thousands of shops. Whereas the amount of data that is in there is probably millions of records sales each day. Each shop has thousands or tens of thousands of units sold. So the amount of data which you get as a result of executing this group by query locally at each machine is much smaller than the original data. Now the trick is that in a simple setup the results of each of these machines can be collected at one machine which then runs in local sum query. So it is going to again group by shop ID. So these two groups which shop ID 1 will form one group and this time instead of counting it is summing. So it adds 5 plus 3 and gets 8. So if there were just two machines in this parallel setup we just have to add up these two and you get 8. What if you have a hundred machines in this parallel setup each of which with a thousand shops. So now you have a hundred thousand records which you have to again group by an aggregate. At this scale it may actually make sense to do this second level of grouping and aggregation in parallel itself. So I hope you understood the problem. First of all each machine is locally done the group by an aggregate on whatever data it has. After this one option is you send all of this to a central machine which again does a group by aggregate. The other option is you actually divide up these records across those machines such that all the records for a particular group let us say all the records for shop ID 1 maybe not just 1 a few other shop IDs. So let us say that we take all shop IDs in the range 1 to 10 go to machine 1. 11 through 21 go to machine 2 and so forth. So what we have done is we have divided up shop IDs amongst machines. So what happens is each machine has computed a local table of shop ID pairs and now it is going to distribute these tuples to the respective machine. So it will say my first record is for shop ID 1 I will send it to machine 1. The second one is also for shop ID 1 I will send it to machine 1 again and then here is one with shop ID 11 which is going to send to machine 2 and here is one with shop ID 123 which it may send to machine 12 or 13 in this case. So the idea is each machine is going to get a part of this. Now machine 1 can do a group by and some locally on whatever it gets and what does it give finally? It gives for each shop ID for 1 it gets a final count let us say that is 8 for 2 the final count. So this one machine has done all the computation required for its range of shop ID. So shop ID 1 to 10 machine 1 has done its computation and it is finished. Similarly for shop ID 11 through 20 machine 2 will do its job and it is finished. What have we just done? This job of the second level of aggregation we have actually parallelized across all the machines which are in our cluster and each of them is again doing a part of the work. At the end each has finished its work locally it has got all the data. So the final result is now stored locally in each of these machines and the only remaining job is to gather all the local results and output it to the person who submitted the query in the first place. So what I have just shown you is aggregation can be parallelized very very easily. It is a very natural thing to parallelize. What about other operations in SQL? Joins and so on. In fact you can parallelize joints and in fact we have seen how to parallelize joints although I did not tell you that was parallelism. Just think what did we do when we did hash joint? We partitioned the data and then we joined each partition. So if you want to parallelize joints a very natural thing is to break up the data. Now it turns out that the data is not initially at one machine. The data itself is already partitioned somehow. So here is the data partitioned across. So here is the data across many machines. It is already partitioned somehow but it may not be partitioned on the joint attribute. So what we do is we apply hash joint but we are only going to do the partitioning step of hash joint initially. What is the partitioning step of hash joint do? It computes a hash function on the joint attribute and partitions. So this is one relation R. Similarly we have to do something for S. It is the same thing so we will postpone that. So what we do for R? So here are the hash partitions. So when R goes, when each of these machines, this is machine 1, 2, 3, 4 and so on. So what machine 1 does is it hashes its tuples and sends it to the right place. So now these hash partitions are again going to be on machines. So machine 1 will correspond to the first hash bucket, machine 2 to the second hash bucket and so on. So the number of hash buckets you create is equal to the number of machines. So this data is going to get partitioned and it is going to receive tuples which hash to the value 1 from each of these machines. So each machine is going through its tuples, computing the hash value and then sending it whatever has hash value 1 to machine 1 which will collect that in its hash partition. Similarly those with hash value 2 go to machine 2 and so forth. So what have we achieved at the end? We have, what we have achieved is we have repartitioned on joint attribute. So that is the goal of this phase. So what we have done is we have repartitioned one relation R on the joint attribute. Well now we do the same thing with S. S is also stored on the same set of machines. We repartition S on the joint attribute and what do we have now? After we repartition S, so this is H 1, H 2 and so on, H n. Similarly S 1, S 2 and so on up to S sorry R 1, not H 1. R 1, R 1, R 2 up to R n and on this side you have S 1, S 2 up to S n and at this point the joints can be done locally. Each machine has all the tuples which it needs to complete the joint locally on whatever it has. What do I mean? The i-th machine here has hash partition i of R and hash partition i of S. All we need to do now is join the corresponding hash partitions of R and S. So each machine locally does the join of whatever R i and S i it has, does the join and outputs the results locally. So what have we just done? We have just taken joint operation and shown how to parallelize it across machines. There is a cost. This thing here, all these lines here which you see going from here to here. These are all edges which are actually network traffic. Tuples are going across the network to repartition the relation. This is actually fairly expensive. So nothing is free. But if your network is fast enough, this will work reasonably fast and then you do the joining locally and you are done. So what I have just shown you is it is actually quite easy to partition and parallelize the join operation. So the moral of this theory is you can continue with other operations, even sort and so on, can be parallelized. So the moral is that you can take an SQL query, take the standard relational algebra operations we have seen so far and parallelize them very effectively, whether it is join, selection, group by, sort, whatever. We can parallelize all of them. As a result, parallel databases succeeded in many applications where other parallel things, back in the 80s and 90s before the web era, people built a lot of parallel computers and they were used for two tasks. One was for scientific computation like weather prediction or simulating nuclear bombs or whatever. And the other was for business applications which had to store and analyze large volumes of data and parallel databases did very well in that market. So now fast forward to the current era of web scale computing and it turned out that they needed to parallel process, but they had a lot of jobs. Now SQL is a language which was designed for doing, storing and processing data in ways in which typical applications or commercial requirements needed. And that made complete sense to parallelize access to data. You have a declarative language for querying data. The declarative language was invented not for parallelism, but to make the programmers job easy. But a fantastic side effect it had was that you can parallelize SQL queries very, very easily. Think about it. If instead of having SQL, the same thing had been written in C code. It's notoriously difficult to parallelize C code. But simply by using a declarative language, life was much easier and it was parallelized. So that is where the world was pre-web. But when the web era dawned, people realized that they have a lot of tasks, many of which could not be expressed very easily in SQL. There were much more complicated tasks which actually need to be parallelized and processed on very large volumes of data. And the data again did not necessarily have any good fit with relational databases. As an example, if Google has crawled the web and it has a local copy of all the web pages which it found, now it needs to do some analysis on this. It wants to compute page rank. If you have not heard of page rank, it's a way of ranking web pages by computing some statistics based on what web page links to what are the web pages, which was the major reason that Google came into prominence about 15 years back. When Google, well, less than 15 is what, 12, 13 years back, I should say. It became an overnight hit because they had a new way of how to rank websites or web pages, which gave much better, much more intuitive results than what other search engines of that era had. In fact, if you think about it, page rank was described in one paper and it can actually be expressed in a few lines of code. But that was the major difference which catapulted Google above all other companies. That was the key thing which let them give better search results. Then, of course, many other things followed. Once it was clear that they had some really cool technology which gave better results, people are attracted to it, they got funding, they could do many more things. They built thousands of applications which all of us use today. But the key step initially was this page rank. Now, it turns out the page rank computation on the web scale is actually not that easy. It is very easy to give an equation for computing page rank. But how to compute it on billions of documents? Computing a matrix computation with billions and billions of entries is mind boggling. So, it turns out you obviously have to parallelize this work. So, the question was how do you parallelize this kind of processing? Across not one, but thousands of machines. Each machine has some documents and you want to do some work on each of those machines. So, what Google did was it went back to an old paradigm for parallel processing called map reduce. Map reduce was introduced long ago. I think it was how long now? I think maybe 35, 40 years ago, the concept of map reduce was introduced in the programming, parallel programming language community. And it turned out that it was a very, very nice way of expressing how to parallelize certain computations which did not fit at all with SQL. So, what I am going to do next in today's lecture is switch from these parallel databases which have certainly been very successful. They are very useful. In fact, we have a chapter on it in the book. And feel free to read it and read the slides and then read the chapter. What I gave you here on the whiteboard was a small peek at what is there in the chapter. There are many more details. And certainly talking of whiteboard and slides. For the past nine days, I have been using a lot of slides. Part of the reason for using slides is it makes it easier to cover material and it makes it easier to go through material very fast. It turns out though that students do not necessarily like slides. What happens is we tend to go very fast through this material. It is quite probable. In fact, I am almost sure that many of you found the pace of some of these topics extremely fast and things zipped by extremely fast. Part of the reason I did this in the course is the time is limited. There are many topics I wanted to cover. And moreover, the assumption is that many of you have taught a database course already and are familiar with the fundamentals. So, I hope it was not too bad for most of you. But when I did the same thing with students, they kind of lost it. So, of late, I use the whiteboard a lot more with students. I keep slides as a backup for certain things which are too tedious to write or draw on the board. But for all other things, I use the whiteboard just like I did with you, drawing a few diagrams. It reduces the pace. More important, it reduces the amount of material which I cover in a class. But it turns out there is a big benefit. The students actually understand the material in a lot more depth than if I put it on a slide and zip by. I say here is something. Hope you understood it. Move on to the next slide. And guess what? They did not understand it. They did not get a chance to recover. When I do it on the board and take some time explaining it, they have some chance to catch up, assuming they are attentive. Of course, there is another problem which everybody has. I have heard this from other faculty in IIT. I have heard this from faculty in other IITs. I have heard it from faculty in other universities across the world that capturing the attention of students these days is very difficult because they feel that all the information that they need, they can get from Google. They just search Google or whatever. Bing or Yahoo or read it. You name it. They can get it from their favorite search engine, which is kind of true actually. If you just take raw information, yes, they can get it. But the point of teaching a course like this is to curate that information. To decide what of that enormous amount of information makes sense and to teach it in a proper flow so that from one topic to the next, you have the background and can understand the next topic and so on. Our job as teachers is to organize this flow and to have examples and help students learn it. Some of you probably only use blackboards. To you, there is no point saying this. You are already doing the right thing. For those of you who like me use slides and we do make slides available for the book, they are very useful. It's quite nice to back off from slides and use the whiteboard wherever possible.