 So some announcements. So the take-home final is going to be Saturday and Sunday. Please plan to stay at home and work on it. I just finished inserting it on Catalyst. So you're going to enjoy it because you have to input the answers on the web. But it's very difficult to insert the latex snippets, snippets from latex inside Catalyst. So they might look funny. So I will send you separately the complete latex document, the PDF document obtained from latex with the entire final. That you can print, you can look at it, that's nicely formatted, and you can use that one. And ignore the formatting glitches that you find on the web quiz. But that's where you need to input your answers. So it's the first time I'm using this technology. And being a take-home final, there are lots of questions. So I had a lot of work to do to insert everything on the web. I spent a long time, but I'm very happy with that result. So I hope you'll enjoy it. So next Wednesday, it's also the new date for the last homework. And I think most of you are already busy with big Latin, right? And we will discuss big Latin a little bit today during the lecture, but not too much. I'm going to focus more on the principles of parallel databases rather than on details of the language. OK, and again, I'm encouraging you to come next week to the last lecture. So last time, I gave a reading assignment. How many people have read the paper? Oh, no, overwhelming minority. It's a fun paper to read. It's written by Surajic Chaudhuri. He really writes very fun papers. Tell you what, put it on your to-do list to read the first two pages of the paper. It's a very quick overview of the main concepts of query optimization. So I wonder whether, I think I'd like to discuss this diagram. So in query optimization, you know that one of the steps and the most difficult step is the selectivity estimation. We need to estimate how we need to be able to estimate the sizes of the various intermediate results because those sizes dictate the cost of a particular plan. And this is the weakest part of the optimizer, computing the selectivity estimation. But there is actually another wrinkle, namely if the query is issued from within an application, then you don't get to see the optimizer, doesn't get to see the actual values, the parameters. They are bound late, so the parameters are missing. And what this diagram in the paper shows is how heavily the optimal plan depends on the choice of those parameters. So there are only two parameters in this query. It's one of the queries in the TPC benchmark. The TPC benchmark is the industry benchmark for testing database systems. And one of the queries here, as the paper says exactly which query, but I cut that information, has two parameters. And depending on what those parameters are, the query optimizer chooses up to 68 different plans. Quite amazing. Remember, we discussed the queries that I refer to quite often as select movies where the year equals 1995 or 1905. This is one parameter. The year is a value, either 95 or 1905. Imagine a query with two such parameters. This is this query. And depending on how you value the parameters, there are 68 different plans. And this graph restrates the distribution of those plans. Quite amazing. And some of the research work that I'm aware of that has been happening in Suraj's group, I'm not sure he was directly involved, was to try to capture the space somehow, such that at runtime, when the parameters are finally known, that the optimizer can quickly switch to a better plan. It's a very difficult problem to characterize this space. So what we'll be discussed today is a critical lecture. So I'm not sure if from this lecture it will be on the final. We will discuss parallel databases. And this is covered in chapters 21.1 to 22.1, 22.5 in the book. We will discuss the basics of map reduce. I have some fun slides about big Latin. I will not discuss the details of big Latin. I have some slides that present the basics of big Latin, but I will skip some because I want to discuss bloom filters. A key technique for processing data in a distributed or parallel fashion that is not covered in any textbooks that I'm aware of. And I think the slides, they presented pretty well. I distilled these slides from several papers. OK, so let's go directly to parallel databases. So first, a discussion on the distinction between parallel databases and distributed databases. So the goal of a parallel databases is purely to improve performance. You're throwing in more servers in order to get more performance. And the goal is to improve the query processor's performance or the transaction manager's performance by exploiting parallelism. A distributed database system is different. This is when different databases come together. And now you want to treat them as a single database. The reason for the distribution is not efficiency, but it's more like business logic. And the goal is how do you integrate these distributed data in a unified fashion. So keep in mind this distinction in terminology. I'm not going to discuss distributed databases. They are covered in the book to some extent. So again, in a parallel database, the goal is to improve performance. And the idea is that it's cheaper to scale by buying more processors and more servers than to buy a more powerful single processor. And the key challenge is to actually benefit from the increased number of processors. That's the difficulty. A few critical terminology in parallel databases. Speed up and scale up. Speed up is a concept that if you add more processors, then your speed should increase. So if you double your processors, what do you expect about it from your speed? You expect it to double. And if it gets that, then it's called a linear speed up. Usually you don't get that. You get less than linear speed up for various reasons. By speed, we actually mean two things. Either the queries, an individual query should run faster if the queries are expensive. Or if it's an OLTP database, then maybe your expectation is that you will run more transactions per second. So the number of transactions per second should increase. Scale up is different. Scale up says that your data increases. And to deal with that, you are willing to buy more processors. So scale up is the ability of the system to keep the same performance when the data increases by increasing the number of processors. That's the scale up. And same thing here. Performance can be the speed of a processor that should stay the same when the database increases and we increase the number of processors or the number of transactions per second, the transactions throughput. So let me show you the graphs. This is speed up. As we increase the number of processors, we expect the speed. This can be transactions per second or the inverse of the query running time. We expect the speed to increase linearly. But for various reasons, the speed actually does not increase linearly. And actually some point may even start going down. Because you throw too many processors and they spend more time communicating than doing actual work. With scale up, scale up is like this. You expect that if your data increases by a factor of 5 or 10 or 20, and you also increase the number of servers by a factor of 5 or 10 or 20, then the performance should stay the same. The more data you have, the more servers you use, and you should get the same performance. But this doesn't happen. Instead, you get a scale up which is less than linear. It decreases. So what are the challenges to getting a linear scale up or a linear speed up? Many of you probably know this. Whenever you're using more processors, there is a cost that you need to pay in order to get some start-ups. You need to get the data to these processors. You need to start the processes on these processors. Yes? Can you get super linear speed up? For instance, you have greater bandwidth over all available. If you have greater communication bandwidth available. Sure. So that's another bottleneck. If your communication bandwidth is limited, so if you increase that one, then you can also increase the running time. You cannot exceed this linear speed up. This is the best you can get. But the lack of a perfect communication network might cause this behavior. So as you're increasing your network capacity, you might get closer to the linear speed up behavior. But you can't exceed that. OK, so why can't we get here? Why isn't the case that if we use 10 times more servers, we don't get 10 times higher speed? Well, if you have 10 times more server, then for one thing, it takes longer to start them. If you want to execute a single query, you're spending some startup cost. Then there is interference, which can mean lots of things. It could mean that they communicate and they use the same network. And now if you have 10 times more processors, the contention on the network is higher. And they get slowed down during communication. But the most important problem is Q. The idea is that if you use 10 times more processors, it's very difficult to make sure that each of them handles 10 times less data. Because as you will see in a few minutes, it's difficult to partition the data correctly and uniformly across the processors. So data skew is actually perhaps the biggest impediment to a linear speed up or linear scale up. OK, more about terminology. There are three classical architectures for parallel databases. And I'm going to discuss them. Shared memory, shared disk, and shared nothing. In shared memory, you have multiple processors. And they share a common memory. And they share the same collection of disks. Give me an example of such an architecture where you find a shared memory architecture on your desk. Your laptop. Your laptop is now a shared memory architecture because it has probably two or four or eight processors. And they see the same memory. So now if you're using any kind of software, so you know the major challenge in software engineering is how to use those extra processors. Well, databases, they seem to have a solution. Databases they know how to parallelize queries. And this is the first thing that you can parallelize on your newest laptop is to run the database server in parallel. Your database server can use those extra processors and run queries in parallel. So this is shared memory. Shared disk is actually still a very popular architecture. Here the idea is that you have separate servers, each with its own memory. But they share a common network. And through this network, they access the same collection of disks. So that's called shared disk. And the other extreme is shared nothing where you have a collection of servers, each has its own processor or processors, its own memory, and its own disk. And when they need to communicate, they communicate through a network. OK, so now think about parallelizing. And think about trying to increase the number of processors from four or eight to dozens or hundreds or thousands or tens of thousands. Which of these architectures is best suited for the various ranges of processors? Yes? It depends on your data. It depends on your data. But I think it also depends much more heavily on the number of processors. So if you only have four processors, if you want to build a parallel database system that scales to four or eight processors, which of these three architectures would you favor? Shared memory. It's feasible to let four or eight or 16 processors access the same memory. And the technology is shared that the contention is minimal. Why do we like shared memory more? What's the advantage of shared memory over the other architectures? Cheap. It's something that is especially cheap. What is especially cheap? The programming. It's much easier to program. It's like conventional programming. You just need to synchronize sometimes with a parallel thread. If you know how to program threads in Java, then you have no problem programming a shared memory architecture. It's the same thing. What about shared nothing? Let's go to the other extreme. What's the advantage of shared nothing? And what's the disadvantage? Number of nodes can increase. It can increase as much as we want. All we care about is a network that allows them to communicate. But what about programming? We actually know it's not so clear how to do joints. Not clear at all. That's what I have in a few slides, because it's not clear. It's not obvious. So programming needs to become more difficult. You need to program taking into account the fact that you have a shared nothing architecture. And shared disk is somewhere in between. You benefit from seeing a common disk space, but now you can't, every processor has its own internal data structure and main memory. So these are the three architectures. Really nothing surprising for you here. So we will only discuss a shared nothing, because this is the architecture that scales. And we will discuss some physical operators on shared nothing architectures. And through our discussion, P denotes a number of processors, a number of servers. And we expect the operators to have a linear scale up or linear speed up as a function of P. So now, thinking about shared nothing architectures, there are several places where we can apply the parallelism. And these places are called interquery parallelism, interoperator parallelism, and intraoperator parallelism. Let's discuss them. So interquery parallelism is when you're using your multiple servers by running each query on a different server. It's kind of simple. If each server has a copy of the database, this works wonderful for shared disk architectures, but you can also make it work for shared nothing architectures. Then you would run every server can be a copy of the database and runs a query in isolation from the others. This is interquery parallelism. This allows you to increase the number of transactions per second using parallelism. But a single query will not run faster because of the multiple servers. The other is the next one is interoperator parallelism. What this says is think about a big query that has a complicated query plan. So there are selections. There are joins, more joins, group bytes, more selections, more joins. What you can do if you have multiple processors, you can assign each of these operators to a different processor. And if they pipeline, if they pipeline the other topics in the query plan, then you can run these operators in parallel and take advantage of the parallelism. And the last one is interoperator parallelism. And this says regardless of what the plan is, every operator in the plan, we are going to evaluate in parallel using all the servers that we have in the parallel database system. So every single join, when we evaluate that join, we use all the servers to evaluate the join. So of these, of course, the last one is the one that's most scalable. The second one, the interoperator parallelism, this depends on the complexity of the plan. If you have a very simple plan, then there isn't much parallelism available. There isn't much interoperator parallelism available in a simple plan. We need a complex plan to benefit from parallelism. And the other one has the problem is that it does not speed up individual queries. It only increases the number of transactions per second. OK, so this is what we will discuss. Interoperator parallelism, the question is how do we evaluate individual operators on a shared nothing architecture? And the rule that we apply here is that every table will be partitioned across all the processors that we have. Ideally, this is how we store the database to start with. So think about a persistent database. Every single table will be partitioned into chunks. And every server stores a chunk from each table. But the question is, how should we partition the data? And there are, essentially, not more than three alternatives that people have come up with. One is round robin. So round robin is very simple. It says this, go through the tuples of your table in any order they come. The first tuple, you send it to the first server. Second tuple is the second server until you reach the last server. And then you start again from the beginning. So you partition the tuples in a round robin fashion to all the servers. The second method uses a hash function. So it picks one attribute of t, maybe the key, maybe some other important attribute in that application. It applies a hash function. The hash function is always taking module of p. But I deleted the module of p from some of the slides because it's too long. Keep in mind, the hash function will always return a number between 1 and p, or between 0 and p minus 1. So the hash function tells us where to send that tuple. That's the second choice. And the third choice is to use a range-based partitioning. Same thing, we fix one attribute that's important. And then we partition the values of those attributes into p ranges. And then every tuple will be sent to the server that corresponds to the range to which that attribute value belongs. It's pretty straightforward. Now, let's see how this works. And to understand the trade-offs, let's start with a simple case, selection. And actually, there are two kind of selections. There is a selection over an equality predicate and a selection over a range predicate. So let's try to discuss in some detail how each of these selections needs to be implemented on a shared nothing parallel architecture. Imagine a round robin partition of the table. How do we need to evaluate the first one? Which servers are responsible for answering the query? And what do they need to do? So let me write here, one would be round robin. So all servers or one particular server needs to answer the query. All servers. With round robin, we have no guarantee about where the tuples have ended. So all of them need to contribute. They have to touch only a fragment of the data. So we still win from over a non-parallel database. But we need to keep all of them busy for the selection. What about hash-based partition? What do we write, too? Hash partition. Only one server. So which one? When we look at V, we apply the hash function on V. And that's the one. So it's the only server H of V needs to answer this query. Why does round robin then have to be all? Can't you figure out, in round robin's case, can't you figure out which server would have gotten the data? And can't you do V, modular, a number of servers and find out what the data went? Well, in round robin, you have no guarantee on the order of the tuples, on which order the tuples arrive. So you don't know where A equals V might be. And by the way, the attribute A might not be unique. So you might have multiple tuples, multiple records with a value V. And with a round robin partitioning, they can be anywhere. They can be on any server. Does this make sense? Maybe a picture would be useful here. So let's imagine these are our table. And the values V are what I'm going to write in the record. So it's like 5, 9, 3, 9, 1, 2, 4, and so on. So round robin, imagine round robin over three servers. Does this. It sends them to the servers 1, 2, 3 in this order. Can you see this diagram? So now suppose we look for the attribute A equals 9. Which servers hold 9? The servers are the blue ones. 2 and 1. It just happened that these two servers have the value 9. So there is no guarantee about who holds 9. Now, I see what you have in mind there at Microsoft. You want to be clever. And you would like to order the table before applying the round robin. Is this what you had in mind? Yes, effectively. Yeah, if you are clever, then it's not called round robin. Then it's called something else. It can be called, maybe it's called a range partitioning. How would range partitioning work? Well, in range partition, we somehow sort the domain. The trouble is, we can't sort the domain. But we imagine the domain being sorted. And we choose these partitions, like in a histogram. We need to choose the values that divide the domain into p intervals. If we hand that, then how many servers do we need to answer this query? We have a selection on b. Just one, so just one server. OK, for a range query, everything remains unchanged, except that even for a hash partition, we need to ask all the servers for the answer. We can't use a hash function for an interval. Yes? There are some additional overheads when we go to range partitioning. Yeah, so maybe we should discuss this a little bit. Let me see when does the discussion come. We have this discussion here, but maybe we should discuss it without looking at the answers. Let's discuss the pros and cons of each partition. So the pros and cons would be which of these partitioning methods has a better load balance and which of these partitioning methods is better at answering queries. So let me write here load balance and write here query. And I'm going to put a plus or a minus if that method is better or not so good. Yes? Are these static hashes and range assignments or are they adjustable? It's a good question. Should they be, let's make them static first and then we discuss what it takes to make them adjustable. So around Robin, does it ensure a good load balance? Yes, very good. What about queries? No, it doesn't help us in queries in any way. What about hash-based partitioning? Does it ensure a good load balance? I put a question mark. If you have a good hash function, then you shouldn't expect a skew because of the hash function. The hash function, if it's chosen correctly, then it's good. But what can happen is you might have skew in your data. Maybe one value of that record is extremely popular. If you have a tip distribution, then probably the most popular value is extremely popular. So then, no matter how good your hash function is, you'll get skew there. So actually, maybe minus is better because hash functions, unless they are applied to keys, they can't deal with skew values. Yes? Well, what if you hash to a large number of processors and you can ship the hot processor? Well, I think it doesn't address a problem because it's difficult to get a handle on that. Here is what the problem is. The problem is not that we don't hash into enough buckets. The problem is that these records have the same value. So nothing will distinguish them. Let me try to take an example. Let's suppose you hash on the attribute category. And you have like 100 categories. But guess what? The first category, for some reason, is very popular. Half of the records have the first category. And the other half have the rest 99 categories. Then no matter how many buckets you hash into, half of the records will end up in the same bucket. What you can do, you can hash on multiple attributes to alleviates this. But then you lose the benefits of queries. But this is good for querying, the hash function. Because you solve for selections, we can go directly to the server that holds that value. What about range partitioning? How does it deal with load balance? Minus, minus. It's very difficult to find the partition values that ensure load balance unless we can sort the entire table. But we can sort it. Even if you sort it, it's the beginning. During updates, the load balance will suffer. But it's very good for queries. It's actually much even better than the hash function. Because we can even support range queries here. Now what about updates? What if I want to change my mind about the hash function? I had one hash function. It worked fine. But now I want to change it. I want to replace a hash function with a better one in mid-flight. What happens in that case? You have to move all the existing data. You have to move all the existing data. It's an expensive operation. Plus it's often the case that you have other tables that are also partitioned based on the same hash function. You're using this during query processor. So now you have to worry about the other tables, what you do with them. So it's expensive. Same thing with the bucket boundaries. If you want to adjust them, it's very expensive. Yes? Are you still on a shared memory architecture? No, I'm assuming a shared nothing. Yeah, in a shared memory, we don't have so much. So in a shared memory, you also need to do partitioning, but here partitioning is easier because all processors can see the same table. And you just need to agree which table is a process. You don't have to do a physical partitioning. Well, I mean, it depends what you're doing. But if your data is given, then it's given. You may not have that very much shelf life, so you may not. Sure, if it's transient data, then you have other options. But the typical data, the first kind of data that we worry about in databases is the traditional inventory payroll banking data that's persistent, it has a long life, and you can't throw it away. Yeah, that's the data. You have to deal with it. OK, so this is actually a pretty simple discussion. So I think we went over the pros and cons of the various partitioning strategies. Any questions about this discussion, about these partitioning strategies? Because I want to start to show you distributed algorithms, parallel algorithms for computing some of the operators. And they are not surprising. They are actually simple too, but we need to discuss them to go over them. And I'd like to start with a group by which is kind of the typical query that you need to run in a decision support environment. And this is the main environment in which massively parallel databases are used today. They are often not called parallel databases, they are called MapReduce or some kind of Hadoop. But there are still multiple servers that process a large collection of data. And as we will discuss later, these technologies, they move now inch closer and closer to traditional database techniques. And what they need to do quite often is to group by something and then compute some aggregates, aggregate by the group. So that's our goal. Imagine one table R, it's a huge table, has gazillions of tuples. And what you want to do, you want to group by the attribute A. And you want to do some aggregate, let's suppose some of each group, the sum of all the attributes. There's a typical group by. So how can we do this? How can we do this on a shared nothing architecture? It's actually quite straightforward. We use a hash function. Hash functions are winners over shared nothing architectures. If we are lucky and the table is already partitioned the way we want it, then it's fine. Otherwise, we need to go over step one and partition the table according to the A attribute. So it's quite straightforward. Every tuple at t.a is sent to the chunk that is given by the hash function on its A attribute. So now every server sends its j's chunk to server j. And now every server collects these chunks coming from the other servers. And they all hash into its own space. So now it can do a local group by and compute the sum of the p's locally. Is it clear? Should we go again over this? It's a very simple algorithm. Let's go again because I don't want to lose anyone at this point. So the way to think about this is that initially the table is partitioned somehow over these p servers. Every server holds a chunk. Maybe it was partitioned around Robin. Maybe it uses a hash-based partition, but unfortunately not on our A attribute. We use a different hash partition or some around Robin. So now you are server i. What do you do? You have this chunk. You have this fragment R sub i. What do we do with it? During step one. We go server i will go over the tuples in this chunk and do what? Apply the hash function and send it to the corresponding server. Or maybe it will collect some locally. The way I described it on the previous slide is that it's going to collect some locally. So it's going to construct chunks r i 1, r i 2, r i p. And then in step two, this is when the communication happens. This goes here, and that goes here, and this goes here. So now if you look at what server i has received, because server i is going to receive from various other places, what does it receive? Well, it's going to receive something from r 1, which is r 1 i, something from r 2, r 2 i, and something from r p, r p i. And all these fragments, they have the property that they are a attribute hash to the same server i. So now all it does is to compute a group i here. It applies the group i operator on this collection of tuples. Let's compute what we know to do best, the i o cost. So we have here the picture, and I hope you can see the temporal sequence of these steps. The i o cost is a number of blocks that each server needs to read and write. Oh, by the way, how big is this chunk? b of r represents a number of blocks for the entire table r. It is huge. But this chunk is not b of r. How big is it? b of r over p. So this is b of r over p. Because we assume that the table was uniformly partitioned at the beginning. So step one, what does it need to do? It needs to read and send. So it's going to be b of r by p and then send. Now, depending on what happens at step two, we might have to write it. If this with what arrives is too big, then we need to write. How much do we need to write here at step two? How big is the data that arrives? So you see, the data that left, these analysis and query variations are always deceptively simple. The data that left had size b of r by p. So the data that now we collect, is how big? It's like it's a p server, so it can get from b minus 1. No, no, no. Well, they only receive these small chunks. I mean, sub chunks. How should I call it? Chunks of chunks. Yes, so they each of them are br by p, b minus 1 of them. No. Because they are actually br by p squared. This entire thing was br by p. So now each of these ones is br by p squared. But I don't think this way. There is a much simpler way to get to the simple result. Look at the data that's being collected at step two. It's the same table, and they send to the same servers. And we still assume we are lucky. We should always be optimistic. We are lucky. So it's uniformly distributed. So how much can we get? Again, br by p. Again, br by p. That's how much we get here. So I'm going to add br by p for writing it. And then we need to read it. Because then we need to do the group i. So plus br by p to read it. So that's our cost. Now if we are lucky and it fits in my memory, then we don't need this step two plus step three cost. That's the analysis. So I have this analysis here. Let's go over them. So step one, we had to read br by p. At step three, we had to write and then read again. So the total cost is 3 br by p. What we didn't discuss is this communication cost. Let me use blue for the communication cost. This is not IO cost. This is the total amount of data that's being sent on over the network. And that's pretty large, unfortunately, because the entire data needs to go essentially at the same time. However, the network, they are increasingly faster than the disks. And actually, the speed at which they evolve, it's often cheaper to get data from another server's disk. No, from another server's main memory. Then it is to get data from your own disk. So the bottom line for our analysis, this is negligible. This is small. So we will always ignore the communication cost. We focus on the disk. We like disks. Are you not including the initial IO to get database into memory and then divvied up among all those processes? Is that not part of this calculation? So that's a good question. So do we ignore the initial partition step? And the answer is yes. We do ignore that because we assume that this is how the parallel database system is stored. We assume that the table they live on these servers. Now, this is not true for map-reduced tasks. In that case, the data needs to be shipped initially. It needs to be sent to the servers. But it's a case that those tasks are usually so expensive. They are so massive that you can ignore. I mean, the initial distribution is a tiny fragment of the total cost. OK, so that was group i. Let's see our favorite operator. How do we do a parallel join? And I took this from one of the textbooks which assumes that when we join R and S, R is partitioned on some servers and S is partitioned on a different set of servers. That's not the only assumption. You can assume that both R and S are partitions on all the servers. And then each server sees a chunk of R and a chunk of S. But that's not what the book does. The book sends R to some servers and S to some servers. Good. So servers from 0 to k, they hold R. And servers from k plus 1 to p, they hold S. And now we want to join such that the A attribute from R is equal to the B attribute from S. So how do we do this? Well, obviously, we need to hash R by A and S by B, send them to the common server based on this hash function. And then each server will join locally what it receives. Very simple. And this is what the slide says. Step one, every server that holds R will partition according to the hash function. Let me use green for the hash function, for the hash function on the A attribute. Here I have found an error. And the server that holds S, they will partition on what attribute? On the B attribute. Here is the book. And then they send, they really distribute the data. They send the data according to this hash function. So now they switch roles. Now they are receivers. So server U receives both an R fragment and the U fragment. So now server U computes a join of all the fragments. It's actually not just RU, but it's R1U union, R2U, RPU with S1U, S2U, and so on. This would actually be K, KU and this P minus K, whatever. All the servers receive their fragments. And then they do the local join. Should we go through a picture? I want everyone to understand that the parallel join is very, very in the tiniest details. Should we go through a picture? Yep. OK, so let me use a new picture here. I don't like it when they are on different servers. I'm going to put them on the same servers. So let me draw the servers in black, that the tables R will be in blue, and S in red. OK. So this is the initial picture. And now during step one, the first goal is to redistribute these fragments according to the hash function. Yes? What's K? Oh, this is the discrepancy between what's on the slides and what's on here. OK, so those first blue ones are K, and then red ones are K. Yeah, yeah. So in the book, I actually don't know if there is any advantage to doing one way or the other. But at least one textbook, I think it's our text with Ramakrishna, uses this alternative in which R is partition on the subset of the servers and S in a different set. Much harder to draw. And you also need to worry now about how many servers you allocate for R and how many servers you allocate for S. You can shoot yourself in the foot if you allocate the wrong number of servers. So I prefer to draw the picture showing that each server holds both R and S. Good. So now let's focus on the server and the meter. What does it do? Yeah, it's almost identical. It will apply the hash function to A to RA and to SB. And every tuple will be sent, the R tuples will be sent to the other servers. And similarly, the S tuples will be sent to the other servers. And similarly, it will collect tuples. It will collect S tuples, and it will collect R tuples. This has been stepped over. And that would be in step two, right? So let me do this. Let me move this higher up. Step one, and this is step two. And then in steps three, what happens? Everything that's here will be joined. And that's it. And now the result is sitting here on the server. Yes? Does it end up such that each of those servers has the complete set of R and the complete set of S? No. No. You mean here, during step three? If that server in the middle is sending the R and S to each of the other servers, as in receiving R and Ss from all of the other servers, are all of the other servers doing the same thing? Are they also giving to all others and receiving from all others? Yes. And so none of them actually has a complete set of R and S at the end? No. Here is a question to ask. Look at the sending server. And think about a blue tuple, a tuple in R, tuple in T. That is sent to only one server. I showed here many arrows, but it only goes on one of these arrows, not to all of them, just to one. And to which one? Depends on the hash. But the hash function tells you. The hash function, you apply it to the A attribute of that record, and that tells you where to send it. Yes? The number of records that is sent from R of the server, I server, to other servers need not be the same, that it gets the same number of records, right? Well, it needs not be the same, but the expected value is the same, the expected number. If the hash function does its job and distributes the data uniformly. Initially, not before coming to step one. Oh, I know, I mean, I'm assuming both initially and during step one. So meaning all servers have equal number of R records and equal number of records? They also have the equal number of R records, and separately also have the equal number of S records, yeah. And the table R might be completely different from table S. That's only at step one, right? Because once they start hashing things out amongst the other servers, they have it. During all steps, that's the assumption that we make about the hash function, that it will do its job correctly. And we have to keep in mind that it's a major assumption that the hash function might not do its job correctly for various reasons, because there is data skews. But the worst possible case would be of all those values that get hashed all have the same value, and just, right, it means statistically it could happen, and then they all end up on one server. Exactly. Something impossible. Actually, it's one of research projects that I'm doing with one student to analyze this from a theoretical point of view. And then you need to hash on other attributes. You need to detect those. And actually, I have slides to show you what they're doing big in that case, yes. Well, they actually hash everything, and it shows up on a single server. That means that throughput-wise, you basically have all the other servers available for work, right? Well, but that's not your goal. So your goal is that you paid the money for all these servers, and now you want to improve performance. Latency, but you're potentially improving throughput. Your absolutely correct. So there are two different parameters that you want to improve, which is throughput and latency. But think about your own situation, right? You are now customers of Amazon Web Services. And by the way, I need to remind you, you only have credit for $100. And the big problem is that if you forget to turn off the server, then Amazon will continue to charge you. It will exhaust the $100, and then we'll get into your credit card. And we had once a case when a student had a credit, and he had a bill of about $500 at the end of this homework. So please turn it off. Remember to turn the server off. So now that's a mindset. You don't care if Amazon can use your servers for other customers, which it will do. This is exactly what they do. What you care is about your own query running faster. So this is why we really care about good loan balance, because this is what will ensure that the queries run faster. Let me skip the cost, because the analysis is exactly the same, and it's boring. And it's right here. The cost is always is four. Beats me, why did we get four here? Should be three. So we read, we write when we receive, and then we have to read again and do the join. So it's only three. I don't know why I wrote four. Yes? Do the results ever combine after that final join? They get sent back to a master node and combined into a final join table? That's what we leave out of the analysis. We don't think we do that. Because you don't know what happens. Maybe the result is a single number at the end. And then that's cheap. Maybe you store it for future use. You don't know. So this should be three. It should not be four. So the rest is kind of straightforward. I mean, it's very difficult engineering, but there is nothing conceptually new. If you have a more complicated query that consists of multiple joins group by selections, the same query plan applies also for a parallel database. You just need to be aware that between operators, there might be redistribution. These split and merge operators that distributes the data according to the way the next operator needs it. So that's how complex queries are evaluated on parallel database systems. Good. And on that note, let me switch to the next topic and tell you what's hot today. So in 2004, people from Google published a paper that people who did not take the database classes found extremely exciting, because it holds a promise of massive parallel processing. Now what they did here, they came up with a simple abstraction, which actually they didn't invent. And they didn't claim to invent, but made many people credit them for inventing it. It's a map reduce abstraction. How many people know Lisp? Not that many people. It's a fundamental language in computer science. And it was invented in the 60s. It's no longer used today. But it really had a major impact in computer science in the 60s, 70s, 80s. Later, it evolved in ML, F-sharp. These are children and grandchildren and grand-grandchildren of Lisp, of functional programming languages. Two key functions in Lisp were a map that allow you to apply a function on each element of a list and reduce. That allow you to combine multiple elements of a list into a single value. These are exactly the abstractions that the Google's map reduce and approach use to do massive parallel data processing on hundreds and thousands and tens of thousands of servers. What they really excelled at was the engineering aspect, making this really work. This is where they excelled at. The way you should think about map reduce is as a pair of physical operators. And you can use these physical operators. They are implemented brilliantly. They work on large collections of servers. And the idea is that you would combine them. This is what PIC does for you. PIC combines them to compute more complicated queries. So let me tell you that map reduce model a little bit in more detail. It processes data. What kind of data? Files. Map reduce knows files. It think of these files as being a bag of key value pairs. So the input is a bag file of input key and value. And then the map reduce job does something. And it outputs a file of output key and value pairs. In order to write a map reduce program, you need to write two things. You need to write a map function and a reduce function. The map function that you write, you need to have an input parameter that is a single key value pair. This is what you consume. And your function will emit a set or a bag of intermediate key and value pairs. And the idea is that the system will ensure that this map operation is done in parallel. That many servers say they hold these chunks and that the map function is applied by these servers in parallel on all the chunks. Now, at the end of the map phase, the system has these intermediate key and value pairs. What happens to them? Now they are redistributed. As we saw in traditional parallel databases, they are redistributed according to the intermediate key. And now comes your second function that you also need to provide. Your second function takes an intermediate key and a set of values. This is the result of the group I. The system will collect them together. And it will give you all the values that belong to this intermediate key. And you do whatever you want to this. And you emit a set of values. And this is what the system will collect. So let me show you an example. This is the example from the original MapReduce paper. And it says the following. Suppose you have a huge collection of documents. So every key value pair consists of a document ID. That's a key. And the value is a text. A text of words. Your task is to count for every word you want to count the number of occurrences. Big problem because, well, this file is huge. It doesn't fit on one single machine. It's distributed across many machines. So the same word now occurs everywhere. How do you count them? How do you make this communication happen? Where do you make it happen using these two simple functions? Map receives the document key, which you're going to ignore if you don't care about it, and receives a value. This is what you need to take as input when you write a map function. And you write it in some programming language. And what you do is for every word, you emit one output. And here it gets interesting. Now the word becomes a key. This word will become the key. And its value is one. This is the map function. So the map reduce system will apply this map function in parallel on all the servers. So each server will compute a set of words and ones. Now comes the communication phase where words will end up on the same server because the communication is done based on this intermediate key. And map reduce also uses a hash function, as we saw in the other examples. In the second step, in the second phase, it applies your reduce function. Here is a reduce function. This takes a key. In this case, it's a word. It's this word here. And it takes a set of values, a set of numbers, the ones. It can be more than ones. But imagine for this beginning that you get a bunch of ones. And what it does, it simply adds them. It adds these ones. And it emits a single value for this key, which is the sum. So let's see this in action. Here is how it works. So the input consists of key value pairs. In our example, the key is a document ID. And the value is the document. It's a string of words. But in general, it can be any value. This file is, of course, partitioned across many servers. And what happens during the map phase is that the map function that you provided will be applied on every key value pair. And that, your map function will compute a set of intermediate keys and value pairs. During the second phase, these intermediate values are redistributed, such that everybody who contributed an I1 is sent to the same server. And their values are accumulated here. This is now a bag. And the reduced program, the reduced function that you write, is going to be applied to this bag of values. I think I made it sound more complicated than it seems. It's actually a very simple abstraction. The thing that I'd like you to remember is this equality. The map phase is a group I. This is when you essentially group I something, but you want it. In our case, it was by words. You want it to group by words. The reduced phase is the aggregate. So the same task as before can be written as here. So we have a document, we have a relation consisting of the document key and the word. I'm assuming it normalized. So the document is repeated once for every word. And what we do is that we group by word. That's a map phase. We do the grouping by word. And then we sum the ones, which is the same as count star. And that's a reduced phase. Yes? So this is similar to the parallel database that we saw on the slide. So where does step two occur here? Step one was once each server sending data to other servers. And step two, what we saw was other servers also sending it back to it. So that does not happen here because map only sends to the reduced function. In the reduced function, it does not send it to other reduced functions. No, but the analogy is complete. So here too is a case that every reduced server, we receive data from multiple map servers. So it's exactly like we had here. I'm glad that I didn't have to raise it. So imagine this would be the map step. And this is a reduced step when the data has been received and now you're doing your local computation. That's exactly the analogy. So map reduces exactly, let me be precise here, each of these physical operator that we saw can be implemented using map reduce. The converse is not quite true for subtle reasons that I want to discuss. There are some subtle differences. But first I need to tell you a little bit more details about the internals of the map reduce implementation that actually made it so successful. So what you see on this on the next slide, on the next two slides, is only engineering. And then we discuss a little bit the abstraction that makes it different from parallel database processing. So in the map reduce implementation, there is always a master note that is is in charge of synchronizing all the works. In terms of terminology, the input file is not partitioned into servers. But it's partitioned into final partitions that are called splits. So if you have 10,000 servers, you might want to partition your data into one million splits, many more than servers. So these are the splits. Now you have a number of workers, the servers. And the master node will assign these splits to the workers. Now it's not going to shuffle the data around unnecessarily. But it keeps track from the very beginning which splits are processed by which servers. Very important, and you'll see in a second slide. Then the second complication that they have is that for the reduce phase, they apply the same principle. Instead of partitioning the data during the reduce phase according to servers, they partition it into a larger number of chunks, smaller chunks, larger number, called regions. And then they assign these regions to workers. And these are called the reduce tasks. In other words, the granularity of the partition is finer than the number of servers. You will have more partitions than servers. Good. So I have here a picture I invite you to look at at home. I want to show you now the key thing that I think may make map reduce scale to a large number of servers. What happens, what didn't happen in parallel databases, but happened at Google's field. Once you cross this boundary of 16 servers like the massively parallel databases in the past, and you get to hundreds or thousands of servers, it's always a case that during an hour or two, some servers will fail. Something will go wrong with them. They will either be much slower than the others, or will fail outright. They will crash for some reason. This is what happens if you do the math in full tolerance and you increase the number of servers. At some point, you get to this threshold where it will certainly happen with high probability during an hour or two. So the map reduce system needs to deal with server failure. And the way it does it is essentially brute force, but they did it, and this is what made it happen. The master server monitors all the workers. And if a worker doesn't make progress or doesn't respond, then it will reassign those splits. Let me see if I got the terminology right. It will reassign the splits that belong to that server. It will reassign them to other servers, to other workers. And this is why it's best that the server who failed should not be responsible of an indivisible chunk. Instead, it's responsible of several splits. And now, once it fails, these splits can be distributed to other servers, and the remaining workers process much faster. Because you're already late, as it were, so you want to get up. Yes, I had a question waiting here. I was under the impression that Google was moving off of MapReduce for some of its index builds specifically because of this problem with stragglers and not really clear on whether this solves the problem or just reduces its impact. Well, I don't know the details at this level. What I know is that they also realize that the MapReduce obstruction, is this insufficient for the data analysis? And now they have a newer system that has been actually in prototype for many years, but now it's published and has an official name. It's called Dremel, like the tool, the high precision tool Dremel. And they support group eyes, but not yet joins. They are still working on the join issue. But I know for sure that Dremel, the major challenge in Dremel are still the stragglers, these workers that fail or that take forever. And by the way, the author of Dremel will give a talk in our database. I mean, we invited him to give a talk here on Friday. If anyone is interested in hearing Sergei Melnick's talk on Dremel this Friday at 12.30, you can either come directly to the database lab in 4.05 or send me an email if you forgot. And I'll tell you at what time and when to come. Should I send the announcement to the entire class? Would people be interested in getting the announcement? OK, I hope I won't forget. So that is going to be a quite interesting talk, and we are going to drill him on their lack of support for joins. Yes. OK, so this was the major achievement. They dealt well with stragglers. So here is the definition. It's a machine that takes unusually long time to execute for a variety of reasons. And what MapReduce did early on in their first paper is that they looked for stragglers aggressively. And if a server looked like it's going to take forever, then they aborted it and they redistributed its work. And this is what made them scale to large number of servers. Good. But the problem is, if you want a complicated query with three joins and two groupies, and MapReduce, this is not your solution, you need a higher level language, which eventually is going to be a query language. And people now move towards SQL, Dremel is SQL, to express this query. And you need an optimizer to compile this query into a sequence of MapReduce tasks. And I think now it's a good time to take a break. But I want to take a short break, only three minutes, because I still have lots of material to cover. It's fascinating. And it's part of the final. And we want to cover it. And then I want to have about 20 or 25 minutes at the end for the evaluation forms. I will ask you to fill out the evaluation forms. Good. So it's a three-minute break. OK. So let's start. I'm just going to discuss this. And the remaining of this lecture, I want to discuss to give you the high level view of big Latin. I'm not going to discuss the details of the language. I do get sometimes questions about this homework. I must confess, for the homework, this is a homework that I know least about. This was designed by Magda. She did a wonderful job, my colleague, Magda Balaginska. But I never got my hands dirty on this particular homework, as I did on all the others. So questions, please send them to Jessica and copy me. She sees me. Good. So the slides I'm going to show you now, they're actually provided by Alan Gates, who is on the big Latin theme at Yahoo Research. He gave a talk in our database group last spring. And he was kind to allow me to use them in the lectures. And he had a wonderful spiel of big Latin that I need to share with you. So first of all, big is an open source language. It was one of the earliest high level declarative languages that exploited the map-reduced architecture. And now there are others available, but they are not actually an open source, which is a problem. Big is not very efficient. CR compiler is not that great. And we will discuss this. But it's free, it's available, it's open source, and that's what we use. So I'm not going to tell you why we use map-reduced. But I have here some slides about the map-reduced architecture that essentially displays the same query as the one before. Given a collection of documents, count for each word, the number of occurrences of that word. And it's done exactly as we see here. There are a number of parallel map tasks followed by a number of parallel-reduced tasks. For example, if the first map is applied to this document, I mean, or to this collection of documents, Romeo, Romeo, wherefore art is a Romeo. And the second is applied to this document here. Then what happens is that the first map task will produce these words with count one. And the other map task will produce these words with count one. And now the map-reduced infrastructure understands the keys, which are the words, and will redistribute them. So it's going to send them to the reduced task. And now the reduced task is actually a wonderful picture. Now you can see very clearly that this reduced task has a collection of values for the key Romeo. They were consolidated by the map-reduced infrastructure into a single value. And then the reduced function provided by the user consolidates them. And in this case, we just count them. We replace this one, one, one by three. There is an important discussion that I realized I didn't bring up. Map-reduced, the way it is implemented. And please remember this. The way map-reduced is implemented is that at the end of the map phase, the result is stored on disk. Forget pipelining. If you think pipelining is smart, map-reduced does not do this. It restores the result of the map phase on disk. Then the reduced phase happens. And the result of the reduced phase is also stored on disk. But the result of the map phase is really critical to be stored on disk. Why? Why isn't the map-reduced smart? Why isn't the map-reduced implementation smarter? We know pipelining is great. If you could apply your map thing and directly send it to the next server, it's so much more efficient. That's the reason. That is exactly the reason, because you're sending it to another reducer, and that reducer might fail. So then you need to resend it to somebody else. So map-reduced is great for full tolerance. It handles these failures very, very smoothly. It's not the most efficient implementation. You can implement much more efficiently. But then if you have failures, then you're in trouble. And actually, one of the projects, my colleague, Malta Balaginska, her research is in this space. She works on full tolerance. And she's looking at intermediate architectures. How can you get the best of both worlds? How can you have full tolerance? But at the same time, do some pipelining, which is so much more efficient than materializing the intermediate results. The data stored from the map steps stored indefinitely on the disk, or is it only till the reduced step gets back to the time when that is done, it is deleted? Yeah, so the question is if the data is permanent or is temporary as far as I know. So it will disappear at the end of the map-reduced job. So this is map-reduced. But then Alan Gates says, what happens if you have a more complicated query? And this is the way he describes more complicated query. And this query, essentially, what it does, it joins two tables. One is called, and I'm going to use this example several times in the next, in the following slide. So let's go over it. It joins users and pages. Because every page has a user. The page is essentially a page access. And one of the attributes is the user who access that page. And the user says a different table where we keep information about the users. So we join the by the name because each of them has a name. And then we want to group by the URL. The URL comes from pages. And then there is also a filter here. You only want users with a certain age. And then we do an aggregate. We count. And then we also want to store that and return to top five. Complicated. But hey, map-reduced is powerful. You can do this. And here is a great slide. This is what it takes you to write in map-reduced. He says 170 lines of code, four hours to write. You write this in map-reduced, and you can solve that task. Or you can write nine lines of big Latin and you solve exactly the same query. Now, I would add you can write only three lines of SQL and solve exactly the same query. But they decided for this language. And this is a debate which I don't want to get into. What do programmers prefer? Do they prefer SQL? Or do they prefer a language that is more operational? And this is what big Latin is. It's a more operational language. And they claim that users prefer a more operational language. On the other hand, Dremel said that in Google, their customers, they want SQL. They want the declarative language. Yes? I'm wondering if you could write it in Perl and fit it inside an SMS message. But you still need to parallelize it. OK, so I'm going to go a little bit over this query because this is my only mini tutorial on big Latin. And you need to finish your homework in big Latin. Load. We don't have any load in SQL. Why is there no load in SQL? Because the data already has a schema. Load essentially says, read this file and process it according to the schema. Pass it and import the data and the schema. We don't have this in a relational database. The data is already stored in the schema. Filter, this is what we call in the relational algebra, select. So filter only those users. And you see users come from here. Filter only those users whose age is between 18 and 25. Now we move to pages. Pages we get from a different file. And their schema is user and URL. And now we join. And we join the filtered users and the pages. And here we give the join attributes. Pages, the user name is equal to the pages user. Then we group. We group by URL. And then for each group, we count something. We count. I don't know why we count. I don't know what else we could count. We count star. And finally, we can order. And there is also a limit. And then we can also store. We can store it in a file. So this is big Latin. Now, a quick question of all the languages, not all the languages that you know. That would be too much. But the languages that we discussed in this course, which is closest to big Latin? SQL, relational algebra, x query. Is this close to SQL? The main property of SQL is that it says what we want, not how we want it. In big Latin, do we write what we want or do we write how we want to get it? We write how we want to get it. We say, you first filter, then you join. Then you group by. Then you filter again when you join. This is not what. This is how. So big Latin is essentially relational algebra. It's very close to the relational algebra. It's also close to, somewhat close to x query. Although x query does have the ability to express declarative queries. You can write here a join. And then it's supposed to take to find the best join order. But not in big Latin. In big Latin, you are in control of the optimizer. So he had a nice way to put it. He said that the optimizer in big Latin is between your eyes and your keyboard. That's where the optimizer is. I mean, you have to optimize yourself. Good, so this is big Latin. Now, this is how they brag. So big Latin is compiled into MapReduce. They reuse all that clever MapReduce infrastructure with this feeling of full tolerance, with load balancing, with everything that comes with it. And they compile every big Latin query or program into a sequence of MapReduce tasks. When they started in 2008, they were almost 10 times slower than a manually written sequence of MapReduce tasks. And by the time he gave the talk, actually even earlier, they claimed that they are as fast as your best MapReduce implementation. That's their claim to fame. OK, so let me skip now over the slides. I'm going to skip these. I want to show you these three types of joints that they support in big Latin. And often, they don't make the decision between these types of joints automatically, but you have to give them a hint. And the first joint that I'm going to show you, we haven't discussed this in the context of parallel databases, but it's an important joint. I would call it a broadcast joint. They call it a replicate joint. Same thing. And here it is. You want to join pages with users, but they are asymmetric. The pages table is huge. And you have to distribute it, because it's too big. But the user's table is small. Every server could have a copy of the user's table. In the replicated joint, that's exactly what you do. You don't worry about distributing the user. You only worry about distributing the pages. So it goes like this. Well, you load both of them. And then look, it's a physical operator, this joint. This is not a declarative language. You have to tell big Latin that you want a replicated joint. So what happens? The pages, they are partitions according to some hash function. They can be partitioned according to anything, actually. And then the pages are partitions, but the users, this is important bit here. The users are replicated. Every server receives a copy of all the users, because the assumptions are small, and they can be replicated across all servers. And keep in mind, you make this decision. You say here if you want this joint to be a replicated joint. Yes? So here, we gave a hint that it has to be a replicated joint, because we knew in advance that pages' table is much, much more than user's table. And what if, you know, it is dynamic, I mean, pages can be growing and shrinking, and users are also growing and shrinking. At that time, we will not be in a position to know that, you know, we can use a replicated joint. At that time, you will be spending much more effort replicating users to multiple maps. Yeah, it's true. So this is unfortunately, in big Latin, the philosophy is that you are in charge. You are in charge of deciding what is the most effective physical plan, the most efficient physical plan. They are slowly moving towards a full-fledged query optimizer, but this is where they started. They started from a language that allows it to express physical queries. Well, the second joint that they support is a hash joint. I'm going to go here much quicker, because it's exactly the hash joint that we discussed. And this is by default. If you just say joint, then it's a hash joint. So it's going to be hashed by the same table. And so the map task does the partition. But you see here one. This means a table user, and two means a table pages. So it tells the receiver where this topic is coming from. It's coming from users or from pages. So this is users, and these are pages. OK, and they are really interesting joints. It's this one. It's a skew joint. This is the one that you need to implement if you have data skew. So let me see what kind of data skews they have. I need to look ahead and see in which direction they skew it. But I think it's a pages that are skewed. So what happens here is that in users, the name is like a key. So if your hash function is good, then when you distribute the users according to a good hash function, then they will be uniformly distributed. And that's good. But the pages, the pages have an unfortunate property. Namely, the same user might visit many more pages than other users. Now, all the pages visited by that user, they have to be sent to the same server. So if you have a skewed data in which one or a small number of users visit a lot more pages than other users, then even that single user, that single skewed user, might overflow, might make one chunk of the data much bigger than the others. In that case, what they do, they use skewed join. You have to tell the system to use a skewed join. And what they do is they detect the outliers. They detect the skewed elements. And then those pages, they are replicated. So let me see if this is visible in. Yeah, it is visible. So they find out that Fred is a very popular user. Imagine at an extreme that half of the pages are visited by Fred. And the others are distributed among the other millions and millions of users. But Fred is maybe, I don't know, he's a spammer. He's whatever. So no matter how smart your hash function is, all these freds, all the pages from freds will end up being sent to the same reducer if you're not careful. So what the skewed join does is, first, it detects the outliers. So it finds out that fred is going to cause problems. And it spreads separately. All the tappers that are not fred, they are distributed normally according to the hash function. But fred, all the tappers that belong to fred, well, they are going to be distributed according to something else to another hash function. And then the user fred needs to be broadcast. So pages, this is number one. So the entries coming from the pages fred, they are going to be partitioned according to something else, could be round robin, could be something else. And the user fred, which is number two, needs to be broadcast to all these reducers. It's actually quite subtle. I hope you understood this. Remember the broadcast? In the broadcast join, which was right, what is it? Difficulties navigating here. In broadcast join, users were replicated to all servers. So in particular, if all the pages are visited by the same user fred, is the only user active on this database? Will you suffer? Will you get a skew in this case? No. But you have to be careful to partition the pages in a way which is safe against skew. Maybe you partition it round robin. It doesn't matter how you partition the pages, because the users are broadcast anyway. So now fast forward to the skewed hash. In skewed join, users are big enough, such that you do want to partition them. But fred is your only outlier. It's the only user that occurs in way too many pages. So then you will treat the tappers differently. All the tappers that are not fred, you handle like in a normal hash join. Both the pages and the users will be hashed according to the user name. Tappers whose user name is fred are handled differently in pages and in users. What happens to the fred tappers in pages? How do we partition them? Round robin, for example. What happens to the fred tappers in users? How do we partition them? Broadcast, exactly. Those we broadcast to everyone. So they will meet. The tappers in pages end up anywhere. Every tapper ends up at some server. The tappers in users, they will be broadcast, and they will definitely meet. That's a matter. It's actually quite clever. Question. How do we know? How do we learn about fred? The user told us just cute. It didn't tell us fred. How do we find out these outliers? Maybe one, maybe 10. There can't be too many. You can't have too many values that are repeated too many times. How do we learn about these outliers? Sampling. Every server will sample from its own pages some number of tappers. And you centralize these samples. And if you find there are tappers that's repeated often enough, a value that's repeated often enough, this is what you suspect is an outlier. And that value you suspect is cute. Good. So this is a cute join. Pig also supports merge join. But what they say is that for that, you can only do this if the relation happens to be sorted already. It doesn't pay off to sort, because sorting in parallel is more difficult than sorting eventually. So I'm going to skip this one, because I have nothing interesting to show you here. Good. So these are all my slides that I wanted to show about pig. What you will still find on this slide that I invited to, if you already downloaded it from the web, please download it again, because I made some last minute changes. And Jessica, I didn't have a chance to post the updates. So please re-download the slides. And if you need a minute tutorial on slide 82, there is a minute tutorial on pig. It works through the syntax. But I suspect that you already passed that point. You probably know more than I included on these slides. What I want to show you now, it's actually a topic that's quite difficult. And I hope I'm planning to spend about half an hour. And I hope I can cover it in this half an hour. And it's also very, very, it's both very relevant and increasingly relevant. It's called bloom fritters. This is a technique that may dramatically reduce the number of communication steps, the amount of data that you need to send in parallel processing. Bloom fritters, they were invented actually in the 70s. They've been like a known great technique. But actually, people didn't teach it so much. But I think now it will becoming extremely important because of this massive parallel query processing. So just tells you where I got the slides from. It's not in the textbook. And I'm going to illustrate it. I'm going to motivate it by using this query from the pig Latin slides in which we join users and pages. Here is a join. And the thing is that the pages is a big file, huge file distributed on many servers. Users is also big. But once you do the selection, very few users participate in the join. So here is a problem. Why distribute all the pages? If only a very few number of pages will ever join, which users satisfy this condition. Bloom fritters allow you to figure out a superset of the pages that need to be distributed. They will give you some false positives but no false negatives. So you will always be on the safe side. So the big compilation of that query is that it will send all the pages but we want to reduce it. So the technique that allows us to reduce the number of communications is based on a data structure called hash maps. It's a very simple data structure. It's an array of bits. So we have m bits. It's like a hash table. But it's even simpler than a hash table because we only store 0 and 1. So we really use bits here. It's one bit. So now if these are, for example, all the user names, then what we would do, we would apply the hash function to all the user names. We can do this in parallel. We can already filter the users by their age. And we put a 1 in this hash map. Everywhere where a user that passes the filter hashes. So now how can we use this? Let me see what do I have the slide where we use it. Yeah, how can you use it? Well, initially, we insert into the hash map, we set to 1 the bits of all the users that are in this age group. Every server needs to do this. So now these hash maps are centralized to a central server. There we can apply a big OR. And we can find all the entries where at least one user would have hashed. And now in the second step, we can send this hash map. We can broadcast it to all the servers. And now the pages before they are being redistributed can be checked in this hash map. And if they are not there, if they are not in this hash map, then we don't have to distribute it. Is it clear how we use the hash map? Good. So this is not the Broom filter yet. The Broom filter improves the analysis. But first, we need to do the analysis for the basic hash map. And the basic question that we ask is how many false positives will we have? So we have this large collection of maybe pages, the pages. And some of the entries in the hash function are one. The question is how many pages will we think are in the collection, but they are not there yet. They are not there. There is just a collision. That's what we try to analyze. OK, so the first question that I'm going to ask you is the following. So the size of the hash table is m. So this is from 1 to m. And suppose I have a single user. I have a single user in that H bracket. And I'm going to insert that single user. It's somewhere in this hash table. And now I have a huge collection of pages. And I'm going to check one of these random pages if it is in the hash map. And that page will map to a bit j. My question is, what is the probability that this bit j is 0? What's the probability that I'm successfully rejecting that page? Well, it depends where that one landed. If it landed exactly on my j, then it's bad. I'm not lucky. But if it landed somewhere else, then I'm good luck. So what's the probability that a given bit j is 0? This bit j. But how many choices do you have for placing the 1 to declare victory? m minus 1. m minus 1. So the probability is m minus 1 by m, right? Or 1 by 1 minus m. In other words, if you have like 1 million pages, then this fraction of the pages will be rejected by this filter. And only 1 by m of them will be sent to the next phase. And your false positives are among that 1 by m. OK, but that assumed a single xj, right? And here is my analysis. The probability is 1 minus 1 by m. But we have more users that are inserted in the hash table. We have a number of n users that are inserted in this dictionary. Now my question is, what is the probability that my bit j is still 0? What should I do to this formula? Well, you insert the first x1. And then after you insert this one, then the probability that your bit is still 0 is this. Now, if we insert the second x2, that's like an independent choice, right? That also wants to go somewhere in this hash map. So what is the probability that the bit j will remain 0 after all these n insertions? This to the power n, right? Of course, as n increases, this probability will decrease, right? And that's bad. I mean, this table will start being filled with ones. And at some point, most of my web pages will be selected. And I will think that they might have a matching user. OK, so this is this value. And this can be approximated. So now I'm looking at the complement. What is the probability that you get a hit? What is the probability that you're testing a page if it isn't this dictionary? And the answer is yes, it's 1. Of course, it's 1 minus. This is exactly the rate of false positives. If you take a page that shouldn't be included, then with this probability, you will include it anyway. And you probably remember this from 1 minus x. When x is small, it's like e to the minus x. And therefore, 1 minus 1 by m to the power n is like e to the power minus n by m. So therefore, the rate of false positives is like 1 minus e to the n by m. When n goes to infinity, when we increase the size of the dictionary, what happens to this value? So when n goes to infinity, then f goes to 2, 1, which is exactly bad because this means that the false positive rate increases to 1. I end up selecting all the web pages, all the pages include. Good. So that was OK. And as a reference point, the typical numbers that we get is that we take the hash map to be as large in terms of bytes as our set of elements. So we have m is 8 times bigger than n. And in that case, the rate of false positives is 11%. So if you have 1 trillion web pages, then 10% of them will be your false positives. Including the true positives, but most of them are probably the false positives. Good. So that is the analysis. Now enter Bloom filters that improve this with some magic. And here is a magic. What they do is that instead of having a single hash function, they use k hash functions. So far, so good. Of course, if we use more than we, more hash functions than we can avoid some of the false positives. But what's mind-boggling is that they put these hash functions in the same bit array. So it looks like this. The first element, x1, is you apply the k hash functions. And we set to 1 all these bits where the hash functions end up. They can also pollute these k hash functions. That's OK when we set the same bit to 1 twice. x2 will activate a different set of bits. And we only set to 1. We never set back to 0. It's a one-way journey. So now we want to test. We want to test, is y1 in the hash function, in the hash map? How do we test this? We compute for the hash functions. And then we look at these positions. And you should tell me what should the answer be for y1 versus y2? What should the answer be for y1? Is it in the hash map or not? It's not, because if it were in the hash map, if y1 had been hashed into this hash map, into this blue field that I should say, then all these positions should be 1. All the green positions should be 1. But we found some positions that are 0, so it can't be there. But if you look at the red, y2, there we don't know. We have to answer, yeah, it might be in the hash map. So again, we might have some false positives. Let's analyze now the rate of false positives. And the analysis is kind of we expected to be difficult because of this trade-off. If we choose a large number of hash functions, then in some sense it's good because the probability that you get an element where all these k positions are happened to be 1 is very low. You decrease this probability that you get a false positives because you're testing more positions when you test an element. But the problem is that you're forcing all these k hash functions to go into the same bit array. So if you use too many of them, then you'll get lots of ones. And then you're in trouble. So these are the two competing forces. So let me show you the analysis and how to interpret it. And we do exactly the same analysis as before. We have now k hash functions. And the question is if we insert only a single element, so it is going to be sent to k positions. What is the probability that the fixed bit j remains 0? What's the probability that we don't get a false positive for just one single web page? I asked the question without thinking. So actually, I don't know the answer. So what did you say? m minus? k by n. This assumes that these k hash functions are distinct. And they are not necessarily distinct. They might overlap. So what should the answer be? This is if we had a single k. The second k, it's the power k, no? To the k. To the k. Yeah, so that should be the answer. Because each of these hash functions have an independent opportunity to hit orbit j. So this is if we only have, is that what I said? Yeah. This is if we only insert a single element in the hash function. Now let's insert n of them. So what's the probability that we still get? We still remain 0. Just continue to, it's the same thing. So it was 1 minus 1 by m to the power kn, right? I hope that's correct. I'm getting tired of that. Yeah, it's the right. I'm going to try. And this is, in fact, if you set k equals to 1, then we get exactly what we had previously. Good. So now we want to compute the rate of force positives. And now it gets interesting, because when we check for membership, we don't check a single bit. We check k bits. So k will occur again a second time. These are the two competing forces. And here is a formula. Let me not try to reinvent it. We're just going to read it. So this was the probability that one bit, a fixed bit j, is 1. But we have k choices, right? So this is to the power k. Because to get a force positive, all of them must be 1. And only then are we unsure. We say maybe this element is in the set. OK, so this is the interesting dependence on k. You see, when k goes, if we increase k, then this number will go to 0. And then it's not good. And then we get the force error, the force positives rate goes to 1. On the other hand, if you ignore this increase and you just look at this number, which is less than 1, when this k increases, then because of this exponent, when you're exponentiating a number which is less than 1, it will decrease. It will go to 0. So this makes it, so one direction makes it go to 1. The other wants to make it go to 0. And we need to do an analysis and see where the optimum is. So the analysis I actually cut and paste from a paper is done by computing the derivative of that expression in k to see where it's optimized. And it turns out that k has to be this. It has to be the ratio n by m, which tells you how many more bits you're using than the size of your set that you're inserting. How many more bits than users in that H bracket? And time's long too. So actually, I tried this evening to reconstruct the argument. It's possible to derivate, but I can't reconstruct it. So if you ask me right now, why is this the root? I can't answer this. But I'm sure that if you spend some time on this particular equation, if you want to find out there is a simple way to derive that case. It's long 2 times m by n. So far it was right. But now let's put things together and see the analysis. And then the bloom filters look very interesting. They have a very interesting structure. So again, n is a number of elements you want to insert. And m is a number of bits that you reserve in your hash table. K is given to you. K, you must choose it to be this, to be long 2 times the increase factor, how many more bits you're using than elements. Once you choose that, then I'm going to skip that. It turns out that on average, half of the bits will be 1 and half of the bits will be 0. That's what that magic K does. It's chosen such that about half of your table is 1, half of the table is 0. And the rate of false positives is this. It is 1 by 2 to the power long 2. And this is that ratio. It's the increase I'm going to denote at C on the next slide. It's the factor, how many more bits you're using than elements in your set. And now this behaves like this. So 1 by 2 to the power long 2 is 0.61. And this to the power m by n. Now let's compare this. Let's compare it to the naive hash map. So let's suppose we are using eight times more bits than elements in the set. So then the rate of false positives is this. It's 0.61 to the power 8. Compare this to the hash table, to the hash map, I should say, where the rate of false positives is this. It's 1 minus e to the minus something. And the point is that this is much better. The bloom filter is much better. And it increases much better when we increase m. Look, when we increase m, the reward that we get from the bloom filter is much higher than the reward that we get from the naive hash table. And that's perhaps the most important aspect of the bloom filter. I wanted to go with me through the simple exercise. I want to revisit the example from before. We had about, let's suppose we had 1 billion users. We have the number of pages is 1 trillion. And I want to ask you what these parameters are. What does it do to us? How big is n? It's really tricky. I got confused every time I prepared these slides. Who's n? In our complete example, remember the query. We filtered the users by the age to join with the pages. And n is who? Is it 1 billion or is it 1 trillion? It's none of the above. This is a number of users that satisfies the filter. It's much smaller than 1 billion. Let me say, just for the purpose of the discussion, that n is 1,000. And my point is it's much smaller than your table. It's what comes back from the filter. So how big should your hash map be? If you want to have as many bytes as n, so you want to be 8 times bigger. How much should you choose? How big is n? Well, 8,000, which is tiny. It's a very small. Because we are actually fishing here for a small number of users that satisfies the filter. And we don't know who are the matching pages. So we give these 8,000 bits. We give like a container of 8,000 bits to every server. They will put there, they will set the bits corresponding to the users that happen to be in that age bracket in their collection of users. We centralize them. We compute the or. We re-broadcast this table. And now the pages, they will be probed in this hash table. How many pages will we get after the probing? We had 1 tera, which is 10 to the power 12 pages. And now we probe them. How many do we expect to get? Of course, you'll get all the pages or the right pages. But I'm not concerned about those. I'm much more concerned about the false positives, the pages that we shouldn't get and that we get. How many will we get? Exactly. 2% of that number. So 2% mean actually not. Yeah, it's 2%. Well, it's not that great. 2 times 10 to the power 10, which is like 20 billion pages. This is a number of false positives. A tiny, tiny fraction of them are actually true positives. But they are lost in the noise. The vast majority are false positives. This gives you a little bit of sense of these numbers. And of course, you can afford to increase them. You can afford to make them like 100,000. And then the analysis will be much more favorable, right? Then you probably reduce the number of false positives dramatically. But any questions about the bloom filters? Because then I would like to stop the lecture here. And I would like to ask you to participate in the survey. And we have all the infrastructure now in place at Microsoft, Huma Qureshi. Is Huma there? Yeah. Hello? Hi. Now I see you. So Huma will distribute the evaluation forms. And then we'll collect them. And then she knows what to do with the evaluation forms. And here Tom. Tom will distribute the evaluation forms. He also has some pencils. There are two sheets, a white sheet that's going to be read by a machine. And the yellow sheet that arrives to my mailbox in January so I can read your comments on the yellow sheets. So thank you for participating in the survey. And I guess we'll communicate by email for the final. And I'll see you next week in class to discuss fun topics. Yes? Don't discuss. So where are room filters used? They're used every time when you need to do this kind of filtering. But they're not used automatically in database systems. Because the decision to use them needs to be essentially done by hand. But there is a great, great potential for using them in map-reduce scenarios.