 Hello everyone, I'm Jor Aruraj. I'm a final year PhD candidate working with Andy on the Peloton project for Andy today. Today I'll be presenting a few frequently used sorting and join algorithms used in database systems. Feel free to interrupt me at any point during the lecture. I would prefer to make it more interactive, if you will. So before we get started, a couple of logistical announcements. So homework number three is due today and homework number four has been released and will be due on Wednesday next week. So today we'll be continuing our discussion on how the database system executes queries. And we will focus on a couple of relational operators, the sorting and join operators. So these are frequently used in queries and we'll look at how the database system actually executes these relational operators. So today's agenda, it's sorting and join algorithms and I will start with the sorting algorithms. So first of all, why do we need to sort? Why exactly do we need to sort the tuples? So the relational model that we covered in the earlier lectures, it does not specify order for the tuples. However, users often want to retrieve the tuples from the table in a specific order. And it is also trivial to eliminate duplicate tuples by making use of the sorted order. So you can use it to execute queries with distinct clause. And lastly, you can also bulk load a lot sort of sorted tuples quickly inside a B plus tree as opposed to doing repeated inserts. So there's a clever algorithm for bulk loading tuples that are in sorted order. So these are some reasons why sorting is actually used in practice. And there is also the join operation which I will cover in the later part of this lecture, which can also be simplified significantly if the input tables are actually in sorted order. So that's the general motivation for sorting in database systems. And this has to do with the relational model explicitly specifying that the tuples need not be in any order. So there are two cases. What if the data actually fits in main memory? The entire table fits in main memory. So this is a simple case. In this case, you could just employ a standard sorting algorithm like QuickSort. However, the more interesting and challenging scenario is when the data actually doesn't entirely fit in main memory. In this case, we refer to this problem of sorting data that doesn't fit in main memory as external sorting. And in this lecture, I'll be presenting an commonly used external sorting algorithm called the external merge sort. So the high level idea of external merge sort is very simple. It employs a hybrid sort and merge strategy. So during the sorting phase, what you do is you try to sort small chunks of data after bringing them into main memory. And then you write back the sorted data to a file on disk. And the merge phase is going to do recursive combination of smaller sorted sub files into a larger file. So it's a two-phase approach. So let's start with a simple example to illustrate the algorithm. So we will start with the two-way external merge sort. And we will later generalize this design to a K-way external merge sort algorithm. So to give some context, let's assume that files are broken up into N pages. And that our database system has B fixed size buffers for processing the operation. So that's the problem set up. So in a two-way external merge sort, in the zeroth pass or the preprocessing step, what you do is you first read in every B pages of the table. You then sort it and then you basically write it back to disk. So each sorted collection of B pages is referred to as a so-called run. Now in the subsequent passes of the sort merge algorithm, we recursively merge pairs of runs into runs that are twice as long. So we do recursion. So this is illustrated here. We basically sort it in memory, and then we create runs, and then we merge them back into larger runs. So you can see that at each step of this algorithm, we make use of three buffer pages, two for scanning the input pages, the input tables, and then one for the output. So this picture illustrates the two-way external merge sort algorithm. So you can see that in the zeroth pass, we sort each page. We sort the contents of each page. In this example, each page, you have two entries. So you sort them, and that's referred to as a one-page run. In the second phase, you basically pick two one-page runs, and you use a merge algorithm to merge it into a larger two-page run. And then in the second pass, you combine, you merge a couple of two-page runs into a larger four-page run. You repeatedly do this process until you end up with a single sorted run that comprises of all the tuples present in the table. So you can see that in each pass, we are reading and writing each page in the file once. So because we are shrinking the factor, the number of runs by a factor of two in each step, it's a log n to the base two. Those are the number of passes you will need. And then you still have the pre-processing pass. So the total number of passes is one plus log n. And the total IO cost, because in each pass you are reading and writing a page once, it's basically two times, and there are, let's say, n pages in the table. So you have two n into the number of passes. That's the total IO cost of the operation. So this is a high-level idea of a two-way external merge sort operation. Any questions? Yes. That's a good question. So you can actually stream in the input runs and write out the output run in a streaming faction. But at any given point of time, you only need three active buffers. So that's a good question, yes. Exactly, exactly. So you only need three buffers at any point in time. So in this case, B is equal to three. Yes, that is correct. The number of buffers is three. And because of that, you shrink by a factor of two in each step. So we will next generalize this design to make use of all the buffers we will have. In memory. So the high-level design of this algorithm is to employ a divide and conquer strategy, and you recursively sort the sub files and merge them back, and you repeat this process until you converge. So this algorithm only requires three buffer pages, as Aaron just mentioned. But if we can generalize this to a scenario where we have more than three buffers, so B greater than three, how can we actually make use of these extra buffers? Can we actually use them to reduce the number of disk access? Because we want to reduce the number of iOS in the algorithm. So that's the target of the general extra merge sort algorithm. So in this case, in the zeroth pass, we basically make use of all the B buffers. So what do we do? We basically produce a bunch of sorted runs of size B, because the number of pages in the table is n. You will have n by B ceiling sorted runs. So you have bigger runs to start with, compared to when B is three. And in the later passes, we basically use a k-way merge algorithm, which generalizes a two-way merge algorithm, a binary merge algorithm. So instead of merging just two runs, you're trying to merge k-runs simultaneously. So because of that, at the time you actually merge B minus one runs, so the shrinking factor is higher here, as opposed to two. In this case, it's B minus one. So it's a generalization of the two-way algorithm. We make use of all the buffers, and we try to make use of a k-way merge algorithm here. So you can see that the number of passes in this case, instead of dropping by two, it drops by B minus one. So the log base is B minus one instead of two. So in the previous case, B was three buffers, so three minus one is two. So that's why it was log base two. So this generalizes the design. And again, in this case, the total I will cost is 2n into the number of passes. So any questions about the general external merge sort algorithm? Cool. Let's move forward. So I mentioned briefly that we make use of a k-way merge subroutine in the general external merge sort algorithm. What exactly does this subroutine do? It basically merges a set of k-runs. So that's what you are doing here. So the input to this algorithm is k-sorted subarrays. And we wanted to merge them all into a single sorted array. So what do we need to do? We need to efficiently compute the minimum element of all these k-subarrays. So that we can, at any point in time, we find the minimum, and we output these elements to the output array. When you keep doing this, you end up with a sorted array. So this algorithm is called a k-way merge algorithm. And the idea is to internally maintain a heap or a tournament tree to efficiently compute the minimum element. So the time complexity of this whole algorithm is n log k, where the base is 2. So this is a subroutine that's actually used in the higher level external merge sort algorithm. So let me illustrate this with an example. So let's say we want to sort a file that contains 108 pages. And let's say we have five buffers. So in this case, n is 108, and b is 5. And we want to make use of all the five buffers. So in the 0th pass, we first construct a set of sorted runs where each run contains five pages. So every five pages you rewritten, you sort it and write it back. So we construct 22 sorted runs, each containing five pages. Ignore the last run. And in the next step, we do a merge. We merge these sorted runs. And because we merge them, the number of runs decreases by a factor of b minus 1. And you keep doing this, you end up with a sorted file that has all the 180 pages sorted in order. So if you use the number of passes, the formula that we constructed for the number of passes, you can just plug in the number of pages and the number of buffers and you will end up with four passes. So that's the design of the external merge sort algorithm. And database systems typically use this algorithm for sorting tables that don't entirely fit in main memory. So what if you're trying to sort a table and let's say there already exists an index on that table that is also constructed on the actual attributes you want to sort the table on? What if there is already a table that does the sorting for you? So can we actually accelerate the sorting step by making use of this index? Any suggestions from anyone? How can we actually accelerate the sorting step if we already have an index? Then we don't have to do the scan, right? We don't have to scan the entire table and do this merge sort. If you already have an index, how can we actually make use of the index to accelerate the sorting step? Yes? In the scan to the label? Exactly, that's correct. So the idea here is to simply retrieve the tuples in the desired sort order by simply traversing the leaf pages of the tree because the tree has already sorted all the tuples in the desired sort order, right? So there are two cases to consider here. The B plus 3 can be either clustered or unclustered. In a clustered B plus 3, the order of the data is same as the order of the sort order. So let me illustrate it here. So in the case of a clustered B plus 3, it's trivial. We simply have to traverse to the leftmost leaf page. We traverse to the leftmost leaf page and we then retrieve tuples from all the leaf pages and this essentially gives you all the tuples in the desired sort order. So if there is already an index in the database system, you can just make use of it instead of doing external sorting. So it's always a good idea to use a clustered B plus 3. However, if the B plus 3 is actually unclustered, in this case the ordering is not the same as the one that we need. Here if we actually chase each pointer to the actual page containing the data, we will be doing a bunch of random axes and in general we will be doing one IO per data record and this is a bad idea because you will be doing a lot of random disk axes. So if you have an unclustered B plus 3, it might not be a good idea to use it for the sorting operation. So that's the takeaway. Other than any alternatives to sorting, what if you actually don't need the data to be ordered? Let's say we want to form groups to satisfy a group by clause, but we don't actually need the ordering. Another example might be you want to remove the duplicates in the table, but you don't care about ordering. All you want to do is get rid of the duplicates. Can you actually remove duplicates without sorting the entire table? Yep, you learned about hashing in the earlier lectures. Just use hashing. If you don't care about ordering, you can just use a hash table to do unordered data access. So the idea here is that just construct a hash table and then remove duplicates by hashing into the table and making sure that we only return one, one tuple for each entity instead of returning multiple duplicates. And as we covered in the earlier lectures, hashing can actually be computationally cheaper than doing an explicit sorting operation. So hashing is an alternative to sorting if the query actually doesn't care about the ordering. In that case, you can just simply make use of a hash table instead of actually sorting the data. If you wanted to use hash group 5, if your groups were really, really large, wouldn't you end up having lots of chain in your hash table? That is correct. So it depends on the query. The query optimizer, which we will look into in the next week's lecture, actually decides when to actually do sorting or hashing depending on the query. It's a very good question. So to summarize sorting, we just show that the goal of external merge sort is to minimize disk IO. And in the first pass, we produce sorted runs of size B and then in the later passes, we do a recursive merge to construct the sorted table. So it's a recursive dividend-conquer strategy. And in the next week, we will be illustrating how the query optimizer actually picks a sorting or hashing operator depending on the ordering requirements in the query plan. So if the nodes higher up in the query plan actually require the data, the input data to be ordered, then it makes sense to do sorting. If not, then just do hashing. And again, you have to take into consideration other factors, like the number of attributes that you want to group by and so on and so forth. Cool. So we just covered sorting algorithms. So in posters, for instance, they use quicksort and external merge sort. If the data completely fits in main memory, they just do quicksort. And if it doesn't fit in memory, then they make use of external merge sort. So external merge sort is a frequently used algorithm in most database systems. Any questions on sorting? Soaring algorithms so far. Cool. So let's then move on to the next part of this lecture. So in the next part, we will be focusing on join algorithms. Join is another frequently observed operator in queries. And again, this has to do with the relational model. The relational model inherently avoids unnecessary repetition of information if possible. We don't want to store the same data several times. That's why you try to make use of higher normal forms. And the high level idea here is that we decompose tables using the normalization theory in order to avoid unnecessary repetition of information. But you still may want to construct the original tables that you essentially decomposed. So you might do a join on these decomposed tables in order to reconstruct the original table. So join operator is commonly used to reconstruct the original table and we don't want to lose any information by doing the join. So join is frequently used in database systems and in this part of the lecture, I'll be presenting a whole set of join algorithms that can be done to support this operator. And for this, we will be using a siling database to illustrate these algorithms. So anybody here into siling? If not, you should consider joining a siling club. It's a fun activity. This is from the Hoofer siling club at Wisconsin. I guess Andy was a member of this club. So we use this data set for presenting the join algorithms. So we have two tables here, a list of silos and a list of reserves. Ignore the actual attributes. All you need to focus on is the first attribute, the SID, which we will use to join these two tables. Apparently there are a bunch of interesting silos in this club. So each tuple has some size and there are, let's say, some number of tuples in the page because the page size is fixed to be 4 kilobytes. So the main point here is that in the first table, there are 500 pages in the silos table. So we will represent this by n. So n is 500. And then because there are 80 tuples per page, we represent this by PS. PS is 80. In case of the second table, this is a bigger table. So it has 1,000 pages. So here m is equal to 1,000 and the number of tuples per page is 100 for the reserves table. So we have these two tables and we want to do a join. I will first try to distinguish between joins and cross products so that we make sure that the problem is very clear. So R join, yes, is a very commonly seen operation and it therefore must be carefully optimized. You often do joins. However, sometimes people actually do R cross S, which is the cross product or the Cartesian product, where you basically end up joining every pair of tuples in the two tables and this cross product can be really, really large because it grows exponentially. So therefore, a very trivial algorithm to do joins might be to first construct the cross product and then do filtering. You first construct the cross product, you come up with all possible join tuples and then you apply the join predicate. That's one feasible algorithm but it's very expensive because constructing the cross product is computationally inefficient. So that's why we need better algorithms. I will show that there are actually many different algorithms that you can use for supporting this join operator and also that there is no particular algorithm that actually works well under all scenarios. So you need to pick different join algorithms for different scenarios and this is where the database system technology gets very sophisticated. So in today's lecture, we'll be covering four join algorithms. The hash join algorithm will be covered in the next lecture by Andy. So the first three algorithms are different variants of the nested loop join algorithm. That's the simple one, that's the block one and then that's the index-based one. And then we will also cover the sort merge join algorithm which is very closely related to the external merge sort algorithm. Before we move on to the algorithms, I will set up the problem so that we can do some kind of Ivo cost analysis. So assume that the first table contains m pages. So the m pages in the reserves table and then PR represents the number of tuples per page and m is the total number of tuples in that table. In the same way, there are n pages in the silos table and then PS tuples per page in that table and totally there are n tuples, n tuples in the silos table. So to keep things simple, we will focus on the simple natural join and we will consider more complex join conditions later. The cost metric in all these algorithms is the number of disk IOs and if you minimize this, you can speed up the algorithm significantly because that's the high pole in the 10th. We will also ignore output cost, the outputting the actual joined table for simplicity. So we only focus on constructing the join table in memory. So here's an example for a join query, the simplest one probably. So you just join these two tables on their common attribute, the cellular ID attribute. So assume that we don't know anything about the actual tables and let's also assume that there are no indices on the tables, particularly on the SID attribute. So how can we execute this join operation? So we will start with a strawman algorithm, the simple nested loop join. This is the simplest thing you can do. So you have these two tables R and S. What you can do is you just have a couple of loops, you iterate over each tuple in the first table in R and then you also iterate over each tuple in the second table S. And then for each pair of tuples, you evaluate the join condition. In this case, it's basically you check the SID of the first tuple R and then SID in the second tuple S. You check if these two values are same and if it's same, then that's a match and then you output a joined tuple. And this I will note that this algorithm is really slow and we will see why it's slow in a moment. So we refer to the table used in the outer loop as the outer table and the one used in the inner loop as the inner table. So in this case, R is the outer table and S is the inner table. So why is this algorithm bad? Exactly. So you end up scanning the two tables for every single tuple combination. So you end up doing a lot of scans. So in this case for every tuple in R, you end up scanning the entire S table once. So the number of disk access in this case is M for the outer table and M, the small M represents the total number of tuples. So for each tuple, you are scanning the entire S table. So S table has N pages. That's why it's small M into N. So you have those many disk access and that's really slow because you are doing it for every single tuple. So let's try to get some ballpark estimates. So we just plug in the numbers. There's M and then small M and N. So roughly we will be doing like 50 million IO operations when we run this simple nested loop join algorithm and even if you use an SSD, you will get the result of your query after like one hour. So that's not so good. It's very slow. So even if you end up swapping the tables, like using yes as the outer table, we change the equation a little bit because in this case N is the first table. So you swap M and N but still you end up with 40 million IO operations. So that's slightly better but still it will take probably an hour. It doesn't improve things significantly. So what assumptions are being made here? We essentially use a couple of buffers for streaming the two tables, R and yes, and we use one buffer for storing the output. So we basically again just make use of three buffers but in reality you will probably have more buffers. So anyone has any ideas for how to speed up this simple nested loop join algorithm? That's correct. You can optimize for page performance. How would you optimize for page performance? Okay. Yes, yes. Yep. So the idea is to actually amortize the IO cost by doing block-oriented access instead of tuple-oriented access. Because in this case for every tuple in R, you are scanning the entire table yes once. What if you amortize the cost by doing it for a group of tuples in R? So this is the basic idea in the next algorithm. The block nested loop join algorithm. So here instead of reading a particular tuple, you read an entire block. So you read a block from R and you also read a block from yes. And then if a particular pair of tuples match, then you output the join tuple. And you can see that unlike the simple nested loop join algorithm in the block nested loop join algorithm, we are scanning the table once only for every block in R, right? Not for every tuple. Because there are a bunch of tuples in every block, you are cutting down the overall cost, right? You are cutting down the overall IO cost. So in this case, the number of disk access is m plus m into n where instead of small m, we are using capital M. So small m represents the number of tuples. Capital M represents the number of blocks. So this is a significant drop in IO access. Any questions about the block? Yes. So we will generalize this in a moment. That's a very good question. It could be a page. It could actually be multiple pages if you can actually make use of all the buffers, right? So in general, you can see this pattern, right? Instead of using just three buffers because main memory typically has more buffers, you are trying to make use of all the buffers. But we will start with a very simple algorithm and then generalize it to make use of all the buffers. Not really. It's just that we still were using three buffers earlier, but we just used a single tuple from the block we pitched instead of using all the tuples in the block. So the question is, do we actually require only space for storing the tuples in the previous algorithm as opposed to storing blocks in this algorithm? The answer is no. Even in the previous algorithm, because blocks are defined by the operating system to be like four kilobytes in size, you end up fetching the full block. Instead of that, you're doing something not so smart by just using the single tuple after going through all the effort to fetch that block from disk. Yes, you end up doing extra IOs in the previous algorithm because you're not making use of all the tuples you fetch efficiently. So the exception is the same block? Yeah. You are repeatedly fetching a similar block, probably, uselessly. That's what we are trying to cut down here, right? In the previous case, for each tuple in a particular block, you are fetching the block once. Here, you are doing it only once for the entire set of tuples. That's correct. So that's why you cut down the multiplication factor by the block size, essentially. In the previous case, it was small m into n. In this case, it's capital M into n. And capital M is like small m by 4k or something like that. So it's very significant. It's a very good question, and that's a nice segue for the next variant of this particular algorithm. What if the entire table fits in memory, right? What if one of the tables actually fits in memory? Things can be accelerated, right? So before we go to that, I think we should probably think about a particular optimization. So which table should be the outer table for this algorithm? Yeah. That's correct. So in general, the smaller table, it makes sense because you end up reducing the number of IOs by making this choice. So we will move on to a variant of this algorithm. So if you just plug in the numbers, right? If you make use of capital M instead of small m, you will see that we have significantly shrunk the total number of IOs. Earlier it was more like 50 million IOs. Now it's only like 0.5 million IOs. So you reduce the total time from 1 hour to 50 seconds. Okay? By just making use of the block abstraction, okay? So what if you have more buffers? What if you have B buffers and you want to actually make use of all of them while running this operation? So here one thing you can do is you can use just one buffer for scanning the inner table and one buffer for storing the output and use all the remaining buffers. In this case, it's B minus 2. So you use those B minus 2 buffers for scanning the outer table of the relation, okay? Outer table of the join, yeah? So that's... So instead of reading just one block from table R, you read B minus 2 blocks every single time because you have those many buffers, right? And then you use all that information in memory to do the join, okay? So because this algorithm uses B minus 2 buffers for scanning the table M, you significantly reduce the number of disc axes. Instead of doing M, capital M into N, you end up doing M by B minus 2 into N, okay? So in the prior case, B was 3, okay? So because B was 3, 3 minus 2 is 1. So earlier it was M plus M into N, okay? So in the prior example, we just made use of 3 buffers. Here we are trying to make use of all the B buffers. So you are using B minus 2 buffers for streaming the outer table. So you cut down M by B minus 2, okay? That makes sense. Any questions? Cool. Yeah? So you can actually do that, but I don't think it's any different from this particular algorithm. This is equivalent to that. You can actually use multiple buffers for scanning the inner table as well as the outer table, but it's equivalent to doing this. So an extreme case would be when the entire outer table actually fits in memory, right? If you can actually pull in the entire table into memory. So in this case, B is greater than M plus 2, okay? So if the entire table fits in memory, you end up with just M plus N, okay? Because you can fetch in everything in a single axis. You can fetch the entire outer table into memory with a single axis. So you can see that if it's basically B minus 2 is equal to M, right? So M by M is 1. So that just disappears. That term disappears. So you will end up with just M plus 1 into N, which is basically M plus N. So this is an extreme scenario where the entire outer table completely fits in memory as he just mentioned. So in that case, you can finish the entire operation with just like M plus N IOs, okay? So in this case, it will be like 1,500 IO operations. Yes. Can you repeat the question, please? That's the B is more than M plus N. Okay. So if every B is really, really big, if you have both tables in memory, I think you can, yes, you can further simplify the algorithm, but you will still need to bring it into memory, okay? So you will still need to bring that into memory. So this is like a lower bound. Unless if it's already in memory from a prior operation, this is the lower bound. This is as good as it gets, okay? That's a good question. You will need M IOs to fetch the entire table because in a particular IO, you can only fetch one page, right? So you will need M IOs to fetch the outer table and then N IOs to fetch the inner table and then everything is in memory. So you don't need to do any more IOs. We are not worried about like outputting it to disk, okay? So everything can be done in memory. So the number of IOs would be M plus N. So this is like the best case scenario, okay? And this is basically a special case of this general algorithm, general equation, okay? Yes, yes. To simplify things, as I mentioned earlier in the lecture, for all the algorithms, we are not focusing on the cost to output the join table back to disk. We are only focusing on the input cost. To keep our analysis simple, yes. So that's the general idea of block nested loop join. So we basically make use of the block abstraction in this algorithm and we try to make use of all the buffers in memory to speed up the join operation, right? So this all makes sense. These are two commonly used algorithms of the nested loop join, okay? But the thing is these algorithms can actually perform bad compared to like if you have an index, right? If you actually have an index, then doing a sequential scan is very expensive, right? Why not just make use of an index? So the reason why these basic nested loop join algorithms are not so good is because for each tuple in the output table, you end up doing a scan over the inner table, right? And you can improve things by making use of the block abstraction or using multiple buffers. But at the end of the day, you still do a scan. And this is a sequential scan. But what if you actually have an index? If you actually have an index, how can you accelerate the join operation? Any suggestions for how we can accelerate this operation using an index on exactly, exactly? So you can replace the sequential scan over the inner table by an index scan over the inner table, right? If you have an index over the inner table. So you replace the inner loop which was doing a scan. Instead, you just do a single index probe to find if this table actually has a match, okay? So here you use an index to find the inner table matches instead of doing a scan. And we can actually make use of an existing index if it already exists or even build one on the fly to accelerate the join operation. And this is the reason why you might actually want to use an index nested loop join. Yes, exactly. So to keep things simple, we will not focus on the uniqueness aspect. But in general, you can just do a single index probe and you will be guided to the exact tuple if you have an index. So that's why indices can greatly accelerate the nested loop join algorithm, okay? Because you are getting rid of the sequential scan over the inner table. And instead, you are just doing an index lookup on the inner table, okay? So in this case, as I just mentioned, for each tuple in years, you are trying to figure out if the i-th tuple in R is equal to the j-th tuple in years and this is an index probe, okay? So instead of actually doing a scan, you are just doing an index lookup and that's logarithmic as opposed to linear in time, right? So the number of disk accesses in this scenario is capital M plus small m into C where C is basically the index lookup cost and C tends to be log M, okay? Log capital M, okay? So this is much smaller than doing a sequential scan over the entire inner table. So here's a summary of all the different variants of the nested loop join algorithm. So you have the simple one, the log one and the index based one, right? So the general idea here is to pick the smaller table as the outer table, okay? And then you try to buffer as much of the outer table as possible in memory and lastly, you try to loop over the inner table or you try to leverage an existing index to figure out matches in the inner table, okay? So that's a general design of the nested loop join algorithms. So any questions on nested loop join algorithms? Yes, sure. Okay, I will repeat it, yeah. So why is this cost M, capital M plus small m into C? So C is basically just the index lookup cost. So it's basically log N, capital N, okay? N is the number of pages in the inner table. So for each tuple in the outer table, which is M, you are doing an index lookup. So that's C. Yes, yes, yes, yes. You assume that the index is residing on disk and you still have to do logarithmic access as you traverse down the B plus tree to figure out the index match. But instead of capital N, it's basically log N, okay? That's a good question. Any other questions? So the question is, this assumes that you already have an index, right? Or you build it on the fly. So that's another interesting choice for the optimizer, the query optimizer, whether to build an index to ascertain this operation or only use this algorithm if an index already exists. So sometimes it actually makes sense to build an index just for ascertaining this operation. But oftentimes you probably use this join algorithm, this index nested loop join algorithm only if an index already exists, okay? Because otherwise you will also have to factor in the cost for building the index in this equation, right? And that can be like high as well. So index nested loop join is often used when you already have an index on the inner table on the join attributes. Good question. Yeah, I think you assume that M is already like, you are doing a scan over the outer table, right? So that's what is factored by capital M. And then once you have the outer table pages, you are doing an index lookup for each such tuple. So that's why it's M times C. So that's factored in. So this represents all the IOC will need to finish the join operation. Cool. So that summarizes the three different variants of nested loop join algorithms, okay? So we will next move on to the last join algorithm that I will present in this lecture. So it's the sort merge join algorithm. So the sort merge algorithm is actually very related to the external merge sort algorithm that I presented earlier in the lecture. So here also you have like a couple of phases. So in the sort phase what you do is you first sort both input tables, okay? You sort capital R and capital S and then you scan the two sorted tables in parallel in a merge phase and you end up constructing the output table, okay? So you have two different phases, the sort phase and the merge phase and this algorithm also makes sense only when the tables are already sorted on the join attributes, okay? So if it's already sorted on the join attribute then this is a nice algorithm that you can leverage instead of doing nested loop join, okay? So I will present that in the next few slides. So when is this algorithm actually very useful? So it's generally used when one or both the tables are already sorted on the join attributes or if the output of this operation should be sorted for operations higher up in the query plan. If you need ordering in the output or if one of the two input tables is ordered then it makes sense to use the sort merge algorithm, okay? So those are the scenarios where you will often end up using this as opposed to a regular nested loop join, okay? So you're leveraging the ordering in the input tables. If it's already ordered then it makes sense to use this algorithm as opposed to blindly just doing a nested loop join, okay? So sorting can be achieved either by doing an explicit sort step, okay? You actually use the sorting algorithm that we just presented earlier in the lecture like external merge sort or oftentimes you just use an index on the join attribute so that you get the tuples in the desired sort order, okay? So let me illustrate this algorithm with an example. So again we are going back to our sailor database example. So you have these two input tables, okay? The sailor table and then the reserve table. The first thing you do in sort merge join is you sort it, okay? So here the join attribute is r.sid equals s.sid. So that's the join predicate. So you first sort these two tables on their sid attributes, okay? So you do the sorting in the first step, okay? So now if both input tables are actually sorted on the join key then you want to do a join, right? Because these tables are already sorted, you can just do a merge, right? You can do a merge of these two sorted tables to construct the join table, okay? To construct the output table, okay? So you are making use of this ordering property in this algorithm, okay? So I will note that you don't have to do this explicit sort step if there is already an index on one of these tables on the join attributes, okay? So the next phase of this algorithm is to do the merge, okay? So you have these two input sorted tables. How do you merge them to construct the output table? So you start with scanning these two tables in parallel, okay? And you output them whenever their sid matches, okay? So you basically see that for each pair if it doesn't match you just keep moving the two pointers forward in the two tables and you eventually end up merging both tables, okay? So you basically scan these two tables in parallel and you output these tuples that match so there is no multiplication involved here so it's not m into n in this algorithm it's basically m plus n because you are just doing two iterations one over the first input table and another one over the second input table, right? So that's the nice thing about this algorithm. You are not examining things m cross n like you are not like checking every pair of tuples in these two tables instead you are just iterating independently in parallel and that significantly reduces the complexity of the overall algorithm, okay? So as I mentioned in this case the number of disk access is basically two parts the sorting cost is basically m log m at a high level basically m log n and then there is n log n we also add in log b to account for the log aspect but in general it's m log m and n log n and then there is the merge cost m plus n because you are just scanning these two tables in parallel, okay? So there is the sort cost and then there is the merge cost oftentimes the tables are already sorted, right? If it's already sorted or if you have an index then you can skip the sort cost and then you only have the merge cost remaining, right? and that's as small as it gets as I discussed earlier in the lecture m plus n is as small as it gets you are just scanning the two tables, okay? So that's why this algorithm is often used when the input tables are already sorted so any questions about the cost equation or the algorithm in general? Cool So let me illustrate the sort merge join algorithm with actual numbers so let's say we have like 100 buffer pages, okay? So you can essentially sort both the tables using these buffer pages in just two passes so the overall cost is like 7,500 I.O. operations and if you take a regular off-the-shelf SSD if you have like 0.1 milliseconds per I.O. then the total time taken to run this algorithm is basically 0.75 seconds, okay? So it's really, really fast, okay? compared to, especially compared to the simple nested loop join algorithm, okay? So what's the worst case scenario for the merging phase of the sort merge join algorithm? The worst case happens when the join attribute of all the tables in both relations contain the same value, okay? In which case you are basically in a bad shape because you will end up doing M into N instead of M plus N because you keep like iterating over things for every pair, right? So this is a worst case scenario and I think at this point I should channel Andy here so Andy wants to like clarify that you don't have to worry about this case this is very unlikely in practice so don't worry about this, yeah? So I would actually characterize it as just being an aspect of the data distribution in general you tend to not have the same value for the entire and yes it also has a relationship with normalization theory you would probably then have a layer of indirection if most of the values are the same, right? You will probably have a layer of indirection so that also helps What is the extra? So the extra sort cost is the cost of actually sorting these two tables if they are not already in sorted order, right? So this M into N just represents the merge cost, yeah? So in the worst case this can get as bad as nested loop joint that's correct, yes but in the common case it's much better in practice especially if you already have a layer of indirection especially if you already don't have the sort cost Yes, so basically the underlying assumption is that these tables are already sorted so in which case this is a much better algorithm in the common case as opposed to vanilla nested loop joint algorithms, yes? Good question, yep? Yeah, in this scenario you can't do anything better I guess Yeah, this is like a worst case scenario and this is like a trick question this normally won't happen in practice that's why I was telling Andy, right? So you don't have to worry about this this is unlikely, okay? So that brings us to the conclusion for all the joint algorithms so this table summarizes all the joint algorithms that we have focused on so far the three different variants of the nested loop joint so you have the simple nested loop joint algorithm and then the block nested loop joint algorithm and then the index nested loop joint algorithm and we also covered the sort measure algorithm which can be used when the input tables are sorted so the IO cost is, as you can see it gets better and better across these different algorithms so we started with a really really slow algorithm which could take an hour to join this joint query when you use the simple nested loop joint algorithm but when we make use of the block abstraction we significantly drop down the time from one hour to like 50 seconds and that makes a big difference for the users of the database system, right? because now they don't have to wait an entire hour it's more interactive, okay? and then things get even better with if you have an index, right? you actually have an index on the input table you can do index probes as instead of doing a sequential scan over the inner table so then you get log n instead of capital N, okay? so that again shrinks it further to like 20 seconds and then eventually we have the sort merge joint algorithm if the tables are actually already sorted then this is a really nice algorithm that we can actually use to again construct an output table that's also sorted which can be used by other operators higher up in the query plan so with the sort merge joint algorithm you can run it in like 0.75 seconds I will note that all these numbers are basically just orders of approximation so it depends on the actual technology and the size of the tables but this will give you a rough idea of why it is actually important to optimize the design of the algorithm, right? you are decreasing the query execution time from an hour to like less than a second and that can make a big difference for the end users of the database system where would you want to give this goal into the query optimization part somewhere so in what case would the change in time from a block nested loop joint to an index nested loop joint not be worth it like currently it's just halving the time but if you look at it from like the birth value I see so your question let me paraphrase it so you can clearly appreciate the drop in complexity between the simple and block nested loop joint algorithms because you are cutting down time significantly but both the block and index nested loop joint algorithms are like 50 seconds, 20 seconds does it matter as much as the gap between 1 hour and 50 seconds, right? so I guess the answer to that question is it depends on the actual size of the data set so these numbers make sense for like small tables but if you are joining really really big tables then it does matter and especially if you already have an index why not make use of it, right? so in those settings the gap between these two different algorithms is more accentuated it's more amplified, right? but in general I think optimizing the joint algorithm is very important because it's very often used in actual queries that we see in the real world and I have heard that some queries can even like end up joining like 100 tables or 1000 tables in the real world so optimizing the joint algorithm can go a long way in improving the performance of the application especially as the number of joint tables increases any other questions? cool, so this summarizes all the different joint algorithms that we covered in this lecture I will quickly move on to a very different idea this is not related to joint algorithms this is related to joint types so SQL actually supports different kinds of joint types there is outer join, inner join semi join, anti join, lateral join and etc so these are different kinds of joint types so this picture actually represents some SQL join types so for instance in the first example on the top left you are doing a left outer join of two tables A and B and then in the middle one you are doing an inner join so you only get the set of tuples that are present that have matches in both A and B so we have these very different kinds of join types but we refer to them as join types these are different join operators that the user can actually make use of we implement these join types using different join algorithms so make sure that you can grasp the difference between join algorithms that are used inside the database system and join types that are used by the end user so let's do a case study so if you look in the Postgres database system it actually employs internally a state machine if you will to track the state of the join algorithm so we narrow down on one specific join algorithm the merged join in Postgres this also holds for other kinds of join algorithms like hash join so in each join algorithm you have a state machine to keep track of the state of the algorithm so that you can support all these different kinds of SQL join types so the state transitions in these algorithms depend on the join type so in this picture you can see that on the right side we have a bunch of states and these states have got to do with whether it's an inner join or outer join et cetera so this is how it's done in the real world and you do this for each join algorithm so to conclude there are many join algorithms that are often used in database systems and this actually is a nice illustration of the sophistication of the technology underlying database systems as a developer if you are asked to join two different tables you would probably start with the simple nested loop join it took several years of refinement to come up with better join algorithms like the sort much join algorithm or the index nested loop join algorithm and reimplementing all these algorithms in a new software system can take several many years so this is where it makes sense to actually use the database technology as an end user instead of reinventing the wheel inside your own custom user application so join algorithms illustrates this sophistication of the database technology and as we discussed during the lecture picking a join algorithm is actually very challenging for the query optimizer and at a high level a few heuristics are often used one such heuristic is that when the selectivity is actually very small that means that the number of tuples matching the join predicate in the outer table is very small you end up doing an index nested loop join and similarly when you are joining whole tables you often end up doing a sort much join but I am kind of waving my hands here things are actually tricky because you have so many different join algorithms that can be used for running a particular SQL query so the query optimizer has to make all these decisions it has to decide which join algorithm that should be used for executing this particular query and that's a very very interesting problem and this whole area of research is called query optimization so stay tuned for more details in next week's query optimization lecture where we will illustrate how the query optimizer makes these choices about which algorithm to use which join algorithm to use whether to use sorting or hashing and so on and so forth okay so to recap in this class I presented a couple of sorting algorithms and a bunch of join algorithms so sorting and join operators are often observed in SQL queries and these algorithms are often used inside a database system to support these relational operators in the next class Andy will actually cover another join algorithm the hash join algorithm and he will move on to more exotic join types like the semi-join, anti-join and lateral join that are actually often used in real applications so these are other interesting join types and we will cover them in the next class and we will also move on to aggregation algorithms and I will note that the sorting algorithm that I presented in the first part of this lecture often simplifies things for all these other operators like join and aggregation because if the input table is already sorted you can actually come up with more efficient algorithms for supporting these more advanced relational operators yep thanks