 Today is the last lecture before the midterm. It's going to be about query execution part two. Remember last week we started query execution part one. So we're going to pick up with that today. But of course, as usual at first, we're going to start with the administrative stuff. So project number two is going to be due this Sunday, October 17th at 11.59 p.m. as usual. Homework number three is going to be released today. It's not out yet, but we're going to post it by the end of the day today. And that is going to be due on Sunday, October 24th. So both of those are, of course, due after the midterm, which is the next item on the agenda here. And the midterm is going to be this Wednesday. So in just a couple days, October 13th. It's during regular class time in this room. It is open book, open notes, as I mentioned last time. So any paper material you want to use is fine. Again, please try to not, technically you can walk in here with a wheelbarrow full of printed papers, but please don't kill the environment too much with this stuff. But again, any of the slides or textbook or any notes you have, just know electronic devices, please. So computers, iPads, phones, that kind of stuff. You are allowed to have a calculator, as I mentioned, because there may be some questions about computing logs on the exam unless you can do logs as these lean your head. That's also fine. But the material covered is going to be everything up to and including today. So everything in lectures number one through number 12. And again, if you check out the Piazza Post, there's more details about everything. And I've also allocated some time at the end of today's lecture. If you have any questions about the exam or database in general or just life advice, whatever you want to talk about at the end, there's some time built in today to do that. Okay, so are there any questions about administrative stuff or scheduling here? And then also at the end there will be more time if you have specific questions about exam stuff. Okay. So jumping right in, query execution. So last class we talked about query execution and we discussed sort of how we can compose operators together into some kind of plan to be able to allow our DBMS to execute in arbitrary query. So again, remember that we go from sort of this SQL query, which gets translated to a logical plan. So something with these abstract operators shown here. And then finally to a physical plan which describes how we're going to specifically execute each operator. So the physical implementation of each operator. So again, just continuing the example from previous lectures. We have this SQL query here, which is just doing a simple join on two tables RNS on the ID column with some simple selection predicate on s.value. And then we're returning RID and s.cdate. So translating that query written in SQL into a plan, logical plan looks something like what's shown here. We have a base table scan over each of the relations RNS. There's some filtering that's going on on s with a selection operator and both of those are fed into the join operator. Again, here we don't say which joint operator we're going to use. We could be using hash joint operator, sort merge joint operator, any block nested loop joint operator. So again, this is just at the logical level. And then finally, we're applying the projection at the end to filter down to just the attributes I want to return to the user. So sort of throughout this discussion of how to execute this plan that we have here, we've sort of been so far assuming a single worker thread or a single worker which could be implemented as just one thread. So in today's class, we're going to discuss specifically how we can generalize this execution to multiple workers. So we have, you know, in our system we've discussed we have potentially multiple CPU cores, a bunch of concurrency available to us. We want to take advantage of it by figuring out how to parallelize, take this plan which so far we've only discussed sequentially and parallelize it and make it run concurrently to utilize all of the CPU resources we have. So sort of just to start, you might wonder like, why do we care about parallel execution? Well, there are a few different parts to this answer. The first comes at the system level. So inside our DBMS, parallel execution can give us increased performance. So that can be measured in one of two ways. We can get increased throughput, which means that we can execute more queries in our system because we are taking advantage of all of the concurrent resources that we have. We can get more parallelism, more queries concurrently executing. On the other hand, if we are talking about speeding up a single query, we can reduce the latency of that query. So the time from invoking the query, submitting the query, until we start getting the answers back that we are interested in. So there are these two dimensions depending on what the specifics of your application are. So if you have something that's like a transactional workload, you might care more about throughput. We have a lot of concurrent transactions coming in. You want to execute as many as you can. So you have a higher throughput. Or if you are more on the analytical side, like online analytical processing OLAP, then you might care about query latency. If you have a really long running query, you want to try and complete it as quickly as possible. So that's sort of inside the system. But from the external perspective, what some of the users of your application might care about are increased responsiveness and availability. So what they are seeing, if you have a web page, that is sort of like an online storefront for your application, then users kind of can see increased responsiveness. You can handle higher user load, page load times might be faster, that kind of stuff. So this is sort of the first thing, increased performance might be visible inside the system. But from the external perspective, we care about sort of this, what the user sees. And then finally the last piece, which doesn't get talked about a whole lot, but I also think is important, is that if you're able to effectively leverage parallelism, then you might end up with a potentially lower total cost of ownership. So that's the entire money that you spend on your DVMS deployment. If you can effectively leverage all of the parallelism that the hardware provides to you, you may be able to use fewer machines to support your application. You may be able to be more energy efficient, for example, if you're able to complete queries faster. So all of these sorts of things kind of factor into this idea of how much money you're spending on your database deployment. So kind of one thing that we need to define in advance is sort of the difference between when we talk about parallel execution versus distributed execution. So parallel execution, I mean, is more broadly speaking, how we're going to take a single query that executes sequentially or serially, and then split it up to make it rank concurrently. Distributed execution is specifically, is talking about one specific case. But first, sort of I want to point out that they share a lot of similarities in the way that we think about them. So in both a parallel and a distributed setting, the database is spread out across multiple resources and we'll define exactly what resources mean, but I think about hardware resources, things like CPUs, disks, machines, could be servers in Iraq, could be multiple data centers, those sorts of things. So you have this sort of pool of resources. And in both cases, a parallel and a distributed setting, the databases is somehow spread out across multiple of these resources. But the important piece is that again, in both cases, we want the database to appear as sort of this single logical database instance to the application. So regardless of what's going on in the background, regardless of what we're doing with, you know, it could be spread across multiple disks, multiple machines, multiple data centers, whatever. We want this to be completely transparent to the application. We don't want the application to have to know or care about kind of the physical organization behind the scenes in the DBMS. Because if you think about kind of way early in the class, one of the first lectures, we talked about kind of the nice properties of SQL being declarative and abstract to some extent. So you don't have to care about sort of the low level implementation details. You just write a query that specifies what you want rather than how exactly to go about getting it. So we don't want the application or the SQL queries we write to have to know about sort of the parallel or distributed setting behind the scenes. That should all be abstracted away and managed by the DBMS. So those are the similarities at a high level, sort of like abstraction similarities. These are a few concrete differences between these two settings. So parallel DBMS is when we think about parallel execution on a single machine, we're typically thinking about resources that are physically close to each other. So if you have multiple cores or multiple CPUs, multi socket machine, they're all physically located close together in one sort of hardware deployment. The resources typically communicate over high speed interconnects where they have shared memory, they call access the same shared memory or the same disk, that sort of stuff. And the communication is assumed to be cheap. So relative to me having to make a remote call to another machine in another data center or something, all of the kind of inter core or local communication and parallel DBMS is extremely cheap. And the other important point is reliability. So if I read or write from some memory region, I expect that that to be safe and managed at the level of the OS for the hardware. So if if there are local communication between cores, then I expect that that communication is going to happen as as I'm told it should. On the other hand, distributed DBMS is we typically think about the resources being potentially very far from each other. So I mean, you could think at a smaller scale machines in the same rack, you still have to, you know, communicate over network. But you know, on the other extreme, you could have data centers located in different regions around the world. So you could have, you know, one data center here, you could have one data center in Europe, one in Asia, etc. And all of these sorts of things making distributed calls to each other are very expensive. The latency just in the network round trip is is large compared to local communication. The other kind of big differentiating factor, other than the cost, you know, slow communication is problems that come up. So faults or faults that occur in communication, it's important for our applications, the DBMS to be fault tolerant. So for example, if you have, you know, network partitions or dropped or lost packets or that kind of stuff in, in inter data center or inter node communication, then you can run into problems with the correctness of your application. So you need to be careful in constructing systems to be able to tolerate faults at that scale, whereas in parallel DBMS is we're typically not thinking about faults that can occur like that. So fault tolerance is going to be a big, a big piece, both both fault tolerance and distributed communication avoiding or avoidance are going to be a big piece of the distributed DBMS piece, which I think we're going to talk about later in the semester, but just for today's class, and for what you need to care about for the midterm, we're going to be focusing on on the parallel execution in a single hardware platform. So I just wanted to kind of, you know, set up the logical similarities between parallel and distributed databases, but also these important differences. And again, we're going to be focusing today on the parallel execution case. So today's agenda specifically, we're going to talk about process models, how we how we handle the concurrent workers in our system, how we can achieve execution parallelism using a particular process model. And then at the end, talk a little bit about how we can also achieve IO parallelism. So the first piece is going to be really about how to break up a queries execution into multiple different concurrent pieces. The second IO parallelism part's going to be about how to leverage disc or or other storage media for how to leverage parallelism from from disk. So before we get into this, there are any questions kind of about the high level differences between parallel DBMSs or parallel, sorry, parallel DBMSs, parallel execution or distributed DBMSs. All right, so the first piece is the process model. So what is a process model? The DBMS process model specifically defines how the system that we're building is architected in order to support concurrent requests from multiple users that can be concurrently accessing the DBMS. So specifically, I want to focus on this term worker. Sometimes I think throughout, you know, the past lectures, I've been interchangeably saying worker thread. As we'll see in a second, it's not necessarily always a thread. But basically, the worker abstraction is just think about a worker as like the component in our DBMS that's going to be responsible for executing a particular task. And we'll talk about what those tasks look like in a few slides, but executing a particular task on behalf of the client, and then returning the results to the client. So the process model is going to define specifically how we think about or implement these these workers in our system. So there's going to be three different approaches that we're going to talk about. The first is one process per DBMS worker. The second approach is to use what's called a process pool. And third approach is to use threads. So this is kind of what what I was getting at when I mentioned that I wanted to make this distinction between what I was sort of loosely referring to as a thread or or worker in the past and an actual worker component. So in approach number one, the process per worker, you can think about it. Each worker is a separate OS process, really simple. Essentially, we're going to rely on the OS scheduler here to to schedule our process when it's running. And we're going to use things like shared memory for global data structures and inter process communication. So the sort of nice thing about this model is that if a process crashes, it's not going to break our entire system. If we have multiple processes running, one to handle each each client connection, then if if one of them fails for some reason, it could be some really rare hardware failure, it could be a bug in our code. Of course, we don't have bugs in our code. But if there were a bug in our code, then maybe like a divide by zero problem or if you read invalid memory location or something like that, then having that process crash won't break the entire rest of your system. So you have a single single worker process crash, everything else can keep running fine, as long as of course they don't encounter the same error that caused your first process to crash. So sort of how how this works at a high level is imagine we have some application over here on the left and we have this layer called the dispatcher layer, where basically the application is going to submit its requests, you know, the query, whatever it wants to have executed to the dispatcher layer. And then the dispatcher is going to sort of fork off a new process, specifically to handle this application's query. So now, as I said, this worker is going to sort of manage all of the logic needed to execute the query from the client, and it's going to communicate back and forth the results back to the client. So sort of this this worker process is going to manage all of the the reads and writes and the different query operators that we need to execute against the database in order to get the answers that we need back to the client. So this is sort of the the the processing model taken by several of these systems here, IBM DB2, Oracle, Postgres. Does anyone have a guess why these systems use this this, you know, process per worker approach rather than, you know, either of the other two, for example, the threading based approach. It's it's because they're old. So if you if you think about when these systems first came out, you know, the 80s or early 90s, there wasn't really a portable threading library like, you know, p-threads that's available and kind of, you know, the common standard today. So if if these systems wanted to execute on on a bunch of different platforms, then they would kind of have to reimplement some threading implementation one off for each. So instead, they kind of went this process based approach, which which required less difficulty switching between the different platforms that we're interested in. So that's that's sort of why they have this what you might think of as a legacy architecture compared to newer options. So the the second approach we have here is called the process pool approach. Basically, I mean, you think about it, it's the same as the previous one, we're still using processes to handle clients, but rather than sort of, you know, forking a new process for each client that connects to the dispatcher, we essentially allocate this worker pool of processes. We have a bunch of processes hanging out in our pool. And when when one of our requests comes in, the dispatcher can can route it to any free worker process in the pool. So this is still going to rely on the the OS scheduler in shared memory or some inter-process communication mechanism to communicate between them. And it can also be sort of bad for CPU cache locality because, you know, if if we don't have any control over when processes are being scheduled, be scheduled or swapped out, you know, there's just this pool kind of kind of working on on the different queries that we have in our system. There's no way for us to control kind of what is running concurrently. So these these different processes could be, you know, thrashing essentially in the CPU cache. So examples using this approach. I mentioned, you know, DB2 in the previous slide, kind of more recent iterations have have moved towards this architecture as has Postgres. So in in newer versions of Postgres since 2015, I think they sort of have this this worker pool approach. So the final one and I think this has, you know, become within the last, we'll say, 10 years. The dominant approach and I think most new systems that are being built from scratch are taking this approach. But the idea is rather than using processes, we're going to use a single thread for each worker. So we have one process running. That's our DBMS process. And what we do is we spawn up multiple threads for each individual worker. So the DBMS in this case is going to manage its own scheduling and can schedule how many threads, which threads I want to use, that kind of stuff. And kind of the trade-off you're making here is that in this case a thread crash might might cause your entire process to crash. So there's not kind of this isolation anymore where if, you know, one of our one of our processes fails then the rest of our system might be able to continue executing. In this case, you know, you could run into a case where one thread crashing causes your whole DBMS process to crash. So there are all sorts of systems. This is just a few. Many of the systems coming out in academic research also use this model. There's kind of a lot lower overhead to using threads than processes. The cost of a context switch between threads is a lot lower than switching between processes. So sort of just to wrap up the discussion of the different process models. Again, to reiterate what I just said, in the multi-threaded architecture there's less overhead per context switch if you're switching between different threads versus switching between different processes. And you don't have to worry about dealing with shared memory anymore. They're all, they don't have to communicate just through that mechanism. So the key piece to remember here though is that the thread per worker model doesn't mean that the DBMS is going to necessarily support what's called intraquery parallelism. So if you think about a query, there are many different ways to parallelize the query. We're going to talk about sort of all the different options we have at the system design level. But just because we can have multiple threads running or multiple processes running in our system doesn't mean we can take a single query and split up its execution into multiple different parallel pieces. So we may be able to support, you know, multiple concurrent queries from completely different users. So, you know, let's say I issue a query to the system and you also issue a query to the system then, you know, we can have one thread or one worker that's managing each of those independently. But in terms of being able to split up or parallelize a single query, just because we have this multi-threading or multi-worker model doesn't mean that we can parallelize that one single query. So are there any questions kind of at a high level about the different process model options that we have? All right, so one last note about kind of scheduling. For each query plan, the DBMS has a lot of things it needs to decide. It has to decide, you know, where, when and how it's going to execute that query plan. That could mean, you know, how many tasks we're going to break down the plan into, how many CPU cores we should use, what CPU core should a particular task be executed on. For example, I mentioned like you could have cash thrashing or cash thrashing in, in, if you're, if you're swapping between different tasks that you want, that are kind of competing for, for what's being used in CPU cache. So if there may be better to schedule tasks on one CPU core or two tasks running concurrently that can share resources. If you have like, numerous regions, which are, we're not going to talk about in this class, but if you have tasks that are primarily executing, or primarily accessing data stored in one NUMA region in memory, it may be beneficial to schedule on a particular CPU core. And finally, kind of where the task should be storing its output, again, NUMA considerations, all that kind of stuff. But the, the, the high level takeaway here, and this is the important piece, this has been reiterated several times throughout the course, when we talked about kind of disk IO and memory management and all that stuff. The DBMS always knows more about the application that's running and what it wants to execute than the OS. So kind of any time, including in this case, scheduling our tasks, where we can apply higher level knowledge about what our application or our queries or whatever we're trying to execute is doing, we can always make better or more optimized decisions than if we just left scheduling up to the OS. So that's kind of the high level takeaway from, from a system design perspective. Again, the DBMS knows more about what it's trying to run so it can do a better job scheduling than the OS can. And I apologize to any OS people if you're, you know, lifelong dream is to be an OS researcher. Sorry, but in this case, in these particular applications, it's, it's, it's, I think, pretty clear that the DBMS always has, you know, by virtue of having more information, it can do a better job on this sort of stuff. But OSs are still, of course, very important. So the, the next thing I want to talk about is kind of the different types of parallelism. So I mentioned that we, we can have different types of parallelism when we're thinking about how to split up the different sub pieces of a query. And the two main ones that we're going to talk about are what's called interquery parallelism versus introquery parallelism. So the first interquery parallelism means that the queries are going to be executed concurrently. Different queries can be executed in a multi user environment. We have all these different queries coming at the system. They can get executed at the same time. And, you know, in, for example, if we're, if we're assigning one process to, to each query or one thread to each query, whatever, we can have all these queries running at the same time. And they don't need to block each other. So this is again going to increase the throughput of our system. We can have, if we have, you know, eight cores or 16 cores or something, we could have potentially eight or 16 or maybe even more concurrent workers running. It's going to give us a lot more parallelism than if we only had one. Similarly, we're going to be able to reduce latency for individual query executions because now rather than, you know, having to potentially, if we have one query come in and then another and another, you know, to execute the nth query in that queue, assuming we're going to process them all serially, we're going to have to wait around for potentially, you know, a long time until our query finish gets, gets to the front of the queue where it can, can execute. So you can, you can substantially reduce latency by, by allowing more queries to execute concurrently. So that's kind of the, the, the basic one. And I mentioned that again, this, this second piece, which is trickier to think about is if, if we want to take a single query and split it up into a bunch of smaller tasks and have those execute in parallel, that's, that's a lot harder to do. And just because your system supports concurrency or, or a multi-worker environment doesn't mean that you can have this, this second type of parallelism, interquery parallelism. So it's not, it's just because you, you have the ability to, to have multiple threads or workers doesn't mean you do this second piece. And it's going to take careful system design in order to be able to do it. So again, the main benefit of this second one is that you can decrease the latency for really long running queries. If you have an analytical query or something, being able to split it up into a bunch of smaller pieces could save you a lot of time. So kind of, this is just a, a summary of, of sort of everything I, I just said about interquery parallelism. The, the, the kind of interesting bits are, are the bottom pieces. So if, if the queries are read only, then it requires relatively little coordination between them. I mean, we, we still have to care about, you know, what, what pages we're reading from disk or what's in our buffer pool, that kind of stuff. But in terms of, we don't have to worry about concurrent updates or something, right? And we don't have to worry about race conditions. We don't have to worry about any of that stuff that happens. If everyone's just reading, then nothing's going to change in the interim while our query is executing. Now, if multiple queries are updating the database at the same time, then it's, it's sort of hard to, to do correctly. And there's going to be, I'm not sure how many lectures, I know multiple, but we're going to start talking about this in, in lecture 15. So this is going to be after the midterm. Don't, don't worry about this for, for Wednesday, but sort of, you know, just keeping the back of your mind that when, when we're doing these sort of concurrent updates, we have to be careful about how we're, we're scheduling them and the mechanisms we use to make sure that there aren't errors that come up during query processing. So, interact query parallelism, as I said, can improve the performance of a single query by taking each operator and executing, breaking into smaller tasks that we can execute in parallel. So you can sort of think about this paradigm in terms of like a producer consumer model. So each operator is potentially, you know, that the child operators are producing outputs that are going to be consumed by their parent operators and those parent operators are potentially also producing outputs that are going to be consumed all the way up to the root node of the tree, which is when we get to the end, we want to return a result to the user. So you can kind of think about each, each operator in sort of this producer-consumer paradigm. Now, all of the operators that we've talked about, selections, joins, projections, sorts, all that stuff, there's, there's a parallel version of the operator, parallel algorithm that we can use to break up the, that operator into multiple smaller pieces. So there are sort of two different ways that, that we can implement this. One is to have all of our concurrent workers accessing some centralized data structures. So imagine like a, if you're doing a hash join, for example, you build some sort of big concurrent hash table that, that all of the workers that we're having run concurrently are writing to or updating potentially or reading from at the same time. So sort of this, this sort of shared data structure that all of our workers can access. The other option, and we'll see several examples of, of how this looks, is to use some kind of partitioning scheme to divide up the work. So there are a lot of cases where we, we don't necessarily need to have one shared data structure that all of the workers are accessing. We can sort of split it up, partition it into these disjoint pieces that can be worked on completely separately in parallel. So sort of like an embarrassingly parallel setting. But the problem is at the end of the day we, we need to kind of merge or combine together whatever parallel outputs we produce. So if we do have, we don't take the first approach, we'll say we want a, you know, a group buy or something. If we use the first approach, we have a hash table that all of our workers can write to, can currently update, can currently, and at the end we have one hash table to give the answer back to the application. If we, if we take this partitioning approach where you split everything up, now we have all these different partitions, we can't tell our application, okay, well, you know, it's, we have 16 workers in our system, so you're going to get back 16 disjoint pieces. We need to sort of abstract this away so that the user or the application doesn't care about the, the low-level implementation details that should look to the, the user, like everything's running on single threaded or, or one worker query execution. The, the user doesn't need to know about the, the parallelism. So in this partitioning case, we're, we're going to need to be able to recombine the results at the end of the query execution. So we'll see examples of how that looks in, in a few slides. So just as an example, so you have some concrete thing to look at here, as an example of interquery parallelism is the parallel grace hash join. So remember the grace hash join we talked about, we sort of do a scan on R and a scan on S and we, we do this partitioning phase and we split the tuples into these, this disjoint partitions using our H1 hash function. And then for each one of these partitions, we only have to do a join between, you know, for, for R, we only have to do a join between partition zero with partition zero of S, partition one of R with partition one of S. We don't have to sort of look through all the partitions. It's just going to be each key is mapped to a disjoint partition from each of these tables. So if you think about how this would look in, in some kind of parallel setting, we could do something like this where we have all of our, our separate workers and each worker, let's say is going to take a disjoint partition of this hash join. So for example, you know, our first worker up there could get partition zero and all of the other workers can do their local join without having to worry about what's going on in, in worker one with partition zero. So these can all happen in parallel and we have no coordination problems. Now again, as I mentioned, we need to figure out how we're going to, to recombine all these results at the end. So each, each worker is going to do his own thing. It's going to do his local join on its disjoint partition, but then we need to stitch them all together at the end to give an answer back to the user. So at a high level, there are sort of these three different options for intraquery parallelism, how we, how we implement it. Approach number one is what's called intra operator or horizontal parallelism. Approach number two is interoperator or vertical parallelism. Those will make more sense why they're called that in a second. Approach number three is called bushy parallelism. As far as I can tell, bushy parallelism is sort of like a combination of, of the first two. I think it, it shows up in the textbook, but I basically, I think you just think about it. Like we're, we're doing a hybrid approach of approach number one and approach number two. So the important thing to remember here is that these techniques aren't mutually exclusive and I guess that's what the, the bushy approach shows, but the, the DBMS is sort of left to figure out how it wants to execute any given individual query using any combination of these techniques. So I, I think horizontal partitioning is probably the, or sorry, horizontal parallelism is probably the most common. So we'll talk about that one first. So, basically what we're going to do is for each of the operators that we have, you know, a selection, a join, projection, aggregation, whatever, we're going to decompose each of those operators into independent fragments or the subtasks that are, are going to perform the same functions, the same thing on different disjoint subsets of the data. So the, the key is disjoint subsets. So we, we want each query, or sorry, we want each fragment to process some disjoint subset of the data that the query is accessing. So if we can, if we can set that up so that they can all execute on disjoint subsets, then we don't have to do any coordination between them while they're running. So, so basically the, the, the high level abstraction here that we're going to inject into the, the query plan is what's called an exchange operator. And this is part of the volcano processing model. Basically it's, it's a, a high level abstraction that dbms is going to insert into sort of like a dummy operator into the query plan that's going to be responsible for just coalescing or splitting up results from multiple child or parent operators. So for example, if you have a bunch of child operators that are split up into these different fragments, we insert an exchange operator that's going to, to coalesce them, group them all together for the next operator in the, the query plan. Similarly, if you, if you are need to, need to split up the, the result of a, of an individual operator so that you can have parallelism across multiple workers. It's the same thing just in the, the outgoing or output side. So as a concrete example of this, let's take a look at, at this really simple query here. So we're just selecting everything from A where some value is greater than 99. And it's a really simple query plan. It's, it's a table scan of A followed by this selection operator. So if you think about what's, what's happening here, let's say we have three workers that we want to split this, this processing across. So we're going to have a smaller table scan, A1, A2, A3. They're going to read disjoint partitions of table A and then the selections can proceed in parallel and then they're all going to get fed up to this exchange operator at the end. So imagine we have these table A on the, the disk pages stored down here. And we have each one of these are a fragment. So again, there are three split up across three workers. And basically what, what the exchange operator is going to do is going to call next on each of its child fragments. So it has to call next three times here. It's going to call next to the first fragment. Then, you know, the, the selection operator for the first fragment is going to call next on the table scan and that proceeds as usual. So in order to, to perform the table scan, this first fragment, A1 table scan here is going to pull out the first page from, from the table A on disk. And then sort of the, the same thing happens for the other two fragments. We are, they're going to pull out the second and third pages from disk. So this again, sort of proceeds as usual. They're going to, they're going to keep processing. When, when they're done, working on whatever they're working on, so let's say A1 and A2 finish, A3 is still working. They're, A2 and A1 can grab the next pages from, from disk. So here, let's say, you know, A2 finished a little bit before A1. It could just grab the next, the next one on disk. So there is some, some coordination going on here. We, we want each fragment to get a disjoint subsets. We have to make sure that, for example, A1 and A2 don't both grab page four. So maybe some kind of counter or something that points to the next page. But at a high level, the different options are the same. So, you know, we could use that mechanism. We could use some kind of partitioning mechanism. For example, A1 takes, you know, every mod three. So it's going to get one, four, et cetera. A2 is going to get two, five, et cetera. The, you know, then you don't have to keep track of the, the current next page. But the, the four, for the purposes of, of the exchange operator, it doesn't really care about that, that low level abstraction. We just need to make sure whatever our, our scan method is, we're not having two fragments work on the same range or the same pages. They need to be disjoint subsets of pages. There are multiple different ways of doing that. So are there any questions about how this approach works? Yes. So the question is, is this actually parallel or is it concurrent? So I think it, so it, I think that maybe the, the confusion is about whether or not the exchange operator is blocking on the next call. So in this case, no calls to, to next in the exchange operator are going to be non-blocking. So you're going to be able to call next and then this fragment here is going to be able to fill up its, you know, whatever buffer it needs to fill up to, or maybe it's a single tuple, whatever. But the exchange operator is going to call next, next, next on each of its children and then wait. And in, in that sense, then you can have the, the child workers executing independently of each other. But it is correct if, if exchange were blocking, I guess on each next call, then there would be no benefit to this, right? You just be executing, you know, three things serially here. But for that reason, yeah. So exchange is going to call next for each child and it's not going to block, it's going to let those execute in, in parallel. Thank you. Sure. Are there any other questions? Okay. So I mentioned there were different types of exchange operators, depending on, if you want to coalesce or split. So the, the one that we saw in the previous slide was a, this first type here, a gather, which is basically going to be combining the results from multiple workers into a single output stream. So in that example, we want to collect all of the results from our, our different selection operators or our selection fragments and then put those into some final output stream to return to the, the user application. So that's a gather operation. The, the second type is a distribute operation, which is basically going to take a single input stream and you're going to split it up into multiple streams, multiple output streams that could be handled in parallel. So, you know, imagine that, that we have one particular scan that's happening. We want to, you know, kind of split those out into multiple parallel scans. So that's what the distribute operator would be. In this repartition one is basically where, where shuffling multiple input streams across multiple output streams. So maybe the, the, in the previous example with the, the selections here, we have these three selections coming in and maybe we want to partition them for some reason into, into different outputs based on their keys. So we can split them up into different outputs and then each worker in the next layer of the, the parallel execution is going to have all of the keys associated with that particular partition. So that's kind of that type three, repartitioning. So you have multiple inputs and then they get partitioned based on some logic and then they get, they get split up into multiple outputs. So the, the, things get a little more complicated when we have sort of this more complex operators in here like a join. So, for example here we're going to have the, the scan on a, so we know how to, you know, split that one up. We can split a up into these, these different worker fragments that can scan in parallel on a, assign them to their, their different workers there. And, you know, we can, we can execute this selection in parallel. We saw how that works because the, the, the different disjoint sub ranges or sub pages that we're working on aren't going to conflict with each other. We can work on them completely in parallel. And, actually we can also do, do one other thing. We can take this projection operator. So, what's the projection doing? It's just filtering out to only AID and B value that we need at the end. So we can push that all the way down. We'll see more about that when we talk about query optimization. In, I think it's the, the next lecture, next, next week. But, basically it's pretty straightforward how to split up all of these different fragments. So, you know, the scan, we saw those can work on different, different pages. That's fine. The selection, we're just filtering out values from each page. That's fine. And the projection, we're just cutting it down to only the columns we care about. But, you know, how, when we get to this join, what are we going to do? Because, you know, the join is building some, some data structure that's going to be shared between all of these different fragments. So, sort of one way to think about it is to build these independent hash tables that we're doing here. So each, each pipeline builds up its own independent hash table. Each fragment has its own hash table. And then we use this exchange operator to combine them into one giant hash table. Another option is, you know, to, to implement some kind of concurrent data structure, current hash table that all of these fragments can be updating, inserting into at the same time. So we need this exchange operator here to kind of whatever method we choose to, to combine these, these different outputs from these scans together and feed them into the join. So this, this exchange operator sort of has that staging area where it can get all these results together and they gather, implement or the, the, the gather operator and present one final unified stream that goes out to the join. Now on the other side, so that's, that's fine. We build up that whole side of the, the query tree and now we have to do the probe side of the join. So again, we can execute this, this sub piece in parallel and that's, that's great. When we get up here, now we're just doing reads into our hash table. So those can all execute independently. So they're going to be all emitting tuples independently based on probes into the hash table. And then I have to worry about combining any results here. We're going to put the exchange operator after the join. It's going to take the three values that are coming out of the probe. So each one of these fragments produces its own output and that's going to get streamed to this exchange operator. It's going to combine them into the final output for the, the user. So are there any questions about kind of this more complex example? Yes. So the question is, what does the exchange operator operate on? So you need to implement a different exchange operator depending on what it is you're trying to exchange or combine. So in this case, we need to have some kind of logic for combining together the output of this build hash table side, the build side of the join. On the other side, it's just the streaming output of the tuples. We're just combining those tuples, essentially merging them into a single output stream. So you need to have some kind of logic somewhere built into an operator that tells you for each potential output from a query, how to, how to combine them. I guess technically you, you could, you know, you could skip the build hash table thing in parallel. Well, so there's two options. One is you could skip the build hash table thing in parallel, just use one concurrent hash table that everything's writing into. Then you don't have that exchange there. I mean, I guess, sort of the exchange logic is implemented in the concurrent data structure. The other option is not to parallelize the build hash table part at all, just cut it off at the projection. And now, sort of similar to the right side, everything is being gathered into a single output stream. And then you have one operator that builds the hash table. So there are different options there. I think that the current state of the art is although there's there's different people who say different things, but the current state of the art, the biggest consensus is around building these multiple hash tables and then combining them into partitioning into into separate hash tables and then combining them that way. So does that answer the question? Are there any other questions? Okay, so that was intra operator parallelism, inter operator parallelism or sometimes called vertical parallelism. Sometimes also called pipeline parallelism, we'll see why in a second, but the idea is that operations rather than the different pieces of the query plan rather than breaking those up in smaller tasks or fragments, the operations are going to be overlapped in order to pipeline data from one stage to the next without materializing it. That's why it's called pipeline parallelism sometimes. So each worker, you think about executing an individual operator or potentially multiple operations from a segment of the query plan at the same time. So visually, what this looks like, again, let's say we have the same query here, we're going to split it up into different pieces. So let's take the join, we're going to split that up, assign one worker to process the join part, it's just going to be running in this loop doing the join. So let's say it's just a nested loop join here. It's just going to keep going in the loop and emitting the join matches. And then we're going to have a separate worker that's just doing the projection. So every time it gets a result from the join operator, it's going to apply the projection and pass that out to the result set. Now the challenge here is that since the join operators probably a little more complicated than the projection operator, the projection operator might be idle or blocking a lot of the time. So we have this projection sitting around, not doing very much, the join's taking a lot longer. So we don't necessarily have equal utilization across our parallel workers. Whereas in the other example, everyone, all of the workers are always busy working on some sub fragment. In this case, if there are differing amounts of work that each worker has to do, some might be idle while some are we're blocked waiting for. So kind of one way around this is this bushy parallelism idea where you have a hybrid of intra and inter operator parallelism where you can kind of group them together in a way potentially to balance the amount of work that each worker is doing. In this case, we're still in the exchange operator here in order to combine intermediate results from each segment. So just as a simple example, this query looks something like this in the query plan and we could split it up into using a combination of intra and interquery parallelism to get kind of a partitioning like this. So we're going to have one worker doing the join between A and B, another worker doing the join between C and D, and then those can go through the exchange and go to the final join at the end. So we want to kind of split up, partition up all of our work into these different groups where we can get interquery and intraquery parallelism. We've broken down individual operators and we've also broken down the workers working on different pieces of the plan. So again, I think this is just a generalization of the other two. If you can do interquery and interqueries, you can arrange them sort of in arbitrary ways to get an arbitrary split across the different workers that you have. So are there any questions about any of these parallel processing paradigms? Okay, so one important thing to note is we've been talking about how to split up parallel processing, parallel execution, but using all of this parallelism, using these different workers to execute queries won't necessarily help if the disk is always going to be the main bottleneck. So if you're doing something really expensive like an expensive aggregation or a join or something where there's a lot of in-memory work going on, then sure, parallelism is going to help you, but if you're always waiting around on disk.io, then it's not going to make things any better. And in fact, if you have competing workers doing, you try to work on different segments of the disk, at the same time, you can actually, through parallelizing the work, make things worse because now you have disk.io that's competing with all these workers sitting around, trying to get different things from disk and that can lead to bigger problems than if you just executed things serially in a single thread. So that leads us to kind of this idea of IO parallelism where we're going to split the DBMS across multiple storage devices. So there are many, many different ways of doing this. I've listed a few here. For example, you have multiple disks per database. You can split up if you have multiple disks. You can have one database per disk. You can have one relation per disk. You know, each table gets on a separate disk. You can split a relation across multiple disks. Many other versions of this, but basically the idea is if you have multiple disks, there are different ways that we can take advantage of them. So, one way is to configure either the OS or the hardware to store a DBMS's files across multiple storage devices. So this can be either implemented in some kind of hardware mechanisms like a storage appliance or through RAID configuration. How many people are familiar to know about RAID? Okay, so RAID stands for redundant array of inexpensive, I think was the original term, or independent disks, whichever you prefer. But basically the idea is that it's like a data storage virtualization where we have all of these different disks multiple storage devices and they're going to appear as a single logical device. So imagine multiple physical disks but they're going to appear as a single logical disk to the DBMS. Everything is transparent to the DBMS. So again, we have our DBMS executing there. There are different ways of splitting up the files. So this is RAID zero. It is called striping. So basically we're splitting up the pages of the file that we have. Page one and page four go on our first disk. Page two and page five go on the second. Page three and page six go on the third disk. So basically we're taking our pages in our file and we are transparently, it is important, transparently to the DBMS, partitioning them up across all of these different physical disks. So this is RAID zero. There are a million RAID configurations. The other common one is RAID one which is mirroring where you're going to duplicate the pages across each of the disks. So imagine we have a file with three pages or sorry, two pages. We're going to replicate page one and page two on every independent disk that we have. So when, when do you think that RAID zero will help? What kinds of operations in our DBMS is RAID zero configuration going to help with? Yes, exactly, yes. So I'll sort of summarize the answer was that if you have a, you know, a file or a table that's particularly important, it's hot to the application, so it's going to access frequently and you have the file split up. It's pages across multiple disks. You get higher overall bandwidth, right? There's more, there's more bandwidth for us to read in the pages from each of these disks at the same time, that's right. So how about RAIDs? What do you think is this, is this going to be, so that RAID is okay, yeah. What's going to happen with RAIDs? Is it going to be better or worse than if we have a single disk? Right, so the answer is that RAIDs will also be better because you can write to multiple separatists. So you'll get higher bandwidth both in reads and write queries. So how about RAID 1 here? What's it going to look like for reads and writes? Exactly, so the answer is reads will be good because we're getting more, you know, overall bandwidth or we can access pages from any disk we want, right? But writes are going to be to be slower because now we have to replicate it to all of the individual disks. So there are different trade-offs you have to consider. Here also, you know, in the RAID 0 if we lose one of our disks, let's say one of our disk fails or we have one of our files become corrupted or one of our pages become corrupted there's no redundancy, there's no way to recover it because the file is partitioned in that way. Here, you know, we, if we fail, then we're okay still because we have two other disks that have the file replicated. So there are different trade-offs you have to consider in terms of all these things. Again, as I said, there are a million of these kind of RAID arrangements. You can combine RAID 0 and 1, so you have both striping and mirroring, all different things. But the specifics of all the different RAID configurations don't matter for our purposes, which is important, but all of this is happening transparently to the DBMS. The DBMS doesn't know about what's happening here. Sort of all of the advantages that we're getting either from striping or mirroring are happening at a layer below what the DBMS knows is going on. The DBMS just sees one logical disk that has, you know, the higher throughput or latency advantages that RAID gives you. So are there any questions about this? Okay, so I want to talk quickly about partitioning and then a few minutes for the midterm stuff. So database partitioning, some DBMSs allow you to kind of specify the disk location of each individual database that you have. And then the buffer pool manager is going to map pages to specific disks. So you're going to have to kind of worry at the application level about where your individual files are stored. The DBMS kind of handles the mapping to these different partitions. It's easy to do kind of at the file system level if you know that your DBMS splits up files into different directories or something. The one thing that we still need to care about is the recovery log that we might have to have, you know, if a transaction can update through multiple databases in one transaction then we might have to care about recovering that from the recovery log. But don't worry about that for now. We're going to talk about the properties of transactions and recovering all that stuff in future lectures. So basically the high level idea of partitioning is that we're going to split a single logical table into these disjoint physical segments and we're going to store them or manage them separately rather than something transparent like RAID that we're getting from the hardware. The DBMS is specifically going to at the the table level it's going to take a single logical table and it's going to split it up into these different physical pieces and then it's going to know how it's storing them and manage them independently. So where the RAID stuff was transparent to the DBMS partitioning should ideally be transparent to the end user application. So the DBMS can do whatever physical partitioning it wants but you writing your SQL query shouldn't have to know that a table is partitioned in some particular way. It should all be abstracted away and it should look like to the application or the SQL query that all of the data is stored in one logical table. So there's this abstraction layer of how exactly physically the data is stored. So I'll just give a few quick examples of partitioning. So one example is vertical partitioning. You can think about this like the column store model we talked about DSM storage model. Basically we're storing a table's attributes in separate locations it could be a separate file, disk volume disk whatever but we need to have some just like in the column store example we need to have some metadata that tells us how to reconstruct the original record. So an example here of partitioning is let's say we have these three attributes attribute 1, 2 and 3 and those get read together frequently and they're ints and they're easy to store in a fixed length and then we have this attribute 4 which is text string data and it's a lot harder to store and it doesn't get accessed as frequently as the other three. So what we can do is physically split up or partition this table into two separate pieces so we have partition 1 contains all of the attributes that we access together frequently and partition 2 contains the text data separately. Again ideally this should all be transparent to the end user and the application they don't need to know that the data is physically stored like this this can be done behind the scenes by the DBMS. The other alternative is horizontal partitioning and this gets done frequently in distributed DBMS's but it also has other advantages for splitting up things in local parallel settings but basically we're going to divide the table up into these just disjoint segments based on some partitioning strategy so we can either use like hash partitioning on the keys or the values we can use range partitioning we can use predicate partitioning so if the tuples match some particular predicate or they're in some particular range or something we can split them all up into disjoint subsets so here we would get maybe let's say we did range partitioning or something we'd get partition number 1 has tuples 1 and 2 in it and partition number 2 has tuples 3 and 4 now those are stored separately but again they all exist in one logical table so we'll talk more about this when we talk about distributed databases okay so are there any questions about partitioning or how any of that works okay so to just wrap up what we talked about today parallel execution is especially important which is why almost every major DBMS supports it they don't all necessarily support all types of parallelism for example you can have multiple queries running concurrently I think MySQL for example as you run multiple queries concurrently MySQL is very popular but there's no intra query parallelism so they can't parallelize like a scan or join that's running inside a single query but they do allow multiple queries to run at the same time so it's hard to get right for a lot of reasons there's coordination overhead, there's scheduling problems, concurrency issues if you have multiple readers and writers going at the same time you have to care about that sort of stuff and there's also this problem of resource content so figuring out how best to allocate the resources that you have in the parallel setting so okay are there any final questions about parallel execution and then midterm okay so the midterm exam who is it for, all of you what is it, the midterm exam it is here in this room and it will be on Wednesday October 13th so it's in a couple days it is in the usual class time so it's an hour and 20 minutes or whatever that is the why is sort of a more existential question I can answer for you your best bet is to watch this video and that may have some of the answers you're after and there's a there's a piazza post and I think it will link you to this midterm exam guide where we have some just general reminders information about the exam and practice exam with solutions so the exam will cover all of the lecture material up to including today so parallel query execution will be the last topic so it could be anything from lecture one the first lecture all the way up to lecture number 12 today as I mentioned is open book open notes whatever paper material you want please know electronic devices that have an internet connection calculators find not the calculator in your phone because then you can just be using your phone so specifically what to bring your CMU ID as we're going to check when we collect the exams so you need your CMU ID a calculator unless as I said you're really good at mental math for logs or and a pen or pencil pencils are recommended so what we're going to do is it's going to be all multiple choice to fill in the answers and then we're going to scan them in and auto grade them so pencil is preferable unless you're really confident in your answers and use a pen I guess but pencil and erases if you want to change your answer during the exam so that's everything I have are there any specific questions about the exam, the format, the content, any of that kind of stuff so the question is for the multiple choice do you have to fully fill in the box yes there's going to be little boxes you need to fill it in as fully and darkly as possible so that way the scanner will pick it up when we scan it in and again I mean if there's a problem where you filled in an answer and it doesn't quite pick it up or whatever then just let us know we'll regrade it with that yes the question is will there be partial credit for scratch work or will the final score be the final score so I would suggest this fill in whatever your answer is if you think there's some aspect of the question if there's some explanation you want to give about the answer write it in don't do this for every question but if there's one really in particular that you think is you know extenuating for a particular question write in either notes or your scratch work or whatever explaining your answer and if you ask for a regrade request on that particular question we'll take a look but again don't write out all of your work for every question and assume you're going to get some kind of partial credit just fill in the answers as best you can and if there's something really in particular that you want to write out an answer for you can fill it in next to that question yes so the question is if there are multi-part questions and they incorporate or they rely on answers from previous parts of the question how will that work right as far as I know and I will double check again there are no multi-part questions like that so every question should be able to be answered independently I will double check and make sure that that's the case and if it's not I will edit it so it will be the case but each sub question or each part should be independent of all the other parts so there won't be a case where there's a 10-part question and you get the first part wrong so now you get the next nine right all of them are going to be independent are there any other questions final chance okay oh sorry no one more so the question is if all notes and books are on the tablet can you bring it you can bring it you just can't use it so no electronic devices nothing with an internet connection if you want to print them out or you want to hand write out notes based on it then that's fine but no electronic devices outside of a calculator during the exam so if there are no more questions we'll see you all on Wednesday us or say nice brook on the jump