 Okay, let's get started! A lot to discuss today. So again, thank you DJ Drop Tables for having us out today. So you're flying out immediately after class, right? Where are you going? My boy, Stabby Steve, has got a hedge fund that I'm interviewing for. Your boy has a hedge fund? Yeah, Stabby Steve, he's trying to do like... It's like block change, you know? He's trying to do block change? Yeah, like he's trying to do it with like selling... You know, it's like something more... Some kind of multilevel marketing. That sounds sketchy as f***. Yeah, alright. Alright, so that's his problem. Alright, your problems are these things. So homework three is due today at midnight. The midterm exam will be in class a week from today here. Again, the end of this lecture will do a review of what the, you know, the outline was expected of you for the midterm. And then project two will be due on f midterm on Sunday, October 20th. Any high-level questions about any of these things? Okay, so let's jump into it. So the last class, we started talking about query execution. And we said that we were going to have these operators in our query plan. And then we showed different ways to either remove data from the top to the bottom, bottom to the top. And then we also showed that sometimes you could send a single tuple, you could send a batch of tuples, or all the tuples for one operator to the next. And so during this discussion, we've made this big assumption, or I didn't even talk about, you know, how these operators are being executed. We just sort of said at a high level there's these functions they call next to each other and they pass data around. So now what we're going to talk about today is how is the data system actually going to execute these things? Right, what are these, you know, we're going to have these workers, these operators and execute them and produce some result. And so for the most part in the last class, we could just assume that we were talking about single thread execution. Meaning there was going to be one thread that was going to call next at the root, assuming you're doing the iterator model going from top down, and call next going below until we've got some data and then just, you know, one thread did everything. But as we spent a bunch of other lectures talking about, we know how to build thread safe concurrent data structures where we could have multiple workers or multiple threads or processes operating or executing these tasks for us simultaneously and then at the end we, you know, we combine it all together to put it into a single result that we hand back to the application or the terminal, whoever invoked the query. So that's what we're going to talk about today. Today we're going to talk about how do we actually execute queries in parallel. So it's sort of obvious why you would care about parallel execution in the modern era of what CPUs look like today or GPUs for that matter as well. Like we have a lot more cores that are available to us and so we want to be able to take advantage of them. And so the benefit we're going to get in a database system if we can support parallel execution is that we're going to get obviously better performance, but not always, but usually is the case or you want this to be the case. And this can be in terms of better throughput, meaning we can run more queries per second or process more data per second. We'll get better latency, meaning the time it takes for us to execute a single query can be cut down in time because we can run things in parallel. The other advantage of this is that we'll get also better responsiveness or availability of the system, meaning the system will feel more lively and respond more quickly to our requests. Again, think about it. Remember, we're talking about disk-oriented database systems. And remember I said that any time for all these data structures that we talked about so far or like the table heap, any time a query could be accessing a page that's not in memory, it's not in the buffer pool and therefore has to stall while we go fetch it from disk and bring it into our buffer pool. So if we only had one thread executing in our system or one process for our entire system, every single time we had to touch the data from disk, it would be this long pause while we go and get the data we need. So the system would look unresponsive but allowing us to have concurrent operations or current execution, we can have one thread block because it's going to get some of the disk but other threads can keep running and hopefully operate on what's already in memory and can still make forward progress. And at the end of the day, what this is all going to add up to is that it's going to reduce the total cost of ownership or TCO of our database system. So TCO is usually how people in the enterprise world think about the cost of a database system. It's not just like the cost of buying the machine or the cost of paying for the software license. It's the total cost of actually running this thing for some period of time. So that includes the software license. It includes the hardware. It also includes the labor cost to actually set up the software, set up the machine, the energy cost to actually run the servers. And so if we can do more work with less hardware, then this is going to cut this down significantly. So this is also a big win for us as well. So it means that if we buy a new machine that has a lot more cores, we want our database system to be able to take advantage of it. So the thing we need also to distinguish, too, before we start talking about parallel execution today is to be able to distinguish it between distributed execution or distributed databases. So at a high level, they're both trying to do the same thing. So in a parallel database and distributed database, the idea is that you have a database that's spread across multiple resources that allow you to have improved different characteristics of the database system. Again, performance, cost, latency, things like that. And so I'm highlighting the word resources here because I'm not necessarily saying that this means another machine or multiple machines. It could be multiple CPUs, it could be multiple disks. All of these things would encompass a distributed or parallel database system. So from the application's perspective, like from the person actually opening the terminal and writing a SQL query and sending it to our database system, they should know, shouldn't care whether we are a parallel distributed database or a single node database system. Again, this is the beauty of a declarative language of SQL. I write my select statement. I don't care where my data is actually being stored. I don't care whether it has to join, move data across the network or move data across different sockets. Again, the SQL query is agnostic to these things. So if we have a single node database system and then we start scaling it out to make it parallel or distributed, we shouldn't have to go back and rewrite our application, rewrite all the SQL statements. Everything should still just work. So that's the ultimate goal of what we're trying to do here. Again, having the disconnection or the abstraction layer of the logical versus the physical, we can move the physical stuff around as needed and the logical part doesn't change. And that seems sort of obvious right now to us, but that was a big deal. Actually, it was a big deal within the SQL systems a few years ago as well, but it's a big deal in the 1970s. So the difference between a distributed and a parallel database is the following. So this is my definition. I don't know what the textbook says, but to me, this makes the most sense and this sort of follows along with what's in the academic literature. The term parallel versus distributed is usually, the terms are often mixed together, but for the most part, people mean, most people have these kind of systems, but a lot of systems are still this. So a parallel database is one where the resources that are available to us in the system are going to be physically close to each other. Think of like a single rack unit machine that has two CPU sockets. So the CPU sockets have the cores that can execute queries for us, those things are really close together because they're going over a really fast and high bandwidth interconnect. So that's how the resources are going to communicate with each other. Again, whether it's CPU, what's compute, or storage, it doesn't matter for now. And then the thing that matters the most in this discussion today is that we are going to assume that the communication between these different resources is going to be not only fast and cheap to do, but also reliable. Meaning if I send a message from one CPU socket to another CPU socket, it's not going to get dropped. Because that means like I'm losing cash traffic on my interconnect. And I have a whole bunch of other problems than just losing database messages. Like the whole system is falling apart. In a distributed database, the resources can be far from each other. So that could either mean like different machines in the same rack, different machines in the same data center, or different machines at different parts of the world. East Coast versus West Coast data centers in the U.S. And therefore, in order to communicate between these different resources, we have to go through a slower communication channel, like the public wide area network. Whereas this thing could be like interconnects between CPU sockets, which is way faster. And therefore, because we're going over this unreliable interconnect, we can't assume that our little messages are going to show up really quickly and show up in the right order that we expect them to, or even show up at all. So there's a whole bunch of other hard problems we're going to have to deal with when we talk about distributed databases in the end of the semester. So we're going to ignore all that for now, and we're going to focus on parallel databases. And for this, you can just assume that it's a machine that has a bunch of sockets, a bunch of cores that can all operate at the same time. Or maybe they're talking also to the same disk that's local to it. So today, we're going to first talk about the process model. This is how we're going to organize the system to actually have workers to execute our queries. And then we'll talk about how we actually support parallel execution for our query plans. And then we'll talk about an alternative to getting parallelism is to do IO parallelism. The distinction between compute versus storage between these two guys. And as I said, we'll finish up at the end doing a quick review on what's expected in the midterm. So the database systems process model is how we're going to organize or architect the system to have multiple workers running concurrent requests. And the reason we have to do this is because we could either have an application send really big requests that we want to split up across multiple workers, or we could have the application send multiple requests at the same time that, again, we want to divide up across the different workers. So in the case of OATP, it's going to be a bunch of small requests. So we want to be able to run those in parallel. In OLAP, it's traditionally a small number of requests, but those requests, then we want to break up and run in parallel at the same time. And so we'll talk about the distinction of those types of parallelism later on. But that's a general idea that we want to take requests and run them across multiple workers. And so I'm using the term worker just to mean some component of the system that's able to take tasks that some other part of the system is telling it to do, like the network layer gets a request, we run it through the query optimizer, and now we have a query plan. And the query plan is a task that we want to hand off to a worker or workers and have them execute this thing. So the reason why I'm using the term worker because it could be a process, it could be a thread, it could be either of them. But at a high level, the basic idea is still the same. And the worker is traditionally responsible for a given task, there's something that would hand it off back to the application and say, here's the result of the query you executed. So there's three different process models we could have. Yes? Can you say that worker is a thread? Yes. The question is, can we say a worker is a thread? Yes. A worker is going to be either a process or a thread. Depending on the process model you use. Does the application necessary need to be multi-user? No. Could be. Is OATP? Yes. Think of like I have a web page, I have multiple users accessing the web page, every single page load is going to fire off a bunch of code on the server side like PHP, JavaScript, Python, and that code is going to do a bunch of requests to get data from the database and then render the HTML back to you. So I'm having multiple users accessing my web page, each of those are then firing off this code that then fire off different requests. Or it could be like a single dashboard or analytical application where it's one user submitting much queries one at a time but then we want to run those in parallel. It could be either one. So there's three different process models we could have in our database system. The first is that we'll have a single process per worker, then we could have a process pool, and then the spoiler would be that the most common one is the last one here, at least in newer systems. We're actually a multi-threaded system. We have one thread per worker. So we'll go through each of these examples, one by one. So the process per worker is sort of the most basic one approach where we're going to have a single worker be represented by a single OS process. So what happens is your application sends a request to say, hey, I want to execute a query, open a connection to the database system, and there's some centralized coordinator or dispatcher that gets that initial request and then can fork off now a worker which is going to get as a separate process that is going to be responsible for handling this connection. So now what happens is the dispatcher says, all right, I got you a worker, here's the port number where you can communicate with, and then now the application only communicates directly with the worker, and the worker is responsible for executing whatever request that the query wants. So the issue is going to be is that we're going to have multiple workers that are going to be separate processes, and now, assuming we're a disk-oriented system, now they could be having their own buffer pool and going fetching pages from disk and bringing them to memory. Of course, now we don't want to have multiple copies of the same page in these separate processes because now we have to coordinate across them and that's going to be expensive if you have the same messages back and forth. And then that's just sort of wasting memory because we're going to have, again, redundant copies of things. So the way to get around that issue is that you can use shared memory that it's going to allow these different processes that normally have their own separate address spaces in memory to allow them to share access to these global data structures. And the OS is what facilitates that. So the one advantage you can also get from this approach if you are worried about the resiliency of your system is that if you have a bug in your worker code and it crashes, it doesn't end up taking down the whole system because just this one process crashed. The OS knows it was forked off from the dispatcher but it doesn't take down the whole system if this one guy fails. Yes? It's not that straightforward to me to understand the shared memory matter. We have different, several processes. Yes? How can we share memory? Right, so shared memory is a... Do they cover that in 5.13.213 or now? Yeah, okay. So shared memory is a construct the operating system provides that says, here's some region of memory. So normally when I call malloc in my process, that is my private address space. Only I can read and write to that memory. With shared memory, you tell the OS, hey, malloc can be a bunch of space and then anybody in that shared memory group that has permissions is allowed to also read and write to it as well. Normally the OS would not let you do that. It's one of the protections that OS provides but this allows you now to have a block of memory be shared across multiple processes. So unless I have shared memory, then every single worker is going to have its own buffer pool and it's going to bring in pages that are just going to be copied in other worker pools or other workers that they are bringing in the same pages. So this approach is used in pretty much every old database system. Every database system that was made in the 1970s, 1980s, maybe early 1990s is using this approach. Anybody think I guess why? Why would you use processes over threads? Maybe there are no threads in it? Maybe there's no thread yet. Very close. So there were threads back then, not as good as we have ones have now, but there was no standard thread API. This is like 1980s, before POSIX, before P-threads. So I had all these different variants of Unix, all these different variants of operating systems. So if I had my database system and I wanted to have it run on Vax and all these other OSs, I had to rewrite my database system to use the threading API for all those different operating systems. Now with P-threads and sort of Linux being the dominant Unix variant that everyone uses, that's good enough for everyone. So back in the day, it wasn't like we have now, everyone had their own threading package and it wasn't a standard API and we had to rewrite everything. But everyone had fork and join. That was the basic operating system primitives. So if you built your database system using this process model, then it would work pretty much everywhere. So an extension of this is called the process pool. So in this case here, we're still forking off processes, but the idea is that instead of forking off a process for every single connection that comes along, we just have a bunch of workers that are sitting around and we are dispatcher can pick one of them and say, all right, now you're in charge of executing this query. And then what you can now also do is because you have a pool and you're aware that there's other processes around that help you to do work, now you can actually do some query parallelism because now you can say, well, I need to execute this query and it's going to take too much work for me to do. So maybe I'll give half of the work to another process and let it run. When the single process model for a worker in the last slide, you're not aware of what else is running and you don't want to fork a process while you're running as well because that's going to be expensive whereas this guy has things around that reliably reuse. So, yes? Is this part of the database mapping system like worker pool? Is that part of the application pool? So his question is like, what is the database system? The database system would be sort of everything over here, all of this. This is the application. This is like your website. This is whatever, you know, desktop application that talks to the database. So this is sending SQL queries and dispatcher is the one for handing them off. Yeah, so think about there's a division line here that says everything over the side of the database system. So the important thing to understand about this though in the case of this slide and the previous slide, these are full-fledged OS processes. So we're not doing any scheduling ourselves in the database system. The OS is responsible for doing all that scheduling. Now, we can give it nice flags or priority flags and try to say this one should get higher priority or more run time in this other process. But at a high level, we can't control what gets scheduled. So once we hand off the work, it just runs. Yes? Is the idea that you have more than one worker pool? This question, do you have the idea that you have more than one worker pool? No. You just have one. Then you have just a fixed number of processes in the worker pool. Correct. And the question is, you have a fixed number of processes in the worker pool. Yeah. So this is something you would define when you turn on the database and you say, how many worker processes am I allowed to have? Because it'll just fork forever if you have any connections and then the system will get overwhelmed. And then typically what you do is you always have one, in real systems, you always have one worker be the special worker so that if you get the system gets locked up, there's always one worker that can take an incoming request from the administrative account so that you can start killing things and clean things up. What does the process do if it doesn't have any work to do? It says, what does the process do if it doesn't have any work to do? It just waits, right? And actually, this is, we were actually talking about this in our developer meeting yesterday for the data system we're building. In our old system we were building at CMU, we had this issue before we threw all the code away in the old system. If there was no work to do, our CPU would still spend like 60%. It was just doing useless stuff and just pulling on something. Ideally, if there's no work to do, you want the CPU utilization to be like 1%-2%. And for my one system I use for all the demos, I'm running SQL server, I'm running MongoDB, I'm running Postgres, my SQL, MariaDB, and you don't, you know, some of them, the CPU spikes are like 10% when there's doing nothing. Most of the ones are like running at 1%. So it's still doing something just because it's checking to say, hey, is there work for me to do? But you don't want to burn cycles. Yes? Do some worker pools use work stealing to make sure that's kind of increasing the time? Yeah, so this question is, do some worker pools use work stealing? In the high end systems, typically what happens is, I should have made a slide for this, at a high end system like the dispatcher or the coordinator, it knows who's doing what work. So if it can, on the fly, recognize, oh, this one worker is maybe it's taking a long time to read a bunch of data and it has a bunch of stuff in the queue and it needs to still process. So maybe I'll take its work out and hand it to somebody else. Yeah, the high end systems can do that. All right, so this is the approach that's used in IBM DB2. Postgres switched to this model in 2015. It's going back to the previous slide here, right? Again, Postgres, Oracle and DB2, these are all order systems, like from the 80s and 70s. I think Sybase and Formix might also work this way, again, also from the 1980s. Most of the modern systems do the last approach, which is the multi-threading one. And so the basic idea is now is that instead of having a bunch of different processes that are all doing different tasks, we just have one process for the database system and inside it, it has its own threads and it can decide how to dispatch things as needed, right? Again, this is just using p-threads or whatever the same thing is in Windows. So in this environment, what's going to happen now is that because now we have full control over what is, we know what the tasks are, we know what threads we have, we can now do a better job and have an easier understanding or a global view of what all the threads are doing and what the tasks are available to us, and then we now make scheduling decisions on individual threads. We're in the process model, we're sort of giving stuff up to the OS and let the OS figure things out. So in my opinion, the multi-threaded model is the way to go. It is, from an engineering standpoint, this is easier to handle because you're not dealing with this, all these OS semantics for shared memory or dealing with process management. The overhead of also now doing a context switch in a multi-threaded environment is much lower. You still pay a penalty when the OS switches from one thread to the next, within the same process, but it's not as heavyweight as going from one process to another process because of all the security and protection mechanisms they need to have for the in-memory representation or in-memory storage of the process. So I can't prove this scientifically, but in general, a multi-threaded process will be, well, a data system will be faster than a multi-process one. So the thing to point out though is just because we're going to a, the multi-threaded process model approach does not mean we're going to get automatically parallel query execution. In particular, we may not necessarily get interquery parallelism, which we'll talk about in a few more slides, meaning, like, there's no guarantee that even though our data systems can run with multiple threads, if I give it a single query, it can't break that query up across multiple threads that run all those in parallel. So MySQL 5.7 is a multi-threaded database system, but it can't do interquery parallelism. This might have been fixed in eight. I forgot to check before today. All right? And for me, like, in my understanding, what's out there in the database world, there's no database system that I'm aware of that's been built in the last 10 years, either from an academic standpoint or from a commercial, like a startup enterprise system. There's no system that I'm aware of that's gone to the multi-process model, unless they're using a fork or Postgres, which is actually a very common approach that everyone does. Like, there's a lot of database systems that will take Postgres to BSD license so you can do whatever you want with it, and it's actually pretty well written compared to MySQL, but you can then rewrite the parts of Postgres that are slow for your particular application and have that be your new database system, like Vertica, Greenflom, Timescale, all these do this. And so what happens is that they inherit the legacy Postgres model architecture on Postgres if you go down this route. But anybody that's starting from scratch for the new code base is going to almost always end up being multi-threaded. All right. So the other thing we briefly talked about, which we don't have time to go into today, is scheduling. We've talked to this a little bit about you have a dispatcher or coordinator who can understand what the tasks are that I need to execute, what resources or workers are available to me, and then it can decide, you know, how many tasks to split up a query into, what CPU cores should be executing those tasks, what threads should pause for another thread, and then once it produces the output of a task, where does that output actually go? All of these things we have to worry about if we want to do in a parallel database system. But in general, there's not one way that's better than another. It depends on the environment you're working in. It depends on what kind of target workload you want to support. But as I said, multiple times throughout the semester, the database system is always going to know better than the OS, so we can always make better decisions about all of these things. Yes? Isn't the core on which the thread of the process should always be decided by the OS? Say it again? Isn't the core on which the thread of the process should always be decided by the OS? This question is, isn't it the case that the, for a given thread, the OS decides what core it runs on? No. It's called Numer Control, in our task set in Linux. You can have complete control to know, say, my thread is going to run on this core, these cores, right? The OS will enforce that for you. If you don't do anything, then the OS will try to figure out, like, well, you're accessing memory in this, it's a multi-socket CPU. It has two CPU sockets. And in the modern Numer systems, each CPU socket has a local memory, the DIMs that are close to it. So if your thread is running here and you're accessing memory on this other socket, the OS or the CPU could automatically migrate you over. But in a high-end system, we know exactly what data we're going to touch. We can pin ourselves ahead of time or force ourselves to say, all right, we know we have this thread running in this core. It can only read data from this memory location. So all tasks that touch that memory location go there. We can do all that ourselves. And we can do a better job than the OS can do. All right, so let's talk about parallel queries. So there's two types of parallelism we're going to want to support. There's interquery parallelism and interquery parallelism. So I'll go through both of these. But we're going to spend most of our time today talking about this one, right? So interquery parallelism is that we can execute multiple queries that are doing distinct things at the same time. And again, this is going to improve our throughput and latency of our system. And then for interquery parallelism, we're going to take one query and break it up to subtasks or fragments and run those in parallel on different resources at the same time. So again, interquery parallelism is what I've already said. The idea is that we have multiple requests coming in from our application. And instead of running them one after another on a single thread, we're going to have multiple workers and multiple threads run them simultaneously. And then that way we get a response more quickly to the application with the result they were looking for. So if all the queries we need to execute are read-only, meaning they're not doing insert, update, or deletes, they're just doing select statements, then this is super easy to do. Because there's not going to be any conflicts. There's not going to be any issues of, you know, I'm trying to update the same hash table while you're reading it. Everything just sort of works very nicely. So this is super easy. But this isn't always that common. The thing that's going to be super hard for us is that if we have multiple threads updating, the database is at the same time. And now we've got to worry about all the concurrency control stuff we talked about for the B plus tree and the hash table. But now we also have to worry about this for the actual data itself. If we have two queries trying to update the same tuple at the same time, what should happen? So the good news for you guys, we're going to punt on this until after the midterm, because this is a whole other ball of wax we've got to deal with, which is super hard and super awesome. And so we're going to spend basically two weeks discussing this in exhaustion. And this is the thing I'm super excited about. This is one of my favorite parts about database systems, is that they can do these concurrent operations at the same time. But it's super hard to do. So we'll cover this after the midterm. All right, so for this class, like I said, we're focused on interquery parallelism. So this is going to be useful for analytical queries where we have multiple resources and multiple workers available to us. And we're going to split up the query into fragments or subtasks and run them in parallel at the same time. So for this discussion here, we're going to focus on compute parallelism, meaning I have multiple workers that have multiple threads or cores that are available to me. I'm going to use them for the same query. So the way to think about how we're going to organize this is that in our query plan, we have these operators. We've already discussed how they have this next function that can move data. You ask next and it gives you back a chunk of data or a single tuple. And so we can think of that in terms of a producer or consumer paradigm, where each operator is not only a producer of data, like if you call a next one, it can produce some data for you, but it's also consumed data potentially from some operator running below it. And so we can think about how we're going to organize in our query plan in this producer-consumer model and we can see how we can then run these things in parallel in different ways. So the first thing I'll say too, for all the operator algorithms that we talked about, there are parallel versions of all of them. But they're going to differ based on whether you are having multiple threads updating some centralized data structure at the same time. Like if I'm doing a hash joint in parallel, I could have multiple threads update and build out my hash table, and then multiple threads could probe that hash table. Or I could split it up or partition my input data that I'm consuming from the operators below me and have them each work on siloed or individualized chunks of data or partitions of data, and then now I don't need to coordinate across these different workers running at the same time. So the conceptually this is pretty easy to think about, right? So this is the same hash joint we talked about before, the partition of gray hash joint. And before what I said was we would have this hash function on both sides on the inner and the outer table, and they would hash into these buckets at these different levels. So now we wanted to do a join to combine these buckets. We would just have to have this, we would only have to examine the tubeless in this level with the tubeless in the same level on the other side, right, because we sort of partitioned this. So the way to run this in parallel now is super easy. We just have a single worker to take its own level and now do the join and produce the output. So you can sort of see how we can do this for all the different things that we talked about before. The sort merge join, any kind of sequential scan, we can break it up and divide the work up and have them run in parallel. The tricky thing though is now putting this data back together. There's different ways to do that. So that's what we're going to focus on. So the three types of introquery parallelism we could have is intra-operator parallelism, also known as horizontal parallelism, intra-operator parallelism or vertical parallelism, and then bushy parallelism, which I think is in the textbook. It's just an extension of these other ones, but I think it's worth just to show you quickly what it is in case you see it again. And what I'll say too is that for all these approaches, they're not mutually exclusive, meaning if you want to run queries in parallel, you don't pick one of these three, you can actually do a combination of all of them. And this is what the data systems can figure out for you. Say, all right, my hardware looks like this, my data looks like this, my query looks like this. I can use some different combinations of these techniques to get the best performance for my workload. So again, let's go through these one by one. So intra-operator parallelism is where we're going to decompose the operators into independent fragments. And each fragment is going to do whatever it is that the operator wants us to do on some portion of our input data. So if I have a scan operator on a table, I could have multiple instances of that scan running a separate fragment on separate threads. And they're each going to scan a different portion of the table. And they're all going to sort of funnel the data up. So the way we're going to combine this data now is to do what is called an exchange operator. An exchange operator is a location in the query plan that the data system sort of injects artificially, like as it produces the query plan, it says, all right, here's the points where I can have parallel fragments. And here's the exchange operator I need to be able to combine the results together because I'm going to need a single stream or single data flow going up to the next operator. So the exchange operator was actually invented by the same guy that came up with the Volcano iterator model from, we talked about last class, Gertz Graffi, the same guy who did the B plus G paper, or B plus G book that I was raving about. He has a paper in 89, 8, 1990 that presents this exchange operator. And this is pretty much the same, this approach here is what every single data system that's doing parallel execution or even distributed execution is doing something like this. We may not always call it exactly the exchange operator. So let's look at a really simple example here. So we have a single select statement, select A, select start from A where A dot value is greater than 99. So the query plan is super simple, sequential scan on A, and we feed that into our filter operator. So to run this in parallel, what we would do is we would divide up now the query plan across different fragments with the scan and the filter. And then we'll split the database up now which is already, you know, in general it already is because it's already divided up into pages. And so what we can have now, within a given plan fragment, we can have it operate on a, sorry, a distinct page. So the exchange operator up above has its own next function just like any other operator. So if we're doing the volcano model or iterator model where we're calling next and going down, the exchange operator say, I call next on my operator here which then calls next on the scan, and then now I'm going to start feeding up data that I'm going to retrieve from a particular page. And we're going to do this for all the other fragments as well. They're each going to operate on separate pages. And then the exchange operator then coalesce that the data it's getting from these three different fragments, these three different workers, and then combine them into a single result that we then spit out as the output to the application. Because the end result of the query always needs to be, you know, a single result. We can't say your data's over here, here, and here because they add three different workers. We have to always produce a single result. Yes? So then exchange calls next randomly to this part of the worker then take? No, no, no, no. So this question is, how is exchange calling next here? It's calling them in parallel. It knows I have three fragments below me and therefore I need to call next on all of them. And it fires them off in different workers. Yes? So you might call next on exchange only once, but then call it next on all of the three the subset data. So this is just accumulating and then waiting for the next call I get and then emitting that when that later is next. Yes. So this question is like, if I call next once here, how is that then percolating down to these other ones? You could have a coordinator up above and say like, I know I need to get data from all these other guys and keep calling next on it until they produce nothing. But it may happen that one of them gets logs and other two are providing data. Right. So think of like the fragment here. This is running sort of separately, right? And it's like a producer-consumer. This thing is asynchronous saying, hey, give me some data and then it fires off and produces the result. And then something else that has to come back to it, go get more data. Yeah. Whenever it gets back the data, we'll do that. Yes. Yeah. Think of these are almost like streams, right? That it will produce results shoving up to the next guy. And depending on how it's implemented, this thing could know I'm going to keep shoving up data until someone tells me to stop or I can do it whenever I'm invoked. Different systems do different things. Yes. Does the call to the fragments also happen in parallel? This question is the call to the fragments like this part here. Yeah. Is this in parallel? Yes. Because you want them to run in parallel. The first call from exchange to... Like this? The previous call here. Right. So again, they're all getting fired off in parallel. They're all doing work at the same time simultaneously on different cores. I mean, I was thinking about can it happen that they operate on the same piece? How do they make sure... Oh, yeah. So this question is how do I make sure that they're not reading the same page? So as part of this query plan in here, there's two ways to do this. You say, here's the queue, a bunch of work I need to do. So let's say that in this case here, the first and second thread finished up these pages and they say, all right, let me go to the queue and get the next pages I read, right? And that you just keep doing that until you run out of pages and then you stop. Or you can actually do pre-partitioning, which we'll talk about later on. You could say, well, the first guy is going to operate on one, two, and three. The second guy will operate on four, five, and six, and so forth. And that's just sort of blindly grabbing different pages. Or you can understand the semantics of actually what's in the table and say, well, I want to look at, this is a small example. Say like, I want to have one thread process all the data where the values are less than 1,000, and then process all the data where the values are less than 2,000, right? And then they could be reading the same page, but now they're just processing different portions of the data. There's different ways to do all these things. The main takeaway I want you to have is to understand there's exchange operator as a way to coalesce or break up the data further. Yes? I guess like inventory is not like guarantee ordering. I mean, I... Guarantee ordering on what? Yeah, I don't know. I know. I was just trying to think of like, if like you need to go like take one, two, three, four. We don't. Well, if the sort order matters and your cluster index on your cluster table, if the order of how you process data matters, then you wouldn't want to do this parallel stuff, right? Because if I can't process page two until page one is processed, well, that's zero-threaded execution anyway. So I don't want to do any of this, set up all these threads, because that's just waste of time, wasting resources to do something, have one thing block on another. All right, so the exchange operator I showed you was the basic one called a gather. And the basic idea is that we're combining results from different worker threads, different outputs, that the operators are generating, and then we're going to produce a single output stream that we hand out up above. So again, in my last example here, like this is the output that we send to the client, whoever invoked that query. So that always needs to be combined together into a single output. There are other times where maybe you want to take a bunch of different output streams and then reshuffle them based on what the data looks like and then hand them out to other worker threads now. So let's say if I'm doing the scan in parallel and I want to divide them up based on the range of the values, I can run the scan in parallel and then put it through a repartition exchange and then now have that then be what's split it up based on the actual values that I'm seeing. And then the last one is to do a distribute where we have a single input stream that we're then going to divide up and hand out to different output streams. So that could be what we did for the gray hash join. We started off with a single input stream from our table. We built the hash table and now that's spread out to different levels of the hash buckets to different threads. So for this parlance I'm using here, so this nomenclature, this is actually what SQL Server uses. Because SQL Server explicitly shows you the exchange operator in the query plan and their documentation. So for me, this is the easiest way to reason about it. Oracle, DB2, Postgres, all the high end systems that support parallel execution, they all have something that looks like exchange and just may not use exactly this terminology. But it works the same way. Yes? Can we plan a repartition exchange? The question is what is repartition? So say I had something up about this in my query plan that wanted to now do a group by based on the values. And so if this exchange pulls out a single stream, then I'm going to have one worker thread do that group by. But maybe instead what I could do is I could then split up and say, well, if the value is even, go to this direction, if the value is odd, go to that direction, and now I have a separate worker thread that can do group by for those things, I don't need to coordinate because I'm doing group by and then I have another exchange above them and that's the result to a single output. It's a way to take multiple streams and produce other multiple streams but split up in different ways. So let's look at a slightly more complicated example here. So now we're doing a two-way join between A and B. And so the first thing we want to do is do this A scan in parallel. So we'll assign these to three different worker threads and then inside our plan fragment, we'll do the scan, do the filter, and then they'll build the hash table. And this hash table could either be a, in this case here, it would have to be a global hash table because I don't know what values are going to be in the table as I'm scanning them. So if I have different hash tables for each fragment, then when I do a join, I got to check all the hash tables and that's going to be expensive, that's going to be slow. So these are all building the same hash table but then I have exchange operator and I wait until they all update my hash table when I'm done. Then now to do the scan on B, I can run that maybe on two cores or two worker threads. They just do the filter and now they're going to partition and split up the data and now they have their own exchange operator. And then now I have, I do the join and this could be either single threaded or I could do it multi-threaded. And this is here, let's make it multi-threaded. So now I can split up inside this after the join. I can have the different threads now do the probe for the partitions over here. So you can see how you can compose these things together where you can have these different workers generate these outputs that are then split across multiple threads and then you can combine them together and split them back up and you can compose them to this giant tree structure that can now run in parallel. So this is intra-operated parallelism. Again, the idea is that within a single operator, like a single scan on A, I can have that run in parallel in different fragments. Inter-operated parallelism is where we're going to have different operators run in separate threads at the same time. And this is also called vertical parallelism because the idea is that we can sort of, for every single operator in our tree, we could have them run as a separate worker and there's feeding data, the output of one is fed to the input of the other. So basically it works like this. For the join part here, I could have one core, one worker, just do the join. It's just getting data from its children operators and then it does the join and then as soon as it does the join, it emits it up into another worker that just takes whatever this guy sends it and then does the projection and then sends that up further up in the query plan. So now these guys are just spinning and this is the producer-consumer model. This guy is spinning on the input it's getting from guys below it and then it hands off any tuple that matches up to this guy who then spins the width for that. So this is where the coordination stuff actually matters a lot because if now, if the number of tuples that the thing is going to spit out is going to be really low, then this thing is basically going to sit for a long time and do nothing. So now I've assigned a task to a core. It's not going to waste cycles because it can just block on whatever its incoming queue is, but it's taking up a task to get resources where it may even would have been better just to combine these two together and into a single pipeline. Again, they're not mutually exclusive. I can do the horizontal and vertical together. I could have this join, be broken up with vertical parallelism with the projection here, but I could have multiple workers all doing this at the same time as well. The last one I'm talking about is bushy parallelism. Again, in my opinion, this is just an extension of interoperative parallelism. It's not something distinct, but again, I think the textbook and other guides online talk about this. So the main idea is that we just have different workers operating on different portions of the query plan at the same time, and we're still using exchange operators as an interchange to move data between them. Let's say I'm doing a really stupid four-way join on three tables. So what I could have is, if I compose my join algorithm like this, I could have this portion of the query plan as a fragment exude on one worker, and then these two joins over here exude on another worker, and they're just running in parallel, shoving data up into our exchange operators above, and then we have other workers operating on that. Again, this is why I'm saying it. To me, this is just an extension of the interoperative parallelism because it's just different portions of the query plan running at the same time. Okay? Yes? Yes, but third and fourth processes still have to wait the result of one or two writes. Yes, question is, do the third and fourth workers up here, do they still have to wait for the result of one or two? Yes. In this case, yes. It depends how the exchange is set up, right? So after you do the join, anything that this generates out, then it gets shoved up to this guy, it could start building the hash table with tuples as it gets out. You could wait, but you don't have to. It depends how it's set up. Yes? Can we go back to the parallel for this one? This one. Yeah. So this is one, so the question is, where is the parallelism here? So I have one worker that's running this, the join, I have another worker that's running the projection. So this thing's spinning and doing the join, and every single time it finds the matches, it hands off the tuple as its output, hands it off to this guy, who could then now start doing the projection. Now projection is like a super simple operation, so it's not that expensive, but the idea is that, instead of having this thing do the join, then do the projection, then go back and do the next join, while this thing hands off to do the projection up here, it can go back and do the next join. Yes? In draw of it, in draw of it, when the excess of the hash table itself need to be the bottleneck, because at one time only one of the threads can update the hash table, right? So your question is, like going back here, with this hash, it says only one thread can update the hash table at the same time? No, right, because they're accessing different pages, I can do that in parallel. Yeah, okay, but like, okay, fine. Right, so again, so you can either do everyone up to the same hash table, or you can do the partition one, and now you're doing multiple stages, so now I could have the first pass, I mean, you're still accessing the same page though, like in the first pass, you're still accessing the same hash table, but in the first pass, you have the different threads, update the different buckets, but as you hash it, but then you may end up hashing, two threads might hash into the same bucket, then you have to deal with that, that's unavoidable, then the next phase though, you can run in parallel, but not have to coordinate across any of them, or alternatively, you could just do the, you know, one thread could take a pass, build out the hash buckets, then you paralyze that. Different systems do different things. If you assume your disk is super slow, and that not everything fits in memory, then having a single thread build the hash table, doing the first scan is probably the better approach, because in that way, you're just doing as much sequential IO as possible, and the disk head isn't jumping around. With an SSD, you can do multiple simultaneous requests, so therefore you do want to build the hash table in parallel. Yes? Any particular reason we're not building the hash table in B? So is there any reason we're not building the hash table in B here? Yeah, like grace. Like, that's what sort of partition is, like partition is sort of breaking this up, dividing up, and then multiple streams now can do the probe. So this is just breaking the data up, different, say, four partitions, it's like the raised hash join, but we're still accessing a single hash table. So what's the difference between the inner operator and the bushy join? So that's what I'm saying, it can meet the same thing. The definition of bushy operator, bushy parallelism is that there's one part of the tree, turn bushy because it's a bushy tree. This will make more sense when we tell what joins next class, like there's right deep, left deep trees, but this is like a bushy tree because I'm joining two tables over here and joining two tables over here, so I can have one thread do this join over here and another thread do this join over there and I don't need to coordinate between the two of them at this point up until we get after the exchange. So yes, to me this is the same thing as what I showed here of the interoperator parallelism, but there's calling it bushy parallelism. Correct, yeah. Or having one operator be its own worker. This approach you see a lot, like every operator is its own worker. You see this a lot in streaming systems, like Spark streaming, Apache NIFI, Flink, or Storm or Kafka, this is the architecture they typically use. In database systems they're probably going to be doing something more bushy because you want to have a single task, much work with the tuple as much as possible as far as you can up in the tree. Yes? So here a worker, one would be imaging any given people to either work with your worker for it. Correct, yes. How do you decide which one you're emitting? So this question is, in this exchange operator here, for this one exchange, it could either be shoving data up into this operator or into this operator. How do I decide? So this is something that would be baked into the query plan. You would say, you know, here's, I want to partition my database on this attribute, just do round robin, just do hashing. We'll talk a little bit about this at the end of the class, but there's some logic in here to help you decide where to route the data. The easy thing to do is just round robin. But actually for this particular example, you wouldn't want to do round robin because you need to know that there's some tuple over here that's coming on this side, if it's going to match with the tuple over here, you want them to go to the same partition, not different ones, because otherwise you'll have false negatives. I'll finish this slide to make that more clear later. Okay, so that's it for compute parallelism. Again, at a high level, the everyday system that supports query parallelism in different manners is going to support exchange operator in some form of it. And how sophisticated they are depends on how complex the system is. The thing, though, that I mentioned is that if we're running on slow disk and all of our threads are getting blocked because the things they need aren't in the buffer pool and we have to go to the disk and get them, then all of these extra cores and all of these extra exchange operators we're doing is not going to help us at all because we're always going to need bottleneck on doing that request. And so the way to get around this is through IO parallelism. The basic idea here is that we're going to break up the database system's files and data across different locations on storage devices. And we can do this in a bunch of different ways. We can have multiple disks per database. We can do one database per disk. We can do one relation per disk or split relation across multiple disks. Again, from the SQL standpoint, from the application standpoint, we don't know, we don't care how this is all being set up. The data system hides all of this for us. Is this like RAID? Is this like RAID? Yes. So who here has heard of RAID? Most of you. So RAID stands for redundant array of... So I know what it used to be called. I don't know what it's called now. Redundant RAID of independent disks. It used to be called redundant array of inexpensive disks. And then the disk manufacturers wanted to use RAID but didn't like to be called inexpensive because they didn't want their products to be cheap. So they went back and had to change it to be independent. So the idea here is that we can configure the system such that multiple storage devices are going to appear as a single logical device to the database system. And we can do this either through a special hardware controller that's on our motherboard. We can do this through software itself, like the Linux kernel supports RAID configurations. Or we have like a storage appliance that just provides this functionality for us. And we have a fast interconnect to our system. But the main takeaway is that for the most part, for the RAID setup, this is going to be completely independent to the database system. It doesn't know, doesn't care that the disk is broken up to multiple devices. Or that my storage is broken up across multiple devices. In this case over here, this is something that the database system manages and therefore it can be smart to make decisions about how it's going to plan its queries because it knows how the data is actually being laid out on different devices and it knows the speed of those devices. So let's say we have a really simple example here. We have a database that has six pages. So this is an example of RAID zero. It's just called Striping. So what happens is that as the database system creates these pages and writes to them, there's some RAID controller up in here that then decides, or you go here, you go there, I decide as a round robin approach which device to write it to. And it knows internally it has its own metadata to say, oh, I need page one. I know it's on this disk, let me go get it. But again, the data doesn't know these things. Another most common approach is to do mirroring and basically that now every single device has a complete copy of every single page. And you can have some erasure coding or other methods to make sure that if one disk goes down, you can recreate it from the other pages. Yes? Is Ray 1 end up being a bit slow? This question is, is Ray 1 end up being a bit slow? For writes, yes, for reads, no. Because for reads, I can say, all right, well, I'll assume my hardware's okay. So I can go with any one of these and now I could have one thread reading page one, another thread reading page two on separate devices and that's all fine. For the writes, I need to make sure that it's propagated across all of them. That makes that more expensive. This is like the most basic thing you need about Ray. There's way more complicated setups, like Ray 5, Ray 1, 0, that can do combinations of these different things. Yes? Round robin is just like you're dealing cards. Everybody gets one and then when you reach the end, you go back around and just do it all over again. I don't know what the robin stands for in this. When you play kids growing up, you play games, you round robin, you hand out things in that order. Yeah, I realize that's an American colloquialism. Okay, so the Ray stuff we just talked about, that's all transparent to the database system. The thing that we can be smart about, though, is the partitioning stuff that we talked about a little bit so far. So the idea with database partitioning is that we're going to be able to split the database up into distroing subsets that can then be assigned to discrete disks. And what happens is now the data system's bufferful manager knows that if I need to read a page, it knows what partition or what disk location is going to have the data that it's looking for. So the easiest way to do this kind of partitioning, if your database system supports one file per database and one directory per database is that you can just set up SIM links to have these different directories point to different disks. The high-end systems actually can know about those different devices, they can do that mapping for you within like a centralized location as an administrator. But a quick and dirty thing like for MySQL, for example, is just move data around and put SIM links get set up. The log file, though, is the tricky part. We'll talk about what a log file is later on, basically the record of all the changes will be made, and that usually needs to be stored in a centralized location, but if you now have different devices and you need to shard your log file, that's something your data system has to do for you. It's not something you can fake out with the file system. So let's quickly talk about partitioning a little bit. That'll help understand how we're dividing up the work for our exchange operators, and then we'll spend more time talking about this in distributed databases because this is the key idea that they take advantage of. So the idea of partitioning is that there's a single logical table, and then split it up to just joint subsets that can then be stored and managed separately on our different storage devices. And ideally, we want our partition to be transparent to the application. Some systems will let you tell it how you want to partition things. Other systems will do it for you automatically. And we don't want to have somebody be cognizant of when they write a SQL query of where their data is actually being located. It's not always the case in distributed databases because it's good to know that if you're enjoying two tables and that one of those tables is in a remote location, maybe I don't want to write that SQL query because that's going to take me a long time to get the data that I need to process it. But in general, we don't want our end users to have to know anything about where the data is actually being stored. So there's two approaches to partitioning. There's vertical partitioning and horizontal partitioning. Horizontal partitioning is what people are most familiar of. If you know about distributed databases, you know about sharding, that's what horizontal partitioning is. Vertical partitioning is just the column store stuff that we've already talked about before. So I have a table, it has four attributes. So I could take this attribute here and just store that in a separate partition, in a separate file, on a separate disk, in a separate storage device. And so what will happen is if most of my queries only need touch data over here in these three attributes, things are super fast because I'm just reading exactly the data that I need. Any time I have a query that wants to combine these two things, just like in a column store, I've got to go do fetches in these separate locations and stitch it back together to its original form. So some systems support vertical partitioning as a way to sort of approximate what a column store is, but it's not exactly the same because the systems that do this don't take advantage of, like, you know, these columns are all the same values and can do compression and have a, you know, the query execution approach is, can be optimized for doing operating on a column store. But this is sort of a halfway point to get you some of the benefits of a column store but not entirely the same thing. Again, as I said, the most common approach is to do horizontal partitioning. And this is where we're going to split up the table based on some attribute, some value. And so that all the data for a single tuple will be located together in a single partition. So now, if a query says, go get me tuple one, I can go just to this partition and get it. If another query says, go get me partition, or tuple three, I go to this partition and get it. And now I could have multiple workers running in parallel and both operating on these different partitions at the same time. So now how you do this horizontal partitioning can vary across these different systems. Again, we'll cover more about this when we talk about distributed databases. All right, just to finish up quickly. So parallel execution is important. It's everywhere. Every single major system is going to support some variance of parallel execution. And whether that means running multiple queries at the same time or taking one query and dividing it up. And how you divide it up could be the interoperative parallelism, the interoperative parallelism, the bushy stuff. The things that are super hard to get right are things that we've covered so far in the semester, and we'll cover more later on and going forward, is how do we coordinate multiple threads operating on the same thing at the same time without ending up with incorrect results. Again, we'll focus this way more when we talk about transactions and concurrency control. But for read-only stuff, this is not that big of a deal. All right, any high-level questions for the parallel execution? Can you explain the difference between the column store and the virtual partitioning? This question is, what is the difference between a column store and a vertical partitioning? It's like a... Not every column is connected. Yeah, but you still can do that in a column store. It's at a high level it's the same thing. But usually what happens is vertical partitioning will be... You can do this in a row store system, but you'll do this... But when you actually process the queries, you're not doing it in a way that's efficient for a column store. You may be still doing the iterator model going one tuple at a time. Or you don't compress this data because you know it's all the same value. It just says, here's my attributes. You go here, you go there, and then everything else up above still looks the same. Is it also that at this stage we don't have any information about the table itself? So what we are partitioning, it may be part of the same column itself. Like two partitions may be the same column itself. The column has very big data in it. Your question is... Repeat your question. At this point, we don't have exact information of the table while partitioning it. So it may happen that the column itself is getting partitioned across multiple partitions. No, no, no. I think your question is how do I figure out where to divide this thing up? You know this. You know what the schema is. You know that, oh, I want a partition on attribute four. I know for every tuple, here's the offset for it. So I know exactly how to split it up and move it over there. You're not going to split this thing up in half accidentally. You're doing this on a per tuple basis. You're not blindly just taking a chunk of data. I think of this as continuous memory, continuous page. So I know how to jump to this offset for tuple one, move it over here and do the same thing for all the other ones. Yes? Is this user-defined or is it something a data system can automatically do? This is typically user-defined. There's no reason it couldn't be automatic. So the high-end systems have tools to help with this. But in general, this is user-defined. Same thing for this one here. This is usually user-defined. But it doesn't have to be. Right, the midterm. Let's talk about that. That's fun, right? So who needs to take it? You. What are you going to take? The midterm exam. When? Next Wednesday, 12 o'clock in this room. And why? This video will answer all the questions in life. Okay. So the exam will cover everything up to and including everything we talked about today, query execution part two. It will not include anything on query optimization that we talk on Monday. If you need special accommodations, please contact me as soon as possible. Some of you have already done this, and we'll take care of you. And then there's a... If you go to this URL here, this will take you to the same information I'm showing here, Study Guide. It will also include a practice exam with the solutions that I'll upload later tonight. I haven't done that yet. Okay? All right. So what do you need to bring? You've done the homework, so you know that you have to do some basic vulg, so compute the math. And then you're allowed to have a one 8 by, you know, standard 8.5 by 11 sheet of paper. Handwritten notes double-sided. No taking the slides and shaking them down super small. Everything has to be handwritten, and again, you can use both sides. Put anything you want on it. Okay? So this list keeps expanding every year. Here's what not to bring. First year somebody brought a live animal. Do not do that. Last year somebody brought their wet laundry. Or two years ago there was a wet laundry. It was kind of weird. It was like, oh, you have laundry. Why do you bring this? Oh, because I wash my clothes. I didn't have time to put it in the dryer. I didn't want to leave it there before the exam. And so he starts spreading out his clothes on... Don't do that. Last year, this kid brought... I didn't know this existed. It's like a holy candle, but it has Jennifer Lopez on it. Don't bring that. He wasn't even trying to light it. I like the smell. Don't do that. Okay. So what do you need to do? You need to understand the basics of the relational model, relational algebra. We focus on the integrity constraints. What does it mean to have a foreign key? What does it mean to have a primary key, secondary key? Some basic things. For SQL, we're obviously not going to ask you to write raw SQL on the exam because that's a pain in the ass to grade. But if we show you a SQL statement, you should understand what it does and what it means. So the more complex operations we're going to care about will be the joins, the aggregations, and the commentable expressions. You don't need to worry about window functions. Ctes, subqueries, things like that. For storage, we talked about replacement policies for the buffer pool management, LRU, MRU, and CLOCK. We talked about different ways to represent the heap file, or the new on disk file. This is either going to be table heaps or the linked list. And then for the page layout, it could be either the slotted pages or the log structured. Again, this would be high level questions about the implications of one versus the other. Not like, draw me a diagram of what a log structured page looks like or what a slotted page looks like. For hash tables, we talked about static hashing or the linear problem hashing, the Robinhood hashing, and the cuckoo hashing. What are the implications of these? Why is one better than another? What problem are they trying to solve? Are they better for reads or for writes? Dynamic hashing schemes, extendable hashing, linear hashing, and bucket hashing should be in there as well. I'll fix that. Why would you want to use one of these versus another? When would you want to use one in a join? When would you want to use one for an index? High level questions about these things. We talked a lot about tree indexes, in particular the B-plus tree, how to do insertions, deletions, splits and merges, the difference between a B-plus tree and a B-tree. Again, what are the performance implications of that in a disk-oriented database system where everything may not fit in memory? How to do latch-crabbing and coupling? How to do traversal or scans along leaf nodes? How to deal with bedlocks? We talked a little bit about radix trees or the suffix trees. Again, we'll ask you high level questions and not draw me an exact diagram of something. For sorting, we talked about the different algorithms, the two-way external merge sort and the general merge sort. To understand the cost, if I give you a bunch of buffer pages, if I give you a bunch of data pages, what is the cost of doing that sort? For the joins, the different variants of them, of nest loop join, sort merge join, hash join, again, what are the costs of doing these joins? What is one better than another? What are the extreme cases? Which one would be better than another? How do we do multiple keys or composite keys and do joins on these things? The thing we finished up today would be the processing models, what are the advantages and disadvantages of the iterator model versus the materialized model versus the vectorized model, top down versus bottom up, and what are the different advantages or different approaches to doing parallel query execution, interoperator parallelism, interoperator parallelism and the bushy parallelism. Okay? Any questions about the midterm? Yes? The question is, what do you think the answer is? Are we responsible for any C++ code? What do you think the answer is? Right, because how am I going to grade that? For like 90 something kids running that through a compiler? I'm not doing that. I'm not that cruel. Okay. Any other questions? Yes? Do I care about ink versus pencil? As long as we can read it. No. Yes? I'll try to put the sample up later tonight. Yeah. Again, what do you need to bring? CMUID, calculator if you need it, not your phone, like a regular standalone calculator, and one 8.5 by 11 sheet of handwritten notes, you can use both sides. What not to bring? Live animals, candles, wet laundry. You can bring food if you want, I don't care. Alright. Next class, we're going to talk about query plan and query optimization. Now, so we're putting it all together finally, taking a sequel query and generating one of these query plans. We'll see how to do that. Okay? Alright guys, enjoy your weekend. See you on Monday. Hit it. The object is to stay sober. Lay on the sofa. Come here to follow me Tim. Stress out, could never be sun. Rick and say jelly, hit the deli for a part one. Naturally blessed, yes. My rap is like a laser beam. The fulge in the bushes. Say nice, feel like a team. Crack the bottle of the say nah. Sippin' through those, you don't realize. The drinking ain't only to be drunk. You can't drive, keep my people still alive. And if the sink don't know you're from a can of pain. Pain.