 For today's class, we're going to focus on talking about how to actually execute queries in parallel in the system. So real quick, before we get to that, a reminder that in two days from now on Wednesday, October 18th, we'll have the midterm here in class and it will encompass the entire hour and hour and a half, hour and 20 minutes. And then project number two will be due on a week from now, Wednesday, October 25th. Everyone should have gotten an email last night about the last two homeworks on Gradescope. It's been graded and released. All the solutions for all the homeworks that are relevant to the midterm are available on the website. And again, I'll be having office hours today if you want to come and ask questions. So again, the exam will be here. You need to bring your CMUID, a two sheets, or sorry, one single sheet, double-sided handwritten notes. They cannot be typed. And then a calculator to do the basic, you know, the logarithms you need to do the joint cost calculations. So any questions at a high level about the midterm? Yes? It would not cover today's lecture. It's everything up until last Wednesday inclusive on query optimization. It must be handwritten, right? Because otherwise you could just copy and paste the slides, and that sort of defeats the purpose of this. Anything else? Okay. So all the discussions we've had so far about joint algorithms, sorting algorithms, access methods, how to do scans, and things like that, and it has all ignored or just, we didn't bother talking about threads or multiprocess execution, right? We described all these things in terms of you have a single thread and want to scan a table. Here's what it's going to need to do. Now in your first project, when you implemented your buffer pool, you did have to worry about concurrency issues. You had to worry about that if you have one thread is modifying a page, you want to pin it to make sure that another thread in the buffer pool manager doesn't swap it out while you're making those changes. So now we can talk about how we actually want to build a database system to support multiple threads running at the same time and updating the database, running queries, accessing things, but sort of help motivate this, and maybe this is sort of obvious at this point because it's 2017, but I'd like to sort of ask why do we care about parallel execution? Why do we want to spend time talking about what we need to do inside our database system to allow it to support multiple things running at the same time? What's an obvious answer to this? Sorry, what's that, say it again? He says emerging of distributed systems. The first distributed database system was from 1979. I wasn't even born yet, so it's not that. It's right. You get better performance, obviously. And you'll get better performance in terms of throughput, how many queries, how many transactions you can execute per second, but also you'll get better latency, meaning the time it takes from your application when it submits a query until the time it takes to get actually the result you need back, that's going to be cut down quite significantly if you can run things in parallel. So we're also going to get better increased availability. This will come up more when we talk about distributed systems, and I'll explain what I mean by parallel systems versus distributed systems in the next slide. But the other advantage also too is the way Intel is getting better performance on their chips now is not ratcheting up the clock speed. It's giving you more cores. So now if your database system can only operate on a single thread at a time, then whenever the latest chip comes out from Intel, every new Xeon update, you're not going to get any faster because maybe the cache size might get a little bigger, so that might help you, but you're not taking advantage of all the extra cores that the CPU is providing you. And so if you can take advantage of everything, then normally we'd have to run on a distributed database system across multiple machines. You can now run on a single machine, and that means the total cost of deploying a database system is much less, you're being more efficient. So this is at a high level of what we're going to care about here, and again sort of to what he sort of, yes, sorry, total cost of ownership. So think about this way. My SQL is free. It's open source. You can download and use it. Is that the total cost of what it takes to actually use it? No, because you have to buy a machine to maintain it. That machine needs energy to pay for that. Then if it's complex application, then you need a, you have to pay for a human DBA to maintain it. So all those things you would encompass in total cost of ownership. And this is typically why people can justify paying Oracle or IBM or SQL server a lot of money, because the total cost of operating a database on those enterprise systems of just the software license is usually a small portion. It's all the other stuff that starts to add up. The hard drive dives. You've got to buy another one. There's all those sorts of things. So TCO is typically how an enterprise will, it's the metric they'll use to decide whether they want to go for one system versus another. So just because something's open source doesn't mean it's always free. Right? Okay, so the, as he sort of alluded to earlier, there's this notion of parallel versus distributed database systems. And you may think that they're actually interchangeable. And I'm here to sort of say that they're not, and I'll explain why they're actually different. But the basic idea of this sort of class of database systems is that we're going to spread the logical database across multiple physical resources. And I'm saying resources are not machines because it could be multiple disks, it could be multiple processors, it could be multiple machines, it could be multiple data centers. And by doing this, we're going to get better parallelism and achieve all the things that I just mentioned in the last slide. The main thing people always point to why you want a parallel system or a distributed database system is better throughput or higher capacity. And so the core, there was sort of the key thing to understand about these distributed database systems and one of the advantages of writing your application to use SQL queries is that the database appears as a single logical instance to the application. And you can write your SQL queries all on that single logical database. And your application doesn't know, doesn't need to know about where the data is actually stored or what machines can execute the query or if the hardware is configured this way versus another way. And so that means in theory, you could write your query on a test database running on a single box, but then if you wanted to then try it out on a production database, the live database that's scaled across a thousand machines, the same SQL query that works on one machine should also work on a thousand machines. And it's up to the database management system to take your SQL query and figure out what data you actually need to touch and then how to route queries to that right location or the subtasks of a query, the operators to the right location. Now the tricky thing will be and will cover this more in the distributed databases just because, you know, say your query runs in one minute on one machine, doesn't mean when you go to two machines all sudden the time is going to be cut in half. That's certainly not the case at all. So there's a lot of tricky things that to deal with when you go distributed. For our purposes here, we're going to focus on running on a single box but running them in parallel on multiple cores or multiple hard drives. So the definition I like to use to differentiate parallel database systems versus distributed databases is the following. So a parallel database could be comprised of multiple nodes and I'll use the term node loosely but typically maybe like a machine or a blade in a rack server. And these nodes are going to be really close to each other. Some high speed land, they're either in the same rack or if you have one of those blade servers, they're all in the same interconnect. And so in this environment, the communication cost between the different nodes is assumed to be really small. And think of the smallest parallel machine you could possibly have is a two-socket mother board. And so the communication between the two, the cores running on one socket versus the other is really, really fast as opposed to if you have one machine in Amazon's data center in Virginia and another machine in Amazon's data center on the West Coast, then the communication between those is really long. And so in those kind of environments, that's what I typically refer to as a distributed database system. So in a distributed database environment, the communication costs and the problems of say unreliable communication can't be ignored in the same way that you can in a parallel database. And so what I mean by that, it's very often the case that the network may go down when you're talking between different data centers and then your query that was running on two different locations now has to get killed because you can't communicate with the other one. Whereas in a parallel database, as we see when we talk about some of these multi-threaded architectures, it's very unlikely that if the CPU on one socket crashes, that the system can keep on running on the other socket. The whole motherboard will freak out and the OS probably would kill itself, have a kernel panic. So in that environment, we assume that we can always reliably talk with the other nodes and we can design algorithms that are based on this assumption. Whereas in a distributed database environment, you assume that the communication is unreliable and you have to account for that and how you build things. So now, again, for this class, we're going to focus on parallel database systems. At the end of the semester, we'll talk about distributed database systems. Most of the times when people think like in a NoSQL system or like a Google scale system, they're typically talking about distributed database systems, but most modern database systems, even MySQL, SQLite, and anything that runs on a single node, they're considered parallel systems. Because it's not a single thread that's doing everything. So what kind of parallelism can we have? And there's essentially two categories. So we're going to target here for query parallelism. The first is going to be sort of interquery parallelism. Yeah, sorry, in the back. So his question is, are, and these two categories of systems, is would one be better for OTP and would one be better for analytical workloads? And I would say the answer is yes. So in a parallel system, right, say you have a big honking server with a lot of cores and a lot of memory, if all the data you need for a transaction, which I haven't defined what a transaction is yet, but just assume all the data you need to perform some operation in the application, if that's on a single box, then you can run that very quickly because you don't need to communicate with other nodes. If you want to, if you have a distributed system and you want to do a transaction like send money from my account to your account and your account is on the west coast, my account is on the east coast, then there's a bunch of back and forth that I have to do to make sure that transaction occurs, happens correctly. That all that happens versus, you know, you don't want to take money at my account and then crash and lose that money. So there's a bunch of extra stuff we'll talk about in two weeks in concurrency control that you have to do to make sure that happens. And that's really bad or slow if you have to go over a wide area network. So there are some systems that can do transactions in a distributed environment, but it's rare and most of the times all the databases aren't that big. You can probably fit them on a single box or small number boxes. For really large data sets, when people talk about big data and those kind of environments, they're typically talking about distributed databases because the size of the database exceeds the capacity of a single node. So you sort of combine a bunch of nodes together and federate them together, and now you have a distributed database that can run analytical queries. And because when you run analytical queries, they're read only, you don't have to do a bunch of extra stuff to make sure that everything's gonna be correct and fail safe when you run. So yes, typically, parallel databases are, people want to run their transactional workloads on a small number of machines and that would sort of constitute a parallel database. And then for really, really large data sets, you would go to a distributed database environment. Now of course, there's always the Google, Facebooks, Amazon, Microsoft outliers, where they have really, really large applications that want to do transactions and they have to run in this. So this question is, in a distributed database environment, do I mean a setup where the database is broken up into shards or partitions or smaller disjoint subsets and each node in the sort of cluster or the configuration has just a portion of that data? Typically when people talk about distributed databases, that is what they're talking about, yes. Again, we'll talk about more about this later in the semester. Yes, so I think the answer, I would say the answer is yes. His statement is probably not a disjoint. Typically, or sometimes it is. There'll be replicas and things like that and you can be, say it again. Make sure copies to live on the machine. Doesn't that make you have multiple shards and training some of that trouble? So by disjoint, what I mean is like, say I have ten tuples, I'm gonna make five partitions or shards. And the first one has the first two and the next one has the second two, right? There's no overlap between the two of them cuz that makes it tricky for bookkeeping, right? So partition one will have two for one and two, partition two will have three and four. There's not gonna be a partition that has two and three. So that's what I mean by disjoint. So, but again, you can replicate those shards on different machines. I'm saying there's no partition that sort of, that would exp, that typically is not a partition that spans the boundaries of two of the partitions, right? There is, but that's complicated, okay? All right, answer everyone's question. So again, another thing to point out too is these are not mutually exclusive. So in your distributed database system, on a node in the cluster, that could be a parallel system, typically they are, right? Cuz you have all these extra cores and you wanna run things in parallel, right? So they're not mutually exclusive. I just sort of wanna highlight the point that there's a bunch of design decisions that we can make if we assume our communication is gonna be failsafe and very fast. And there's other ones we have to make if we assume the machines are gonna have some distance from each other. And we need to account for the fact that we may lose communication. Right, so the parallelism we're gonna target in this class are these two categories for query execution. So interquery parallelism means that we're gonna allow the database system to run multiple queries concurrently at the same time. And the idea here is, if save you had a single threaded system, if you executed queries one after another on a single thread, then there's sort of be this long queue where all these queries show up and there's one thread that's gonna pick the first query, execute it, then go to the next one and execute it, right? And that will increase your latency because now you have to wait for whatever's in front of you to finish. But if we have all these extra cores, then any thread could be executing any of the queries from the queue and they don't need to wait for each other. The tricky thing about this, and again we'll focus on this later on when we talk about concurrency control, is what happens if there's queries actually need to update the database. There's a very complex subject of figuring out what other threads are allowed to see if other threads or other queries are updating the database and make some changes. And that's, again, we'll spend two or three weeks on that topic. And that's my, probably, most favorite thing to talk about. And then the other type of parallelism will be intro query parallelism. And that will have a single query will get broken up into subset, subtasks. And those subtasks will be executed by multiple threads or multiple processes or multiple cores in parallel at the same time. And so the advantage of this is that if you have a really complex OLAP query that has to touch a lot of data, instead of having that single thread do the scan in serial order, sequential order, you can have different threads scan different portions at the same time. And then the database system knows how to coalesce their individual results and compute the final answer you need to the application. So for today's class, we're gonna focus on first talking about how you design a database system at sort of a high level architecture level to support parallel execution or parallel operations. And then we'll talk about executing queries in parallel, and that'll be the sort of two categories that I mentioned before. And then we'll finish off talking about how to achieve IO parallelism. So to think about execution parallelism is how do we run things on different cores and take advantage of the actual computational power we have of a multi-core CPU. And then IO parallelism would be how can we take advantage of having multiple storage devices and reading data very quickly in parallel, okay? All right, so every database system is gonna have what's called a process model. And this is the high level organization of how the system is gonna be able to support multiple requests from an application. So I didn't define this earlier, but I'm really talking about a database management system that supports sort of a multi-user environment. So think of like my SQL and Postgres and all the other database systems. You have some database process running, and you can open up different processes and open up the terminals and start connecting to the database and interacting with it. So this is different than like a SQL light system, which is an embedded database system, which has to run in the process that is gonna vote queries of it. There is no SQL light process running on your laptop or on your phone. Whatever application that needs to use the database is responsible for invoking that library inside of it, so it's all the same address space. So we're really talking about an environment where, again, there's some always running database management system demon running somewhere, and we connect to it with multiple terminals. So in this environment, we're gonna define a sort of a logical execution component of our database system, we use the term worker. And the worker is responsible for executing the tasks on behalf of the client. So if I wanna execute this query, we turn it to a query plan, and we have a bunch of operators, and that worker knows how to execute those operators and use all the algorithms that we talked about before. And then it produces some result and then sends it back to the application, to the client that asked for it. So there are essentially three different process models you can have. And I'll go through each of these one by one. So you have a process per database worker, you have a process pool, and then you have a thread per database system worker. So the first approach is where you have a insider database system. You're gonna have the worker execute as a separate process. Like when I say process, I mean like an OS process. Like when you do an exec or a fork, it has a PID in its own standalone process. And then what'll happen is there'll be this sort of one process that's called like the dispatcher or in Postgres it's called the postmaster. And as connections come in, the dispatcher figures out, all right, I need to execute this query, I'm gonna hand it off to a worker. And now that worker is responsible for doing whatever it needs to do to execute that query on the database and then send back the result, right? So again, Postgres does this, what happens is when you make a connection to Postgres, you always go to the postmaster and then it hands you off to a worker and that worker then goes back to the client says, all right, I'm the worker for your connection. Here's the socket where you can send any additional requests to me. And then execute the queries on behalf of that connection and operate on the database system. So because we're in a multi-process environment, in order to allow different workers to communicate with each other, we need to open up like an IPC channel, like a pipe, or if you shared memory to have sort of a global address space that the different workers can write into and communicate with each other. So the one advantage of this is that if the worker crashes, like say something funky happens in your code, because it's multi-process, that one, the OS will kill that one worker process, but everything else can keep on running, right? The downside I would say, and I can't prove this yet, but this is something I think is worth investigating is what's actually the overhead of using shared memory versus a multi-thread environment. So I already said Postgres uses this, but this process model is also available in IBM DB2 and in Oracle, yes. So his statement is, and he's correct, statement is that if you're using shared memory between the different processes to communicate with each other, and you have one worker thread go errant and has to get killed, then could it potentially corrupt the shared memory data structure it was writing to at the time, and therefore take down the entire system? Yes, that can happen. It obviously depends on what the worker thread was doing, right? The, there are some database systems, I think SideBase is most famous for this, where they are, they sort of have that fail fast policy, and they're careful about making sure when they crash a process, they don't take down the whole thing, right? I don't, again, the full details is I don't know. And obviously, you try not to ship a database system that has faulty code, right? So you try to avoid that problem entirely. The, so this is actually how most database systems were built in the 1980s and 1990s. If you're gonna build a new system today, unless you're based on Postgres, which a lot of systems are, you would not wanna choose this architecture. And I'm gonna take a guess why. Actually, let me phrase this, let me phrase the question. Why do people choose to use this model in the 1980s and 1990s? Pthreads. Pthreads, exactly right, right? Back then, we have, so we have PositThreads now, and everyone knows how there's a standard API that can be used in Linux and other Unix variants. Back then, there was all these different threading packages and they weren't portable. And so if you wrote, if you wrote your database system to use threads on one version of Unix or BSD, then it's not guaranteed that it would work everywhere else, right? So back in the day, because multi-threading or threading packages aren't what they are today, a lot of systems went this route. So an alternative is to use a process pool. So it's just like before where all the different workers are run as separate processes and what'll happen is you go to the dispatcher and then the dispatcher is sort of the go between the worker pool and the application. So the dispatcher gets a request and it says, all right, I'm going to hand this to one of my free workers and they execute it, do whatever needs to do, and then sends the result back. And the dispatcher can try to be kind of smart about this in order to ensure cache locality, it'll make sure that the same connection tries to send another query, it may try to use the same worker. So one thing I didn't talk about before, but it's relevant about this and the worker model for the process model is that all the scheduling of what worker threads gonna wake up and execute for how long, all that's handled by the operating system, right? Because we don't have any, these processes are just sort of running on their own. We can provide hints to the OS and say, this process may have a higher priority than another, but it's still left to the OS to decide who goes what, who goes next. The other issue about this pretty good approach that can be bad is unless you're always guaranteeing that you're gonna get the same worker process for every connection, you may end up with bad CPU cache locality. Cuz they're sort of jumping around. So IBM DB2 also supports this with Postgres, this is sort of how it works, but not quite. Postgres latest version and actually in recent years, they now support parallel query execution, the kind of things we're talking about today. It used to be what happened is when your query showed up, the postmaster would hand you off to a worker and that worker would be responsible for doing everything for that query. The parsing, the planning, the optimizing, then executing, getting result, and then sending the result back to the application. So that means that this worker was a single thread. That means you couldn't run your query in parallel. And so in the latest version of Postgres, what they're able to do is, they can recognize that if I have a query that could take advantage of multiple threads or multiple processes, then they can try to find maybe a free worker in the worker pool and hand off a subtask of the query to that worker and then execute on behalf of the main worker. And then it knows how to call this result to produce the correct answer. So this is not quite how Postgres works, but DB2 has a, you can configure DB2 to work this way as well. The modern implementations or modern database systems actually don't use this process model. They instead use a multi-threaded model. And what happens here is that for every single worker, you're going to execute that as a separate thread inside the main process of the database system. So it's all within the same address space. You don't need to worry about shared memory. Any thread can access anything. Of course, that has some downsides when we start updating the database. But we'll cover that later. So what happens in here, there's sort of a single process. There'll still be a dispatcher or a gatekeeper in the networking front end that gets the request and then knows how to hand it off to different threads to do whatever it needs to do. The database management system can now do all the scheduling at once. It can control what thread to wake up and let it run. Of course, the downside is that if you have a thread to crash, then it takes down the whole thing. But of course, we said before that may be also tricky to do. It might be a problem as well in the process model. So I'm only showing four systems here, but I would say again, every database management system that has been come out in the last ten years will be based on this model. And last of course, you're building your system on top of Postgres, which a lot of them do. We actually started when we started building our new database system here at CMU, we did what everyone else does. We took Postgres and hacked it up to make it do what we wanted to do. And we actually spend time to get rid of the per-process execution model and get rid of the shared memory and switch everything over to be a multi-threaded architecture. It was a lot of work and we ended up throwing all that code away. But again, it's my opinion that this is the better approach, but I don't have any scientific evidence to say that this is the right way to go. But this is typically what everyone does. So the one thing I'll also point out too here is in all three process models, I showed IBM DB2. And the reason because is IBM DB2 has to run on not only Linux and Windows on standard Xeon processors or they sell power machines, but it also has to run on their old school mainframes that run on ZOS. And in that environment, you have all sorts of other things you don't have to deal with in Linux. And they need to support any possible configuration that you may be trying to run DB2 on. So DB2 of all these in my opinion is the most flexible that can support all the different process models because they need to run on mainframes plus on sort of standard commodity hardware. Okay, so I sort of summarized all these things before. The multi-threaded architecture has several advantages over the other ones. The key one from a software engineering standpoint is that we don't have to manage shared memory. I suspect that would also be a performance bottleneck in the OS, but we haven't run that experiment. And then you also have less overhead now doing a context switch because doing a context switch between different processes is much more expensive to do in the OS than a context switch between threads within the same process. But I'll also point out though, regardless of what process model or worker model that the three choices that I showed before, which one you actually end up choosing does not mean that you can't do all the parallel execution stuff that we'll talk about here, right? It's just the way you communicate between the different processes or threads for the different workers will be different. And just because you have a multi-threaded architecture doesn't mean that you're going to get interquery parallelism. In fact, MySQL works this way too, but MySQL can only execute every query with a single thread, even though there's multiple threads available to it. I haven't checked to see whether that's been fixed in or changed in MySQL 8, but as of MySQL 5.7, they can't do parallel query execution in the way that we're talking about here. Now, I've sort of glossed over this issue, I've mentioned it a little bit, but when it comes time to actually schedule the different workers to execute a query, there's a whole bunch of other questions that we have to deal with other than just, I have these tasks, I have these threads, I have these processes. Go get it, right? Go let it run. There's a whole bunch of extra stuff we have to figure out now in our query optimizer or query planner to figure out what level parallelism we're going to want to have when we execute a query, right? So how many tasks or subtasks should we break our query plan up into? Then how many cores should be allowed to run for that query at the same time? If you're in a in-memory environment or if you have a pneumo architecture, where you have multiple sockets, then you can also make decisions about what task could run on what socket, and you want to do this to minimize the communication between the two sockets. And then of course also like when you execute a subtask in the query plan, it produces some intermediate result that you need to send to the next operator, where should you actually store that? And so again, this is another example where the database management system is always in the best position to make these kind of decisions. The OS can't figure these things out for us, right? The OS is going to say, all right, here's my task, let's go run. We have, since we know what the query plan is and we know what we're trying to do and we know what our data looks like and we know what resources are available to us, we can always make better decisions about these sorts of things, right? So in the commercial systems, they have, they expose different knobs and configuration parameters to you as the database administrator to tune all these various aspects about scheduling, right? It's not just simply letting the threads run whatever they want. And you can set up things like transaction priorities so that you say you have one query that's more important and that it'll get more cores or more CPU than a sort of a query that's not as important, right? So all these things, the commercial guys will expose to you, but we're not really going to cover that. All right, so any questions about process models? So now regardless of what process model we're using, now we can start to talk about how we want to execute queries in parallel or get parallel query execution. So the, as I said in the beginning, there's two categories of parallelism we're going to go for. The first is interquery parallelism where the database minimum system will execute multiple queries simultaneously rather than having one core execute them in sequential order. So now, if everything's read only, then this is super easy to do because they just run and they don't interfere with each other. And maybe you can be careful about making sure that if you have two queries that execute the same scan, then maybe you want to do scan sharing that we talked about before, or maybe put them on the same socket so that you get better casual quality. All of those things are still relevant, but it's not as hard as having to deal with multiple cores or multiple queries updating the database at the same time, because the question is, again, what should one thread be allowed to see from another thread? Or what should one worker be allowed to be allowed to see the modifications from another worker running at the same time? So all of this falls under the purview of what is called concurrency control. And again, that's a very, very hard topic. It's been studied for a long time. Not only is it hard to get this thing to run correctly, it's also very hard to get this run very fast. So we'll start talking about this I think next week after the midterm. And again, the bulk of my research for several years has been revolved around this. So I want to punt on this problem for now, but just sort of in the back of your mind think about as we go along that the challenges of this are easy if it's read-only, it's harder to do if you're updating things. All right, so what I want to spend more time though is talking about intro query parallelism. And the idea here is that we have a single query, ignore simultaneous queries run at the same time. We have a single query and we want to execute its operations in parallel. Instead of having one thread do all the work, we want to break it up into multiple threads. And then, so within intro query parallelism, we can have two approaches. We can have intraoperative parallelism and then interoperative parallelism. And so these techniques I'll say that they're not mutually exclusive, meaning we can have any combination of these two together. But the sort of the spoiler would be that most people, most database systems will implement the first one. And then the second one is typically used in another class of systems, we'll talk about it in a few slides. And I'm also not going to talk about in this in this course the algorithms that allow you to execute the relational operators we talked about so far in parallel. So it's suffice to say that external merge sort, hash join, sort merge, sequential scans, index scans, all those things we talked about before, there are parallel versions of them. And again, so in this course we're not going to focus on that, but we'll cover this more in the advanced class in the spring. All right, so for intraoperative parallelism, this is sometimes called horizontal parallelism. What we're going to do is we're going to take a query plan and then we're going to break up or take the operators and decompose them into individual instances that will execute the same function as the parent operator, or not the parent operator, the original operator. But we can execute them on subsets of the data we're trying to access or operate on and produce subsets of results, of intermediate results. So the way we're going to do this and I'll show in the next slide what I mean by this is that we're going to introduce a new operator into our query plan called the exchange operator. This has nothing to do with relational algebra. This wasn't defined in the original paper from the 1970s. This is something you would add to what's called the physical plan of the query, where we actually now define what algorithms we're actually going to use to execute the query plan. So the way the exchange operator is going to work, you sort of think there's a barrier or a location where all your child operators will send their results to. And the exchange operator knows that I can't send the results to the next guy or to the next operator of my query plan until I get all the results from my children. Again, it's sort of this artificial barrier within the query plan that doesn't actually affect what the final result is. It's something you introduce to do parallel execution. So let's look at a simple example. So we have a join query here. We're going to join A and B on their ID fields. And then we have two filters in our wear clause for both A dot value and B dot value. And so the logical plan after we're doing predicate pushdown, essentially looks like this, right? And we could execute this on a single thread by just doing the scan on A, then doing the join or build the hash table, then do the scan on B, do the filter, then do the probe on the hash table, right? We could have a single thread sort of do each of these steps one by one. But if you want to parallelize this, then the first thing we're going to do is we're going to target this first scan operator here. So now let's say that we can break up table A to three disjoint subsets, or three disjoint chunks. We call these partitions or shards earlier. That's typically what people use in a distributed system. But for our purposes, it's fine. And so the way to sort of think about this is that this is a logical plan, and now what I'm going to show over here is the physical plan. So this is actually what the database system is going to execute. So for each of these scans, I'm going to assign them to a distinct core in our system. So now we can execute all of these in parallel, right? One core would do the scan on A, A1, the next one does the scan on A2, and the third one does the scan on A3. Then now also too we see that we have this filter here. So we can also have the threads immediately pipeline their data from the scan into the filter operator and produce those results, right? And then we say, now we want to get up and do the join. So we know we're going to do a hash join here. So all the threads now will take the output of the filter and start building their hash table. So this is where the exchange operator comes in. Because the exchange operator basically going to say, I can't do the join or I can't start operating on my join until I get all the results from the individual subset scans on A, right? And typically this is not done. You don't assign a core to say you're doing the exchange. It's just a barrier the system knows that I can't proceed up in the query plan until all my three children produce whatever it is they need to produce. So now I can do the same thing in B. And let's say B is a smaller table and it only has two chunks. So we'll assign two cores to the scan and then the filter and then build the hash table and then it also has an exchange. So once we know that the exchange on the left and exchange on the right have been satisfied, meaning we've gotten all the data we needed from our children, then we can actually now do the join, right? And produce a result. But now this is the same thing. So now we want to do maybe the, sorry, we do the join, we do the probe and the hash table and then we also do our projection. And then we do that and push it up to the final operator. So here what I'm showing here is the join operator can actually be done in parallel, we'll assign different threads to execute the probe side of the join and produce our final result. And they all again get fed into a single exchange operator where you coalesce results and then you produce the final answer, right? So the exchange at the top is basically saying I can't produce the final result to the client until all my sub children have finished. Yes? So his question is, is this exchange operator implicitly? Actually, I'm afraid it's a question. Where is the exchange operator running, right? And typically the way this works is there is like a accounting barrier inside of the exchange operator that says that when a child operator completes the whatever it needs to do, it says I've done all you want that I needed to do. I'm not gonna generate any more tuples for you. It decrements that counter. So once that counter reaches zero, then whatever the thread that last updated it can then update whatever it else needs to update to say, go ahead and execute this, right? So typically you don't have a thread pulling, say, did I get everything? Did I get everything? It's the last guy that finishes that then pokes whatever else needs to happen after that, right? And so in this example too, what I'm showing is also a partition hash join which is sort of like the gray hash join we talked about before, where we would break things up in the bucket. So we could have each level of the hash table, you compare the bucket on the left and bucket on the right. You could sort of think of like the query plan being the same thing here. Say that we produce four levels of buckets in our hash table and each thread will just do the join between each of those two levels, right? And you don't need to communicate between the different levels because they've already done that pre-partitioning ahead of time. So this question is why was this named exchange? I think of like an exchange where like it's like the central location that you need to call less results. If you call it aggregation, that's sort of confusing because there's also aggregation operator to do aggregations in SQL. So the exchange operator was proposed in the volcano paper from the late 80s, early 90s. Remember we talked about the iterator model before and I said that this is also sometimes called the volcano model. The volcano model also includes the definition of this exchange operator. And again, what that paper actually originally talks about is sort of codifying exactly how you want to do this, right? People have done parallel query execution before but the author of this sort of laid out that wrote down everything that you need to know about in order to do something like this. So I would also say too that this will come up later when we do distributed databases because this is essentially how they work as well. And again, there's no direct mapping from the exchange operator from relational algebra to this, right? It's something artificial that you introduce in your physical query plan to move data around and make sure you have everything you need before you go to the next step. Okay, so the other type of interquery parallelism is called interoperator parallelism. This is sometimes called vertical parallelism or I think in your textbook they call this pipeline parallelism. And the basic idea here is that we're going to allow the different operations in our query plan to execute simultaneously. And this will allow us to pipeline the output of one operator into the next operator and immediately do whatever processing we need to do on it. So I think again the visualization is the best way to see this, right? So let's just focus on the join here. So ignoring about exchange operators and ignoring about running this join parallel, let's assume that we're going to sign a single core to execute this join operator. And again, it's really simple, we're just doing a nest loop join. So this core is going to spin through and it's going to do the join between the outer table and the inner table. And then for every time it finds a match, it sends it up to another operator, right, the projection. And so this projection can now be being run with another thread or another core and it's just blocked on this input queue from its child. And every single time it gets a new result, it immediately does whatever it needs to do and then spits it up to the next part of the query plan. So the way to sort of think about this is that you have all these threads running simultaneously and they're just pulling on their input queue to find results and every time they get something, they immediately operate on it. And then they go back and block on the queue again, right? So in this case here, this top guy, every single time something comes in incoming, it does whatever it needs to do, and it comes back and waits to get the next thing. And then there's some control message or some control coordination you need to do to make sure that when the join actually finishes, you notify the guy above and say, all right, you've got everything, there's nothing else that's going to come. So go ahead and clean yourself up and finish. So as far as I know, and I might be wrong about this, as far as I know, I don't think this is actually used in any what I'll call traditional relational database maintenance system. So traditional would be a system where Oracle MySQL Postgres, where you submit a SQL query and it generates the query plan and executes the operators. I poked around and I may be wrong, but I haven't seen anybody actually implement this. And the reason is because not all operators can actually admit results until they get all the inputs from the children. For the join to do a hash join, you can't start doing the probing until you've already built the hash table, because otherwise you may get a false negative. So it may be the case that in our query plan here, say that we have different cores doing the scan on A, scan on B, and then we have a core at the top doing the projection. It's just waiting forever until you actually get passed to the join. Until the join thing starts producing anything. So you don't really get any benefit from that. And again, there's nothing preventing us from actually running the filter in parallel. It's just we don't have a thread dedicated or something that's pulling in queue waiting for input, right? When we get an input then, when we know that the join is producing input, then we go ahead and get it. Another way to think about this too is like before, when we talked about the iterator model, we talked about calling next, next, next, next down. And that was a blocking operator where you pass control of the thread or the worker that's executing it from one operator to the next. In this case, it's sort of like there's some worker always doing, calling next and it waits. Where this model approach actually shows up is actually in what I call stream processing systems or complex event processing systems. So these are sometimes continuous query systems. And these are systems where you have like an input stream from an outside data source where you're always getting new tuples, new data. And you wanna process them in an online manner and produce results. So sort of think of like this, say you wanna do, if you just wanna, you wanna count the number of tweets you see, then you have the input stream as all the tweets, and every single time you see a new one, you just add one to a counter. And maybe aggregate them across different time windows. So in that environment, people set up these complex pipelines where you have different input streams coming along and each operator is gonna do some kind of manipulation on that data and feed that into the next operator. So that's where you typically see this approach. And so this Spark Streaming does this, and there's a whole bunch of other Apache projects. NIFI, Kafka, Storm, and Flink that all sort of fall into the category of stream processing systems that do this kind of thing, okay? All right, so one observation about this, of everything we talked about so far is I've only talked about how to execute the queries in parallel at sort of a computational level. Of having different CPU cores running different parts of the query plan at the same time. So all of this doesn't matter if the disk is always gonna be super slow. Right, I could have as many cores as I want, trying to process the query in parallel, the different tasks, but if they're all going to the same spinning disk hard drive that can barely squeak out any data, then all those CPU cores are gonna be blocked waiting for disk IO. So the query parallelism stuff we talked about before is really great. If everything's in memory, not so great if you need to get things from disk. So, and also actually too, is we're saying that you can actually make performance actually worse by overwhelming the disk if you have different cores or different workers executing different parts of scans that are trying to get data at the same time and the disk arm is jumping around and different parts of the platter and because it's all random access and it's gonna make things much less worse. So the way we need to solve this is through IO parallelism. And the idea here is that we're gonna take our database management system installation and I chose those words very carefully. I'm saying database system installation and not database because the database system can be comprised of multiple databases. We're gonna take our installation and we're gonna break up its storage across multiple devices. And there's a bunch of different ways we're gonna do this, all right? So I'll go through each of these one by one. So for the easiest way to get multi-disk parallelism or IO parallelism is just to configure either the OS or the actual storage device to have multiple drives or multiple hardware units. And typically the way people do this now is you either have a storage appliance, think of like you buy a box that's dedicated to just doing, being a file system machine, like a NAS or a SAN and inside of that appliance you're gonna have multiple disk drives and it can run things in parallel. And then you can do this within your single machine also using what's called RAID. Who here has heard of RAID? Most of you, okay, cool. All right, so there's a bunch of RAID categories. It was actually co-vended by Garth Gibson, who's a professor here at CSD when he was a grad student at Berkeley from the 1980s. RAID stands for Redundant Array of Inexpensive Discs. I think the I actually means something else now. I think back then when they came up with RAID, it was inexpensive disks. And then all the hardware manufacturers complained and said, don't make it sound inexpensive because then people think we're overcharging. So actually, what does it mean now? I forget, independent, thank you, yes. All right, so now it's run an array of independent disks. Back then it was inexpensive. So the basic idea is that you're gonna have multiple disk drives, it could be SSDs, it could be spinning disk hard drives, it doesn't matter. And there's a bunch of different configurations of how you can set up to store pages or blocks across these devices. So the first approach is called RAID 0. And the way this is called striping is basically that, say I have six distinct pages and every drive will have one of those pages. So now let's say I do a sequential scan on my table and I need to read pages one, two, and three. I can issue multiple signable tenies requests to the storage layer and the different drives can then go read them in parallel because they're independent and then shove them back up to the database menu system, right? And of course, so another approach also too is to get RAID one level, which is mirroring. So here now I only have two pages and now I have, every drive will have an exact copy of the page. There's a whole bunch of other configurations you can have to do parity blocks and other fault-tolerant ways. RAID 0 and RAID 1 are sort of the most simple things and there's more complex ones at the higher levels. But the key thing to point out about this is that all of this is transparent to the database management system. Meaning we don't have to modify any part of our code in our system to actually take advantage of this. The database system just thinks it has a faster storage device, right? You can read and write data much, much faster. And depending on whether you're doing, you know, you have mirroring or striping, some might be better for LTP workloads, some might be better for OLAP workloads. But in the end, we don't have to modify anything in our system. So that's kind of nice. The next approach is to do what's called database partitioning. And this is basically where we're going to take an entire database and we can assign it to one storage device. And we'll take another database and assign that to another storage place. And again, the OS, or sorry, the data-dependent system may not necessarily know that it's dealing with different storage devices because you can do all of this at the file system level without having to manage anything in the database system. Now in the commercial systems, they'll support sort of at the database system level. They know exactly what storage device is being stored, what database, right? In something like MySQL, you can just go into the data directory. It's a big truck. You can go into the database directory and you can just make SIM links to different storage devices. And MySQL doesn't know and doesn't care, right? The tricky thing about this is that the log file is sometimes shared across multiple databases. And so that can't be on, you can't shard that or partition that across each database installation. But you could also put that on a separate hardware device as well. The last approach, and actually what he was referring to in the very beginning, is to do what's called partitioning. And so if you come from the NoSQL world, sometimes they call this sharding. It essentially means the same thing. And the idea here is that we're going to take a single logical table in a database and we're going to break it up into disjoint subsets that each manages to storage separately from each other. And whether that's on different storage devices or whether that's different table heaps or buffer pools or whatever, it doesn't matter. It's up to the database to decide how to do this. So ideally partitioning at this level, table partitioning should be completely transparent to the application. So in the last two examples, when I said multi-disk for the database installation or breaking the different databases into different storage devices, all of that is transparent to the application. I write a SQL query. I don't know where the data is stored and I don't need to do anything extra. In table partitioning, ideally it'd be nice to have the application not know anything about partitions. It's not always the case. And a lot of times when people roll their own distributed version of my SQL, they have to write their own sort of middleware layer or router to say, all right, this partitions at this node go there. And then if you move things around, you have to change that router information. So not all distributed parallel databases can support this, but typically the major vendors actually can do this. So there's two classes of partitioning. The first is vertical partitioning, where we're going to split up the table into subsets where the attributes or some set of the attributes we stored in one partition and some set of the attributes we stored in another partition. So let's say that we have a really simple table like this. It has four attributes. And say this last attribute here, attribute four, is a really, really large text field. And so that means that if I have any query that wants to do maybe just get the first three attributes and ignore the fourth one, if I'm doing a row based storage model or the NSM storage model, then I have to go fetch the entire page and get all that data. Now you can store the large attributes in a separate page, which for now we can ignore that. But this would be bad because maybe I have to go get that fourth attribute every single time. And I'm not going to actually use it for my queries. And that's going to take up memory and slow things down. So what I can do is I can take this guy, move it over here, and store it as a separate partition. And so now any time I have a query that only needs to access attributes one, two, and three, it can always go directly to this partition. If and when I need actually attribute partition four, then I know how to jump over there and get it from there. And again, I don't have to change anything in my SQL query. The database management system knows that these things are split up in this way and knows how to go get data from the other guy. Yes? Device to Stop Recording, please restart the recording. Sorry. Sorry, let me fix this real quick because otherwise people are going to play in on the internet. It says it's still recording. I think it's OK. It says device to stop recording, please restart the recording. Yes? Then yeah, you should get it. All right. Screw it. All right. Sorry, repeat your question. So this question is, are there algorithms to decide which attribute to partition on? Yes. There are a ton of them. I've written one. My advisor wrote one in 1979. There's a ton of different things. Another way to think about this, what does this look like? We've already talked about this before. What does this look like that I'm describing here? Column store, exactly, yes. So in a column store, typically they'll store everything as its own partition. Some databases that support this can So usually the commercial guys will support something like this, done at the database system level. And they provide some tools where you feed it in, some workload trace, and it can help you make decisions about where to do this partitioning. In other systems, this is usually done manually, but also done at the application level. So this is actually what Wikipedia does. So the way Wikipedia stores their revisions for every article, they actually partition it in between different tables. So you have one partition has, or sorry, one table has the metadata about the revision, and then they have another table that has actually the text of the revision. So that essentially is doing the same thing, but now in your application code you have to write a query that knows how to join them together if you ever need to get actually the text, right? What I'm describing here, you wouldn't have to do that. Everything is done for you internally, and the same queries can operate without worrying about there's data here versus data there. So the original question is, the commercial guys have tools, the open source guys don't. But not all open source systems can support this kind of partitioning. Okay, what is most common, and what most people think about when they say a partitioning, is called horizontal partitioning. And again, when people refer to sharding in a new SQL system, they're talking exactly about this. So here what we're now going to do is rather than splitting it up on a, it's storing some side of the attributes over one location, some side of the attributes in another location, we're gonna store all the attributes for tuples together, but we're gonna split them up so that some partitions have some tuples and some partitions have another tuple. So let's say here, I have four tuples in my table, and as I say, I just split it up in the middle, and partition one will have tuple one and two, partition two will have tuples three and four. And then now up above in my database system, when a query comes in, I have to figure out, oh, it needs to touch tuple three, so I know I need to wrap my query to go run on partition two. And so the way you now decide how to split up your table into these horizontal partitions, there's a variety of approaches you can do. Hashing is probably the most common one, where you pick some attribute, hash its value, mod by the number of partitions you have, and that tells you where to go. You can also do range partitioning, predicate partitioning would be, if you have a more complex predicate, you want to use this called interval partitioning, where you sort of like arranges. So there's a whole bunch of techniques to do this. There's a ton of papers going back to the 1970s on doing this. And again, the commercial guys will have tools to help with this, the open source guys, it's a manual process. So we'll cover this in way more detail when we talk about distributed databases, because this is essentially what he was referring to in the beginning, is what's a partition, what does that actually mean? It's typically this. And if you think about again, if you're doing a large table scan, then you can run them in parallel at the different partitions. If you're doing a transaction, then you can route the transaction or do an update to the database, you can route the one partition and not worry about coordinating with the other partition. There's a whole bunch of advantages of doing this in a distributed parallel system, but it brings up a whole bunch of other complications, because now you need to worry about, if you ever have to communicate changes between different partitions, how do you make that actually fail safe? So again, we'll worry about that later in the semester when we talk about distributed databases. But this is essentially what I was sort of talking about before when I showed the exchange operator, and I said, the table A will be broken up to three chunks, I was meaning these horizontal partitions. So you could have one scan operator instance, scan partition one, another thread could be scanning partition two, and then the exchange operator will coalesce the results. Okay, so to finish up, parallel execution is important. Parallel execution is hard, but in almost every single database system you can think of will support some variant of parallel execution. At the very least, they'll support interquery parallelism where you have multiple queries running at the same time, and then the more sophisticated systems can support interquery parallelism. We can take a single query, break it up into subtasks, and have them run at the same time. So as I was talking about throughout this entire lecture, there's a whole bunch of other things we have to deal with in order to make sure we get this right. How do we coordinate things? How do we schedule the threads? How do we deal with threads updating the tables at the same time? How do we deal with threads, one thread wants to take a ton of memory and another thread needs to use memory, but it doesn't have any more? It's a whole bunch of those things that we have to deal with that we're not gonna cover in this course, but we'll cover in the advanced course. So one thing to be mindful about for actually, for project three in this class, when you end up building a lock manager or a concurrency tool algorithm for your storage manager in SQLite, SQLite is multi-threaded, so they support interquery parallelism, but they don't support interquery parallelism for virtual tables if you go through the SQLite command line terminal. We'll write an application or a test case that spawns multiple threads and allows you to modify your virtual table in parallel, but you can't test it in parallel from the command line because there's no way to write SQL from the command line to invoke multiple threads. The other thing to be mindful about SQLite is that it allows interquery parallelism for reads, but it does not allow for interquery parallelism for writes. So it'll have multiple threads, can read only queries in parallel, but only one thread can write to the database at the time. And they do this because it makes concurrency control, which is what we'll talk about next week, way, way easier. It makes the system more portable, okay? So any questions about parallel execution? All right, so next class, it's the midterm. And I'm serious, like one year we had like, oh, I thought you were joking about the midterm. Like no, it really is next class. So please come, okay? I also have, if you didn't get a copy of this, I have a copy of the practice exam with solutions. So please come to me right afterwards if you wanna get a copy of this. And then on Monday next week, we'll have what I'll call a potporee session where there's a whole bunch of interesting topics about that you need to have in a sort of a database system that's actually usable by people. That I don't know if it's all covered in the textbook, but there's a whole bunch of things I wanna sort of cover. So these are sort of to be a hot, positive topic that maybe aren't connected all together in a single theme. But these are things you sort of need to have if you wanna build a modern system. So we'll do store procedures, UDS, or user defined functions, user defined types, triggers and views and materialized views. All right, and that'll be Monday next week. Any questions? All right guys, see you on Wednesday.