 On Thursday, I was not going to be on campus this week before the midterm, so as Mushi said, I'm out here in California dealing with the intervention for our friend from Fat Faced Rick. He's somebody that used to know back in the day for slinging rocks, and then Mushi knows him from the trap circuit. So it's a mess out here. Who makes an intervention open bar? This is really stupid and crazy, so they might need to call me and deal with something out there in a second, but I just want to get through this lecture for you guys really quickly. So coming up for you guys on the docket, obviously this Thursday is the midterm exam. Again, that'll be in our main lecture hall at a regular time, and everything's posted on Piata with the practice exam and the study guide. Checkpoint two is the first check point to do tomorrow, so Tuesday, October 11th. And then the checkpoint two will be due the first Wednesday after fall break when everyone gets back. So again, post on Piata if you have any questions for any of these. So last class, we were talking about query execution. We were talking about how we can take these different operators that implement these algorithms that we talked about so far and we convert them into a high-level logical plan, which we then convert to a physical plan that specifies the exact implementation of the operator, like what join out we're going to use, what sorting out we're going to use. And so up until now, the entire semester, we've been assuming that the queries are being executed by a single worker or a single thread or process, right? There's sort of one execution thing that is processing tubals. So in today's class, we're now going to talk about how we actually execute queries in parallel using multiple workers in the same time within one single database instance, right? And this sort of seems obvious why we want to do this, right? Just make sure that it's concrete and understand why we're actually spending this time to talk about parallel execution. I'm going to go sort of three main points of why this all matters. So the first is that we obviously are going to get potentially better performance running in our database system for the same amount of hardware that we have to procure or deploy for our system. So nowadays, you really can't buy and you can't get CPUs, especially at a server level, server grade, they don't make single-threaded CPUs or single-core CPUs. So the way Intel gets better performance in addition to SIMD and some other, some of the acceleration stuff we talked about before is to now give you multiple cores on a single socket and for Intel, for most of their systems, most of their CPUs, the cores are all homogeneous. I mean, they have the same amount of power or computational capacity as one core versus another. The M1 stuff from Apple, they certainly have heterogeneous socket CPUs where some cores are fast, some cores are slow. That's sort of where things are going next. But in general, they're just giving you more cores. And so if we can take advantage of these cores that we provide on the CPU, then we're going to potentially get higher throughput. We can run more queries and run more transactions, complete more of them within a certain amount of time, and then we'll get lower latency, meaning the time it takes from when the query shows up in the system from the time it takes to produce a result, that'll get reduced as well. And so as part of this, you're going to get increasing responsiveness in the system. So because now when we have multiple threads or multiple workers running at the same time, when, say, one worker stalls because it has to go to disk and get something, get some page, and that's a really slow operation, another worker can keep on running and process the query, process another query that has stuff in memory. And so the system doesn't look like, it stalls all the time and this seems really sluggish. And then so related to all of this is that you're going to have what is called TCO or Total Cost of Ownership, this will get reduced for your system. Again, because now we can take full advantage of the harbor that we're given at the CPU or even the disk, then we'll need fewer machines, need fewer parts to put in these machines and maintain them, need fewer people to maintain these machines, although if you're running in the cloud, Amazon, Google, Microsoft handle that for you, but even then you pay them money to do that. And of course, all of this means that you're going to use less energy potentially because every machine has to maintain its own RAM, has to maintain its own disk, all that takes power, but if I can have my system run really, really well on a single box, then maybe I don't need to spread out across multiple machines. Now there'll be, we'll get into discussion later in the semester of like, there are some cases where you do want to have across multiple machines and this will be called a distributed database. We may want to do this for performance reasons, availability reasons, reliability reasons, but for our purposes today, we can ignore that. And so I want to make the distinction now between sort of parallel databases and distributed databases. And for the most part, I think this is my definition and so I'd like to do this sort of compartmentalize what we're talking about today and then when we talk about distributed databases near Thanksgiving, what I mean by that. And the idea is at a high level the same, like we want to be able to have multiple workers run in our database system and process queries and take care to run transactions. But for today's class, we're mostly going to focus on, or we are going to only focus on using them, having multiple workers within a single server instance, right? So for both distributed databases and parallel databases, to the application, although there may be multiple workers, it's going to appear as a single logical database. And this is sort of that logical versus physical independence that we talked about before where I could write one single query and I don't care whether the data system is a columnist or a roast or it's going to figure out the best way to execute that query and organize my data. So in the same way, we want to be able to have the application be unaware that the data is maybe split across multiple partitions, multiple servers, and that way when I scale the system up vertically or horizontally at bigger machine, more machines, I don't have to rewrite my application. So that's sort of the high level goal we're trying to do and you'd want to do this for both a parallel database and a distributed database. So the idea is that a single SQL query that we run on a single thread or a single worker database system will also work in a parallel distributed database system. So just define the distinction between the parallel distributed systems for this class. So for a parallel database system, we're going to assume that the resources, both the disk and computational resources, so CPUs and the disk, they're going to be physically close to each other and that the communication channels between them is going to be both very fast and high speed and also reliable, meaning if we send a message to another CPU core, another worker running on the same CPU socket, we're going to, for our purposes in this class, we assume that that message is not getting lost. It may show up in a different order and we use the latching to make sure that happens correctly, but it's not like we say, we pass a message from one thread to another, that message disappears. Because if that happens then you're having cataclysmic problems in your system and the whole node is going to crash and go down. So for our purposes and for the algorithms and things we'll talk about today, we assume that we can communicate between different workers reliably and very, very, very fast. In some cases, actually read the memory of another thread, another worker. So I keep doing this. I keep saying thread instead of worker. I'm trying to keep these words sort of separated. I'll explain why in a second, but just know that if I say worker, if I say thread, I really mean worker. And so in a distributed database system, which we'll cover later in the semester, that's where the resources could potentially be very far apart from each other, meaning different data center, different racks in the same data center, different data centers in the same state, country, different sizes of the globe and so forth, right? And therefore the communication between them is going to be slower than we'll assume in a parallel database, but it's also going to be very unreliable, meaning if we send a message, even if we're using TCP, there's no guarantee that our message is actually going to show up. And likewise, there's no guarantee that we'll get a response from the server. So we have to account for that in our, in our albums, in our, in how we design our system. So again, for today's class, we're assumed that we're the first one here, parallel databases, and it's just going to make our lives a lot easier. But then the same techniques that we were talking about today can get applied to the distributed database world. It's just we have to add extra safety protections to make sure that, to deal with, you know, lost messages and so forth. All right, so I'm going to begin talking about different process models you can have in your implementation of the database system. Then we'll talk about how to do query execution in parallel. And again, cover the high level concepts of how you organize a query plan to move data or shuffle data between, between different workers. And then we'll finish up talking about IO parallelism, which is an important problem. If you take the source system class, I think they cover this a bit more detail, but we'll talk about it from a database system perspective from about what can the database know is going on versus what is external to it. So the first thing that's fine is the database system process model. This is sometimes we call a process architecture in different systems, but basically it determines how the system is going to be organized to support concurrent requests from a multi-user application. And so by multi-user, what I mean is you could have multiple things or entities try to request different queries at the same time on our application. Now this could come from either within the application itself or the application taking its own input requests, like if it's a web server, taking requests from the outside and then passing this request over the database server. But for our purposes here, we don't really care. And as I said, I'm trying to be very careful in my language here, but we'll say that a worker is gonna be a computational unit or component in the database system that's gonna be responsible for executing tasks on behalf of the application, on behalf of a client, and then returning those results. And it may be the case that a single worker is responsible for getting the query request, and then it may split that query request up into different pieces in the query plan to execute in parallel and then submit those two separate workers. But the end of the day, something needs to produce a single result back to the application in response to that query. So we'll say that there's a sign sort of primary worker for the query, but it may use multiple queries, multiple workers at the same time. So there's basically three high-level approaches here of how we can organize the system. And the first two are basically very similar to each other. It's just one's gonna use processes, one's gonna use native OS threads. But then the last one is an embedded database. And this one is super common, and I just wanna cover that a little bit, but it's not, how do I say this? Whether this is not actually a process model or not, it depends on, in my opinion, this is a common use case of a database system, the embedded approach. So we're gonna cover that and understand the implications of it. So the process per worker is exactly as it sounds. So each worker in our database system is gonna be a separate OS process, meaning we're gonna call fork when the system starts up and starts a bunch of worker processes and there's gonna be some coordinator or what we'll call dispatcher and postgres they call this, the postmaster, but it's responsible for knowing here's all the processes I forked and keeps track of whether they're alive or not. So now in the application, assuming this is external to the database system, it submits a request to connect to the database system. So assume this gray box here is our database system process, at a high level it may contain multiple processes. And then for the database part here, I'm assuming it's running on the same box and the local disk, but doesn't necessarily have to be, it could be a remote disk, but we don't need it, we don't care about that right now. So the application says I wanna connect to the database system, it first hits up the dispatcher process and then the dispatcher process is responsible for picking one of the worker processes to say you're not gonna be in charge of this query or this connection. The dispatcher hands off that information back to the application who then then connects directly to the worker process and submits any query request to that who then executes that query on behalf of that connection and does whatever it needs to do on the database system. So there's a couple of things here about this approach. Why sort of I say it's less than ideal. And this is a sort of remnant of the days before the dominance of Linux and POSIX and P-threads and so forth, right? So the first is that we rely entirely on the operating system to do scheduling. We can play games with the priority flags in the OS so that maybe some one process gets a lower priority or higher priority than another but I don't think any system really does that and you really can't control. The dispatcher is not gonna control, okay. You guys, everybody stop executing because I want this process to take control. Everybody's sort of doing their own thing. And we have to use shared memory to maintain global data structures or do some kind of IPC or memory or message passion between processes in order to keep track of global state like a buffer pool, for example. So there's a bit more overhead to do that unless you're using shared memory which is what POSIX does and I think all of these other systems do. One advantage of this is that if now some process goes off the rails and crashes or if it does something wrong, divide by zero and kills itself and crashes, seg faults, you don't end up actually taking down the entire database system because that crash is isolated to just do that process. So that aspect is kind of nice but obviously if we just careful programers we would avoid divide by zero in the entire database system. So what I was saying that this is sort of a remnant of the old days, the 1980s and 1990s before Linux became what it is today because you typically see this in order systems like DB2 from the 80s, oracles from the 70s, Postgres certainly the original version of this from the 70s or sorry, the 80s. And so the reason why these order systems use the process per worker model is because it was a four P threads became sort of standardized. Like all of these different variants of Unix that was out there, BSD, Solaris or SunOS, Hbox, True64, there's all these different versions of Unix you may have never heard of and they all had their own different native thread packages. So it was a huge and had different semantics of what would happen when you spawn a thread. And so if your system needs to support a bunch of different environments you're better off using fork because that's something everybody pretty much implemented. And then you didn't worry about the threading stuff because it sort of complicated things. Because it was different one system than X. So what is more common is the thread per worker model or a multi-threaded database system instance. And that's where you have a single process with its own internal multiple worker threads. And these are typically implemented as preemptive native OS threads. And so for the most part this now means that our database system is going to manage its own scheduling because it has full control of what thread is doing what within some reason. And you may not use a dispatcher thread in the front you could have a single thread, you could have a single dispatcher thread take the request and then immediately forward it to another thread and then the outside application doesn't need to reconnect to something else or you could have do the similar approach that I said before where the application connects to the dispatcher, it gets assigned a worker thread and then the dispatcher communicates directly with that thread going forward. It's basically, you know, do you want a thread pool that any thread could potentially execute any query for any connection or do you want a dedicated thread per connection? There's pros and cons of each of these. So pretty much every database system that's been built in the last 20 years is going to be using this approach. SQL Server does this. DB2 actually supports both process model the process worker and thread workers because, you know, they have to support different these various, you know, these UNIX's and Linux versions that are UNIX versions that, you know, that maybe other open source system support mainframe stuff as well. Oracle added support for a multi-threading instances in version 12, which came out in 2014. My SQL does this. I'd say at CMU, we did hack up Postgres in 2015 to make it multi-threaded. Don't recommend it. I forget why we did that. But, you know, it can be done but there's still strictly a single process model or process for a worker model. The only systems in the last 20 years that don't do this approach is anything that is a fork of Postgres. The one system I want to talk about that's really interesting, I think, that does this approach is SQL Server and, in particular, I want to focus on how they do scheduling. So, for each query plan that shows up, and this is not just for SQL Server, this is for everyone, that is trying to do as much fine-grained scheduling as the cam workers, they're gonna have to, the systems are gonna decide like how many tasks should it use, how much should it parallelize and break the system, break the query plan up, which CPU cores it should use for the workers, and then where the CPU cores actually are. This is only if you care about NUMA, not every system does, certainly my SQL and Postgres do not, but the SQL Server definitely does. And then where should the output of the query go? Or, sorry, the operator go. We've talked about these inmate results that we'd write out to potentially a buffer as we potentially emit them up into the query plan, but it may be the case that, since we know that one, the next operator will be executed on this core over here, maybe we want to write to memory that's local to, for that core instead of our own local memory, these are some things we won't really talk about in the, for this course, we talk about in the advanced class, but I'll generalize, the data system is always gonna know way more than the, than the operating system, and so we don't want to allow the operating system to make any of these decisions if we can avoid it. And again, the commercial guys do much better at this than the open source guys. So I said the one I want to call out is on SQL Server, because I think their architecture is the, is probably the most fascinating of all of these. So in 2006, 2005, SQL Server rewrote their entire execution substrate into a new abstraction layer called SQLOS. And the idea was basically, they want to not have any part of the data system make direct sys calls to the operating system, in this case, in Windows. We don't want any part of the system make direct calls to the operating system. Instead, they're gonna make direct calls, or make calls to the SQLOS layer, and then it's responsible for then making the appropriate call down to the operating system to do certain things, right? And what they ended up did with this, as part of this rewrite using SQLOS, is that they switched the code base into a non preemptive threading model, where you basically allow the threads to yield up, yield their execution contacts back to this centralized scheduler in SQLOS that Microsoft implements in SQL Server. And then that SQLOS layer can decide who runs next, where do they run, and where's the data actually gonna be stored, or where the data is coming from. So this was a major undertaking for them. It's really, really fascinating. As far as I know, at least, there are some systems like Fauna that does a little bit of non preemptive thread scheduling, but not at this level sophistication that they do this here, right? And so you can think of like, in similar way we talked about that the buffer pool is a re-implementation of virtual memory inside our database system. You can think of SQLOS here as just like a re-implementation of almost the entire OS to do thread scheduling. And so they originally did this rewrite because they wanted to have control, where every time that they wanna have, find great control of where memory was being allocated, where memory was being written to. And they didn't wanna have to, as the hardware landscape changed over the years, they don't have to go back and rewrite, or add custom code in all different parts of the system to account for new types of hardware. If you just did it at the SQLOS layer, then everything above could take advantage of it and you'd use implement that complexity in a sort of single location. So the unforeseen advantage of this effort back in 2006, 2005, was that in 2016, 2017, when Microsoft finally added or ported SQL Server to Linux, because prior to that, it only ever worked on Windows, even though it's one of the most state-of-the-art data systems now, but if you could only run on Windows for Steve Ballmer and Bill Gates reasons, legacy reasons. But in 2016, they were able to switch over to Linux and now support Linux because of the SQLOS layer, they only had to change those syscalls that you would make to Windows to now to make to Linux, and you'd have to change anything else up above. Now there's the tooling ecosystem, like GUI's and all that kind of stuff, that's not affected by, that can't take advantage of this, and they had to port all those things over, but so the core kernel, the core engine of the system was much easier to port because they switched it to SQLOS. So I wanna talk about how they were doing the thread scheduling. So in SQLOS, they're gonna maintain an application defined, sorry, or a programmer controlled and defined four millisecond quantum, and then they're gonna have their own scheduler that's gonna say, okay, I wanna run things and task in four millisecond chunks. And again, this is not something we can enforce because we're doing non-pip and threading, meaning it's up for the implementation, the data system itself to say, okay, it's been four milliseconds, let me go back to the scheduler, let the scheduler figure things out, right? Because we can't, we don't have that kind of, we can't send the signals and do this interrupts the way the operating system can use a level for this. And so the way you implement this is you have to add explicit yield calls in the source code in roughly where there's four milliseconds or work being, have been done and then go back and yield to the SQLS schedule. Let's say we have a SQL query like this, select star from R where R dot val equals some input parameter. So as we saw last class, if you're doing an iterator model, it's roughly gonna look like, the query plans could look like something like this, right? For every tuple in R, evaluate the predicate, if it evaluates to true, then emit it. And so what, what now they have to do to do non-pip and threading is they add explicit yield calls in the source code like this. Like literally you get the right in the source code, it's not some compiler magic if they do this. And this is a gross approximation, but basically you get the current timestamp, check to see whether it's been four milliseconds since you've last checked the timestamp. If yes, then you've yield back to the, do the SQLS database and thread scheduler, otherwise do whatever the processing, do whatever processing you wanna do, right? And again, the idea here is that we have more fine gain control of how to intermix threads. There's also additional metadata that Microsoft can maintain about who's waiting on what latch and what lock and therefore when I yield, I can decide should I even try to run you again or do I know you're blocking on some lock up above? And again, the operating system can't know this because it doesn't know, it may know about mutexes and you're blocking on those, but it doesn't know about the high level locks that the database is gonna maintain, which we'll talk about after the midterm, right? So this is way better because now you have complete control of when threads run and when they don't run. And I said, the other two systems and do they do this or a font of DB? And as I said, all they really put in to do is just put a blind yield after like a disk read or before they do a disk read. And the idea there is just like, most of the time you're blocking, when you're getting something from disks because let me just yield before I go do that or do it again. The one that does do this in a more sophisticated way is Celia DB. We had them give a talk at CMU during the pandemics probably a year or two ago and they talked about how they were doing this non-patriot thread scheduling and they instrumented their system, the source code. So, you know, they had to do it at Microsoft. You have to put explicit yields, but then they would have infrastructure in place to know that if they found a portion of the code of a task took more than, you know, their five millisecond quantum time, it would then sort of send an alert to the databases and developers so that they can go look and see, okay, why did we just spend more than time we should have for this task? And then they can recalibrate and move the yields to the right location so that they have, you know, consistent chunking, right? Because otherwise without that sort of instrumentation, you're blindly putting yields and hoping it all works out. So I thought that was a really clever idea. All right, so the last process model I wanna talk about and which is again, is not really, how do I say this? I don't think the textbook mentions this as an explicit process model. I think it's as worth talking about because it's a different, it's a much different approach of how a data system would run than we've been assuming this semester. So, pretty much we've been assuming our data system runs like Postgres where it's a separate process either running on the same box or on a remote machine. Again, whether it's using a process per worker or thread per worker doesn't matter, but it's some separate process in the application. And that we're gonna connect to it using TCP or Unix Sock as whatever we want and then we can issue queries and get responses back, right? And then when our application dies or goes away, the data system is still running, right? Sort of a client server model. But another common approach is just called, what is called an embedded database system where the database system itself is gonna run in the same process or the same address space as the application. And therefore when the application makes calls to the database system, it's responsible for having whatever thread that it makes the call is for actually going executing the query or executing whatever the task it needs to do on the database system. Now, I say mostly responsible for thread and scheduling because there are some embedded systems, I think like DuckDB where it's an embedded system but it can spawn its own execution threads so you get parallelism that way and that are separate from the application threads. But for our purposes, the main purpose is like it's running in the same address space is the main thing we care about here, right? So the application could have multiple threads and each thread could just be responsible for executing some queries. So I mentioned DuckDB, the other two most famous data systems that work this way is SQLite and RockDB. When you connect to SQLite through the command line, that SQLite client is the thing that's actually the application that's actually running the system, right? You just happen to do it through their terminal interface but you can also embed SQLite in, it's better than in your web browser, right? So the web browser is responsible for the threads and it knows how to make calls in SQLite and then SQLite uses those threads to execute queries. Again, there may be additional threads that spawns for parallel execution, like I said, in DuckDB or background tasks to do garbage collection and other things, right? SQLite's probably the most famous one, the most widely deployed embedded database. It's actually the most widely deployed database in general but it's not the first embedded database. The first one probably is, I think it's BerkeleyDB from the late 80s, early 90s. Oracle bought them in 2006. Think of it like it was a embedded key value store, like a hash table, but they also had deep libraries. Embedded database is also common in systems that then build a more traditional client server architecture on top of that. So MySQL, for example, their embedded engine is called innerDB, that's their B plus tree index organized tables implementation. Then MySQL, the client server layer above it has the SQL parser and the optimizer and all that stuff. And so the innerDB doesn't fork its own threads, or spawn its own threads as far as they don't know, but MySQL up above does that and gives them to innerDB. Cocker's DB did this with RockDB. There's a bunch of systems that do this where they take an embedded system that doesn't have its own process model or maintain its own threads and they build a larger system on top of that. Okay, so I've talked about the pros and cons of multi-threaded versus multi-process architecture. And again, pretty much every system the last two decades, 15 years is using native OS threads. There's less context over switch between threads. You have more complete control over what thread one runs and where. And so if you're building a system from scratch today, the multi-threaded approach is the way to go. Now, just because your data system supports multiple threads or multi-threaded workers or start using multiple workers or multiple threads on multiple threads, it doesn't necessarily mean you support intro query parallelism, which we'll talk about next. Meaning just because you have multiple threads running, does it mean a query can show up and that data system will be able to break it up into multiple threads or execute it in parallel multiple threads. So MySQL is the best example of this. It's a multi-threaded data system architecture, but one query shows up and it'll be only executed by one thread, right? Because that's the model that they pursue. All right, so let's talk about the different types of parallelism we have at sort of query level. So I already mentioned intro query parallelism, this is where we take a single query shows up and we can break up the operations to multiple workers and run those in parallel. And this is super common in a lab systems where like, you know, when I want to scan a large table, I can break up the scan into separate workers and have them run in parallel at the same time. Intro query parallelism is the most common one. This is what pretty much everyone does. This is where the system itself can support multiple simultaneous active queries running on separate workers at the same time. And obviously, as always, we want to do this because we have increased throughput and latency. So I'll briefly talk about intro query parallelism, so we'll spend more time talking about it after the midterm because we need to worry about correctness of two workers for two separate queries, update the same thing at the same time, how do we deal with that? But we'll spend most of today talking about intro query parallelism. So in a query parallelism, as I said, it's multiple queries running at the same time, like two different requests show up and we want to run them at the same time. So if everything's read only, then this is really easy to do because they're not going to change the state of the database itself. They may change the state of internal data structures like our buffer pool, like you update the page table and so forth or any internal statistics that we maintain. But the underlying core database itself is for the most part is not going to get modified. So this is pretty easy to do. So we do use all the lashing stuff to protect the data structures, but we don't have to do any higher level locks to protect the database itself. But then if they update the database at the same time, then this is where things get really hard, equally as hard as maintaining the parallelism in the data structures. But as I said, we're not going to cover this right now. We'll cover this in lecture 15 after we get back from the midterm. Again, so there's not really much to say here that like I can have, as I showed before, application issues, different application instances issue different connection requests that dispatcher could assign them to different workers and whether it's a thread per worker or process per worker doesn't matter and those queries are running parallel and they have to contend for resources on the buffer pool and other parts of the system and we sort of make sure that they're doing this in a safe manner that doesn't corrupt our data structures. So intro carry parallelism is what people mostly will think about when they say parallel execution in a database system. And the idea is that for a single query invocation, we're going to execute its operations or operators in parallel at the same time. And so this is going to be similar to, you just think about the processing models we talked about last class, iterator model, vectorized model and then the materialized model where it's sort of this PubSub or produce a consumer model where we could have different operators are producing results to different parts of the query plan and there could either be a worker, multiple workers for each of those query operators running in parallel or the different operators could be assigned to different workers and they're feeding to each other. But the point I'm trying to make is that the produce a consumer model makes this all easy to do with the reason but to expand the or extend the query plans into do support parallel execution because we have that decoupling and a well-defined API of what one operator sends to another operator. Got a high level we're just saying it's emitting tuples and obviously when I talk about SQLOS you care about the location of where the results are being sent but for our purposes here we can ignore that. So the for all the query sort of the operator algorithms that we talked about so far there's gonna be parallel versions of all of these. And sometimes they're like completely parallel because I can run the scan in parallel on a table and I don't need to pass information from one scan invocation to another, right? Other times you do care about the results of the different operators like when you care about knowing what the global state of something like you're trying to find distinct keys maybe you need to know what are the distinct keys coming out from different operators at the same time so that you can coalesce them into a single distinct list or distinct set. But and we'll get to that in a second but like the basic thing here is just to point out is that the output of the operators can be done in parallel and they can either write to unique buffers or unique locations or they get right to a single data structure. We'll see how to do this in both cases but at the end of the day the result of the query itself needs to be a single result that's sent back to the client. As I said we write SQL, we don't know, we don't care necessarily that the query is gonna run in parallel. We just wanna make sure we get a result back in the single set or a single relation. We don't wanna get, we don't wanna be told here's the results, here's half the results but here's the location find the other results. Nobody would write their application code like that. So the data doesn't have to produce back a single call less result. So it's easy to conceptualize how to do parallel versions of all the algorithms we talked about so far. So I just wanna show sort of a high level how you would do this for the parallel hash join. So the, we talked about before that you would have this first phase where the build side and probe side they're both gonna scan through the result and hash it and produce the right, write the tuples into buckets based on the hash key. But then now an easy way to parallelize this would be for each level of buckets, I could just have another worker be responsible for taking the build side and probe side and either doing the nested loop join if it's in memory or build the hash table on the build side and then probe on the right side. So each worker could be doing this for distinct level of buckets in parallel. The alternatively, you could be when you hash everything together, if everything fits in memory, you sort of do the scan in parallel on the build side, build a single global hash table and then on the right side, you can then probe that in parallel with different workers. And because you're not modifying the state of the hash table on the probe side, you can run that in parallel without any latching and that'll be fast, right? So again, the tricky thing is the coordination between the different workers, but it's not hard to see how you could run these in parallel and the decision is whether you want to produce a single output result or split them up and then maybe coalesce them later. All right, so for intro query parallelism, there's gonna be three types. Well, in my opinion, there's two types. There's intro operator parallelism, horizontal parallelism and then intro operator parallelism, vertical parallelism. I think the textbook refers to a third type as bushy parallelism. And in my opinion, it's just a combination of the two. So I don't really see that as a distinct type of intro query parallelism, but we'll just, we'll go through and make sure we understand what's actually going on. And again, the advantage of SQL is because it's declarative, we don't have to write anything in our SQL statement to make it use any of these things, right? The data system response we're figuring out, okay, these are the resources I have, my data looks like this, my query looks like this. Here's the best, here's the optimal way to configure the query plan to use these different types of parallelism, right? And the spoiler I would say also too is that the first approach, the horizontal parallelism, this is probably, this is the most common one you see in parallel data systems. All right, so in the first one, the idea is that we're gonna decompose an operator in our query plan into what I call independent fragments or query plan fragments that are gonna perform the same function, the same operation or operator as the other fragments within the same operator. But they're gonna do it on a disjoint subsets of the data or whatever the data coming up from below it. Now it could be a table scan, so we're breaking the table up into shards or partitions and every query fragment's gonna operate on just its shard or it could be coming up from an operator below us and we can split that up and everybody operates separately. So to make this all work now, we're gonna introduce a new sort of synthetic operator in a query plan called the exchange operator that we're gonna use to coalesce and split results from either multiple children together and produce a single result or take a single result and split it up to multiple parents up above us. So again, this is not something that exists in relation algebra. This is something we're gonna introduce that Dave Simms is gonna introduce in the physical query plan to allow us to gain a reason about what our degree of parallelism and where the data is actually coming in and where it's going out. These are different steps, all right? And as the Postgres is gonna call this gather, they're gonna do the first type of interoper parallelism, intraoperated parallelism with something like SQL Server, a more sophisticated system is gonna have a support basically all three types. So let's see, I really see most of this. So say, I have a table with a billion tuples, a ton of tuples, and I wanna do select star from A where A dot val is greater than 99. And assume I don't have an index here. So the query plan is pretty simple, right? At a logical level, it's scan A due to the filter on value. And so what I'm gonna do is for all the different pages I have in my table, I'm gonna divide the query plan up for the scan operator with the filter into a fragment, right? So the fragment's gonna do some scan on a portion of the table A and apply the filter, and then it's gonna feed this up into this exchange operator. So the exchange operator is like a barrier that says, I can't proceed, I can't send anything up above the next step in the query plan, next level in the query plan tree until I get all the results from my children operators below me, right? And it's responsible for saying, okay, I have three workers and I have three plan fragments, so let me assign the plan fragments to these different workers and let them run at the same time. So now when I call next on the exchange, assuming we're doing iterator model with top-down execution, I call next on exchange that can call next on this one plan fragment on the filter, then calls next on the scan on A, and it's responsible for going and fetching the first page. And again, there's some bookkeeping that's keeping track of what page this plan fragment is responsible for reading. But unlike in the iterator model where it was a sort of blocking call as I go down call next on my children, the exchange operator has a lot of call next on the other children operators at the parallel at the same time. So now my three plan fragments are running in parallel, all processing different pages of A here. All right, so let's say that the first two plan fragments they finish, and we start emitting tuples of exchange, now exchange to say, well, I know I should have scanned the two other pages in table A, so it can call next on A1 and A2 again to have them start scanning four and five while the third query plan fragment finishes up scanning page three. All right, so again, the idea is that we're dividing the scan on a large table across multiple fragments that are each run in parallel so that we can sort of reduce the time potentially of the query execution of doing the scan because we're not blocking, waiting, work is not blocking, well, the worker does have to block and wait when it needs a page, but some other worker could have already the page it needs is in memory and it can process that and make forward progress. All right, so this type exchange is the most common one. It's called the gather operation. I think in SQL server, they may call this the parallelism operator. Postgres calls this gather. The original paper from the 1990s that described this approach, it's actually the volcano paper that we talked about many times before. They call this exchange and that's their term, but this actually distinctions between these different exchange operator types. This actually comes from SQL server because they as an enterprise system they support these different types. So this is the most common one gather. We're basically taking the results from multiple workers and you're gonna coalesce the results and produce a single output stream to whoever needs it up above. So going back to my example here, since I need to produce a single output result to the query or sorry, to the application for the query, the exchange operator is gonna combine them and then shove them up. So this is what gather does. The next one is to do distribute and this is we're gonna take a single input stream from our child operator and then we're gonna break it up according to some partitioning scheme. Round robin hash partitioning range partitioning we'll cover these later, but you basically take the single input stream and then you break it up to disjoint subsets and you send that out to different consumers up above. And then repartition is the combination of the first one and second one where you take multiple input streams, coalesce them and then produce multiple output streams. And the key thing to point out here in repartition is that the number inputs into our exchange operator does not need to be the same as the number outputs. I could have a lot of parallelism in my input, but maybe I only wanna have two workers process the outputs and let's put it up to divide it up to two workers. So again, as I said, the first one is the most common one, the second two are again, typically seen in the enterprise system and any system that's really focused on OLAP performance we'll do these other two. So let's say now we wanna do our scan on A. We had three workers that are doing this for this fragment here. We're also gonna push down the filter before they join for obvious reasons that we talked before because we don't wanna do, we don't wanna start doing hash table probes or building hash table for tuples that were never gonna match anyway. So we wanna get the filter to be close to the scan as possible. We're also gonna do an optimization where we push down the projection because we would know that these are the only tuples we're gonna need at some later point in the query plan. So rather than us copying the whole tuple, assuming we're doing early materialization, we just copy the subset of the data that we actually need. And then we're gonna build our hash table here. So we'll have exchange operator above because we want to coalesce the hash tables together into, or assume we're building a single giant hash table for these three plan fragments. So we don't wanna start doing the probe until we know our hash table is fully populated. So then this feeds into the build side of our hash join. So now we do the scan on B, same thing. We're gonna do the projection push down, the filter push down and the projection push down. And then now we're gonna probe the same hash table that was built by the other side. And again, this will not get invoked until we complete the build side, which is blocked by the exchange operator. And then now the output of these different threads doing the join, they're gonna write their own individual output buffers. So we can put another exchange operator up above that to then coalesce the results into a single output, right? And again, this exchange is just internal metadata that just denotes in the query plan that it can't do anything up above until it knows the task below it and completes. And where the metadata is is say, like well, what determines the amount of parallelism you're gonna use when you execute the query that's typically done before you start executing the query. Like the data system is not gonna say, oh, well, I could use more threads while it's running and start giving you more threads. Some systems do that, Postgres does not, SQL Server does not. Typically there's like, you define the amount of parallelism you want when the query plan is actually generated. Some systems like Snowflake can recognize it, oh, I'm running slower than I should and they actually can pull in, they can borrow workers and not only from your cluster they have this sort of global pool of available workers and they can pull in workers to speed things up for you. I think if things are going slower than they had anticipated, which is kind of cool. All right, so interoperable parallelism, this is vertical parallelism and as I said, this is not as common in what I would call traditional data warehouse OLAP systems. You typically see this in what are called streaming systems or continuous query systems, things like Spark Streaming, Apache Flink, Pulsar, KSQLDB, which runs on Kafka. And the idea here is that you're gonna have a sign a worker to an operator and one worker per operator and so that operator can just sort of run all the time on any tool that arrives and when it produces an output, it admits that to a next operator, there's some other worker that can pick that up and process that, because there's always gonna be some other work that each worker can have to keep doing. The reason why this is common in streaming systems because the idea is that I'm connecting to some event source or some source of data that's always sending me new data and so rather than run the query in a batch mode where I run, produce, resolve and then new query shows up and on a second let I run it again, these things are just running all the time. And this is also called pipeline parallelism because again, within a pipeline, you have multiple workers running parallel at the same time. All right, so let's say in a real simple example like this, we have this join operator here and let's ignore what we talked about before what we paralyzed that with multiple workers doing the scan, assume that there's just one worker that's responsible for taking any, any tuple gets from the inner table if you build the hash join or hash table and just keeps emitting it, right? So this thing is always running on one worker and it's emitting tuples like this. And then the up, up, up, up, it's gonna be another worker to do the filter and then it just sits and listens on some incoming stream and says any tuple that shows up apply my projection and then spit it up, right? And so the idea is that this incoming queue or input stream is, or socket is there's no new result and this worker is just idle and it waits, right? And again, the idea of a continuous query is that this thing just say you're trying to say, you know, do an alert on the stock price. So you want this thing running all the time as new ticks, new stock ticks come in. So each worker can be doing a piece of the query plan rule always running parallel. All right, the last one is the bushy parallelism that I talked about before, again, my opinion is just a combination of both intra and intra operator parallelism where you have different workers, excuse you, different multi-operators and different parts of the query at the same time. You still need exchange operators because you have to combine things and put things together but at a high level it works the same way. So now we have say a three-way join or sorry, four-way join between table A, table B, table C and D. So with bushy parallelism, I could break it up so that like the join in A and B is join in one worker, the join in C and D is join in one worker and then these are producing outputs that are then running to do the join on the intermediate results between all of them. Everything we talked about so far is about getting parallelism for across CPU cores. The problem there is that it may be the case that the CPU isn't the bottleneck, right? It's actually the disk. So if I have a CPU socket that has 1,000 cores and it's great that I can run using the different types of parallelism we talked about, I can run all my workers on different cores at the same time but if they're all in the day going to the same disk in that super slow, then all my additional parallelism is for not, right? It's a waste. It's actually I could get potentially worse performance because now everything is just, the system disk is just thrashing, bringing things in and out of memory nonstop, right? So what we also potentially need in our database system, actually what we do need in our database system is IO parallelism, right? So that we can split the database itself, the things that we're actually storing across multiple devices, multiple non-volta devices, multiple SSDs or running in the cloud, multiple S3 buckets because this is gonna allow us to improve our disk bandwidth quite a bit. And there's a bunch of different ways that we can do this. Like we can have one database could be comprised stored across multiple disks. One, you could have one database per disk and you have multiple disks and one per database. You could have assigned one table, one relation per disk. You split a relation across multiple disks, right? All the different approaches have different trade-offs and not only in terms of both performance but also reliability or durability. Like if one disk goes down, do I lose everything or do I lose part of it, right? And so we're not gonna get too much into that now. I just talk about from a database system perspective what we need to actually think about for this. And we'll talk a little bit more about this when we're talking distributed databases. So I will say that some systems will support this notion of IO parallelism natively inside of it, meaning the data system is aware that it has multiple disk devices that are unique and that it can assign tables or assign relations or sign indexes to different disks and it knows now in the disk manager or the dispatcher when you go fetch a page where you're actually going and getting it and then potentially also do back up some recoveries according to the topology of your disk layout. Other systems, this has no notion of different devices, right? It's just using the file system, the operating system's file system API to say, create me a file, create me directories and it doesn't know where, you know, what the actual physical storage is for that. Again, so PostgreSQL might see it work this way. And so it's up for you now in the, as like an administrator on the operating system level to then move directories to different disk devices or do syn links and so forth, right? And so that you'll get the benefit of the parallelism in the data system because the disk got potentially faster but the data system scheduler can't take, it's not aware that these are actually different storage devices, it can make different decisions based accordingly. So this is that a person I'm talking about here. This transparent parallelism we would have in data system, do you do I parallelism? And this is, again, this is probably the most common one. This is what most people do to get started. The enterprise things can do more complicated things. The basic idea is that we would have at the hardware or the OS level, this notion of multiple storage devices. So from the data systems perspective, I got a bunch of pages on disk or in my file, but I don't know where this is actually being stored. So this is what you would use something like RAID 4. So RAID 0 would be a striping where thing that was around Robin for every page that shows up, you assign it to one of the disk devices that you've set up in your RAID array. Storage appliances basically are doing the same thing. So striping is one way to get better performance because now within one logical disk view from the data system, it can run requests in parallel and there's multiple devices dispatching and executing these things. Of course, now the problem is in this case, if you crash and go down or one of these disk crashes, disk crash, you lose all the pages there. And so there's also mirroring in RAID where every page is stored on multiple devices in duplicate. So this makes reads go faster because now you can potentially have different queries read different pages from different disks. It makes the data more reliable because if one disk dies, you have backups on the other disks and then again, this is all transparent to the data system. Most systems are going to use some combination of RAID 0 plus 1. So you get the benefit of the striping because that's better for writes because now all the writes don't have to go to all multiple disks, but then you get the benefit of mirroring so that one disk dies and you don't lose everything. And again, from our purposes, the data system doesn't know anything about this. The way it isn't, doesn't need to natively support IO parallelism. The way you would do this is through some amount of database partitioning. And the idea here is that we are taking a some level of division in our database to then split the database up to destroying chunks and assign them to different storage devices. So you can get, as I said, you can do it on a per database level, per table level. Within a table, you could either do this at sort of horizontal partitioning or vertical partitioning. Again, it's up to the data systems to decide what's the best way to do this and get better performance. So if in some cases, if you're storing everything one database per file, one database per directory, this is really easy to do and you can just move directories around. But again, this has pros and cons. The one that is super common, especially in distributed databases, that we're gonna hold off on talking about until later is the lower level table level partitioning or tuple level partitioning where we can split a single logical tuple into disjoint subsets or segments that are then treated separately in the system. And then the query planner knows how to take the request of the SQL query and route it to the right location of where the data is located. And ideally, we want this partitioning to be transparent to the application because we always want it to appear as a single logical table, even though physically it's stored in different ways. But again, sometimes this information bleeds out into the application. Now we can talk about the implications of that later. So I don't wanna go too much detail of this later. Detail of this right now, we'll cover this when we talk about distributed databases because we don't really need to know this particular, we don't need to know this technique to get through query planning and transactions, but we need to know about this for scaling onto multiple machines. So again, we'll cover that later in the semester. All right, so that's everything I wanted to cover for parallel execution. As I said, the multi-threaded one worker, so one thread per worker is the most common approach. And then enter query parallelism, meaning running multiple queries at the same time is also in this common approach. The multi-threaded execution of a query, so intra query parallelism, that's more common in the OLAP systems because you wanna scan a lot of data and do things in parallel. So this is, again, easy to think about but hard to get right because there's a lot of things now, a lot more moving parts in the system and you have to worry about things that we don't have to worry about before when just, okay, one query shows up, we just run it on a single thread because now we gotta worry about what fragment's gonna run at what time, where do they send data to, who runs first versus others, if we're doing writes or we have to worry about updating the data at the same time, the same parts of the same data at the same time. We've already talked about how to do latch and protect the data structures. And of course now there's contention on the resources, everyone's trying to do, access the disk at the same time. I mean, how do you make that and not be a major bottleneck? And again, this is why we don't want the operating system to do anything because the data system always knows what's going on and it can always make a better decision for all these things, you know? All right, so coming up this Thursday is the interim exam for you guys. And as I said, it's gonna cover everything that we discussed in class up to including, up to including the lecture from last Thursday. So this lecture here on parallel execution is not covered, but everything prior to that, so query execution one, that is covered. It'll be the hour and 20 minutes that we normally have for lecture time and it'll be in the location where we're gonna have lectures. And then that link here will take you to the midterm study guide, so please go look at that. And there's practice problems in the textbook you can look at if you want additional more feedback, okay? All right guys, I gotta go. I gotta deal with Fabius Rick and this intervention that's not going well. And I'm flying back tonight and I'll see everyone at the exam on Thursday. See ya. He is you with me, what I'm speaking. Just grab a double duper two and then I'm good to go. Yo, I get this shit done and get it super snakes.