 So this lecture sort of give you a different to what we're doing this year versus lecture. This lecture I'm going to talk about was actually combined into a single lecture. So we were doing scheduling and query processing all in one. So now I'm splitting it up into a way I can go over it more slowly over the scheduling aspects of things. And then the next class is the midterm, then after that is spring break, and then after that is proposals, and then it's the next lecture we'll have, we'll be on the query processing stuff. So before we get into that real quickly, I want to make one correction of something I said last class. So this is the second year row I've made this mistake, where when we talk about the DBDK, which again, that was the kernel bypass method for allow you to read and write on buffers directly on the NIC. And I said that as far as you know, you can't get it for Amazon EC2. It turns out I'm completely wrong about this. Amazon's actually supported this since 2016, and I found this out because somebody tweeted at me when I put up the lecture last week, said, oh, you're wrong. Here's the description of what Amazon does. They call it the Elastic Network Adapter. And then I feel like more stupid about this is because someone actually tweeted me, the guy that runs ScullyDB, he tweeted me, oh, by the way, in December that you can run DBDK on EC2. So again, just DBDK is a very interesting tool, our library that I think systems should look at, but it's not easy to implement. But apparently it's on Amazon, so it should be portable. So I love being corrected, because obviously I wanna know I'm wrong. All right, so today we're gonna talk about scheduling. And as I said, when we looked at the diagram of the overall architecture of the system, where we were at in the semester is that we've sort of covered everything we need to know about storage. We've covered some concurrency control and indexing stuff because that again permeates all throughout the entire system. And now we're looking at how we're actually gonna execute queries. So last class was how do we take a request? What is the wire protocol we're gonna use to communicate from the client to the server, get a request for a query in order to be able to execute it? Now there's a step in between where you have to actually turn the SQL query into a query plan. That's the query optimizer which we will cover unfortunately at the end of the semester. But now at this part here, we're assuming that we have a query plan that has a bunch of operators that we wanna then invoke on in our system. And then those are gonna be computing the query, generate some result that we send back into the client as part of the response. So the bunch of terms I wanna define here going forward is that when we talk about different aspects of the system, we have again the common vocabulary for these things. So as I already said, a query plan is the query plan that the optimizer generates and it's gonna be a tree structure, right? At the leaf nodes you have the access methods where you actually access the data in our storage system and then they shove that data up into other operators to do filtering joins and other things. So all of those operators in that query plan, well they're called operators, right? We'll make a distinction between physical operators and logical operators later on, but for operators it's here, it doesn't care. So one operator could be a join, one operator could be a group by or a projection. So at runtime though, when we actually want to execute the query, if we're gonna run it in parallel, then for each operator we're gonna have multiple instances of that operator. So you guys read this about in the morsels paper from Hyper, if I'm gonna scan my table, I'm not gonna have one thread just being charged scanning that. I'm gonna break it up into these chunks which they call morsels and have multiple threads doing the scan in parallel. So what I'm defining as the instance of the, an operator instance is the one thread is taking that task that's doing the same, again high level operation, it's part of the same operator in our query plan but it's a single invocation of it. So the, now at a more high level construct, we're not gonna care about so much here, is this notion of a task and this is when we can have pipelines where we can take a sequence of operators within our query plan or operator instances and invoke them within the same, the same, they're part of the same invocation on one of our workers. So again, thinking about a query plan, we'll see examples in a second but like, I don't want one task to be the scan and the next task to be the filter. I'm actually gonna do that immediately one after the other. So we'll call that, the task and that's the thing that we're gonna hand off to some kind of worker queue and have the system schedule that for execution. So the problem we're trying to solve here today is that we have some query plan, we can break it up into operators and have operator instances and we need to figure out how we're actually gonna run this in our multi-threaded, multi-core environment or even a multi-socket environment. And so again, we're in this class we're focusing on single node systems but if you're in a distributed system you have to make the same decisions, right? If you think of like a multi-socket, multi-core system it's just a fast distributed system, right? So what do we need to be able to figure out? Well, we gotta figure out how many tasks we should use to divide up our query, how many CPU cores we should be using as we execute this thing, where those tasks should actually execute and that actually is gonna be a big issue we wanna deal with. And then when a task completes it's gonna have some intermediate result the question is where do we actually wanna store that? Right, because think of the operators are feeding one, the output of one is the input for another so where do we actually wanna store the output of one task or one operator so that when the next operator gets it it can be close by or we can have it efficiently. So as I say throughout the entire semester the end of the day we wanna do as much as possible ourselves we don't wanna rely on the operating system to really do any of our scheduling for us because we know exactly what the query's trying to do because we have the query plan we know what all the operators are we know roughly what the data looks like or hopefully we know what the data looks like based on our statistics. So we're in a better position to make a decision about where to place these things than the operating system. Because the operating system just knows that you have some threads or you have some threads that are just reading, writing data, but doesn't know what are the high level semantics of what they're actually doing or that the output of one task would be used for another one. This is why we again we wanna do everything ourselves we don't wanna let the operating system manage this for us. So today's agenda we're gonna focus on three problems. First we're gonna talk about a high level how we're actually gonna design a multi-threaded or multi-worker system. And then we'll talk about the big issue of being aware where being aware have the system be aware where it's storing its data on a machine, actually in memory. And then what the performance of penalty or performance gains we can get by having threads operate on data that's local to it. And again that was a big aspect of the morsel's paper. And then we'll finish up talking about two different scheduling approaches which again one was the morsel's from Hyper and then we'll look at a more sophisticated or complex scheme from a research paper from SAP HANA. And I'm picking the HANA paper because it's sort of going, it's like the trade-off between work stealing versus not work stealing and trying to have a dedicated pool just for workers or it's a dedicated pool for the entire system that can do anything. So it's two contrasts of different approaches to do of how to assign tasks to workers. Okay, so this we covered in the introduction class but I think I wanna go over it again because it's gonna be relevant for our discussion here and for next class. And also we'll see that the, you'll see to understand, this will give you a better understanding of why we choose one particular process model, the multi-threaded approach versus the other approaches. So the first thing we need to understand is what is our database systems process model? And we're defining a process model essentially how we're gonna organize the systems components to run in parallel in with multiple workers. So a worker is this thing in our system that can knows how to take tasks, execute them and produce results, right? So we don't wanna call them threads because we'll see an example in the next slide where the worker could actually be a process, not a thread even though it is a single process is still a thread, right? So again, there's different ways to do this and we wanna organize in such a way that it's sort of a trade-off between software engineering complexity, resiliency to crashes and isolation between the different workers and actually just in terms of better getting the best performance. So there's a great paper or sort of the book from Joe Hellestine, Mike Sternbreaker and the guy that runs most of Amazon and AWS, James Hamilton, where they sort of lay out what's the architecture of a database system. So I really like this book, you go to check it out. It sort of covers some of the things we talk about in the intro class for a disk-oriented system but he sort of lays out like, here's the architect, actually building the internals of the system. And they cover a lot of things that we already covered up. But it's another perspective on the problem. All right, so there's three approaches to do for process models. So you can have a single process per database worker. You can have a process pool of workers and then you can have a single thread per database system worker. And again, the spoiler is that this is what we use. This is what every modern system uses but for historical reasons, there are systems that use the other ones. All right, so the process per worker approach is that every worker is its own separate operating system process. So it has its own PID, it doesn't have a parent PID. Actually it's not true because it's a fork. But it has its own address space. The only way it can communicate with other processes or other workers is through shared memory or some kind of inter-process communication, like a pipe, right? So the basic idea is that you have your applications and a request to some kind of dispatcher, all right? In Postgres, this is called the Postmaster. And then the dispatcher is responsible for handing off your request to some worker. And then now the worker can go back and communicate with the client and say, I'm your worker, I'm in charge of whatever you need. You can send queries to me. And then the worker knows how to take any query request, run it on the database system and send back the result. So if we wanna need to run our query in parallel, then this worker needs to write to shared memory or do a slow message exchange between other workers because they're not in the same address space. So this is what you see in older systems for historical reasons. And at the end of my take guess what? When I say older, I mean like the 1980s, 1990s. He says things crash more often. Yes, but no, in the back. Bingo, exactly. P-threads, like POSIX threads did not exist back in the 1990s, 1980s. Like there's a bunch of shit we have now that we take advantage of. Threading library, P-threads, right? Back in the 1990s, 1980s, there was no standard of what the threading package was. So all these different UNIX variants had their own threading package. So if you wrote a multi-threaded database system to run it on BSD, there's no guarantee that it would run the same way on Solaris because Solaris had its own threading package. P-threads changed all that. So again, these older systems like Oracle, Postgres, and DB2 were written in a time where you didn't have a portable threading package. And so they used, most operating systems would support multi-processes and shared memory. So you would use that instead of threads. A variant of the process per worker is a process pool. And the basic idea here is that you still go through a dispatcher, but instead of having one dedicated worker assigned to your client request, it can hand it off to any worker, right? And then now if you wanna do multi-threaded queries, sorry, sort of parallel queries, this thing could then talk to other workers that may be idle to hand off portions of the query plan for it to execute and give back results. So this is actually what Postgres does now in the newer versions, since like 9.5, like two or three years ago. Like, because Postgres used to be, you can only have a single process execute your query, a single worker execute your query. And now they can hand off the query and then run on parallel on different workers. So again, you get the same isolation that you would have from memory because each process has its own address space. And then in order to communicate that with this, you gotta go through shared memory or an IPC. So how do we do scheduling on this? For either this approach or the previous approach. If these are all processes. Exactly, the OS does it. We can't do anything, right? Because everybody is their own first-class process. We can give hints to the operating system about who gets priority, like setting it like the nice flag, but the OS is gonna decide who does whatever it wants, how to organize these things. So as I said, Postgres added this very recently. DB2 you'll see is gonna do all three approaches because DB2 is actually, I since learned it's four different code bases because there's one that runs on Linux and Windows, there's another version that runs on ZOS, their mainframe, and there's like two other versions. And they have to run in all these different environments and so they need to support all these different models. But again, I think if you, I actually don't know what you get like the Linux version. It might be the process per worker model. All right, the last one is the most common one at least in more recent years and this is how anybody building a new system today would actually do this. You use a multi-thread environment. Single dress space, you have a bunch of worker threads. Whether you assign one worker thread to a connection or whether you assign it to a pool and be able to read from a task queue, we'll get to that question later. But the basic idea is anybody can communicate with anybody, anybody can write to share or to the address space. Therefore, you need latches to protect things, right? So again, this is what everyone uses in modern times because P threads is common enough, right? So what we actually did when we first started building Peloton, the old system, since it was based on Postgres and it was Postgres was a single process per worker, we wanted to make it multi-threaded for performance reasons and we ended up porting the entire thing. If you go Google multi-threaded Postgres, you should see my PhD student posting on the Postgres mailing list saying like, hey, we did this, here it is. We also converted it to C++ which was another undertaking that I regret but that's okay. It actually was kind of crazy. It turned out that rather than going, trying to take the Unix code and make that the Unix version of Postgres because it's like the same code base but they have these pound of fines to say if Windows, if Unix, right? Turns out if you take the Windows code for whatever reason, it was easier to convert that to make that multi-threaded with P threads on Linux than taking the multi-process version of Postgres. But again, we abandoned that through all the code away. All right, so as I said, like the multi-threaded approach is the way to go now because it has, I mean, it's easier to engineer because you don't worry about the shared memory stuff. There's less overhead now when you have context switches because it's just threads running in the same address space and threads are more lightweight in the OS than you're scheduling. And we have to be careful to make sure that we don't want our threads to crash because if one thread crashes, the whole thing comes down. So it is, other aspects are easier for engineering, other aspects are harder. So the other thing I'll point out too is that we'll talk about query parallelism in the next couple of lectures, but just because you're using a multi-threaded process model, you're using threads for your workers, does not automatically mean you're gonna get intro query parallelism. So what I mean by that is like, just because your database system itself is multi-threaded, you still may be only executing queries of one thread at a time. So MySQL does this. MySQL can only do single-thread execution. But modern systems try to avoid this, especially if you're doing OLAP, then you have to have parallel queries. You have to have this. As I said, I'm not aware of any other, any new system built in the last 10 years that isn't multi-threaded with the exception of the systems that are based on Postgres. So there's a lot of systems that take Postgres, fork the code as we sort of did and rip it out bits and pieces that added the new stuff that's specific to your new system. And in that case, unless they do what we did and actually make the effort of making it multi-threaded, then they're all gonna be the process provoker model. And it's become less of an issue since Postgres supported parallel queries. In the old days, it was problematic. All right, so regardless of how we're gonna do our, what process model we're gonna use, the allocation of the method we're gonna use to figure out how to assign tasks to workers, we still have to be in charge of figuring that out, because we wanna make sure that we assign workers to tasks that are gonna operate on data that it's gonna be physically close to it. This means we need to be aware of what the actual hardware looks like, like in terms of the dims and the sockets that we have. So to understand this, we need to understand what the memory layout could look like and the distinction is gonna be between uniform versus non-uniform. So uniform memory access is the old way of doing multi-socket or multi-CPU systems, all right? The basic idea is that you have a bunch of CPUs and each CPU has their own local cache, like SRAM L1, L2, L3 and then in order to read data from the dims, the memory dims, you had to go over this thing called the bus. You basically say, I wanna read memory that's in this dim here, the request would go up into the bus and the bus would somehow find, we knew where the thing you wanted is, goes and gets it for you and brings it back down. So in this environment, the cost of getting data at any one dim is roughly the same. Again, memorize said that the hardware, especially x86, they're gonna go out of their way to make your job easier as the programmer by hiding the complexities of multiple dims and multiple sockets, right? They're gonna make sure everything's cache-coherent and they're gonna allow you to have a single process run on multiple sockets at the same time and each socket can access any address and memory. So they hide all that from you. So in that environment, again, in that case, if this thread is reading some memory that's over here, it costs the same as going get it through the bus as over here, right? Now it does all the cache invalidation stuff for you as well, like if I write to something in here and this guy has a cache, the bus handles that for me too. The main thing to understand again is like this bus is sort of like, it's just things sort of ruining everything and making sure everything costs the same, even though something might be physically close to it. So this is the old way people built sort of parallel hardware on a single machine. It might still exist in embedded devices, but for like x86 or modern Xeons, they don't use this. They instead use what's called the non-uniform memory access or NUMA and you hear me refer to this multiple times, this lecture and then going forward for the rest of the semester. So now, the way this looks is that you have now, say we have four sockets, every socket again has its own L1, L2, L3 cache. In the multi-core sockets or the multi-core Xeons, you'll have, every core will have its own L1, L2 and then L3 will be shared for the entire socket. And then each one's gonna have its own set of dims that are physically close to it. So this CPU can read and write to memory to its local memory here really fast. And we have the same guarantee that I said in the last slide where it's the same cache coherent address space, even though our process may be running on multiple sockets. The hardware provides all that for us, but now if we need to access memory that is over here, we have to go through this thing called the interconnect to get to this memory region because essentially you're going through the socket to get the data you want and then it comes back over the interconnect and then you cache it. So again, this memory is far away, so going, getting memory from here is more expensive than getting memory from here. And so normally we don't care about this. Normally we actually don't know about this. If I just malloc something, I don't, it's just gonna show up wherever. We'll talk about that in the next slide. But in memory database, I'm gonna care about this a lot because now I need to know where to actually wanna run my tasks, where I wanna run my operators so that my threads are always operating in data that's over here. So as I said, this is why all modern multi-socket systems are implemented. So in Intel they call it, it used to be called the QuickPath Interconnect QPI in 2017, they felt that name wasn't good enough so they now it's the UltraPath Interconnect. And AMD, for the old days it was called the HyperTransport but now it's called the Infinity Fabric, right? And I don't know what it's called an Empower or ARM but the idea is basically the same, right? So this idea of splitting up the database and then storing it at different locations in a way that we can maximize parallelism and locality of the data we access is not specific to a single node system or this new architecture, this is a classic problem in distributed databases. Like how do I split my database up, store it in different nodes, in my case here, different sockets so that I can take my work, my queries that show up and either route the query to the right location that has the data that all the data it needs or take my query and split up to sub tasks or plan fragments and distribute those out to the places where they have the data, right? So as I said, the hardware is going to do all this work for us to make it easier for us as programmers to not worry about where the data is actually being stored but because we're database people and we know better then we can actually exploit some of the syscalls and the information that the OS and the hardware will provide for us to make sure that we do things correctly. So what'll happen is if you actually want to figure out where your data is located, there's a syscaller called move pages where you can pass this thing a memory region and it'll come back with the... So you can pass it a memory address and it'll come back with a numer region which is essentially the socket that'll tell you where a particular piece of data is located on. And then you'd also use this syscall to say, for a given memory address and a size, you tell it what socket you want it on and it'll go ahead and move it for you, right? So we could just load our database, let it land wherever it wants to land and then go back and move things around as needed, right? They actually expose this for us. But that would actually be kind of crappy to load a one terabyte database, then go back and figure out how to divide it back up and move it around. So we actually want to do it as we're creating the data. So my question to you guys is, let's say I call malloc, because I want to allocate a bunch of chunk of memory that I'm gonna store my data, right? It's the data table. Assuming that we don't already have, our allocator doesn't already have a chunk of memory already allocated for us, it has to go get new memory from the OS, what actually happens? Yes? What's that? He says sbreak. Not yet. Sbreak doesn't extend the address base, does it? Actually it does, yes, so he's right, it is sbreak. Yeah. But what did that do? You might go to the end of the correction. Yeah, so he says, yeah, I've already explored it. So, correct answer is sort of nothing, right? So what happens is you go to the OS and you say, we want more memory. So it's gonna make that sys call where it's gonna extend the data segment for our process. But the key thing to understand is that this is just virtual memory. And as I said, virtual memory is not backed by physical memory until you actually try to use it. So even though we extended the data segment of our process, all the sys is always just bookkeeping in the OS, right? It's only when we actually try to go access that the location of the memory we've just given, then there's the interrupt and the OS says, all right, you're actually gonna use this memory, you weren't just lying to me. Let me go actually back what you're trying to touch with physical memory, right? So now my question is, where is this memory gonna be stored? Like what NUMA region? All right, we have two choices, right? So the first is that we just let the OS do whatever it wants to do. And in that case, it can just do like a round Robin approach where it says, all right, the last chunk of memory I allocated was on this socket, now the next chunk I'll allocate on this other socket, right, and it just goes around that way, right? A better approach is actually what we probably really want is called first touch. So we want whatever the CPU, wherever the CPU of the thread that acts as that region of memory, that's gonna be running on a socket. So wherever it's running, then that's where we want our virtual memory to be backed by physical memory from it to local DINs. So there's a command called NUMA control. I mean, there's also a syscall for this as well, but you can basically give hints to the OS and tell it, I wanna do this, right, rather than this one, because I think this is the default, right? And again, like the OS is trying to be the jack-of-all-trades, it's trying to support any possible application you throw at it. It means your web browser, your Bitcoin miner, your database system, but we can give it hints to make it do, you can bend it to our will and make it do what we wanna do. It's gonna give us the best performance. So this solves the problem that I said before where I wanna load it in a one terabyte database and I want to have it go to the right DINs as it's getting loaded in rather than me having to go back and do a sequential scan on everything and call move pages to move things around. I think also too, and some, I think in x86 as well, I don't know whether this is true or not, I think it's tried to be smart enough about like, if you see one thread writing to a memory location over and over again, that's in a remote NUMA region, like on another socket, it'll move it for you. But we're really focusing on, or if doing OLAP queries, then it's never gonna get written, it's just gonna get read a lot. So we may not wanna rely on that and we might as well just do it ahead of time since we know exactly what's gonna happen. So let's look at to see what the performance difference you can have for being smart about where you schedule your tasks or your threads based on where the data is being located for both OLAP workloads and OLAP workloads. So in OLAP workloads, again, these are transactions that wanna touch a small amount of data and also maybe update a small amount of data. And in that case here, we have four different choices we can do this. So this is a paper written by the Shoram-T people out of EPFL by Natas Alamaki and her students. And so they're gonna configure the OS to store, spread the data across evenly, across four different sockets, group all the data you want on a single socket, split it across two, or then let the OS do whatever it wants to do. And so this are gonna run just the payment transaction in TPCC, which is sort of a right heavy workload in for this benchmark here. And so no surprise by putting everything on the same socket and because transactions need such data across all four partitions or for all four chunks, then you get the best performance here, right? And the OS does sort of okay, actually, it's not much better. This is the worst case scenario. So it's not doing that much better than the worst case scenario. But again, by knowing exactly how what the transactions are gonna do, storing the data at the right location, making sure threads get assigned to those locations when they run, that's how we get the best performance. So for all the TP, the impact is, this is good, we're getting better performance, but our improvement here is not that much, right? We're going from maybe, it looks like 8,500 to maybe 10,000, 11,000 transaction per second. Like a 10%-ish improvement. Whereas the performance difference roughly between a local memory access and a remote memory access over the interconnect, it's about could enroll in numbers as like three acts. So let's look at the workload for OLAP queries. So this is an action experiment that some former students of mine did a few years ago. I think this is like part of the project for 618. They built a little query engine. And so this is running on a beast, like eight socket machine in the PDL. It's a bit older, but it was the only eight socket machine we had around. And for this experiment, they have a database of a single table with 10 million tuples. And they're just going to do a sequential scan on that entire table and just filter out some things and see how fast you can actually go. So the size of the table will be the same, but along the x-axis, we're going to scale up the number of concurrent threads that are running. And then this little vertical line here is when we run out of physical threads and now we're running with hyperthreading, right, running with the virtual threads. And so at the very beginning, there's not that much difference in performance because since the size of the database is fixed, the number of threads you have are increasing. So the probability that your thread needs to access data that's on its local NUMA region or on its socket is much higher here, but then as you go up and higher, the probability becomes less and less. And so now you're going over that interconnect, you're essentially saturating it and that's why you don't get any performance improvement as you add more cores, right? And then the plateau up here, just because hyperthreading is not real, like we're CPU bound, so hyperthreading is actually not going to help us here. It's not like we're waiting for IO from the disk, right? So the red line is when we're controlling where things are actually being stored and make sure threads only read data that's local to its socket and then this one's just random, right? So again, so this matters a lot. This is almost like the best case scenario because it's a really simple application. There's obviously other aspects or other things that are running in our database system that will make the gap not be that significant, but this is just showing that it matters and we are going to make sure that we store our data correctly and assign our tasks and schedule them correctly, okay? So, yeah, sorry, question or no? Okay, sorry. We're not going to talk about partitioning so much here. We'll talk a little bit about when we talk about doing parallel hash joins, but this is going to assume that we have a way to intelligently divide our tables up and sort it in the different sockets, right? The easiest thing to do is just round Robin, like every two people shows up, I say you go to the first one, the next guy goes to the second one, right? And that will divide our data evenly across all sockets but we're giving up some high-level semantics about what the actual data looks like that we may need later on when we actually start running queries, right? Or if we do partitioning based on the actual attributes themselves, then we can be sure that when we do joins that we're joining the inner table and the outer table, they're already partitioned on the join key on a single socket, so therefore I don't need to communicate with other sockets to do the join. That's like the best-case scenario. So this is why typically for, even though we're on a single-node system, we still want to do partitioning, which is very common in distributed systems, for making sure that we divide the data evenly across our sockets. So placement is basically, after you've generated what your partitions, now you're going to figure out where you just assigned them, right? Because typically what you do is you grade more partitions than sort of sockets you had to place them on and then you just need to wait to decide how to store them, all right? And again, there's different trade-offs of how to do this. Round robin's probably the easiest one to do. You could try to then balance, like if you know the data might be larger on one node, but like most queries don't, most queries are going to filter out most of the data on that socket for whatever reason, so therefore maybe you overload that one versus another one, because it can computationally be less expensive than other sockets. All right, so what do we have at this point? We have a process model, we have a worker allocation model, I actually skipped that, but we have a task assignment and we have a data placement, so we know how we're going to take tasks and assign them to numerous regions or sockets because we think that's going to get the best performance, and but now we need to figure out how we're going to take a set of tasks from a logical query plan and split them up into sub-tasks and actually be able to run them. So again, we'll discuss this later in the semester, like a logical query plan is basically what you get from moments like the abstract syntax tree of the parser and then the optimizer generator is a physical plan, so we're now figuring out how to actually execute those operating instances in the physical plan. So for all of these queries this is really easy because there's no parallelism we can have, it's because it's going to be like go grab this single tuple, we can't parallelize that. It's really for this lecture and for most of the rest of the semester we're focused on OLAP queries because that's where we're going to do large sequential scans that begin running parallel on multiple cores and multiple sockets. So again, we need to figure out how we're going to take a bunch of worker threads we have, a bunch of tasks we need to execute for our queries and match them up, right? So the easiest thing to do is what's called static scheduling and this is where before you even start executing the query you just assign this task that's going to execute this thread and nothing changes as you're running the query no matter whether the distribution of the data looks different than what you expected, if other queries show up at the same time you just say this is the plan and I'm sticking with it, right? This is the easiest type of scheduling to do, right? So the problem with this though, as I said there could be issues where we maybe assumed that the data was uniformly distributed for how we're doing filters. So we think that we have a one terabyte database it's split across four different sockets and every socket is going to filter out the same amount of data and every socket is going to do the same amount of work but then when you actually start running it you may find out it's not the case at all and some socket may be oversaturated or inundated with more work than other sockets and those other sockets are now idle. So this is what dynamic scheduling is going to solve for us. It allows us to be able to on the fly and recognize that how to assign tasks to worker threads based on who is available to do the work. So the paper I had you guys read was a thing from the HyperGuys called Morsel Scheduling. The way to do that, so Morsel is, if it's a hyperterm it's not like a standard term using databases. I think it's because they didn't want to use partition because that word is used all the time and they didn't want to use block because again that's used all the time as well so they're calling these things Morsel which are slightly bigger than what we normally think of a block but smaller than what we normally think about in a partition. So what they're gonna do is they're gonna have one worker per CPU core. They're gonna do a pool-based task assignment meaning the worker threads are gonna look in a queue, pull out tasks, execute and then when it's finished that task it goes back and pulls another one. And then they're gonna do a round robin data placement across the sockets to spread out things in the different NUMA regions. So we won't talk about this now, we'll talk about this when we talk about parallel hash joins, parallel sort merge joins but they have a, what we're calling a NUMA aware operator implementations where the, as you're executing the query plan the operator in the cells can recognize that I'm reading data that's local to my NUMA region therefore I should do things a certain way or I'm reading data from another NUMA region, I should do things another way and likewise when I write data out we'll make sure I write it to my local NUMA region and not someplace else. All those things you gotta be careful about because again like the, if you're doing this across a lot of data and a lot of tasks and this'll cause problems. So they're not gonna have a separate dispatcher thread, as I said they're doing pool-based scheduling in the single task queue everyone goes grabs things out as needed. So you don't need a dispatcher thread to assign things to the two of the different threads. And the little happen is that when a thread goes and tries to say I need work to do looks in the task queue, they're going to prefer tasks that are going to operate on data that's local to it that are in the same NUMA region. But if there's no tasks available then they do work stealing where they go look at other tasks that may not be in its local region and go ahead and pluck those guys out and run those. So the idea is again you prefer things that are local to you because that's how you get the best performance. But if no tasks are available then you go take other people's work. So let's look at the example. So the first thing we need to do is divide our table up into morsels. So say we have a really simple four column table. Again a morsel is just taking all the attributes for tuples within a single segment and store that as a single morsel and then you sign that to one CPU socket. So I think in the paper they talk about a morsel in their world is 100,000 tuples. For reference in the old Peloton code we did a thousand tuples per block and the new version of the system we do one megabyte of data per block and we do that be aligned at 20 bits for addresses. We can talk about that offline. In each store we did 10 megabytes per block. So in the old code that was a thousand tuples in an eight store where it was 10 megabytes we just picked those numbers that we just picked those numbers. There's no magic or deep reasoning behind them. In their world they're saying 100,000 tuples is the right number because that gives you the right amount of parallelism without having to be too large. It calls bottlenecks but you can have them sort of spread out or be divided enough that you can run things in parallel. So now you actually want to run these tasks. So again we have our query plan here. Each of these guys are an operator and then we're going to divide them up into operator instances. For this one here we want to build a hash table on for B, say we're doing a Grace partitioning or partition hash join. So each of these guys could be a task that could run on some morsel of data. We divide B up into three morsels by divide A up into three morsels. So for each socket or see each core it's gonna have a directory with the morsels that it has in its local memory. So there's this internal catalog that says if you want a morsel that has data maybe in these ranges, here's where it's actually being stored. Actually no, they're doing round robin partitioning. So there's no higher level semantics about what's actually inside these morsels. It's just you know there's a morsel here, right? Then each socket is also gonna have its local buffer and this is a local memory region where they're gonna store the intermediate results that they're gonna generate as they execute tasks. So I wanna start running and say I picked these first three tasks here where we wanna build a hash table on A. So say that socket one has morsel one, socket two has morsel two and so forth. So the threads will go look at the global task queue and say well these are the tasks that are operating in data that's local to me. So let me go grab those guys, right, and execute them. So they pull data up now through their local memory, operate on them and then they produce results they store back in the buffer, right? All right, so let's say that the first two threads complete a finish very quickly because again like say we're doing a filter, they end up filtering out 99.9% of the data so we're not spending a lot of time building our hash table. But this other guy here he's gonna he's for whatever reason its filter is not very selective so it's gonna run slower. So the first two threads they go back to the queue and say well I have these other two tasks I wanna do I wanna do my build on the build side do partitioning for B that gets one and two that way. And again for this one they don't actually need the intermediate results for their buffer that they restored the output of the old tasks are still sitting in the buffer here. So let's say this first guy completes and now he goes back to the task queue and at this point you recognize that this thing is running slowly so the remaining tasks are on for this guy here. So he's gonna go ahead and steal that task and start running it and the idea here is that by the time this thing finishes this won't also be finished. So we're finishing all the tasks we need or all the operators we need within this task before we go on to the next stage of the query plan. In the Hadoop world this is called stragglers or you're basically doing speculative execution assuming that this guy's not gonna finish in time by the time this guy finishes too. So even though we have to go get memory that's remote over here, right? By the time this guy finishes then this guy would finish as well, right? So that's basically what we're doing here at a high level, right? And again, and still writes the output of the result for this task even though a red data from over there to its local buffer. And again, there's no centralized coordinator figured this all out. It's just we know what the task are for our query plan. We know what depends on what output of one query task depends on the input for another query test. We know all that so we can keep track of how we actually stage these things. Again, the dealings of the work stragglers because we only have a single core per or single thread per core. And so that core can only be really processing that one query or one subtask. And so again, to avoid stragglers we have to make sure that we make sure that they, other threads can steal those things from the global work queue. So in the paper they talk about how they use a lock free hash table to maintain these global work queues. We'll talk about lock free hash tables in after the spring break. There's really no magic to them. It's, you know, you just do the compare and swap stuff that we talked about before, right? So this is an approach of doing work stealing with a single queue or single global queue for the entire system. So now I want to talk about another approach that doesn't do work stealing or not in the same way that the hyper guys are proposing and they're now gonna have actually multiple worker queues per different, per socket or per numeral region. So this is a research paper published in BODB from, it was written by a PhD student at a German university but he did it on HANA because they had this arrangement where they basically get paid by SAP to work on HANA stuff but then they get a PhD at the university, right? So instead of getting paid like crap like a PhD student gets in the United States they get paid real money and they can do research but the company owns everything which is, it may or may not be a good idea. So what they're gonna do is that they're gonna have different types of queues for each socket and you have this notion of a hard queue and a soft queue. The idea is that threads can't steal any work from the hard queue but they're from other sockets but they can still work from the soft queue and instead of having a single sort of, you have a bunch of threads that can only do task execution for queries they're gonna have these different types of queues for all different parts of the entire system. So when I mean by that in the hyper approach for these threads back here these are only running queries then there'll be a separate thread pool to run networking stuff, separate thread pool to run background garbage collection or whether maintenance tasks, right? In this HANA paper they're gonna have thread pools that can get swapped around as needed for all different parts of the system. So let's say right now I recognize that I have not a lot, there's not a lot of networking requests so maybe I scaled down the number of networking threads that I have but then I increased my number of background worker threads to do garbage collection and other things. So that's why I like this paper because there's a really interesting idea about how do you, what should the threading model look like for the entire system itself rather than just focusing on just the query execution. So because now you have this notion of you need to move things around you need some kind of way to coordinate all this, right? So they call this a washdog thread that basically goes around and checks all these queues for these different components in the system and it can change the status of the threads based on how it thinks it needs to reorganize and balance the system to get the best performance, right? So the basic idea again, for every thread group on a socket you have a soft and hard priority queue. You're allowed to steal thread pass from other thread groups from the soft queues but you're not allowed to steal from the hard queues. And then now within the thread pool there's gonna be four pools, sorry, within a thread group you have four pools of threads that are gonna be in different states. So the easiest state, the first state is called working and this is where the task, the threads are just executing tasks that it pulled from the queue. Then you have an active threads that are blocked inside of the kernel by taking a latch and that causes the OS to not schedule your thread for execution. Then you have a free thread where there's no active task that you can work on right now but it's just spinning and checking the queue every so often to see whether there's something new for it to execute. And then the last one is a parked thread and this is getting like free but it's not allowed to wake up and go check the queue. The watchdog thread has to come along and poke it every so often. So this sounds very complicated, right? What does this sound like? I know that's exactly, right? So this is a good example where we could potentially get better performance in our system and have it manage it better because we are controlling exactly how we're scheduling all these different threads in our system. Now of course means now we're building a whole scheduler in our database system on top of the OS's scheduler and now we need to make sure that we understand what the OS wants to do versus what we're trying to do here. So right, so the way they're gonna make this no more aware is that they're gonna pin the threads based on the threads can move around in the system but they'll be pinned to a CPU based on whether it's CPU bound or memory bound. So like if my task I'm trying to execute is memory bound then I'll pin it to the core of where that my memory is located but if I'm CPU bound then maybe I let it jump around to different sockets, different cores because the OS can then let me run on any, whatever core has free is idle, right? So the main takeaway, the more interesting aspect about this paper that came after the morsel's paper is that they comment that when you start looking at machines with a really large number of sockets and CPU cores, then the work stealing approach that the hyper guys talk about actually becomes detrimental because the overhead of moving data around at such a large scale on a single machine becomes problematic. So I think in the hyper paper they looked at most of like two sockets or four sockets. The HANA guys are talking about machines with like 64 sockets, right, or even higher than that. I think SGI sells, I said before this, the HANA guys talked about they had some customer that's running on a machine that they maxed out two to the 48 address spaces of memory. Even that wasn't big enough and that thing has had a ton of different sockets. So again, the other thing too, as I said before, is that they're using these thread groups for everything, not just for task executional queries. They're using it for networking, using it for garbage collection, which that part I really like. This might be an overkill for what we try to build in our own system, but I think this idea, if you want to have complete control of everything that's going on in your system, then you want something like this. And I suspect the major vendors DB2, SQL Server, and Oracle do something similar to this. All right, so let's look at a really high level what this looks like. So let's say this is, again, we have a single thread group and we'll have a single thread group per socket and then we split up these different queues of our sets of threads. We have our working threads inactive free and parked. So up above we have the task queue. So these are all the tasks we have in our system right now, but then we'll have a bunch of them will be assigned to the soft queue because these will be things that we want to execute at this region, but we're okay with somebody else coming along and stealing it from us. And then say we have some of their background tasks like garbage collection or say whatever reason, these sub tasks of this query plan, these are on our hard queue and that prevents anybody else from coming along and stealing them. So again, in the paper they claim that you want to put everything in the hard queue and just completely disable work stealing if you're running on a large, with a large number of sockets in your system. So the working threads are all actively pulling from something kind of work queue. The inactive threads are stalled in the OS. The free threads are spinning every so often to see whether there's something there and the parked queues have to be poked by the washed off thread and say, hey, go check the queue. So now what happens if I say we have a free thread, free thread checks the queue and says, oh, there's something for me to execute, then it's allowed to move itself to be in the working state here. Then when it comes back, if it sees that there's not another task every completes the task that it pulled out, then it goes back and becomes free. All right, so any questions about this? So again, this is the idea we're trying to do in dynamic scheduling inside the database system is essentially have the threads that are gonna run tasks run on the data that's local to it in its numeric region on the same socket. And then we can decide whether we allow threads to steal work from other tasks or other threads that may be reading and writing memory that's remote to where I would be running. All right, so next, the last to finish up, last question of if we end up saturating our system because the clients are just sending us too many requests faster than we can process them, this becomes problematic because now we come overloaded. We're getting more requests at a rate faster than we can process them. So we need to make sure we handle that in our scheduling system. So again, the OS is not gonna help us, right? Because if we're CPU bound, there's really nothing it can do, right? If we're memory bound, meaning we're gonna start running out of memory, the OS has this thing called the out of memory killer, which is basically it picks a random thread, which is usually gonna be the database system thread, picks a random thread, kills the process with a sigterm and then frees all your memory, which is the worst case for us because we're assuming we're running on the database system, we're the only thing running on this machine, we're the most important thing. Even if we weren't the only thing on the machine, we're still the most important thing. And this thing comes along and shoots us in the head and takes all our memory, which is bad, right? And obviously if you're CPU bound, it's essentially the same thing as memory bound because what'll happen is you'll keep getting requests, the queues will get longer and longer until eventually you run out of memory, right? So the easiest way to solve this in our database system is just to throw our hands up and kill ourselves, right? Which again, that's obviously bad because now we have down time and that doesn't really solve our problem. So the sort of problem is called flow control. Again, we're not gonna get a queuing theory, we're not gonna get into sort of the more theoretical aspects of actually handling this, we're just gonna talk about the engineering approaches we can do in our system to take care of this. So the two ways to handle this are admission control or throttling, actually one or two, typo. So admission control is basically, you recognize that you have back pressure, meaning you can't process the requests faster than they're arriving. So you basically prevent the clients from sending new responses or new requests because you immediately kill them or you abort them. So they'll send a request and you immediately come back with a failure message. And then hopefully the client is smart enough to recognize they shouldn't do that or they back off for a period of time, right? So in at least I know in post-custom and a lot of protocols for database systems like in the network inside the wire protocol, they don't have a notion of this back pressure. So there's no hint you can pass back and say, I'm rejecting this because I'm overloaded. The newer systems do actually support some of this, but traditional systems don't. The other approach is you throttling and basically you XU queries just as fast as you could as before, but when you get the result, you actually hold on to the result and not send the response back right away because you're assuming the clients are running in a synchronous model where they send requests to you and then wait for you to send back the response before they send the next one. You just wait a little bit before you send that response and that should prevent them from immediately coming back and firing another one, right? Simple, both are simple approaches. You kind of need both of them in your system. Now the second approach, the throttling one, this doesn't work if you have an asynchronous submission scheme, meaning you're allowing clients to submit requests and then go out and thread it can go off and do other work while they're waiting back for the response through a callback. There's some newer systems that actually support this. If you're doing this and you're screwed, this doesn't really help you because it's going to go fire off the next thing without actually waiting, right? So we would need both, we want both these approaches in our database system. It's not clear to me whether this is actually on the networking side, unless it's not really, it's a networking protocol issue and a scheduling issue. I'm just bringing up here because now we've sort of understand all the things we care about, right? All right, so to finish up, the database system is a beautiful creature, right? It's, we want it to be uninhibited by the operating system. So we want to let it do whatever it wants to do. So that means we want to manage everything ourselves. But in order to do this correctly in a modern system, we need to understand what the harbor actually looks like. And the main thing we're going to care about here for an in-memory database is the Numa regions, the sockets, where is memory actually being stored? Because we can assume that the CPUs are all going to have the same performance. So my thread, my computational task, I run on one socket, should run the same on another socket. But if I had to read memory, then where it actually runs makes a big difference. So the way we do this is basically to track where memory is being stored and assign tasks to the socket that'll have the data that they're going to process on. And we know what data they're going to process on because we're the one that created the task because we divided up our query plan. So again, as much as possible, we don't want the OS to rain on a parade, so we want to reduce the amount of communication or reliance we have on the operating system and then where necessary, we provide it with the hints we want to tell it to make sure it does the right thing for us. All right, any questions about scheduling? All right, so in the last five minutes, I want to talk with extra credit assignment. So I'm going to post this out today on Piata. This is not something you have to worry about today because obviously the midterm is on Monday, but you should be thinking about this over spring break. So we're doing the same thing we did last year or last semester in the interruption class where I'll give you 10% extra credit for the semester on your final grade if you write an article, an encyclopedia style article about your favorite database management system. So it's sort of like Wikipedia, but instead of having this free form semi-structured text, the way it works is that we have these features about the database system where you select what options they have and then you write a description of how they actually implement those options. But if you ever go read Wikipedia, sometimes you'll see things like, they'll use different terms for two phase locking or NVCC or OCC or indexes. Sometimes they call it a V plus tree, sometimes they call it a B tree where there actually is a distinction, but in Wikipedia people use the terms interchangeably. So the idea is that we have a standardized taxonomy that I've defined for all these different features and you select those things and then you write about how they do it and provide citations. So the website is dbbb.io. It's the database of databases. And again, this screenshot's old. It says 560 up there. We're up to like 583. So it's every single database system, both academic, open source and commercial that I'm aware of, including some historical ones. And you can go along the side here and you can select these different options and actually filter down and see the ones that you want. And then the articles essentially look like this. As I said, there's the different features, what the data model is, confidential checkpoints, and then you select the options of how they do things and then write about them. So all the articles will be hosted on this website. Basically it's like Wikipedia where you have different revisions and you keep track of your changes and it's assigned to you. So what I will do is I will post on Piata a link to, I'll add a new sheet to the Google spreadsheet where you can sign up for what system you want to cover. It's first come, first serve, but we've also had, this is like the second time we've done this, there's a bunch of systems that are already been filled out so you have to make sure you pick a system that you haven't picked before, right? And so you should have no problem finding a database system you like. As I said, I'm aware of 583 different database systems. So like if you're interested in graph databases, we have those, you're interested in distributed databases, we have those. Think of any aspect of a system that you care about, there's a database system that does it. If you like blockchain, there's like a bunch of blockchain databases, I learned a new one on Sunday, right? I guess it was yesterday. So there's a ton of systems, you should have no problem finding one that has not been covered that peaks your interest. Now, what I will say though, is that you should be careful because some of these systems that are on there were like startups that were around for like two years and then they went under or got bought. So email me and ask questions because again, I'm the one who populated the database, I know roughly how, where these systems come from. So I can tell you whether this is going to be a good idea or a lot of work. So last semester we had somebody picked this thing called Cornerstone, which was a video game company in the 1980s decided to get into the database business. So they had a database for three years and then it went under. And so she was looking for like documentation from 1986 to 1989, right? It was, she actually reached out to like the authors and try to like find people that could talk about this, but like that was a nightmare. The other thing I'll say too is like for some systems obviously Oracle is super well known. So if you pick Oracle, then the expectation is that your article should be very complete and very thorough. If you pick, the 1986 one was a bad example, but like if you pick one that's a bit more obscure, you may have a harder time finding things, which you basically want to describe to me in your write up what you actually look for. The other thing I'll say too, if the system is actually still active now and you can't find the information you want, go contact them, go email the developers. It's like hey look, how do you actually do things? And they'll tell you, right? Like so last semester we had so many contact TileDB, like the co-founder of the company was helping them figure out how to actually do things, right? And the issue was they were writing, he was describing how the system does things to her over the email, but he couldn't cite the email. So then they end up had to make a, she went back to like their message boards for the company and asked the same question and he just took the email and copied it in there. So that way we could cite them, right? So it's things like this, I understand that like not everything's gonna be easy to find. Most of all this is in English too, I would say. That's the other thing that surprised me too. There's some systems, I mean some systems, obviously there's systems in other countries, but like most of the documentation of everything was only like one or two. The Japanese ones are hard to find and it's like one Russian one I couldn't read either. But like most of the Chinese ones as well, but some of you guys have no problem with this. So again, have fun with this, right? It's not meant to be super hardcore. Again, ask the people who did it last semester. Was it fun? He says, okay, okay, that's fine. All right, so I have to say this. We can talk offline why. Please do not plagiarize. So what does that mean? Do not copy and paste shit from Wikipedia. Do not copy and paste from their documentation, right? And post it as the description. I don't care that you cited this, right? That's not how this works. You can't just copy things and say, oh, I cited it, I'm done. It has to be everything in your entire, written in your own language. I don't care if your English is not as good as there, if you think you can't write as well as them. It has to be in your own words, okay? And this includes both when you submit it for your final grade, but I'm also gonna offer you guys, as I did last semester, halfway through you can submit it for review and I can give you feedback and say, hey, fix this, fix that, think about this, right? That has to be all in your own words too, okay? And I will know when it's not in your own words because there'll be words that I normally don't even use in my day-to-day vocabulary that you start using. I'm like, there's no way, right? So some things will be really obvious, or also don't, like, we'll take it all fine, okay? Next class, midterm, okay? It'll be fine. Any questions? Thank you.