 So real quickly about Project 2, if you haven't signed up yet for a group, please go do that. The other thing we're going to change is that we felt that the number of submissions of people submitting their Project 1 at the very end was higher than I wanted, and it's sort of partly my fault for not warning you guys ahead of time, hey, don't wait till the last minute. And so to avoid this same issue for Project 2, what we're going to have is we'll have a checkpoint halfway where you just have to do inserts and reads, so you have to do splits, and then that'll be like 25 or 30% of the grade, and then we'll have the regular submission deadline, that'll still be the same, and then that'll support the full test suite. Again, there's a forcing function for you guys to start looking at this sooner rather than later, so you're not trying to write a thread safe in memory index one day before it's due, because that's going to be a bad idea, okay? All right, so today we're going to talk about scheduling. So last class was about how to take a query request from the client, put it into our system, run it somehow, we didn't say how, we're getting there, and then we would take a response from the result from the query and shove it back over the wire to the client. So now today's class, we're going to start the next chapter of how we're actually going to execute those queries, and so we're going to focus on scheduling. So we're still not actually executing the scans yet and reading our indexes, but we're getting closer to that. So we need to define some terms we're going to use going forward, describe what query execution looks like. So a query plan is going to be comprised of operators, and you can think of these as sort of like a relational algebra operators, and so we'll talk about how we do query planning in a few more lectures. The basic idea is that we get this SQL query like this, and we can represent it as a directed tree structure comprised of the operators, and at the lowest level we have these access methods where they're going to scan tuples or scan the index and feed them up into the next operator, and then the results percolate to the top. So we're going to define now an invocation of an operator in our query plan, a specific invocation of it will be defined as an operator instance. So for a given operator in our query plan like scan table A, we can have multiple instances of this operator that allow us to execute this scan in parallel. And now the idea is the way we're going to group up, group together the operators instances that we're going to execute in our system are going to call these groups as tasks, right? And the idea is that sort of like in the pipeline model we wanted to have string up together as many tasks as we can within a single task or pipeline so that we don't have a context switch going from one task to the other. So for a scan on A, I can do the filter and then depending on what join algorithm I'm doing, like if I'm doing a hash join, then I can have the pipeline also build the hash table for me. So I'm just passing her up intermediate results from one operator instance to the next. So again, at a high level our job today is to figure out how we're going to schedule these tasks, right, and where they're going to execute and what they're actually going to do. So for every single query plan that shows up, we've got to decide now where, when, and how to execute it. And so one of these decisions is, well, how many tasks should we actually use for our query plan? Like if we say we have this number of cores and we could assign one task for our query plan for each core, we may want to oversubscribe, have more tasks than we have cores and that way we can have more flexible scheduling. Then this question of how many cores we actually want to use, right? So we can have a X number of tasks and Y number of threads or cores and we can decide how many cores we want to use. Then we have to decide on what core should a task actually execute on. And then when that task completes and it produces an output, we need to decide where that output's going to go. We've been vague about this so far. We say, oh, we've executed this query, we get back a result. But now as we start executing the tasks that are comprised in that query, we have to make the decision on where we should take its output. And as we see, as we talk about different sort of memory architectures, we need to be aware of where that memory is actually located, where we're reading, writing to so that we get the best performance, right? So the reason why we're going to do all of this is that the database system knows exactly what the query is, knows exactly what the tasks are, it knows what threads has available to it, it knows where the memory or the data is actually located. So we're in the position to make the best decision about how to execute this query plan efficiently. The OS doesn't know this. Question? How do we define pipeline in this series of words? How long would a pipeline take? So this question is how do we find what a pipeline is? We'll come to that next class. But the basic way to think about this is how far can I take a single tuple or batch of tuples and write it up the query plan before I get to a point where I can't go farther? So in this case here, scan a filter and say you're doing a hash join. I can build the hash table on this side of the join, but I can't actually produce any result to do the projection because I need to do the probe side. So this is called a pipeline breaker. Yes? Is it actually a pipeline breaker? Because if you have both sides of the pipeline feeding into it, you can start generating some certain matches and you're getting set ups in a higher level. So the question is, is this truly a pipeline breaker? Because if the pipeline breaker was here and I materialized the output from both of these guys, then shove it into my hash table, the basic thing, I can't go higher in the tree until I have all the data from one side. So there's a pipeline here for this side, but the pipeline for B can go all the way up. This will make more sense in the next class. But the main idea of what I care about, as you understand, is like, oh, well, a task is what we're going to schedule. And a task can be comprised of multiple operators. And we're roughly going to base it on pipelines. And again, the main takeaway, again, from all of this is that we're not going to rely on the OS to do any of this for us. The data system is always going to know better, so it's going to make the best decision. So to begin, I want to first talk about different process models we can have in our data business architecture. This basically tells us what is it. It's going to find what a worker actually is. Is it a thread? Is it a process? Then we'll talk about the problem of data placement. And for in-memory databases, we need to be aware of the layout of memory and the physical location of addresses in memory so that we can have our workers try to operate on local data. And then we'll talk about the different techniques for doing dynamic scheduling in a database system. So the paperhead you guys read on Hyper, that's an example of doing dynamic scheduling. And we'll contrast this with something like HANA, which is going to be having different types of worker pools, different kind of stealing versus no stealing policies. So that's sort of what our focus is on today, for this part here. But to make this decision about how we're going to design our data system to do scheduling here, we need to be aware of how we built the system up here. OK? All right, so let's first talk about how we're going to assign workers to computational units in our database system. So the process model is going to find whether the system is going to support concurrent requests from a multi-user application. Meaning our application could send multiple queries at the same time, multiple transaction invocations at the same time. And we need a way to decide how we're going to interleave them at the sort of low level hardware level. So we're not talking about how we decide a logical level like what transaction is allowed to read, what piece of data. This is like, how do we take a task and actually assign it to some worker? And what is that worker? So I'm using the term worker. I'm occasionally going to slip up and say thread. And for the systems we're talking about, it is going to be a thread. But the way to think about this is like, the data system doesn't know, technically doesn't need to know whether it's a, it doesn't need to know. Take a look at that. It's a sort of high level term to describe, again, a component in the system that can execute tasks. And this could be either a process or a thread. And the reason why I'm saying, take back what I said, the data system doesn't need to know. Certainly, if it's a process, I need to know how to communicate with other processes because they're not going to technically be in my same address space. So it's just basically a way for us to execute tasks and then return results to the application. So there's a great book written over a decade ago from Mike Stonebreaker, Joe Hellersheen at Berkeley, and the guy that runs most AWS, James Hamilton. I call it Architecture of a Database System. So this is in the context of a discordant system. But this book really lays out at a nice sort of clean abstraction how you actually design the system at this low level, execution level here. So the three approaches we're going to talk about are process-per-database worker, process pool, and a thread-per-database worker. And again, we've covered this in the introduction class, but I just want to go over this again. And the spoiler would be that for all the systems we're going to talk about today and going forward the rest of the semester, it's going to be the last one here, because this one is the most common one in modern systems. So process-per-workers, where every single worker in our system is going to be a separate OS process. So that means that when a request shows up, say it goes through some kind of dispatcher, it can then hand off the connection to another worker that will then read the socket and allow and take any request that it gets from the client and execute it on the database system. And so all the scheduling that's being done here is managed by the operating system. Because word is calling fork to start off a new process, and we have no direct control to say whether it should when it should run. So the tricky thing sometimes often is that with this is like I may not know my dispatcher exactly how much work these workers are doing. And I have to rely on the OS to do sort of throttling and flow control, because these guys can just do whatever they want. Or I need to write extra code and have them communicate some way back to the dispatcher to have a centralized view about what's happening. But in systems like Postgres, as far as I know, they don't do this. All the systems that do the approach is DB2 and Oracle. DB2 is a weird one, because there is a, when people say, oh, I'm running DB2, there's actually four versions of DB2 that are completely separate code bases, like ones for ZOS, ones for some other mainframe system, and then there's the Linux, Unix, Windows 1, and there's a fourth one, I forget. And they're all completely separate. So we'll see in the next two slides, DB2 is going to support all these approaches because they have to run in all these different environments, but it may not be the exact same code base every single time. Okay? One advantage you get from this is that if the worker, if there's a mistake in the software and the worker crashes, you don't take down the whole system. This worker dies, and then you can fork a new one and bring it back up. So this makes the system slightly more resilient than you would have in a threading-based model, because if one thread has a segfault, then the whole process dies. The next approach is just an extension of the worker, a single process per worker. We have a process pool. And rather than forking off, the dispatcher forking off a new process for every single request that shows up, it knows it has a bunch of workers that are available to it that can hand off the request and have that run the query for us. And in some cases too, if I want to support intro query parallelism, I can take a single query and run across multiple processes, multiple workers, one worker could identify that there's other workers available to it in the same pool that are idle and start handing off work to that. The bad thing about this approach is that it's gonna ruin our locality unless we have extra logic in the dispatcher to be mindful of what worker executed our query the last time. Because say these guys are all running on separate sockets, and if for this one request, the first query shows up, I run on this worker, the next query shows up for the same connection, but it runs now on another worker running on another socket, all the cash locality I had from bringing in data into memory or the local cash for this worker is now gone because it's now running on a completely different socket. So this oftentimes can be bad for MRE systems just because the overhead of keeping track of what worker I ran on last and either pausing until that worker's freed up if it's running another query before it connects to my next query or basically holding it until the next query shows up, which is just the same thing as the previous slide, all that sort of orchestration can slow things down. And Postgres added this in 2015, they can have interquery parallelism, and as I said, DB2 tries to support everything. What's the most common one is the thread per worker, a multi-threaded application question. In the previous one, if they are running on the socket on different machines then it is fine, but like if they're running on the same machine then... Assume you're running on... Disassume for this semester, it's all a single machine. Okay, then if they're on the same machine then like you're saying they are on different posts. Sockets, few more slides, it'll make more sense. If they're on the same socket, you have local cash and L1, L2 is not shared, but L3 is shared. So if you run on the same socket, the penalty of a cash miss, the penalty of the cash miss is the same, but the likelihood that the data I need, like if I have a query to execute to read some data, the next query shows up wants to read that same data. You see this all the time, read after write or read, modify, write. Then if I'm running on the same socket, then the data I just read could be hanging on L3 cash and that'll go fast. If I'm on another socket, which doesn't have to share that L3 cash, I pay another cash miss. Why is the same thing not true for the previous one? So the previous one is like, for this connection, this worker is now dedicated to the connection to this client. So any query that shows up is always gonna go to this worker, because it basically takes over listening. Basically what happens is, the way Postgres works, this is called the Postmaster. The connection shows up, the Postmaster says, great, I can hand you off to a worker, go now right to this socket or this port number here. And so this comes back now and this thing's listening on this port and now has direct access to write, send queries here. So the queries do all of this on this? Yes, you're bypassing the dispatcher. This one is like, I'm always going to this dispatcher, think of this as a sectionalized scheduler and it's now making a discussion on how to handle things off to different processes. Thread for workers is, it's a single process and we have, inside that process we have multiple threads. You may or may not have a dispatcher thread, you still can do the same approach where you have one thread listens on a socket that all incoming requests have to go to and then you hand it off to another socket on another thread to process things or you can just have a dedicated thread for your connection or you can just have a sort of a general purpose networking layer that is hands off work to anybody that has idle cycles. As I said, every single database system written in the last 10 years is followed this approach. So you may be thinking, well, this is clearly better, you don't pay a penalty for context switches, everything's in the same address space so any thread can read any memory location of another thread. Yes, you have to do the currency control stuff to make sure you don't have any issues or the latching to make sure that you don't clobber each other when you access critical sections. So, but this is clearly gonna be lower overhead than the process approach. Everybody's thinking, why nobody built databases this way in the 1980s and 1990s? Correctly, so it's positive threads or P threads were not standardized. So all these different operating systems, Sun or Solaris, HPOX, before Linux even came around but if Linux came around in the early 90s, there wasn't a positive standard to say here's a threading API, everyone had their own API and they had the semantics of how they could spawn threads and do joins on them was slightly different. It wasn't dramatically different but like the API was certainly different. So if I wanted to support a bunch of different operating systems for my databases then I would have to have a sort of wrapper layer to make sure that I had sort of a, the lowest common denominator of what the thread API was so it made it more portable. Now with P threads, this is not an issue. So, one thing I also say too, one of the things we did when we sort of first started building Peloton was we took the PostgreSQL source code, forked it like everyone else does. We then converted it to be multi-threaded. If you go Google PostgreSQL multi-threaded, you'll see a mailing list post from former PGA students saying, hey look, we did this. I forget why we did it, it was a bad, whatever. But it turns out the way we actually got it to convert it wasn't like we took the, in the PostgreSQL source code it's a single code base but they have all these pound of finds that says if Linux do this, if Win32 do that, if you go take the Windows code you can actually convert that to P threads much more easily than you could convert the Linux code. Anyway, we also converted to C++11 again, for historical reasons I don't remember but in our new system we threw all that away. We don't inherit any PostgreSQL code anymore. So again, the multi-threaded approach is better. You have lower overhead context switch and then you don't have to manage shared memory in the multi-process case. You're either sending IPCs to each other or you have shared memory so that everyone can read and write into the same location. The important thing I also pointed out too is just because we may be using a multi-threaded process model doesn't necessarily mean we're gonna get inter-query parallelism in our database system. So MySQL is a multi-threaded database system but it's one thread per query request. They can't divide it up the task across multiple workers and in their environment doing OTP that's fine. And as I said, I'm not aware of anybody any other system built in the last 10 years and unless they're forked from Postgres which is a process pool model approach and unless they're based on Postgres every new system is gonna be using multi-threads. Okay? And that's what we're gonna use in our system. All right, so now let's get to what his question was about this issue of whether it's core running same core in the same socket or a different socket. So regardless of how we're gonna do our worker allocation or task assignment policy in our system, we wanna make sure that all the workers are going to operate on local data. So in a distributed system, this is like a no-brainer if I have two machines and my data is one machine is on the west coast, one machine is the east coast, my query shows up and once it's touched data on the west coast I send the query to the west coast and process it there. So that way it's operating on local data. Well, a multi-socket, multi-core data or multi-socket, multi-core system even though it's running on one box still is sort of like a distributed database system. So the same rules, the same concepts apply here. So we wanna make sure that we always have our tasks operate on local data for where the thread the worker is actually running. This means now that our scheduler needs to be aware of what the memory layout is for our underlying hardware. And at a high level there's two approaches, there's the uniform and then the non-uniform memory access. So this is also the acronym is NUMA. So you hear me say multiple times NUMA region, NUMA this, NUMA that, that's this one. This is the most common one we have in a multi-socket system today. Back in the day we had what was called uniform memory access. This is sometimes called symmetric multi-processors, SNP. It's basically the same thing. And the idea is that the tasks of the workers are running down here in the cores they do have some local CPU caches L1, L2, L3. But all memory is managed through this system bus. So for any task running on any socket, the cost of accessing a chunk of memory is the same no matter where I'm running on. So if I'm down here and I wanna access memory that's in this dim here, the cost of retreating that is the same if I was running here or the other ones. Because there's overhead of going over this system bus. So we still need to do cache and validation in this world. The hardware is aware of I read this region of memory and now it's in my CPU cache. Somebody else reads the same thing and writes to it. The system bus has to take handle of invalidating our cache entry over here. All of that sort of handled for you. So as I said, this is how they built up until maybe like 2005-ish, 2006. This is how Intel and AMD were designing their multi-socket CPU systems. But now we have the newest stuff. And in this world there's not really any, from a database system perspective, there's no intelligence we need to embed in our scheduler because as I said, we don't know, we don't care. This looks like one giant address space and the access cost is the same. What modern systems look like in the NUMA model is that you now have memory that's local to each socket. So again, you always have your local CPU cache and then you're gonna have dims that are gonna be physically closer to your socket and have a direct path to go read and write to it. So on Intel chips, the CPU cache for every core, the L1, L2 will be specific to that core, but then they'll also share L3 and then obviously they share the dim here. And then now if I need to have my task running here, if it needs to access memory that's not in my local dim and I gotta go over to this other CPU here, I'm gonna go over this interconnect to send my message in the hardware to say, hey, go read this thing for me, then the CPU knows how to go out to the dim and then bring it back down into me. So the performance difference here for just pure like read write speed is about 50% slower. So if I have to read data around here, roughly it's gonna be 50% slower than if I was reading data locally, yes. Is it faster for the CPU locally because like is there still a system bus? Like for the here to here? No, like the bus is basically like, this bus is basically saying like, all right, I wanna access memory address and you basically translate that into some dim slot, right? So that means locally it is faster? Yeah, locally is, yeah. So if I have to read here is 50% slower than reading locally here. Like think of this as the fast path, yeah. So again, like that means now in our database system, when we start loading in our data we wanna be aware where we're actually gonna put it. Then when queries show up and they wanna start executing, we need to be aware of what data they wanna touch in our tasks and make sure our tasks run here. The OS doesn't know anything, right? The OS sees a bunch of threads that wanna do some kind of work but doesn't know your Bitcoin mining thread from your database system thread. So we have to tell it, hey, we wanna make sure that we run things locally here, right? So this interconnect is a high speed, high speed, not really, I wouldn't call it a bus, but basically it's a high speed connection between the different sockets, that's multiplex, you can go both directions and talk to anybody. Intel originally called this the quick path interconnect of the QPI, then that wasn't good enough so now they call it the ultra path interconnect in 2017, AMD has their own version, it's now called the infinity fabric, power has their own thing, ARM I think has their own thing as well, but at a high level it's all doing the same thing, right? Okay, so now, if I call malloc, question is where is my memory gonna show up, right? So the way to think about this is that we're gonna take the tables that we have in memory and we're gonna partition them or break them up into chunks of data or blocks. And then now we wanna assign them each of these blocks to a specific CPU core, right? Or a more high level CPU socket or a new more region. So if we're aware of where we're putting our data, where that data is being located, when a query shows up and we make decisions about how to break it up into tasks and what pieces of data those tasks are gonna operate on, we can make sure that we schedule our operators to execute on the data's local to this. So this is an old problem in distributed data, this is called data placement. So I think a partitioning is deciding like how do I break it up and what boundaries and then data placement is saying where do I put those actual partitions that I've generated. So in Linux you can control this through the move pages syscall or the numeric control command line option or command line tool. So basically what happens with move pages is that if you just invoke it with a memory address, it'll come back and tell you what numeric region it's on. But if you invoke it with a memory address, a size and a numeric region, it'll move that data to that location. And I think that's a blocking syscall. So if you think about this, if I have a one terabyte database, I can load it into memory and then sort of get randomly scattered across different sockets that I can go back and call move pages and start putting things in where I want them. So let's talk about now what happens when you call malloc. So let's say our data system calls malloc because we're gonna load into one terabyte database. The question I have for you guys is what happens? What happens in that syscall? Assume that my allocator has already handed out all the pages that it has already pre-allocated. What does it actually do? Yes? He says calls sbreak. Yes, sort of. Well, yeah, that's this, right? So sbreak will extend the process data segment, but that's just moving it to be larger. But after that, it's actually not doing anything. All it did was update the internal bookkeeping data structures it has for this boundary. But all the virtual memory just allocated with the sbreak is not actually physically allocated in memory because no one's access it yet. It's only when we touch the data is there a page fault and then the OS says, oh, it looks like you're actually gonna use this memory, let me go have it and be backed by physical memory. So now though, let's say after the page falls, after you call sbreak and then someone tries to access this memory, where are we actually gonna put this memory that we just allocated? Let me take a guess. Is it the thread that allocated it, or is the thread that touched it? Touched it, why? That's what the paper said, yes, okay, good, fantastic, okay. Yes, so there is a policy to tell the OS that whatever thread touches that memory, that's where I want it to be allocated. So I can allocate it, it calls sbreak, extends virtual memory for my process, but only when the thread touches it, then there's the page fault and then we put it in a physical location, yes? So the OS cannot promise virtual memory unless it does have enough physical memory. This question is, the OS cannot promise virtual memory unless it has physical memory? What do you mean by promise? You say that you want like 512 MB, I can't promise you unless you do. Think big, I want one terabyte of memory. What happens? If I do have one terabyte, I can't promise it to you, but if I have it, I can't just give it to you and later you do. You can, you can, yes? And then when I actually try to do something on it, I get a fault. Yes? Isn't that bad? Shouldn't you get a fault? It's bad, but is it wrong? Arguably. We have swath space for that. You have swath space, yes. So that's the way, so actually, you can see this now. If you run some of the tests with ASAN for our database system, it'll say the virtual memory size of the database process is 12 terabytes. We obviously don't have 12 terabytes on any machine, right? So the OS is making a decision here that like most people are gonna allocate more memory than they actually need, so I'll just let you have it. And if you actually end up needing it, then you, yes, you have to start swapping out pages to the swath space on disk, and that makes it look, that's virtual memory, it makes it look like you have more memory than you actually have, right? But in our database world, for the things we're talking about here, we're actually talking about the data inside tables, the tuples themselves. We're gonna use that space. So thinking of this as like, all right, well, I'm not gonna allocate, I'm not gonna turn the database this morning and have it pre-allocate one gigabyte of memory. It's only as I start inserting data that I recognize, oh, well, I'm running out of space for my pre-allocated memory that I have now. Let me go grab another 100 megabytes, break it up into blocks, start filling data in. So the question is, we're trying to get here, is like, when I do that, where is that memory actually gonna be stored? By default, you get this, where the OS does round robin to start writing out pages to one socket at a time and goes around over and over again. What he said as he pointed out in the paper is that you can tell the OS, I wanna use the first touch policy so that I'll allocate the memory and then when the task actually starts inserting into that block, there'll be the page fault and then that's when it updates the virtual memory table to now be backed by physical memory and then I can insert that data into the table and that physical memory will be wherever my thread is actually running. Right? Again, you can modify the location after the fact. You can call that move pages sys call to say, all right, well, I've loaded my table and let me go start moving things around accordingly but that's actually obviously stupid because if I'm loading a one terabyte data is I don't wanna load it all in first and just have it be scattered around randomly, then go back and do a sequential scan and then actually start moving things to the right location. I wanna use this policy ahead of time, be aware where I'm scheduling my task so that I have an even distribution of the data across all threads, across all sockets, right? So let's see the impact of this when we have fine-grained control of where we're putting data. So I'm gonna show two slides here. The first will be OLTP and the first will be OLAP. So this is from a paper a few years ago from Natasha Alamaki's group at an EPFL. And so this is running an in-memory database on a four socket machine, eight cores per six cores per socket and they're just gonna run the TPCC payment transaction and what they're gonna do is they're going to put the different warehouses or the blocks of data for the database on different configurations across the sockets. So spread will be every socket has an equal portion or one quarter of the total table. Group was where you're gonna shove everything into a single socket, mixes where you split at 50-50 and the OS is just letting the OS do whatever it wants. And I'm putting question marks here because we don't know. So whatever the policy that the OS decides to do is what it gets. And so what you can see here, obviously, that if our threads can access data that are all running on the same socket, you're gonna get about a 30% improvement over what the OS would do. Because the OS is gonna do something, looks like it's doing something a little bit smarter than spread, but not that much smarter than mix or group. So this is just showing that if you let the OS be in charge and we're deciding where they're gonna place data and also where it's gonna schedule our threads, is you're gonna have a bad time and you get 30% better by doing it yourself. And then what I'm thinking about this too is all the threads are running on one socket. So you may be thinking, all right, well, wouldn't it be better for me to spread across multiple sockets so that our threads have a large portion of the CPU caches to themselves? Well, no, because going over that interconnect to transfer data from one thread to the next is gonna be problematic. Because this payment transaction will update data at different partitions. So you may have to go across socket. The next slide I wanna show you is a microbenchmark experiment that some former students of mine ran a few years ago. So this is running a sort of a simple extrusion engine they wrote for, I think for, they were taking 618. And all it's gonna do is just a sequential scan over 10 million tuples. And they're gonna run this on a beast machine we had in the PtL, or PdL, there was eight sockets. At the time, it was like a 2008 machine, it's a bit old, but like, it was the only eight socket machine we had access to. And every socket has 10 cores plus hyperthreading. And so along the x-axis, you're gonna see is that we're gonna add more threads to do the sequential scan in parallel. So the size of the table has always been the same and now we're just adding more threads so that we can execute the scan in parallel. So what you see is that at the lower core counts, or thread counts, the performance difference between the, so the random partition is like, you just let the OS decide where I want to put it, the local partition is where you assign threads to execute the scan on data that's local to its numerator region. So at the lower thread counts, because the table is spread across multiple sockets, the probability that a thread is gonna have to access data at a remote socket and go over the interconnect is higher. So therefore, it doesn't really matter if you're dividing up intelligently or not. But as we now increase the number of threads, then the amount of cross socket traffic we have for our scan threads goes down significantly for when we do the local partitioning. And that's why you get better performance. And then this division point here is when hyperthreading kicks in. So up into 80 cores, it's all real hardware cores, but then the hyperthreading, you get the virtual cores. And at that point here, we're bound by memory bandwidth. So throwing more cores and more threads at us, that doesn't help us anything. Because it's not like one thread's stalling, waiting for disk IO, another thread it can run. Everybody's just waiting for good stuff out of the memory controller. So that's why the performance plateaus. So in this case here, I forget the exact numbers, but it's almost about a two X performance difference between, or over two X performance difference between being intelligent at how we do our, place our data and run our operators versus letting the OS and the hardware manager for us. So as I said briefly, there's this notion of partitioning and data placement. We're not gonna talk about partitioning here. But think about partitioning scheme is, again, it's some policy we're gonna use to decide how we divide our data up into chunks. And the placement decides, the placement policy decides where we're gonna put those partitions. So the placement policy is sort of agnostic to what's actually in our partitions. And you saw this in the morsel paper, right? In the morsel paper that it said, all of the threads care about is that they're operating the data that's local to it and they just pull it out of queue and process it. In a partitioning scheme, you can have some higher level logical meaning or semantics to how you're dividing up the data that you can then exploit for query execution. And this is why we'll cover this in future lectures. Like, if I know that my workload is always gonna be doing hash joins on a given partition, or sorry, on a given attribute, then I could decide to hash partition my data on that join attribute so that now when I run the join algorithm, all the data that every operator needs to operate to do the join is local to it. And the query optimizer is aware of this partitioning scheme. All that is sort of above us when we do data placement and scheduling here. Okay? So at this point, what do we have? We have a process model, we have a task assignment model, basically how we're gonna decide where should tasks actually run, it's where the memory is local to it. And we decided how we're gonna decide, determine what mechanism we're gonna use to pin chunks of memory or chunks of data to memory locations. So now we need to talk about how we're actually gonna create a bunch of tasks that we wanna then execute for our logical query plan. And then once we have those, we need to decide how we're actually going to schedule them. So for O2B queries, this is super simple because most of the time there's not really any opportunities for parallelism in a single O2B query. I could have multiple O2B queries run at the same time across different transactions and certainly I wanna have them run on separate threads and they wanna run on the sockets that have the memory that's local to it for the data they wanna access. But within a single O2B query, I can't really divide that up to sub chunks, right? Like go get Andy's account record, that's like a single index probe to go get my one record. I can't paralyze that. So this is gonna be, what we care about this for is for the OLAP queries. And then we can still apply now the same techniques we're gonna do to schedule OLAP queries to schedule multiple O2B queries at the same time, that are running on behalf of different transactions. So the easiest way to do scheduling is called static scheduling. And that's where the data is designed before it even starts exiting the query. I know the number of threads that I have, I know the number of cores that I have. It can decide, like I'll just say I have one task for every core and then shove them off to the hardware and let them execute, right? So this is the easiest thing to do because I don't worry about monitoring the behavior of the task while they're running. I just say this is my plan, I'm sticking with it. And I don't care about alternatives. So if you're building a data system for the first time, this is probably what you end up building because it's the easiest thing to do. So again, this is not the same thing as the placement policy stuff, right? We have a placement policy that we use to assign the task to threads based on the data location. This is just sort of saying how do we divide up our tasks and at runtime how do we actually schedule them? So the approach that I had you guys, the reason why this is gonna be problematic for OLAP queries especially, is that it doesn't take into account the runtime of these individual tasks on the different cores. So if I have that one terabyte database and I'm doing a scan on it with a predicate and for whatever reason on one socket the data at that socket, that predicate evaluates to true for more tuples where every other socket the predicate is very selective so I end up throwing most of the data away. And so that means now for my task runs on my one socket it's gonna spend much longer than the other sockets to finish that one task because more tuples are getting put into its output buffer as it's doing more memory copying. And if a static scheduling approach I can't dynamically adjust the system and say all right well I see this guy is going slow and there's much more data that it could possibly execute that we're waiting on, let me go bring in other threads to help it out to speed things up. So a static scheduling can't do that but this is what dynamic scheduling can handle and this is the problem that the hyper guys are trying to solve with their approach. So the morsel driven scheduling is the idea is that we're gonna process our tasks in parallel over these horizontal partitions that they call morsels. So morsel is a hyper term it's not like a standard term in database systems. I think they picked it because they didn't want to use word term block because that's used in a bunch of places. They didn't want to use term partition because that's already used in a bunch of places. Like a morsel is larger than a block but smaller than a partition. It's a way to think about it. And as far as you know, no other system actually uses this term. This is specific to hyper. So they're gonna have one worker per core and so if you have a multi-core socket you can have multiple cores for that socket. They're gonna do a pool-based task assignment meaning the workers are gonna run, they're gonna check some centralized data structure and say what work is available to me. And then when they load the data in they'll just do round robin data placement. So they'll have, say I'm doing a bulk insert and I'll say well some portion of that bulk insert goes to this socket, some portion goes to this other socket or the morsels. So what's gonna be interesting about their approach is that the implementation of the actual database system is gonna be entirely NUMA aware. Meaning they'll have implementations of operators that recognize whether they're accessing data on a local NUMA region or a remote NUMA region and they'll have different strategies of where they put the output data or what algorithm they actually may wanna use. And they'll use this when they make decisions about how to pull things out of the task queue. So as I said, they're using a pool model so that means that there's no separate dispatcher thread there's no single thread in charge of having a global view of what's going on in the entire system. And the threads are doing, the workers are doing cooperative scheduling because everyone's looking at the same queue. So what's gonna happen is that when they pull task out of this queue each worker thread is going to prefer threat of tasks that are gonna operate data local to it. But it also can recognize that there are, there are no tasks available to it that's operating on this local data and go try to steal work or steal tasks from another worker thread that may be operating data that's remote to my worker thread but because this other worker thread is going slow for some reason, we'll pick up the slack and take some of their work. And then their approach also too, I'm not actually, I don't have a slide for this, but in their model they only execute one query at a time. So a query shows up, they're gonna break it up into a bunch of tasks and you don't move on to the next query until all the tasks for that one query are finished. All right, so it's either the query's completely done and there's no more tasks available to it and I move on to the next one or a worker thread can go steal tasks that the straggler worker can't keep up. Question. So wait, what is this? Yeah, so the question is like... Yeah, so this question is what is actually dividing up a query into tasks and putting them into the queue? I would call that a dispatcher that's sort of like the front end layer, the query shows up, you parse the SQL front end through the query optimizer and then at that point you divide them into tasks and then assert them to the queue. And whether that's a dedicated thread that's separate from the never-ending thread or the never-ending thread itself, it doesn't matter at this point. I actually don't know what they do. In our current system, the same thread that takes the SQL query off the network socket, parses it and binds it and optimizes it, well then divides it up. But you could hand it off to another queue. Yes? How does work-stealing work? This question is, how does work-stealing work? Give me two slides, we'll get there. All right, so the first thing we're gonna do is divide our data up into morsels. So say this is our data table, we have four columns. So morsels are just horizontal partitioning where we're gonna say within some stride, this is the boundary of a morsel. In the paper, they use 100,000 tuples per morsel and they said they picked this number because it provides the right amount of parallelism versus the overhead of putting stuff in a queue. So for every task I get and pull out of the queue, I'm gonna at least operate at most 100,000 tuples. In Peloton, our block sizes were 1,000 tuples. In HStore, the system I helped build before I came to CMU, we did 10 megabytes. As far as I remember from the discussions of those systems, we just picked those numbers out of our ass. It's not like there was any magic to it. I think in the case of Hyper, they said they actually did some internal micro-benchmarking to determine that 100,000 was actually reasonable. In our new system, as I said, a few lectures ago, our block sizes are one megabytes because this allows us to have the 20-bit offsets that we use the line as command in C++11 to align our blocks to one megabyte chunks in memory. All right, and again, so for each of these morsels, what does assign them to different CPU sockets? So now, getting to his question, how are we actually going to do work stealing? So say now we take our query plan, and then we're gonna chop it up into different tasks that we're gonna put in this global queue. And then now, down here, for every single socket, it's gonna have its local memory, and then it'll have a buffer that can show results, but it'll also have all the morsels that it's aware of that are local to it. All right, so the buffers will be, as you process a task, you wanna write data to our local buffer. It's sort of the same way we talked about doing the DeltaStore in the hyper approach, where as I was generating new versions, I put the old versions in thread local memory, and that means that I didn't have to have a coordination of a global shared memory space that multiple threads could be writing right into at the same time. I know that no other thread could be writing into my buffer at this time, so I don't take a latch on it. So now all the threads are gonna go look in the queue and pull down the first bunch of tasks that it wants to execute. So in this task queue, you maintain information about the dependency between the task. So it would know that I can't actually execute the probe until I build the hash table and do partitioning on the other side here. So it knows that these are the tasks I can execute first. So again, so as each thread, a worker runs the task, it's pulling data out of its local memory, which is gonna be fast, and then as it executes the operators and produces some output result, it writes it now to its local buffer. Yes? Is the global task queue in some particular socket or is it distributed? This question is, is the global task queue in some specific socket or is it distributed? I think they let the OS manage this one because there's not really anything, if everyone has to read and write to it, there's not any magic you can do. I didn't think about it too, like the... And you can firstly, the task that I present in your local memory and see if they are... Yeah, so his point is like, well, we'll see this in Hanum. Right, so his statement is, this global task queue is a shared memory space, right? So they're gonna use a lock-free or latch-free hash table to avoid having to synchronize on this or have a heavy synchronization primitive. But you could partition the memory of this global task queue such that these guys look in their local memory first, find tasks that are specific to him, and if not, go look at other ones. Hanum would do this as far as they know Hyper doesn't. One other thing is that you only push the tasks that are local to them in their task queues and when they know they are going to some other task queue, they know that it is not global. So anything in your local task queue, the task will always be local to her queue. Correct. Hanum does this, we'll see in a few more slides. Yes, there is just a quote. I think what other... You can just read it and you know it's yours only then just pass it over here. Yes, for whatever reason they didn't do this. So the point I'm just going to make is like, the moral size is 100,000 tuples. That's where the bottleneck is going to be if I'm going over the interconnect to a remote socket. Like the bottleneck is reading this. Going fetching one task, that's going to be maybe a kilobyte of a size. It's not going to be that big. So fetching one thing from a remote memory region is not a big deal. That's not where the bottleneck is. So again, you want to optimize where the big overhead is or we're spending all our time. It's with this and that's where they optimize. All right, so say now these first two threads, I think this is still on, good. The first two threads are finished up, the first two workers finished up their tasks but the third guy is running slower. As I said, so maybe this filter is unselective so we're generating a lot of output and running into our buffers and that means we're doing copying and we're running slower. So now the other two workers can go grab the next two tasks and sort of execute. Same thing, but say now this first one though finishes up this task more quickly. It can then recognize that, oh, well the B3 is also needed to be executed before we can execute the next task in our pipeline or in our query plan. So let me go ahead and steal this guy, bring it down and operate on it. So that means now I'm going to copy data or move data, or not moving data, you're copying data from the memory of the morsels and this socket here, you're putting into your CPU cache over here so that's going over the interconnect but as I produce output, then I'm going to write this in my local buffer and there's some metadata that I'm maintaining to say for this task, the location of the enemy result that the next operator would need in our task queue is located here. So there's some internal bookkeeping that they're keeping track of to know this task output got written to this location. Yes? If you said something like the same pipeline job is completed in the time it takes to process a single morsel. The same pipeline, it says what's it? There was a line called a sort of finish. They're guaranteed to reduce the finish line within the time period it takes to process a single morsel. All of that's working on the same pipeline job. But does that mean like all of these would finish at the exact same time? I think what they mean by that is like because everything's in memory, the amount of time, like say here for this one, they were all A's. Because everything's in memory, they're going to be close enough. It's not like this thing's going to be waiting on disk and therefore be minutes slower, right? It's just like the idea is okay because they only execute one query at a time. Rather than have this guy be idle, waiting for B2 to finish, then A3 has to finish and then has to execute B3 before he connects to the next task, he can go ahead and just go ahead and steal it. Yes. His question is for this one here when I had to copy the morsel into my CPU cache running the socket here, do I then physically permanently move it to be over here? No, right? Because again, the next query might show up and now maybe this socket is the slow one, right? And now this guy might be stealing back from me, right? Typically you don't move the memory around once it's already placed in locations. You may want to repartition it, maybe split it up on a different attribute, but that's even rare in those systems as well. So again, because they only have one workup per core, Hyper has to use that work stealing because otherwise they're going to be, they have threads waiting for stragglers. And as we already said, they use a lock-free hash table to maintain global work use. So that means that the overhead of going, grabbing things from the work queue is not that significant relative to the actual cost of processing the morsel. So now I want to talk about another approach from Hanam and just sort of gets into what he was saying about how you want to organize the queue. So Hanam is going to be using a, actually be very clear. This is from a research paper that a PhD student wrote while employed at SAP. So in Germany they have a model for PhDs where a student can go work full time at the company. They still do research, they still write papers. They get paid by that company like a regular salary, like not like a PhD stipend, but then they get the degree from an institution. So as far as I know, what I'm describing here never actually made it into the real Hanam system, although the student was actually physically working on the real Hanam source code when they implemented this. But this is sort of a different approach to how to do scheduling that then what Hanam actually really does. You're taking the core code? The question is, so his statement is, does that mean whatever they described here did not work out and therefore they did not want to put it into the real Hanam system. There's like, yes, this is a good idea, we should do that. And then there's like the engineering overhead of taking this research code and putting it into the real system. That's oftentimes why this stuff doesn't materialize, right? But I mean, it's not like it's a waste of work, right? Certainly like if now ACP decides to build a new system, they actually end up rewriting a lot of Hanam. They now know something about some of the different alternatives way to organize the scheduling mechanisms so they could have done this. So I actually don't know, since the 2019 rewrite of Hanam that came out last year, I don't know whether they do this, I suspect not. But I just don't know. All right, so what they're gonna do is they're gonna, it's a pool-based queue, but they're gonna have multiple workers per soccer. So it's not gonna be the same thing in Hanam where you have one worker per core. Every core can have multiple workers and they're gonna be in different, they're gonna have different statuses. And so each group now also is gonna have a soft and a hard priority queue. So this is what he was saying. So couldn't I have a queue where I know that's local to me and I go check that first? And if nothing's there, then I go check another queue, that's the global one. That's this. So the soft queue would be local to my group. It's gonna need some number of workers running on the same socket. Sorry, take it back. That's the hard queue. The hard queue is local to my group and hard means that nobody can steal from me and the soft one means that someone is allowed to go ahead and steal this. So I still wanna execute both tasks. This just says that this thing has to run in my socket or run in my group, no one can take it. And these are things that are eligible to be stolen from other things. So now they're gonna have a separate watchdog thread that's gonna have a global view of what's going on at every single group on every single socket. And then they can decide that whether some group is being oversubscribed and has more tasks that they can actually process. And then it can take threads away from one group and then allocate additional threads or allow additional threads to run in another group. So you're getting both types of balancing. You're having the work stealing that you have on your hyper, but you also have the ability to crank up the number of threads you wanna be executing tasks as well. So you're using both approaches. So as I said, we have a soft queue and a hard queue. So threads are allowed to steal tasks from other soft queues and the hard queues are not allowed to do that. And then for the different pools, within a group we're gonna have four different pools that keep track of what types of threads we have running. So a working thread is when we have one that's actually running a task that it pulled from some queue. It doesn't matter what queue it came from, we know that it's processing something. We have an inactive thread where we know we're blocked in the kernel due to some latch and some later point it'll get woken up. A free thread is one where it's gonna sleep a little, wake up, check its queue to see whether it's anything for the execute in the soft or the hard queue local to it. If so, then it go ahead and execute it and now it becomes a working thread. If not, then it goes back to sleep. And then we have part threads where the thread just sleeps and it doesn't get woken up until the washdog thread comes along and says, hey, we need you, wake up and actually start doing stuff. And so again, what this is gonna allow us to do is we can now scale up the number of threads we have like in the park thread queue case because like the pool case because if I recognize that my group cannot, all my threads are working or they're inactive, there's no free threads because they're all working on things. If I don't have enough resources to process the work I need to have for my group, I can now release these park threads or unpark them and have them start processing things as well. And then maybe on another group, I'll park their threads. So this is a lot of work, right? This sounds like an operating system because it sort of is. And we'll see this from SQL Server and a few more slides but they actually did build an operating system that runs inside the database system. They even called that. So with this, again, they're gonna adjust the threads, the pinning of the threads where they're actually running based on whether the task is CPU bound or memory bound. So like if my task is, if for my group, if my task says that I, you know, I have a bunch of tasks and I can't process them fast enough, I can allow threads to, well, I couldn't park them if I have not enough work, if I can unpark them, if I have more work that I actually need. And I have to do that for other socket because I don't want to saturate the memory bandwidth over the whole system because maybe those tasks are pulling data over the interconnect, right? What was interesting about this is that they found that when they looked at the really large socket machines that HANA can run on, they found that the hyper approach of allowing the work stealing was actually a bad idea. So in the hyper paper, they were looking, I think, at two to four socket machines, the HANA guys are looking at like 64 socket or 128 socket machines. So in that environment, what you wanna do is you don't have any soft queues. You put everything in the hard queue and you can then scale up the number of threads you're running on every single group, on every single socket, but you don't allow threads to take work from another socket and go over the interconnect. Everybody always processes things locally, right? Yes. So what's the difference between the task in the soft queue and the hard queue? The task in the soft queue and the hard queue, like what would be an example? What would be the differences between, like how do you decide to put a task in the soft queue? So the question is how do I know what to put in a hard queue versus a soft queue? So like a hard queue might be a task would be like, well, we'll see in the next slide. So in HANA, for this example, they have the entire system architecture is running on these worker pools. So that means like network requests will be running on the worker pool. So if I have a task that says, read the next message, next packet from the socket, I wanna run that at the thread or at the socket where the CPU socket on the network where that network socket is located. I don't want somebody else to go take that from me. Yeah, so this is the next slide. So again, like their thread groups are running everything. So in the hyper case, the pools were only doing processing queries. They're putting everything in the system. So now you can do things like if I am oversubscribed on the number of queries I'm executing, rather than accepting new network requests, I'll take some networking threads from their pool and spin up new threads or unparked threads in my execution pool. And so that sort of gets the natural flow control of not taking more queries than you actually can execute. So here's that same query we had before, right? And so save these tasks here. We can put into the soft queue and then for whatever reason we put these other tasks in the hard queue, right? For my purpose here, it doesn't actually matter. I'm just showing this for illustration purposes. So now again, all the worker queues, the worker threads are pulling from the soft queue. My inactive threads are just waiting on something in the OS. The free threads are just spinning on the soft queue or the hard queue looking for work to do and the parked threads are just blocked by the watchdog thread. So again, say here that I recognize that I can't keep up with my demand. So say now my free thread pulls out something from the hard queue, then it gets migrated now and now it is considered to be a working thread. And the watchdog thread above all this knows how many threads are actually running on every single group and it can make scheduling decisions about our resource allocation decisions up above. Yes. What's the other parked threads? Park threads are like threads in reserve. So like instead of calling spawn, I just have them there, right? So they are, I actually don't know what the true difference. I think inactive thread would be like I'm blocked waiting to get like a latch or a mutex in the OS, but if I come back then I become working. This is like, I think you're parked in the OS, you're blocked in the OS on some mutex. And it's the conditional variable where like some other thread has to come and say now you're allowed to wake up. And the idea is that I can dynamically have these guys get released and execute without having to call spawn because that's a contact switch. Or a sys call and a contact switch. Okay. So the last scheduling thing I want to talk about is which I think is the most fascinating one out of all of these. It's not for an MRE system, but again it's so different than everything else we've talked about so far, but I think it's amazing. So in SQL Server in 2005, they released in the, the new version of SQL Server, they now released that in bundled inside the database system, this basically a user mode operating system called SQL OS. And the idea is that this is a layer that sits above the real operating system like the Windows kernel and the actual, the database system engine itself that manages and provisions hardware resources. And the original motivation of this was that rather than the, you know, as new mode was coming more prevalent, as there was hardware changes, rather than having all the different implementations or the different operators inside the database system execution engine, rather than having to all be aware of NUMA and memory layouts and things like that, they would have abstract all that away through the SQL OS layer and have it makes decisions about, you know, where to actually run tasks and how to actually, you know, run things in parallel. So it's, the thing's amazing because it's more than just like scheduling threads. They do a bunch of other stuff like IO scheduling that you actually manage with the buffer pool inside the SQL OS. They do lock scheduling and lock management inside this as well. And then the idea here is that rather than us having to have these ad hoc invitations of specialized logic to deal with like, oh, this thread's ready for this lock and therefore my Joe and Alvin can run this way or whatever, like all that is taken care of in the SQL OS layer. And you just implement your basic operators up above and this thing handles all that for you, right? Again, when you think of like, if you took the intro class, we talked about Discord and database systems, right? We talked about the buffer pool manager. The buffer pool manager is basically replicating the same logic that the OS does for virtual memory, allocate some pages or I can allocate more memory than I'm actually available to me and I know how to swap things down on the disk. So Microsoft basically went all in and said let's do everything the OS is doing and put it inside of our database system, right? It's pretty amazing. So, but for this lecture, what we're gonna talk about is they are doing, for thread scheduling, they're gonna do non-preemptive thread scheduling by modifying the database system code itself to be able to yield back to the scheduler. So before we get to that, I also say too, is like this SQL OS layer is part of the reason why Microsoft was able to port SQL server over to Linux. So there was an announcement, it was in 2017 and this TechCrunch article, they talk about how when they had to take SQL OS and instead of talking to Win32, they now talk to the Linux kernel, that wasn't a major change for them. I mean, it was not insignificant but it's not like they need to change all other aspects of the database system because all of this low level management of OS resources and hardware resources is all done by this OS layer. So you just change the SQL OS part to make it stock to Linux and not touch anything else. So the same code base for doing all the join algorithms, doing all the execution flow control, all current control, all the indexes, all that sits above SQL OS and they didn't have to modify any of that. Going back to my example of DB2 before, DB2 did not have that abstraction layer so that's why they have four different distinct code bases. They attempted I think to try to have a unified code base but it never happened, it's never gonna happen. So Microsoft was able to do it through the SQL OS so I think it's really fascinating. Okay, so what does preemptive thread scheduling look like in their world? So in SQL OS, the quantum's gonna be for every thread, it's gonna be four milliseconds. I don't even know what the quantum size is in default Linux. These questions, what's the quantum? Quantum is like, so you have a thread, it gets scheduled on, the OS schedules it for execution, it's gonna run for a certain amount of time called the quantum and then at some point when the quantum's up, there's an interrupt, they take your thread away and have a context to show another thread, right? Time slice would be another term, right? So for those of you who've taken OS or OS class, what is the quantum size in Linux? Nobody knows because it's a vague answer because it's actually dynamic, right? It can vary depending on the clock speed, depending on what else is running, right? If it knows that there's no other threads running that are taking demand from the CPU, it can then make your quantum slightly larger to avoid a context switch to some other tasks that's not needed, right? If you switch into real time mode for Linux, then the quantum is fixed to be I think 100 milliseconds. The idea there is again the context switches are not cheap because you're taking all the registers, putting out into CPU caches, changing the program counter, like all that's not cheap and so in the real time OS, they let your quantum for a longer time. So in SQL OS, the quantum time's gonna be four milliseconds but they have no way to enforce this because this is all done in user level, in user mode. Meaning we hand a task off to a thread, it runs. There's no way for us, because we're running in the same process and we're not the OS, there's no way for us to say thread, you're done. Give us back control of the core and you can't do that because that's what the OS does because they can do it through hardware. So what they're gonna do now is modify, they went and modified the source code of SQL server itself to introduce yield calls to cause the thread to go back and return control back to the schedule. So let's say this is a simple query here, select star from A, where A dot value equals some parameter. So a really simple implementation of this would be a for loop where we get a bunch of tuples, evaluate a predicate and then if they match our predicate, we admit it. So what basically what Microsoft had to do was go back and modify the source code to keep track of the amount of time that they spent at these different parts and recognize that if the time since the last, the last time I called yield to the scheduler was more than four milliseconds ago, then I go ahead and yield. That puts control back now to the scheduler, the scheduler can decide whether to run your task again or whether they hand it off to another task. And so this is amazing, this is like pseudocode, they're obviously not calling, get current timestamp over and over again for every single tuple, that would be stupid. I think the way they did it was they sort of knew like in the code itself, like if I'm doing a for loop, they know the cost of going and maybe retrieving a tuple and whether it's in disk or it's in memory, what the time of that is and they can then use some built-in calculations to decide very cheaply has my four milliseconds gone up. Yes? How does it control the next thread that the OS is going to schedule? That is also very important in the data thing. So this question is how do I control with the next thread the OS is going to schedule? Right? You can't. Until that thread is the one that is running, right? OS yield or SQL OS or yield? This yield. So this is not an operating system yield, this is the SQL OS yield. But like even on there, they can't, are you saying that they can control which SQL server thread is running? They can control what SQL server task is running. Yes, right? And so there's a bunch of advantages of this. So one is you're right. I can't prevent the OS from swapping up the scheduler, right? So what best practices we're setting up any database system is run the data system on a machine by itself. And so you're not doing Bitcoin mining or encoding a video or watching YouTube at the same time you're trying to run transactions, right? So the contention for the CPU will be lower and with quantum scheduling or the fair scheduling approach in Linux, if I know another thread is trying to access or trying to do stuff, I'll come back to my, the thread right away. So yes, I could be in the middle of this, the OS might swap me out, correct? So this quantum time is a user mode concept. What's that? Correct, the data, yes. Another interesting advantage you get from this as well is like now I can do sort of fair scheduling and provisioning for between maybe different tenants running on the same database. So I have two customers running my same SQL server instance. One is I can just divide them up to run on different sockets but I could also now just use these quantums to enforce that they both could equal time. Or like if I now have a budget for how many quantums I'm gonna give for one tenant versus another, the quantum time still gonna be four milliseconds but maybe I give more quantums to this other guy because they're giving me more money than the first tenant. So this allows you to do all sorts of amazing things that you can't easily do with the hyper approach or the Han approach because they don't know what the actual tasks are actually trying to do. So I don't know whether, as far as I know, nobody else does this with this SQLS layer. This is something I wanna do in our system because I think it'll benefit the self-driving stuff. The only other system that I know that does something similar is FaunaDB and so they're not doing the same kind of like fixed quantum size or not, again it's not fixed but they try to make it be always four milliseconds but the way Han does this is that they only yield back to the in database scheduler whenever you do IO. So in SQLS I could do all this in memory and I'm still gonna yield back after my quantum is roughly up. In FaunaDB they basically say oh I'm gonna read something from disk or over the network let me go back to my scheduler and say hey I'm gonna pause for a bit, schedule somebody else if you can and if the scheduler says oh you're fine go and then you get back in control otherwise schedule something else. Yes. Is this a good project three? Is this a good project three? Yes, we can talk about this, yes. And right now we don't really have anything. I don't have slides for this. We use, for query execution we use Intel thread building blocks which I don't like because it's basically, it's almost like a black box. You basically, for the query plan you divide it up into a DAG, you hand it off to the, to TBB library and the scheduler. It does this thing and eventually you get back to response but you can't send in something else at the same time. At least I don't think. Okay. All right so the last thing I wanna talk about is how to do flow control. I sort of briefly talked about this about with the hard and soft queues approach like if I recognize that I can't process queries fast enough that I can take away networking threads and assign unpark query processing threads but basically like if queries are showing up faster than we can handle them then we're gonna become overloaded. So once again the OS is not gonna help us because if for CPU bound then it just does nothing, right? If for memory bound and actually what his question is like if I allocate one terabyte of memory and I don't have one terabyte of memory and I don't have one terabyte of swap space what's gonna happen with the OS has this special thing called the out of memory killer OOM and basically it says oh I'm running out of memory I'm running out of swap space let me pick a random process and just kill it, right? And you get a sig term and it just kills you, right? Yes. If you did something like call it would that make a difference for you? Something like what sorry? Like call it like you actually like write something to the space initially. What is call it sorry? As in like it's your initializes everything. So you're actually like. Oh. Oh, call it. Sorry, yes. So what would that do? Like if I tried to malloc and initialize one terabyte of memory it's gonna start running out the swap space and again like if I don't have a swap space I think the policy is for OOM I think it picks whatever the one has the most memory because again from the operating system world like Linux job is not to die, right? And if so in order for it to survive it's got to kill your database system it's gonna do it, right? So, you know, and so from the operating system perspective this is the right thing to do. Instead of me crack grinding to a halt and never doing anything, right? I'll kill the data system and then that way you can still watch YouTube, okay? So from a, again from the data system that's the same thing, you can just crash but we knew something a little bit smarter and there's basically two concepts here. Admission control and throttling and basically the idea is where do we recognize where our bottleneck is that we can't process things fast enough and then how do we then expose that out to the outside world so that new queries don't show up and we get overwhelmed. So admission control is basically we recognize that a query shows up and then we don't have enough resources to execute it right now because we're CPU bound or we're running out of memory. We go ahead and just say we reject the request and I think in the PostgreSQL protocol I don't think you can send back a specific rejection request but you can send back a query failed or it's closed connection. Throttling would be you just introduce some artificial delay for queries as they show up so that the idea is that I sleep a little before I maybe start processing the query or as I'm running the query maybe sleep a little as well and that way for the system overall because now queries are not trying to all do work at the same time, this will sort of smooth things out and I can actually process things and make forward progress. So I'm not saying one is better than another different systems do different things. This one is probably the most common one but again you need to now be aware of what's going on in the actual worker pools where you actually exceed the queries in order to make this decision. For this one you don't need any centralized control you just say other than like I need to recognize that I'm going slower or my queue is getting a certain size that it's getting larger than it is smaller as I process things than I just introduce these things but I don't really make a global decision. Every threat can make its own decision about introducing this sleep. So the high end systems can do this. This is sort of like a approach one approach two sorry it's backwards. This is the easiest one to do. The high end systems can do this because you can do other things like I can have priority queues to say like if someone connects with this username let their queries always run even though I'm over saturated whereas like if someone else is like a low paying customer they can run slower and certainly if I'm the admin a lot of these systems always maintain one open port for the admin to be able to connect to the system and actually queries start killing things and free up resources. So in that case here that's this is very common. All right so we covered a lot today. The main takeaway again throughout the entire semester and especially today is that what is the database system? Well it's a beautiful piece of software that we don't want the OS to ruin for us. So we're gonna try to do as much as we can as possible. SQL OS is sort of one extreme but at the very least we need to be aware of the layout and memory where our threads are gonna be running so that we minimize that traffic over the interconnect, right? And the amount of metadata we need to maintain about where a morsel is located or where a block of data is located is pretty low, right? It's not like we need anything inside that says we need to know if you want this address here's where it's located. Or we could also do the syscall on the fly and say well I know there's a block of located here what Newman region is it located in and then assign a task to it after that. But typically that's that you don't want to do that because that's a syscall and so we wanna track everything ourselves. And the OS again it's a friend of me we need it to live but we don't wanna actually have to talk to it as much as possible and we don't want it to do anything, okay? Any questions? All right so next class now we're going into more detail about what are these tasks actually look like and how we're actually gonna execute them, okay? And then we'll post some piazza for the update of the checkpoint for project two. And if you don't have a group I will email you today and force you to make friends and pick a group, okay? All right guys, see ya. Bank it in the side pocket. What is this? Some old pool shit. Ay yo, ay yo. Took a sip and had to spit cause I ain't quit that beer called the OE cause I'm old. Ah, you look then it was gone. Needed just a little more kick.