 Yo, hey, yo, hey, yo, yo. Pack the Chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots, and rapids fled with the church of time, and I'm not trying to get in. All right, so today we're talking about query scheduling. So how do we take the tasks that we've generated for our query plan and start distributing amongst the workers to execute them, produce the results? Again, just reminding for the last class, we were talking about processing models. And we said that the vectorized model is the best approach for running anecdotal workloads, because it has lower overhead of what Kevin called next, next, next. And so we get a batch of tuples, but it's not all the tuples that you see in the materialized model. And then as far as I know, the top to bottom approach, the pull, sorry, the push base, there should be a push not pull, the push base approach for query execution in the vectorized model that we see in hyper is going to be a better approach. I don't know actually whether there's research based on this, but this is sort of the general convention with them right now. So I showed this before, last class as well. Like, this is sort of the model of how we're going to break things up and what we're going to execute. Again, a query plan is going to be this directed acyclic graph on these operators. Then we'll have different operator instances that are the invocation of one of these operators to execute on some piece of work. Again, in the morsel's paper you guys read, each invocation of a task on a morsel, those are executing operative instances. So again, the task can be one or more of these operative instances to put together in a pipeline. And then the task set now is going to be the collection of executable tasks for a logical pipeline. So you can sort of think of like, you take the query plan like this, you define the pipelines based on the boundaries for the pipeline breakers, and then now you have a bunch of these tasks that are going to do that work on distinct chunks of data or morsel's or whatever you want to call them blocks. And this is what we need to schedule and execute today, okay? So for each query plan, the data set has to decide basically where, when, and how it wants to execute it, right? And this is going to be, there's a bunch of different design changes that we have to make, like how fine-grained should we break up the tasks for our query plan, how many CPU cores should we use for executing them, and then which CPU cores we should be using. And again, in case the morsel paper, as we see, that was an in-memory system, to the assuming the data is entirely in memory. For this lecture, we're going to ignore reading things with disk, assume that the data's already gotten in memory because something else is already figured out, this is what I'm going to execute and dispatches it, but we still need to care about where the data is actually residing in memory when we go assign our tasks to execute things. And then after the task does the computation that it was meant to do, it produces some output. Where do we put that output? Because it's going to depend on who's actually going to end up reading it. So the, this is a great example, and I don't know how much there's emphasized in the morsel paper, but the follow-up paper one from AMRA, from the same team, they make a big deal that like, these are all the decisions that we want to make in runtime inside our database system. And this is not something we can let the operating system figure out for us. The operating system is not our friend, it's always going to get in the way, and so we don't try to avoid it much as possible. So that means we want to do all the scheduling and data placement and all these things manning inside our database system. Because if we let the OS do it, you know, we're going to have a bad time. I think about it, if I'm executing a query plan, I know I have these tasks, I have these task sets, and these tasks are producing results, and I know who's going to read that result. So I know where to go place it and where the next task will read it from, or where that next task will read it from. The operating system simply can't know these things. Your statement is, is this because the operating system is too general? Yes. Database systems are the most important application class of all computing from this beginning of time. I'm highly biased, but we're just one, from the Linux Peel's perspective, we're one of 100 different things that could run. We're the most important, again, I'm biased, but so they try to be general purpose. And there's certain design decisions that we'll make in their own scheduler is going to be not optimal for a data system. This has been well known since the 80s. There's a paper from Mike Snowbreaker from 1980, tells us how the OS is basically your enemy, and this continues. All right, so in our scheduling algorithm and scheduling methods, we have four goals. The most obvious one is that we want to maximize throughput. We're going to maximize the number of completed queries that we can push through the system in a certain amount of time for our given resources. That's sort of a no-brainer. We also need to be mindful of fairness, meaning that we don't ensure that no query is going to be starved for resources. Meaning like if there's a really long running query that eats up all the cores, and that's going to block a bunch of the shorter running queries that may be behind it, and the system is going to look unresponsive during this. Related to this, I mean, basically the same thing, like if we don't ensure the sort of fair allocation of resources so that nobody gets starved, this is going to ensure that the system is going to seem more responsive because our tail latency is going to be much lower. So I guess the way to think about this is like if I make the, if I have a really long running query, if I could use some of the resources instead of for that long running query for other short running queries, the short running queries are going to definitely notice that they got the result back right away, but if the long running query takes another 10 seconds to run, but it's already ran for five minutes, no one's going to notice, okay. So we won't see this so much in morsels, but we'll see this in the HANA paper and then the Umbra paper where they can kind of still take advantage or still allow the faster short running queries to get results more quickly. And then lastly, sort of obvious as well is that since we're not letting the operating system do any of this, we have to implement all this in our database system ourselves, then we don't want to have this very expensive scheduling process or mechanism where it takes a long time and threads at the block because updating global data structures to figure out what things to run next. We want our threads, our worker threads to be spending more time executing queries rather than trying to figure out what queries to execute or what tasks to execute, all right. So for today's class, we're going to first talk about worker allocation. We're going to not spend a lot of time, not really talk about the process model too much, it's to say like it's multi-threaded. We can make that assumption going forward, although Postgres is not. Then we'll talk about data placement policies, how to decide actually where do we want to store memory, sorry, store chunks of data in memory and then why that matters for modern CPU architectures. And then we'll finish up talking about four different schedule implementations. The morsel's paper from Hyper you guys read, the follow up to that in Ambra, HANA and then a little bit of SQLOS from SQL Server at the end, okay. As I said, for this discussion today, we're going to ignore disk. We're going to assume that when a task gets assigned to start running, the data that it wants to read is in memory. And obviously in a cloud system, a disk based system, that's not always the case, but you just assume that something else has got, prepared getting things in memory for us ahead of time, right. Okay. So in the undergrad class, we spent time talking about what a process model is. So this is just a reminder to say what that is and understand how that's going to relate to the things when we talk about scheduling. So the process model is going to define what the underlying system architecture looks like in terms of supporting multiple queries and multiple resources or basically multiple CPUs, right. You could have a process per worker or you could have a thread per worker, right. And I'm trying to use the worker as much as possible during this lecture, but I've already slipped up and said worker thread, right. You just think of like the worker is some resource that can execute task, execute machine code for us, like a process or thread. So all the things we'll talk about today, it's actually not going to matter whether it's a process based worker or a thread based worker, but for all the systems, for the most part we'll talk of the semester, just assume that they're threads. Like there isn't a, nobody's building a modern OLAP database system today using the process model, everyone is using threads. Unless you fork Postgres, which a lot of systems do. But you know, think like Redshift, I think when they fork Postgres, when they're based in Park Cell, they've gotten rid of the process model and it's all multi-threading now. But any system that's based in Postgres without making major changes is going to be using process. For whatever reason, 2016 I had a student, my first PhD student, we took Postgres, we forked it like everyone else, and then we switched it to be multi-threaded instead of being process model. Forget why we did that. But it turns out the way we did this is like, we actually leveraged the Windows specific code, like the Win32 code. You can use that to convert it to be multi-threaded instead of using the Unix stuff that's inside of it. There's a bunch of pound of finds in the source code that says here's the Windows process model stuff, here's Linux process modeling stuff. If you start with the Windows one, you can convert it to P-threads more easily. We also converted to C++11. Again, I don't remember why we did this. It was the early days, things are wild. Anyway, so again, the worker's gonna be the thing that actually connects you stuff for us. So what today's about is figuring out, we have a bunch of these workers, what should they be doing? And how should they find out what they should be doing? So if we wanna allocate these workers, you gotta say how many workers we actually wanna have. So the two basic approaches is one, have one worker per core or multiple workers per core. And in the morsels paper you guys read, they were using the top one. A lot of systems are using this just because it's simple to reason about. You use some sequences also too, you actually turn on hyperthreading, like you wanna have like sort of a single, for a single physical core, you only have one worker. Hyperthetic can help if you're disbound in some cases, but that's not us for this. So what we're gonna do is again, for the first case, we're gonna have a, for each CPU core, we're gonna have a thread that'll be, or worker pinned to that core exclusively by the OS. Meaning the OS schedule will not allow any of the thread within our process and within our database system to run on that core. It can run like K threads or other random stuff, but like within, unless you try to turn off everything, which I don't think you can within kernel threads, like you can't, you know, with any reason, we know that one worker is only gonna run on one core at this location. All the workers per core allows to have a pool of workers, either per core or per socket. And the idea here is that the operating system is gonna have a bunch of cores, much sorry, a bunch of worker threads that could run. So in the case that one of them gets blocked on like, you know, a mutex down in the OS, it can then schedule another worker thread or process to execute, right? Hannah's gonna do this, and we'll see this in how they're gonna support this, because they're gonna have a bunch of pools, and they're actually gonna keep track of who's allowed to run at any given time. So again, it's basically re-implementing the OS that's gonna do for you, but entirely inside the database system. But again, for the morsel paper, you know, this is the one, this is what we're gonna assume. And this syscall here is basically how you control telelinux. I want my thread to run at this CPU core and nowhere else. All right, so the next question is, how do they find out what tests should they execute? And just like, you know, just like before with the processing model stuff, there's a push and a pool-based approach. So in the push-based approach, there's a centralized component, a dispatcher, a scheduler that knows what workers it has, knows what they're doing at any given time, and then when a worker says, when it knows that a worker is free, it then pushes the task to that worker and say, please go run this, right? And then when it's done, it gets a response, it results back, and then the scheduler says, okay, here's the next thing you should run. In a pool-based approach, there's some global queue or some global data structures that have the available tasks that could run, and then it's up for each of the workers to go look at this thing and say, what should I run next? And you can partition it maybe like per socket you have a bunch of tasks and then maybe there's a task queue and then there's a global task queue. It may be check your local queue first and if not, go check another queue. Again, we'll see this in the HANA case or it could be a single queue like in the case of the morsel paper. All right, nothing sure here should be that exotic. All right, so regardless now of how we're gonna assign workers to cores and how they're gonna find out what tasks that they should execute, then we need to also be aware of for the data that they're gonna process when they execute that task, where is that data located relative to where the worker is located? We're assuming that we can assign a worker to either a single core exclusively or within a group of cores like on a single socket. The data is also gonna be somewhere as well and we need to make sure that ideally, workers are always accessing data that's local to its location, to its CPU socket. All right, so I don't have, what other classes cover, how much do they cover? Numo versus the Uniform Memory Access. So there's gonna get a quick crash course on what CPU architectures look like relative with different kind of memory layouts and then we'll see why we want our database system and how we can make sure that we minimize the traffic across CPUs. So back in the day, the old way of doing multi-socket CPUs was using a approach called Uniform Memory Access, sometimes called symmetric multi-process architectures, SMP, right? Think like 90s, early 2000s. And so in this case here, what you would have is a bunch of these different CPU sockets. They would each have their local L1, L2, L3 caches and then they had this system bus where this was sort of a gateway to a bunch of memory. Again, it's all on a single motherboard. We're not going over the network here. But the idea is that if I'm over here and I wanna get, running a worker on this socket over here and I wanna get data over that memory, I just go up to my, make a load a store address call into my system bus and this thing is responsible for routing things to the right location. So in this environment, there is no notion of local memory for the sockets, meaning the cost of reading memory from this dim versus that dim is the same. So there's nothing really to optimize in the system because everything costs the same. Of course, I'm glossing over a lot of details. Of course, now if I do a bunch of reads, different sockets read the same thing, they all have it in their caches. So then if someone does a store on a cache line or updates that piece of memory, the system bus has to make sure it does the cache and validation to make sure everything's all synchronized. Like x86 doesn't, it's very aggressive and make sure everything always has a consistent view. We can ignore all that for this. Just the main thing is like, if I go access memory to any dim up there, its cost is always gonna be the same. So in modern systems, they use a different approach called NUMA, non-uniform memory access. And the idea here is that every socket now still has its own local L1, L2, or 3 cache, but it's also gonna have local dims, or local memory. So the cost for this socket here to read some chunk of memory over here is way faster than reading from a remote NUMA region. Right? Like 2x, like pretty significant. Now we're talking at the nanosecond scale, but still 2x is quite a lot. And so the way this works is that now if I need to go read this socket, a worker running this socket needs to read memory from over here, I have to go to this interconnect on the motherboard and send the message up to here and say, go get the memory I need and then bring it back. And again, the motherboard is still responsible for figuring out who has what cache lines for what pieces of memory, so if there's an update, now I have to do a cache invalidation across all the sockets that may have that, may have a copy of that in our caches. It's very complicated, very interesting. But again, the only thing we care about, because we're not doing transactions, we're not worried about updates, we only care about the data we want to read, we just need to know that if the memory's here, that's going to be way faster than the memory's over there. So this interconnect has a bunch of different names, Intel used to call it the QPI, Quick Path Interconnect, then they upgraded it to Ultra Path Interconnect in 2017, and then AMD calls theirs the infinity fabric, right? The bandwidth is quite high, it's like the gigabytes, hundreds of gigabytes, but again, the latency of the local memory versus remote memory is quite a lot, right? Yes? It's a question is, do I have to go send a request to actually core to get a piece of memory, or can I go directly get the memory? Actually, I don't know. I think you just go to the memory. So it's like DMA, Direct Memory Access to go, yeah. You don't have to go to the core, you don't have to use the core. But you still got to go over this thing. Yeah, you don't have to go to the wire. Yeah. But not utilize the core. Yeah. Okay. All right, so now let's go back to the databases. So this is gonna make our system a little more tricky now because before with this one, I call malloc, I don't really care where it actually is because the cost is always gonna be the same. But in this world, if I call malloc, and it always comes back with our memory address, where is it actually being stored, right? Because now again, because it can matter is that now if I start ripping through in my task, scanning some data that's in memory, if I'm going over the interconnect every single time, now it's not gonna be on a burp pipe basis, but it's gonna get things into a cache like chunks, still that's gonna be two X slower than reading it locally here. All right, so we wanna be able to control exactly in our database system where this memory is being allocated, and then we know where our workers are running. So now we ensure that if we assign a task to a worker, make sure it's a task that's gonna be in his local memory. If we can schedule that accordingly. All right, so the database system is basically gonna partition memory for a database, assign a partition of memory to the CPU. And because we can control exactly where this memory is being located, and we can control what things we wanna schedule at different cores, again, as I said, we can make sure that the data we wanna read is gonna be local, ideally. Some cases it won't always be the case. So this is an old problem in databases, especially in distributed databases, it's called data placement, right? Think of like the partitioning problem is, how do I take my table, pick some attribute, and then slice it up into horizontal chunks or shards, and put that on different machines. The data placement problem says what chunk goes into what machine, right? So again, we're not in a distributed system, we're running on the same box, but it's essentially the same problem. So we can rely on the syscalls, like move pages, or the command line program, numeric control, this allows us to specify the policies for how the data systems process will decide where memory's gonna be located, because we can have complete control exactly of all this, like the OS will actually expose this to you in the hardware. All right, so a quick question would be, if I called malloc in our database system, what happens? Assume there's no, assume the, the, Sumer allocator doesn't have a bunch of memory in user space, right? Because they may maintain their own cache as well. Assume we have to go to the OS and get memory. What happens? You know, it says sbreak, yes. Well, that's later, right? Initially, it was nothing, right? Well, the allocator extend the process data with sbreak, but then this is gonna be virtual memory that's actually not backed by physical memory, right? It's just like a promise. The OS is like, oh yeah, I got some, here's a memory address. It's virtual memory, but it's only when you actually do anything with it, then there's a page fault and then it actually tries to back it with physical memory, right? So then the question is, if now I try to access this memory, I just got back, where's it actually gonna be located? It says it cannot be controlled. One choice. It says, it's when, sorry, which thread, what, sorry? It depends on the memory that's local to it. So you're ignoring what thread, I think you're basically saying whatever the first thread that touches it. Yeah. Yeah, okay. Yeah, so that's one of the policies, yes. So by default, you get interleaving, where you just say, it just allocates it, you know, from across memory, across all CPUs, and hopefully it works out, right? And what he was referring to is a mode you can put in, you run your process, which is called first touch. So you have one thread, some other NUMA region allocate memory, it's virtual memory, it's not backed by a physical memory yet, but then when another thread then touches it, then where that thread is running, whatever NUMA region that it's pinned at, that's where the memory actually gets allocated, right? Once things are actually already allocated, you can move stuff, like going back here, like this move pages syscall, you can either, if you give it a memory address, it'll pass back, hey, here's the NUMA region where this chunk of memory's in, or you can say, here's a memory address, and here's the length of the stride or chunk, and then here's the NUMA region I want you to put it in. So you can have very frankly control over all these things, right? So this is an old experiment that a student's run a long time ago on a machine here in the PDO that had eight sockets, so we wanted to test this out, but it's a really simple execution engine, similar to what you guys are building in project one, where it just allocated a bunch of memory and then just did a sequential scan through it, right? And so this thing has eight sockets, two cores per socket, but then, sorry, 10 cores per socket, and then plus hyperthreading, and so the red line, sorry, the black line here is you just let the OS figure out where to put stuff, like using the random policy, and then the red line is with we ensure that the thread that's gonna read a chunk of memory is reading memory that's local to it, and then along the x-axis, you're scaling up the number of threads, and so you see here, what is this, triple thread per second, it's almost double, it is double the performance by only reading local memory. Yes? Why does it increase in the beginning and then stabilize first, last, and what? So this question is why does it increase, increase in the beginning and then it stabilizes? Because at this point, I think at the beginning, there's fewer threads and therefore the probability, is there fewer threads, but you have to still read the same amount of memory. So the probability that a thread is reading data that's local to it is higher because each thread is just reading more things, reading more tables. And after that, it's- Doesn't the probability- Doesn't the probability- There's only like 80. Wait, so you repeat that again, sorry. I said there are 80 copies. In total, 80 copies can come for them. So if there are 80 threads. Yes, that's- There's only a little bit there. Oh, that's hyperthreading. That's the drawing, this line here, this demarcation here? Yeah. Yeah, that's not his, yeah, we'll get there. His question is why in the beginning, are they basically the same? And I'm saying that the, the probability that the thread's gonna read data as local to it is increases because it's reading more data. Whereas like, it's randomly placed, you just have to be, read data that's gonna be local to you. Yeah. Is there further away? Yes, what I'm saying. Yeah. Yeah. Yes. Okay. They're randomly distributed. They are, and if you're- Yes. How are they distributed? Say it again, if the local partition- Yes. No, random would be, so local partition would be, I break it up into so many chunks of partitions, and then I make sure that each thread that's gonna scan it only reads data that's in its numeral region that's local to it. Yes. Oh. We need to cache all that data. We need to cache- It looks like not that much. Yes. So you're paying the- Well, one thread doesn't matter. Correct, yes. Because like, my thread is running on, let's pin to this core. The data I need is at the- The numeral regions, I gotta bring it over the interconnect to get it to the scanner. Correct, and it is the same. Yes. All right. So the other thing he brought up too is this, I have this demarcation line here to say, this is when hyperthreading kicks in, because this is 80 cores. So after that, now you're getting to the logical threads in delgalism hyperthreading. And in this case here, because we are, we're bottlenecked on memory bandwidth, in both cases, this is why it plateaus. So this is what I was saying before that oftentimes for OLAP systems, you actually want to turn off hyperthreading because it's actually not gonna help you and it's just gonna get in the way. And give you a false sense of parallelism or additional resources that you don't actually have. Right? They're all waiting for cache line fills because they're trying to read something. It's not in their cache. Then it's gotta go out and get it from memory and if it's going over to interconnect or whether it's local, you still have to stall while you go fill in your cache line. Then it can run. So it doesn't matter that you have additional hyperthread on a core because they're both waiting for the cache line fill. And that's why you plateaus. And this experiment is probably five or six years old. I don't have access to another eight-stock machine. I don't know whether how much better it would have gotten. I still suspect you would see the same plateauing effect. But maybe because if the interconnect got faster with UltraPath over QPI, the difference between the two would go down. Okay. So I've already said this, the partition scheme versus placement scheme. So we don't spend too much time on this. Partition scheme, we already talked about this, how to decide to break the table up into horizontal chunks. And the placement just says, where should we put it? And then the default will be round robin. The better approach is to interleave it across cores. And then the data system will be aware that this chunk of memory is located in this new region. Again, in the morsel's paper, it's actually the memory is actually the table itself because it's an in-memory database system. But if you're reading much of Parquet files or Orc files or whatever, coming off the network of over S3 or whatever distributed file system, that's got to go into memory somewhere as you start reading it. And you want to put it in memory where you know tasks will be able to run locally on it. All right, so far we have the test assignment model. We have the data placement policy. Again, it would be no more aware. So now the question is, how do we take a bunch of tasks from a logical query plan or physical query plan and then distribute it up into test sets and then assign them to workers that actually run? In the old to be world, this is easy because there's usually is not, there's only one pipeline, typically. Like it's going to be, do an index scan to get Andy's account and maybe do a projection or something on it. So it's one pipeline, it's one task. You just assign that to a worker and be done with it. In other lap queries, it's going to be more complicated because we have to worry about dependencies between these tasks or these pipelines. And then there's going to be a lot more instances of tasks within a task set for a pipeline. So there's way more things we actually need to run. So the naive approach is you call it static scheduling. And this is where the database systems are just going to figure out at the time it generates the query plan, how many threads or how many tasks should I have for my query? And then I just dispatch it and I'm done. I don't change based on the, based on resources, I don't change based on what other queries are running. It's the easiest thing to do because I just say, all right, I have same number of tasks, the number of cores, assume that the tasks within my task set for pipeline are all going to run the same amount of time. And I just shove those out and I'm done. And I can do a really simple first come, first serve policy where the priority in which queries to execute depends on the, at what time they showed up to the system. That's the most naive thing to do, right? The challenge of course is like this won't work in an operating environment in a system where there's a, there's a bunch of other queries running or showing up where now I could have contention on those resources and I don't want to, and I don't want to starve them and make sure I have to have those goals in mind that I talked about in the beginning. Right? So dynamic scheduling is what we're gonna do instead where we're gonna be able to decide on the fly which, on the fly how to assign tasks to workers based on what's available, what resources are available, and also based on where the data they want to process is located. So this is what the morsel driven scheduling approach, the paper you guys read is about. So the term morsels, this is a hyperterm, this is something that they invented. I don't know of any other system except for maybe DuckDB which uses this approach is gonna refer to data this way. But basically they didn't want to use the term block or partition. They had to come up with something else to say here's a piece of data. So they call it morsels, right? So it's slightly bigger than a block but smaller than a partition. And I think in, I don't know what they say in this paper but it's like 100,000 tuples, right? It's based on the number of tuples. A morsel is defined to a fixed number of tuples. Anyone have even sized morsels for a table. So you're gonna have one worker per core and they're gonna pin it to the core so that only that worker can run to that core. You have one morsel per task you'll do a pool-based assignment where the worker's gonna go look in a queue and try to figure out what's the next thing I should run. And they're just gonna round robin data placement for the tables. And again, this is an MRA database. They're not bringing in per K files and disk. Umbra does that, actually sports disk but they in their, that the newer scheduling paper for Umbra which is the precursor or sorry, the successor to Hyper, they want to support and really get it from disk but their scheduling paper ignores disk entirely. As I said, we'll just do the same thing. So all the operators are gonna have sort of parallel, new one where implementations of them, they're gonna be a push-based approach. They're also gonna do query compilation which we'll see in two classes but how they sort of define what these tasks are. But we can ignore that for now. Yes. The workers have to go look in the queues and figure out what to do. I'll have to double check the paper but I thought they would, they go check queues and they can do work stealing. The dispatcher, which is that? I said it does both. Okay. As you know, I don't think there's, there isn't a dispatcher thread in morsels. No, there's no dispatcher thread. There's no dispatcher thread. It's like a mockery. It's like they don't know how to do it. Yeah, sorry. It's like coffee patchers. Yeah, yeah, yeah. So I'm not crazy. Yes, sorry, yes. Yeah, so it's a pool-based approach. Yeah, there is no separate thread that figures out what's going on. Hanna will have that. They'll have a watchdog thread. But even then, they're still pulling from queues. But there isn't anything that, like giving you do this, you do that. To her point, they copy the code. It's the code, like in the code you would have to, it's not like copying code. It's just, they execute the same function call to say what should that next thing they do. And everyone executes the same function call. But they're aware of what queues are they reading from or pulling from. There's a global data structure. There's a global data structure. So in this paper, they say, oh yeah, it's a global data structure. It's a lock-free hash table. They don't want to say how it's actually implemented. And they claim it's not gonna be, they don't talk about being a bottleneck. But in the follow-up paper, we'll briefly cover after this. They avoid this, and they specifically come up with a global queue that doesn't have that same hash table because it is a bottleneck at their scale, at larger scales. Okay, yes. So no dispatcher thread, because they're using a pool model. And then all the workers are gonna be looking in a global queue to figure out what's the next thing they need to do. And the threads are gonna have logic that's gonna, or they're gonna choose or prefer to take tasks that are in, that's gonna operate on data that's local to its numeral region. But in the case that things aren't available for its local numeral region, they can go ahead and steal tasks that would normally be used in another, you know, at a workover like in another numeral region, right? So there's a cooperative schedule where everyone's working together in Unison to make sure that things are actually getting executed and getting done, right? So, again, as I said, the paper ignores the synchronization call so the global hash table, but then they'll fix that in the umber paper after this, right? All right, so here's our data table and we want to do this simple join like this. So again, the morsels are just gonna be some, some artificial boundaries of 100,000 tuples or whatever the number you want it to be to say here's the different chunks of data but again, they call them morsels. And in their case, it's 100,000 tuples. When we started building our own system here in Peloton, we did 1,000 tuples. In H-Store, we did, earlier system I built, we did 10 megabytes. These are arbitrary numbers we just picked. I think they claimed they did some benchmarking locally, profiling and decided why they came up with 100,000 but again, umber will get rid of that. When we were building noise page, it was an in memory system as well. We were doing one megabyte morsels because we wanted to do, we want to be aligned at 20 bit offsets for addresses and because there was a trick you can do in C++ to make, to reduce the size of a pointer to some memory chunk if everything's aligned along that size. But we didn't know that. All right, so then we have these different morsels and we're gonna assign them to these different sockets. So now when we wanna execute, sorry, executing a query, right? The first thing to point out is that here's all the morsels that these sockets, that these workers are responsible for, that they know about. They're also gonna have a chunk of memory that's gonna be local to it, that's gonna be in its own numeral region, that's gonna be where it's gonna store integrated results or the output of a task has to go somewhere so they wanna write it back to locally. And the idea here is that because now I, I'm gonna operate on data that's in my local numeral region, then I don't have to go to interconnect when I read from it. And then when I produce output, I'm gonna write it back to my numeral region because the next task I'm gonna read or the next task I'm gonna execute is then gonna read that data I just outputted. And if it's in my same numeral region, then I'm gonna be fast because I take my output and then read it back in in the next task. And it's all local to my numeral region. All right, so there's also this internal metadata we're maintaining, what I'm not showing here. We know the dependency for these pipelines, these tasks. So in this case here, I know I have to build the hash table on the build side of this join, I have to scan A and then build the hash table. Once I have that done, then I can do the probe with B. So there's, again, there's a metadata keeps track of I can't run anything for the second task set until the first one is done, right? So in this case here, say this task set for the first, through the build hash table, it's evenly divided across the different buffer site for the morsel sizes. So I will have each of these threads, each of these tasks run in parallel at the same time. All right, so they read their local data to build the hash table and then they write it back out. I assume in this case, we're just doing partition hash table. For illustration purposes, it doesn't matter. But now in this case here, say the first two workers are done, the third one's not done yet. These guys have to stall in this case here, assuming there's no other queries to run anything because we know we can't execute anything in the next task set for this pipeline because we have to reach to the hash table. All the hash tables are done because we don't have any false negatives. All right, so when that's complete, then we go start executing the next one. Again, all these now run in parallel. Say this one guy finishes, then it goes up to this global task queue and it can pick another task to run. In this case here, if there's no task that prefers or wants to run on this new region, like say this nest task will pull it out wants to run over here because this guy's still running and because there's nothing else to do, then go ahead and poach it, steal it, and actually run it as well. And maybe this task has to read data that's over here, so that's okay, right? And then in this case here, it just writes this output to its local buffer, right? Yes? It's lower. Is there a case like that because of the reading from memory that's in a different like new region? So her statement is, her question is, in my example here, this guy was idle, so he says, okay, well I'm just gonna steal something and we start running this nest task. Could it be the case where we were actually better off letting this guy be idle because it's gonna have to go over the interconnect to get this data, that it's better off to just let this one finish and then go get the next task that it wants to execute and operate globally? So we'll see this when we talk about HANA, they claim you don't wanna do this at large SOC accounts. That this actually makes things worse, exactly as you said. And I think for the morsel paper, I think they run up to four sockets. The HANA papers run up to like 64 sockets, like it's massive machines. And in that environment it's just, the cost of going over the interconnect is so expensive, just don't do any work stealing. Okay, so again, the reason why they have to do work stealing is because there's gonna be only one work up per core and one morsel per task, so if all the threads are just sitting waiting for the stragglers before I can execute the next set of tasks, then they argue again, it's better off to go ahead and try to execute something specatively, or not specatively, try to execute something, even though it's going to interconnect because that's better than being idle, right? As I said, and they maintain this lock free hash table to keep track of what should they execute. So yes, your statement is they're basically implementing an out-of-board execution from a CPU. Yeah, I said the word speculative and I shouldn't have done that. Speculative makes it sound like, like the CPU, it's not speculative. But no, it's not speculative. And then there's some, there's a confusion there, which is the work that they're doing. Right, so your statement is because we have this dependence graph, we know that what pipeline is dependent on the output of another pipeline that we know what task are we able to run right now. So if we have, if we're idle, go take something that we can run it safe. Yes? I mean that's how it works because we're also not spending time. Right, but it's something similar to that. Yes, it's not, yeah, without the speculative part, yes, it is the same. In fact, I think they can lose speculative in the game because they know it's going to happen. They can at least do speculative memory loading or speculative disk loading for certain things. Yeah, so we're not talking about, yeah, we're ignoring disk dispatches, but yes, you would say, these are the things that are coming up. Let me go ahead and fetch the disk. Like, yeah, that's a whole another beast. In terms of speculative execution, you can't, I think whether this is true or not, you don't want to speculative execute any tasks in an OLAP query because they said, like if the hash here was not fully built yet and I start probing it before it's fully built, then I can end up with false results. Right? So, it's hard, I think it's harder to do speculative execution in an OLAP system. Yeah, in an OLAP system, if you're doing store procedures where you have transactions that have multiple queries, as long as the output of one query doesn't depend on the, the input of one query doesn't depend on the output of another query, you can do speculative execution, even though the program might be written in a sort of serial procedural manner. There is work on that. And like, or like, it's sort of like optimistic control. Like, I assume, I'm not gonna have any problems, let me go to execute these things, and then I check at the end whether that was okay. You can't do that in OLAP, because it's driven by the data. Okay, so, in this paper you guys read, there's sort of two big issues that come up that they don't discuss, but they've shown up in the follow-up paper. And that is the, the morsels are fixed size and one task corresponds to one morsel. And so, they assume that the execution costs per tuple is gonna be the same across different morsels. Or even within the same morsel or within different query plans. Meaning like, if I wanna do a scan a table and I have a where clause in the predicate, there's one, and they're operating the same morsels, and one predicate is like, where A equals one, right? With another predicate, it might be like a regular expression comparison or evaluation on a string data. That regular expression would be way more expensive. So the cost of looking at every single tuple would be way higher than in the other one. But they can't account for that because everything's based on these morsels, and that's the rigid definition of a unit of work within a task. You can't have, you can't, how do I say this? You can't change other scheduling decisions based on how much time you're spending. Which is the really thing you care about if you wanna have fairness and responsiveness. You wanna be able to say, this thing's gonna take a long time. You're gonna go ahead and let a shorter run query run for a bit. I can't do that if the morsel size is fixed. Because I gotta finish the morsel before I go to the next thing, right? And then related to that, of course, again, now you don't have a, this is finished, but all query tasks are executed with the same priority. If I have long running queries, they're gonna get blocked. So I don't have all the slides I wanna show for the Umbra Schedule approach. I'll give a high little flavor how it looks like. So in this world, it's still gonna be morsels, but now a task is not going to be restricted to be one morsel. It can be one or more. And then instead, their notion of a chunk of data, you guys close the door, that guy's on the phone. Yeah. He's wearing shades and like, sorry. Anyway, so now what's gonna happen is the notion of a competition unit is gonna be based on time, like up to one millisecond. So the idea is that if I have multiple morsels, given a morsel to operate on, but I complete it within under microseconds, I can go back and get the next morsel to execute without having to go get rescheduled for the next task. And this ensures that now you're sort of operating terministically on a time-based schedule rather than like a database schedule, right? The other thing I also have too, is they're gonna support priority decay, meaning when a query shows up, it's given the highest priority, right? Because I assume it's gonna be short. But then if it keeps running over time, it takes longer or longer, then it's priority decay is exponentially. And that means that it's less likely you're gonna get scheduled. It's still gonna be able to operate and get things done to process data. But now if short queries show up, they'll have that higher priority and then they'll get the run complete very, very quickly. So again, this ensures that the system seems very responsive and short running queries finish quickly because otherwise people complain and the long running queries, they'll still get done, just maybe they take a little bit longer than they would have otherwise. And overall, the system is fully utilized. So this paper came out in 2021. Again, the reason why I didn't have you guys assigned this to read this because it talks about morsels and it assumes you already know morsels. So you had to read the morsel paper first and understand this. So that's why I picked the morsel paper first. The guy that wrote this paper gave a talk with us a year or two ago during the pandemic. He's since graduated. This was like his master's thesis. And now he's like in charge of query execution team at Firebolt, which is a OLAP system that's a fork of Clickhouse. So, all right, the other thing they're gonna do is they're gonna get rid of that global task queue. And instead they're gonna rely on thread local storage to keep track of here's the state of the system. It's still gonna be a pool-based approach, but each thread's gonna have its own local metadata about what the overall state of the system is and the queue is. And it uses that to figure out what the next thing I should go execute. So the ways it's gonna work is that you're gonna have a global test set slot array at the top, and this will be fixed length. I'm showing four, but I think they go up to 128, the paper. The idea is like every position in that slot array will tell you here's the test set for a given query that we have tasks available for us to execute. And these will just be pointers to this other test set data structure on the side that has the dependency graph of here's the test sets, here's the morsels that they can operate on, and then here's the other test sets that depend on the output of the current test sets. So now within every worker, I have these arrays that basically keep track of here's what's changed since the last might check, here's where to go find out what's new, and here's what I'm currently executing, or here's what they'll be able to execute. So you have this active slot, just a bitmap that says the correspondence of whether there's a one in meaning that there's, the one is set meaning that the same position in the global array, there is a test set I can go look up and get and find the next test to execute. Right? Then there's be this change mask with another bitmap that everybody's gonna have their own version of this, where I set the one to indicate that something has changed in my test set. So whatever metadata you have cached about what's up there, you have to invalidate it and go check again. And then the return bitmap will say, if I complete a task, complete all the, if I'm the last thread, complete all the morsels for a given test set, and I've returned the, or I've changed what's in here, then go check to see what happened, what has changed. And all they're doing now, if I, I'll show you an example on the next slide, if I have to update one of these bitmaps, I do a compare and swap on all the bitmaps in every single thread. And they claim that's cheaper to do versus like everyone trying to, having a lot of attention on a single global queue. All right, so let's see an example here where the thread completes the, completes a task set and has to go get the nest task set, put it in the global queue, and then notify the other threads that here's what's changed. All right, so say this thread here completes running query one, task set one. So it's gonna follow the, it knows where it got it from. Goes up to the global task set. And this is just a, this is just a bunch of list of pointers that then point to the data structure in this task set array or hash table. And if it figures out, okay, I processed all the morsels that are available for this, then I need to go put the nest task set, put that back in the queue. And that's again, just do compare and swap on the array to ensure that nobody else tries to put the same thing in the same, tries to update at the same time I did. And you back off if there's a collision. But now I need to notify all the other workers that we put a new task set in here. So avoid any caching issues. So they just do a compare and swap now on the return mask and all the data structures. Oh, there's a return mask in all the different threads and their local storage. Again, compare and swap, that's cheap to do. You have to go, you know, you have to go across numerous regions, but it's unavoidable. And then now when this thread says, okay, what's the next action? You know, when it has to say, I'm done my own task, what's the next thing I need to do? It can go check whether there's a change in here. And if yes, then it knows, you know, do something like, you know, throw away my cache copy or go follow the pointer or go find of what actually has changed, right? Same thing if now, for the change mask would be if a, like a new query shows up, you flip the bit and say the change map for the change mask on all of them. So then now they know that since the last time they checked there's now a new, there's now something new in the slot array and they update my actual slot. So it's just an alternative to the global data structure for the, for, you know, versus, alternative to the global data structure for keeping track of what's in the queue, what's available and who's running what. I'm not showing here also too, there's a notion on priority. So again, every time I go complete something and I update my task set, complete a morsel, I keep track of like every thread has its own local priority for the different tasks that are available. And that way it's less likely to choose something with a lower priority versus a higher priority. So when I have to figure out what's the next thing I'm gonna execute, it's basically flipping a weighted coin to figure out which slot should I look at the next task to execute. And over time this decays. Yeah, high level overview, just trying to say it's an alternative and they fixed a bunch of the issues that were in the first morsel's paper. So let's talk about way more complicated scheme than the morsel's and the Umbra stuff in HANA. So HANA is an in-memory system out of SAP. SAP is one of the biggest and oldest IT or computer software companies in the world. They make like customer resource management software, ERP software, like it's basically think of like internal backend stuff for like major corporations. They make a lot of money. The, and so HANA was a system they started building as a way to combat Oracle. So what I'm saying here is not a secret, this is public. SAP software, the main software for the longest time only ran on Oracle or Sybase. So they bought Sybase and then they started making this new thing called HANA to replace Oracle. So that, you know, Oracle didn't get a cut every single time they sold the software. The, then Oracle bought PeopleSoft to fight SAP, whatever, above my pay grid. But anyway, so it's an in-memory system. It's not used by startups. It's mostly used by big companies, big corporations. But it has a lot of sophisticated modern in-memory and columnar's methods and implementations in it because it's written from scratch in the 2010s. And so this paper here, this is actually, was a student project at SAP HANA. So I don't actually know whether it made it to the big rewrite that they did a few years ago. But this is done by a PG student to re-do the entire scheduling system in HANA a few years ago to switch it over to be something more sophisticated. So this thing of this is an alternative approach to the morsels. So it's going to be pull-based schedule with multiple worker threads. And then each socket's going to have a pull of workers. And these pull of workers are going to be in different modes. And then the data is going to figure out what mode should you be in and what you're waiting for. And then it can increase or decrease the number of active threads or active workers based on the demand of the system. So the entire database system is going to run off these worker pools. And that includes all the background threads for like garbage collection, networking, anything else, compaction, anything else that Davis has to do has to run off this. In the morsels paper, they had dedicated threads doing garbage collection, dedicated threads doing networking. And the morsel scheduling stuff was only for query execution. In this approach, everything's going to run off the worker pool. Each worker pool or group with a worker pool is going to have a soft and hard priority queue. We'll see this in a second. But basically, the priority queue will determine whether a thread running in a different numeral region is allowed to steal tasks from that queue. Hard priority means you're not, because I want to run it exactly at this numeral region. The soft priority means you can steal it. So you would put like garbage collection that's accessing a lot of memory. You want that to be, since that's a heavyweight operation, you want that to be pinned to a numeral region, same with other network operations. But again, query tasks, that's OK. We allow that to be steal. And then there are going to be, I wouldn't call this a dispatcher, but there will be a separate watchdog thread that hovers above everything else, that wakes up every so often, looks at the status of these thread pools and thread groups and can decide whether one is over to utilize or under utilize and can switch the balance of resources on the fly as needed. So a lot of this I've already said. So again, for every thread group and within a pool, it's going to have the soft and, sorry, every thread group is going to have the soft and hard priority task. And that determines whether threads running in different regions can steal from them. And then with each group, we're going to have four different pools. So the working pool will be, here's the threads that are actually executing some task. The inactive pool will be the threads that are blocked in some UTECs inside the kernel, where I know I can't execute anything. They can't do anything until something wakes them up. They're blocked on something about the database of a lock on a table or something. The free pool will be basically threads in a busy loop, where they're going to sleep for a little bit, wake up, see whether there's something to do in the two queues. If not, then go back to sleep. And they're spinning in user land. And the part would be like in the free pool, it's a thread who's looking for work to do, but rather than doing busy loops and spinning and waking up over and over again, you block it on some UTECs so it gets descheduled in the kernel and it just sleeps down there. And then I can wake it up later on with the washdog thread. The reason why you want to do something like this is this is going to be way cheaper than having to spawn a thread, to go flip a conditional variable and say, hey, wake up, now there's stuff to do. That's cheaper than spawn. And then the thread has to go maybe copy some memory and then jump it around and figure out what to do next. So think of this as like an alternative to having to spawn. All right, so now, depending on whether the database system is going to be CPU bound or memory bound, we can adjust the where threads are running, or what threads are running, based on what's needed, right? So if we find that our database system is entirely CPU bound instead of memory bound, then we're okay with paying the penalty of having to copy things over Nuba regions because we need the computational resources anyway, right? But as I said, I've already sort of spoiled this. In the paper, they say that in their experiments, when you look at really large machines, like 64 sockets, and they have some customers running at that scale, then work stealing is just not as beneficial as just making sure everyone only operates on the same Nuba region. Because again, that cross region going over, the interconnect between sockets is super expensive, and it's better off just spinning, waiting for things to free up to actually run. So the SAP would have these sort of mini sessions before SIGBOD, where they had a bunch of database researchers come and see presentations, talk to their research about what they were building in HANA, and they would have HANA customers come in and talk about how they were using HANA. And I don't remember which company it was, but they were running on some box, this is like 2019, 2018, where they were running out of a dress space, because it's an NME database. And in x86, you get 48-bit addresses. Like even though you get a 64-bit pointer, Intel only uses 48 bits, their database system was running out of space for 48 bits. It was that large. And they had a ton of these sockets. So these are pretty massive machines, not anything you can get on Amazon. Right. So these thread groups, it's going to allow our task to execute other parts of the system. As I said, back on task maintenance things, like Postgres has the log collector, has the auto vacuum. It's these kind of operational things. For OLAP systems, it's less of an issue, because we assume it's read-only. Then we're not doing compaction like in the snowflake paper, we're just reading stuff and ripping through it. We don't have these additional other networking stuff or dispatchers. We don't have additional things we need to schedule. So let's look at, for example, how this works. Same query as we had before. We have a bunch of tasks. We know the dependency between the pipelines. And then for here, we just have two garbage collection tasks on Table B. So the first task we want to execute, we'll just put all of these in the soft queue. And then the threads can wake up and go take things from here. For the garbage collection stuff, we want to only run on local memory within our numerations. So we'll put these in the hard queue, and this will prevent anybody from stealing them. So now our working threads can wake up and they go pull things out of soft queue and then start running. For these other threads here, the inactive ones, they're waiting for some latch or some new text that is blocking it from executing anything. So we can't execute anything there. Not so much an issue for OLAP systems because we're not doing transactions. So actually, latch is a bad example because you would spin on that. But if I want to get a lock on a table, I can't run until something wakes me up and says I can do that. So I'm stuck on that. Free is the one we're doing. User space, busy loops, looking for stuff to do. And the part is where we're down in the kernel and we're sleeping. So the free threads will wake up and it'll look in these queues and say this first guy finds something in the hard queue. So then it changes into the working pool and it can execute. Question or no? So what does this look like? An operating system, right? Because again, we don't want to use, I'm not picking on Linux, but we don't want to let the operating system figure this out because it's not going to know the dependencies between this thing's waiting for some kind of logical construct, a logical lock, right? It's not going to know that based on what the queries I have showing up in my pipeline or my queue coming up, that I want to maybe scale down within one socket of the number of working threads and increase it to another one. Yes. Your statement is Linux has the ability to find and control what threads are doing, yes? So what's the advantage of writing this all ourselves versus letting Linux do it? Yeah. Because Linux doesn't know what the queries are, doesn't know the tasks ahead of us, right? It doesn't know that, OK, this thing's going to execute. It's going to read this data from this region and write it in this location. We can do that because it's SQL and it's declarable. We know exactly what's coming along. So we're in a better position to make better decisions about what the right thing to do is. Now, would I go this far and have all these different pools of exactly what they're having? Maybe not. But they argue you would, yes? I think there are two words. Threads never find you in support. All you need is just a thread manager on top of it. Do you see someone who can basically invert that's the only condition you should be using from using an OIT, so I use thread. So you're saying, you have a thread manager that. This gives chance to the thread. Versus like putting them to sleep in themselves and things like that, OK? You don't want to commit all of this, like, not to say it. So that was the first question. There's more of a statement than a question, right? Like, yeah. Basically, I said that you need to. I mean, I actually think this might be an overkill, right? And I don't know whether they actually implemented this in the newer version, Haunt or not. I said there's somebody who's peachy dissertation to build this. I've been showing this as an alternative to Haunt, to Hyper's Morsel stuff. Maybe think of this. There's Postgres, we just let the OS do whatever once, right? Then there's this Haunt thing, where it's really fine-grained control of deciding who goes to sleep when and when to wake up and so forth. And then deciding what threads to turn off and what threads to spin up on the different sockets. So I think that's the two extremes. The Morsel's one, I would say, is somewhere in the middle. The data structure is shared. Like, this is shared within a socket. Yeah. Yeah, so like I said, there's eight threads per socket. This is probably a socket. Yeah. Where's the bits made? The two groups, what's the system's nodes? Correct, so this is local to the socket. Every thread group is going to have, think of one thread group equals one socket. Every thread group has these much to use. Another thread group can go poke in here and figure that out, go look in there. And that's why, again, they were saying the work stealing didn't work out because the cost of good interconnect and the cash emulation of these shared data structures becomes a problem. OK. So in the last five minutes, I want to talk about SQLOS. So what's that? What's wrong? This thing's awesome. This is like one of the best things part of SQL Server. This is one of my favorite parts. Yeah. It's also what? I disagree. OK. All right, so SQLOS is a numer-aware user space abstraction layer above the operating system that runs aside the SQL Server that manages all the access to the Harvard resources, like CPU, disk, and memory. Right? You're writing the OS? Of course, yeah. We don't want to use the operating system. I see. Right? So the reason why they built this was because they found that every single time new hardware came out, in particular, actually because of NUMO, I think was the catalyst for building this. Every time new hardware came out that had maybe slightly different properties than the previous hardware, Microsoft found themselves having to go back and actually change the physical operator implementations in the data system over and over again. And they said, OK, well, is there a way to abstract away the operating system so that we don't have to do that every single time? So if you want to start running on GPUs or FPGAs, that we don't have to go rewrite everything all over again. So the idea is that they're going to have the single interface that's going to support parallel operations that are aware of the location of data and can extract away maybe the low-level details of where things are actually located or in terms of the implementation of the operator itself. The other cool thing that they did with this is that they switched to be non-preemptive thread scheduling, meaning they're basically using code routines before languages supported these things directly. And they're going to use cooperative scheduling so that threads can just, if they don't have any work to do or they're blocked on something, they yield back to the SQLS scheduler who then can decide what's the next thing to run. So what's really awesome about this, this came out in 2006. In 2017, Microsoft announced that SQL Server now runs on Linux. I'll say why that's funny, but. Why is it sad for me? These things amazing. Why would this be sad? No, no, no, hold on, hold on. So in this article, it wasn't the entire reason why, like SQLS wasn't, they didn't rip out SQLS. SQLS made it so they could port it to Linux fairly easily without having to rewrite everything. Because all the Win32 stuff that was specific to Windows, all of that is hidden below SQLS. The upper parts of the system don't need, don't care. Like what's the system called to go read data? SQLS hit all of that, so all they had to do was replace the Windows-specific code in SQLS with the Linux-specific code, and they were reported to Linux. This is fascinating, this is amazing. This is a triumph, right? So they did this port in 2017, but like again, SQLS was written in 2006. This is fantastic because they added SQLS to solve one problem, but they didn't solve the major problem later on. As I said before, I think I said this in when we did the Q&A session at the intro class, someone asked me why do I think SQLS server is amazing. I think it's this plus the Freud stuff, although Sam has some problems with that later on, but we'll cover a bunch of the things that SQLS server does that I think is really interesting. So the way they're going to do co-routines and scheduling is that they're going to have these four millisecond quantum. But because we're doing non-prampto thread scheduling, the database system or SQLS layer can't enforce that. Thread can't run and it can't come in and say, stop running, I'm taking back your core, and I'm giving it to another thread. That's what the operating system does, but we can't put that interrupt in. So the way this has to work is that we have to modify the database system itself, the actual COVID execution engine, to introduce these different barriers or checkpoints where we decide have I run long enough, if yes, yield back to the scheduler so it can run something else for me. So say I have a really simple query here, select star from A where r dot value equals something. So an approximate query plan would just be iterate over r about my predicate and emit it. Right? Nothing fancy there. So with this non-prampto scheduling, what you have is you basically keep track of the timestamp of when you started and the last time you checked, and then if the amount of time you've run inside this for loop has gone past your loud quantum, like four milliseconds, then you yield back to the scheduler so it can run something else. Yes. Is this just like a layman? Yeah, this is like illustrating purposes. You wouldn't actually do it. I hope you wouldn't do it this way. Because calling, this is not cheap, right? So again, we're doing four millisecond quantum and then the idea here is that, say my example here, it's based on time, but say I also then try to get a lock for something and I can't. At that point, I just yield back to the SQL S layer and say, oh, I'm waiting for this lock. Don't schedule me again until I get it. Again, that's a high-level construct that because we don't want to use mutexes for our locks. It's a high-level logical lock or protection mechanism that the operating system doesn't know about. So instead of us scheduling waking up to saying, oh, we don't have the thing we need, go back to sleep, we know the SQLS schedule will not schedule us. The statement is why not use interrupts and then have special additional code to handle those interrupts? Yes. But the statement is why not use interrupts? Then of course, every threat has to have an interrupt handler. But the problem is the interrupt shows up where you in the code. It doesn't matter. It doesn't matter. If I'm holding a latch on something, I get interrupted and I get swapped out, but I hold the latch. I mean, I say that's way harder than this. Like I'm scanning a B plus tree, I get the interrupt and I go figure out what latches I hold in my B plus tree, unlatch those, and then pop back out. I think it's way harder than this. We're going to start with this. You're taking the latch. Nope. Latches are a compare and swap. Lock is a high-level concept. Latch is like a low-level data structure thing. Yeah. So I only know four systems that do something like this. SQLS is probably the most famous one. Wherever time. Sorry. Let me finish up. CeliadDB, they gave a talk a few years ago, they had their own scheduler in this thing called CSTAR. This is probably the most sophisticated open source implementation I've seen. The fontaddb does a poor man's implementation of this. Basically, anytime you do, you read something from disk, they just yield back. That's the only part that they actually do this. It's based on Windows 95 non-printed scheduling from the 90s. Then CoroBase is an academic system out of Simon Frazier, but this is designed entirely for doing code routines. Okay. So in the sake of time, let me just jump to the end. Flow control is pretty obvious. Basically, a bunch of contrary show up, you get overwhelmed, how do you prevent that? Easy thing to do is just crash, but that's obviously bad. So the two approaches that do admission control, basically when requests show up and you don't want to take it anymore, you abort them, you deny the request. Throttling is just when you maybe get the result for a user, but you just delay the response back saying, here it is because you don't want them coming back immediately send another query. So typically, what you want to do is sort of a combination of these two things. This one only works too when you know someone's running the tight loop. So typically, you just do this. You deny new queries to show up. The Ombra paper says you put things in a queue on the side, but again, those network threads aren't cheap. They maintain a state. Depending how you do interactions, you may have to go check every single network thread to see whether look at their rewrite sets. So having a lot of network connections could be a bad idea, postgres it is, but typically the first one is the way to go. All right. So just to finish up. So today, we ignored this iOS scheduling. But I said, you know when you need to execute, you go ahead and dispatch it asynchronously. We can talk about IOU ring and asynchronous IO later on. But the main thing I want you to get away from this talk and discussion is that, again, the database system is super important. We don't want the operating system to do anything for us as much as possible. So that means we want to do all our memory allocations, all our task scheduling, everything ourselves. Don't let the operating system touch it, because it's going to have problems. It's going to make our lives terrible. So next class, we'll talk about vectorized execution. There's another paper from the Germans. You can skip the GIS stuff at the end. Just focus on how you're actually doing the scans and the hash table stuff. That's more important. OK? That's my favorite all-time job. What is it? Yes. It's the SD cricket IDES. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Now here comes Duke. I play the game where there's no roots. Homies on the cusp of your mafukas, I drink brew. Put the bus a cap on the eyes, bro. Bushwick on the go with a blow to the eyes. Here I come. Willie D. That's me. 12-pack case on the phone. 6-pack 40 act against the real promise. I drink brew. But yo, I drink it by the 12 hours. They say bill makes you fat. But saying eyes is straight, so it really don't matter.