 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. It's two o'clock now, but we've covered Bristol Scale, we've covered Diarrhea, we've covered, what else? Behavior interviews. Okay, let's talk about databases. All right, so today we're going to talk about now how do we take the query plans that we've been given and actually start running them on our system. So recall that the last couple of lectures we've had were focusing on how to actually optimize, or build an optimized execution engine so that we can run sequential scan queries as fast as possible. And again, the major two camps are going to be the vectorization people with using SIMD and then the query compilation stuff that we talked about last class. In the case of query compilation, there was two high-level approaches. There's transpilation or source-to-source compilation. That's like you have your C++ code, emit C++ code that then gets compiled. And then the alternative was from the hyper paper that you guys read was compilation by generating like a low-level IR for the actual instructions you want to execute for that query plan and then using something like ASMJIT or LLVM to compile it. And again, as I said, the main takeaway that from at least from the seminal papers and vectorization and compilation come out over the last decade is that most of the systems that we're going to read about near the end of the semester are going to choose vectorization with SIMD and often times it's combination of auto vectorization and intrinsics, but they're going to choose that over compilation just because the engineering overhead to build and maintain a JIT compiled database engine is super high. We'll read this in the photon paper from Databricks. They explicitly call out, it's better off having a bunch of people try to optimize the SIMD stuff because then you can reach parity to the compilation implementation versus like if you go down the JIT path, there's a smaller people that actually can work on it. So again, today we're talking about scheduling of how do you take a query plan and divide it up amongst the different workers in our system. And so again, just make sure that we're using the right terminology. We're going to say that a query plan is going to be a DAG of operators and then relational operators. And then the operator instance is going to be an evocation of that operator on some portion of the data that we're trying to scan. So we're trying to read a table. It's broken up to row groups. We're going to have an operator instance be responsible for scanning a single row group and processing that. And then a task is going to be some computational piece of work that's going to contain multiple operator instances typically in the same pipeline that we want to then hand out to our workers to execute. And then a task set, sometimes called a resource set in some of the papers. This is going to be the collection of the tasks that we need or the collection of the tasks that we have for a given query that we need to execute. And the idea is that we know where the pipeline breakers are because we're the Davy system, we're the one building the query plan. So the idea is to convert these pipelines into individual tasks that we can then farm out and execute. And so today's class is really discussing figuring out how do we assign these tasks to workers in our system. I'm loosely defined using the term worker generically, but you can think of it almost as either a core or thread or process. It doesn't matter. Or a node. And then keeping track of where the data they need access is coming from and where is any intermediate results at the generating. Where is that going to go? So this is basically what I said. The idea of the schedule in our system is that we want to know for a given query how many tasks should we use, right? Because we want to take advantage of all the parallel cores that we have available to us. Also the parallel operations within SIMD. That's usually below what we're actually going to schedule for. But keeping track of how many tasks we want to use, how many CPU cores we want to farm them out on. And then when a task generates some kind of output that needs to go to a next task, where should we actually store that? Because in some cases if it stored it local to us, it may be the task that's going to read it, it may be remote. So it might be better to push the data to where the next task is going to need it. But you might not know what that task is going to be. So we'll see this as we go along. But the paper ahead you guys read from Hyper, it's about single node execution. And we'll see how we tie this all later at the end of the class. But the three different implementations of a schedule we're going to look at, they're also going to be all single node systems. And so the reason why I'm focusing on this rather than a distributed system is because basically the problem is the same. It doesn't matter whether it's a single node or multi-threaded or multiple nodes that are each single threaded. It really is a high-level problem trying to solve this. What tasks we want to run where and where should their output results go. And the main takeaway is going to be that we're always going to want to do this ourselves, especially on a single node and not the operating system. I think in the Hyper paper you guys read, I think they distinguish it between Postgres. Postgres is just letting the OS do all the scheduling because it's just forking full processes. And I don't think they even play games like Process Priority and so forth. But instead, with the exception of Postgres, every data system is going to want to do all the scheduling stuff itself. So we can talk about how to do it on a single node and then you'll see how that maps to a distributed environment. There'll be some things we can do in a distributed environment that we're not going to cover today, but we'll see this later on. Like in BigQuery Dremel, they're going to do shuffle stage after every pipeline breaker. And that allows them to reorganize and recalibrate the workers later on. But again, we'll cover that later. So what are our goals for building a high-performance schedule for a data system? So obviously we want to maximize throughput. We want to be able to process as many queries as possible. In our system, just sort of keep the thing always running and always consuming results and producing output. We're going to maintain some notion of fairness. And again, this is subjective of what fairness means to sort of one query to another query. But at a high level is that at the end of the day, we need to make sure that no query gets starved for resources. So even though we may delay the priority, and we'll talk about what that means, why we want to do that as we go along. But even if you get a lower priority for your long-running query, at the end of the day, we still want you to complete. And the flip side of this is that we want to make sure that the system seems to be responsive. That's reducing the tail latency, like the P99 latency of queries, if we can. But this will matter a lot for short queries. And so the idea here is that we want our short queries to complete as fast as possible because that's something someone is going to notice. That's the shortest scales of query execution. Like if your query goes from 100 milliseconds to 1,000 milliseconds, then you would notice that. So you want to get these short queries out as quickly as possible. But if your query is running for 10 minutes and it takes 20 seconds longer, no one's going to notice that. So the system's going to appear more responsive if you can get the shorter queries out more quickly. In the case of the morsel stuff, they don't have a specific way to actually handle that. They're sort of treating everyone... They had a QoS thing that they were suggesting they might do in the subsequent paper. In the Umbra one? It was in the hyperbill, the one that we did. There was something in the conclusion. I'm saying they would like to do it. I don't know whether they did it in hyperbill. They do it in Umbra. We'll cover that in a second. But the morsel's one is... Again, that sets the foundation for this idea of how to divide up the work. And they're going to do a static assignment of tasks to morsels, and we'll see in Umbra how they can break that. The Umbra one is more sophisticated. And the last one, of course, is that we want our scheduler to have low overhead. It doesn't help us if we're running this super complex computation to figure out the optimal scheduling, this optimal schedule for all our tasks. If that takes 20 minutes, we'll finish up in a few milliseconds. So we want to have most of our system spending time computing queries, because that's what people end of the day really care about. So I'm going to talk about some quick background in the beginning. Again, this will be a quick reminder of the things we talked about in the intro class. We'll first talk about, like, what is actually a worker? How are we defining the scope of a competition? Where is it actually located in our system? We'll briefly talk about what data placement actually means in the context of partitioning. And that basically is just, if I've already divided the data up in some way, where should I put that data? And the two are linked together, but we'll discuss that. And then that's one of the things that the morsel paper I spent a lot of time talking about was this notion between local memory and remote memory in a NUMA architecture. And they were trying to schedule things so that you were always processing stuff that was local to you. Again, the same idea applies in a distributed system. Ideally, we want our workers, if they're on a node, processing data that's local to it rather than having to go over the network to some far storage. And then we'll talk about three implications of schedulers. We'll talk about the morsel, some hyper. We'll talk about the follow-up in Umbra. And then we'll talk about an alternative from SAP HANA guys. And then we'll finish off, again, just putting in the context of a distributed architecture. Okay? And so what'll be interesting about the... We'll talk about schedule implementations. We'll see in Umbra and Hyper, these are going to have sort of dedicated worker pools that are just like crunching through all the tasks as they can. Whereas in the HANA one, they're trying to be a bit more sophisticated and have this notion of some worker threads can be asleep, some worker threads can be parked. And we'll also see this trade-up between work-stealing and not work-stealing, which is another dynamic we have to consider as well. All right, so... And this is just a reminder from the intro class that there's this notion of a process model and any data-explanatory system, and this specifies what a worker actually is going to be in our system. Right? So the sort of early database systems in the 1980s, early 1990s, these are a process-based system, but every worker was a separate OS process, because back then they didn't have P-threads like we have now that weren't really portable. So if you wanted to support one Unix versus another Unix, you had basically the POSIX API specified how to do fork to spawn processes, but maybe didn't have threading. Every modern system now today is going to be multi-threaded, so we'll assume in our system we're conceptually building that it'll be multi-threaded. The only ones that are not multi-threaded are the POSIX API, because POSIX uses a process per worker. And the worker is going to be this generic term that means it's just a computational resource that can be assigned a task to execute for some query or something for the internal database system, and that it can take some data in, crunch on it in our operator instances, and then produce output. And as I said, for our purposes going forward, you consume every system unless they fork POSIX to be multi-threaded. For some reason, in the early days when I first started at CMU, we took Postgres and we decided to make it multi-threaded instead of multi-process. I forget why we did that. But the interesting thing about it is, if you ever looked at Postgres code, there's a bunch of pound of fines for the different OSes that they support, like Linux, and Windows, and HPox, and BSD. And we ended up going and using like the Win32 code and that was at least starting point for us to then convert everything to P-threads. We also converted it to C++11, which I don't know why we did that one either. Yeah, we did it, we shouldn't have done it, but whatever. Yes? So you said Postgres is single-threaded, but I thought you also said it has processes. So Postgres is a process per worker. So a worker is going to be a whole entire OS process. So you can do some parallel execution of queries, but that's going to cross multiple processes, and they use shared memory to communicate. But no, you wouldn't build a system like that today. Yes? This question is, if all modernity systems are multi-threaded, why was it a bad idea for us to try to do that in Postgres? Looking back on it now, I fail to see what the research contribution would have been. So we had this execution that was written in C++ that was faster than Postgres, which is not always that hard to do, and then rather than sliding it in as an extension using extension hooks that Postgres supports, which Timescale does and Citus does, Cfowl is another one. There's a bunch of these data systems that use extension hooks to get OLAP engines inside of Postgres. We decided to fork everything. And then we, the top half of Postgres, we had the top half, kept the top half, rewrote the bottom half, we decided to then just scrap the whole top and re-write everything, because everything was sort of slow for what we wanted to do. But again, if I had to do it all over again, I would have just used extensions. Yes? So there's a process model that determines, right? Is that just like multi-threaded versus processed-threaded? So single-threaded also determines that new and part of the new and state-readed process? Its question is, does the process model determine the new and topology stuff, or is it just like, is it a thread per worker? Is it a process per worker? It's just a process per worker. It's just like, what is the, is it a thread? Is it a process? Is it a process pool per worker? The new and stuff is insular. Okay. So the other thing we can account for is how do we want to assign the workers to CPU cores? And the basic two approaches are, you could have a single dedicated thread or a single dedicated worker be the only thing that can run on a single core, a single CPU core. And this prevents like, this prevents contention on that core where like two threads are trying to run at the same time, their worker is trying to run at the same time at the same core, and they're trashing each other's, you know, L3 caches and so forth. Yes? So one worker is working on one task or one entire task set or one partition of the task? We're not there yet. Okay. Right? But it's going to be one task. One task. Yes. And then again, and the thing of that, like at the scan of this table, the table has 10 chunks or morsels. You had one worker for each of those 10 morsels. Right? The other one is going to be, you know, multiple workers on a per CPU core. And the idea here is that with one worker working on one task, one task set or one partition of the task? We're not there yet, right? If, when one core gets, or thread gets stalled because, a one worker thread gets stalled because the thing it needs is out on disk. Maybe it'll fetch it into memory or even if there's a low-level L3 cache miss. Right? You could have other threads run at the same time. For maximum performance, this is probably the right way to go though. Right? And actually, you also want to turn off hyperthreading. Just run, you know, bare metal harbor threads. Right? Because we're going to be CPU bound and most of the computations we're going to do in this system, we don't want any contention on the actual hardware itself. Right? There's other advantages in this for, if you do like transaction processing where I think you can't just, you don't want any stalls, but for OLAP, for both OLAP and OTP, this could be a better way to go. The HANA guys are going to claim this is better because they're going to have really, they're going to try to do fine-grained control of what threads are actually awake and running at a given time. But again, we'll cover that later. And they claim that's going to be better for a machine with a lot of cores. I'm sorry, a lot of sockets. Yes? You're saying the core should run instead? Yes. But why wouldn't you want your core to be visible? So, if you're compute bound and you're careful about what you're putting in your CPU caches and you're prefetching things ahead of time, a core should never be stalled. Right? Again, we saw this with the branchless stuff. Like, if you designed the system in such a way that like, do you avoid branch mispredictions and having to flush the pipeline of the CPU, then you should just be crunching through data as fast as possible. And you should never have it stalled. Or branch misses. Yes? Per CPU cores? Per core. But within that, you turn off hyperthreading so it's one hardware thread. So, like, a socket can have four cores and each core can have two threads because of hyperthreading. Right? So, you turn off hyperthreading and now it's one core equals one hardware thread. So, avoiding pretensions is more valuable? Yes. You're running basically a bare metal. You're running a whole system. Like, wouldn't you be able to then run a little, like, you have a cron job running or something? Can you have that run on the hardware thread? What is that cron job? What is this for? I don't know. Like, you have a server running. Like, you have a database on the server. Yes? In case it's dominating the CPU. Yes. Right? You can have that. It can handle all of the... It can sort of own all of the actual real threads that correspond to real cores. And maybe then the back-on-passing system can run on the unutilized processor core. What are these back... Like, garbage collection and stats collection? What are these back-on-tasks? Like, not even prior to the CPU. Oh, random stuff? Yeah, yeah. Why would you run them in your database server? Yes? I'll be doing something. Yeah. Turn all of that off. You don't want any of that. No, no. The core of the question is why not... Is there an advantage of turning it off? Yeah. Yeah, you don't have two threads contending for the same hardware resource. Yeah. I do have a graph. My impression was that when you have hyperthreading on, the other logical thread that spawns from the same core is going to only be unutilized part of the core. No. What is the unutilized part of the core? You have a pipeline of instructions, right? So when one gets stalled, the idea is like you swap out all the registers for this other logical thread, hyperthread, and then they come in and pick up where you left off. Right? See it. Again, if you're for your CPU bound, there's not going to be stalls like that. Yes. So we don't want convention just to summarize. We don't want convention because when we have two threads, if we have a PFI with threading, then we end up... Is because if I do the same hardware, there will be more cache misses at the other level? Yes. The same thing is, if there's more contention, if there's more things running on a single core, saying with two hyperthreads, then they're both trying to do something, it's the amount of work, and they want things that you bring in the L3 cache, or they bring it in your caches, and that's going to pollute it, whereas we're going to better off just letting one thread run to completion. Yes. For your desktop, sure, because like you're browsing the web, listening to music, watching videos, there's a ton of stalls like that. Who cares? Right? But for the database system, you're not running random Bitcoin mining on it. It's no cron jobs. If you care about... If you really care about your database system, right? If it's a blog, and you're running MySQL Postgres just to service the blog, who cares? Sure. But if it's like a high-end database system, like you're not letting anybody just SSH it into and run random stuff. Right? So this is an older experiment that students did here at CMU where it was just a toy in-memory engine, and it's just damaging between letting the database system decide where it actually plays data in a new architecture. Does everyone know what NUMA is? Who doesn't know what NUMA is? Perfect. Excellent. Yeah. So the idea is like, do you let the data system figure out, okay, this piece of data is going to go at this core, at this location, or this NUMA region, or do you let the OS figure it out for you? And so what you see here is that before you get to the hyper-threading line, right, you're going to get better scalable performance, better performance, when the data system controls exactly where the data is relocated, because now any data that's local to it, you don't have to go over the interconnect, which in some cases is going to be 2x latency. Right? So that's why this thing is going to apply to it. Now, to your point, why not leave hyper-threading on? This is when hyper-threading kicks in. Right? And now you see complete flat lines for either one, because they're not, it's CPU bound computation. Right? And in the cases when it has to go to memory to go fetch something to fill its caches, well, if I'm running another hyper-thread, it's going to do the exact same thing, and now it's, you know, I'm not giving any benefit. So here I'm throwing more threads at it, but performance is plateaued. Yes? At some point you're fetching from that data. You don't know this sum. All of these are, right? Like, all of these are fetching from memory. But like, if I'm stalled, like, fetching from memory is so expensive, if I'm fetching from memory, and I'm waiting for that, then you start running, well, what are you going to do? Fetch from memories. You do the exact same thing I am. So, and now you're blocked on the bandwidth of getting things from the DIMMs to the CPU, plus you're polluting your cache. Furthermore, you might, I mean, this is, talk to the architecture people, but now you're like, I don't know how well the hyper-prefetching is going to be, because now like, I would have been better off having one thread and a chunk of data versus having two threads start in different spots and try to prefetch those. Again, I think the hyper-prefetching stuff could probably handle that, but it's just making things more complicated. Whereas if you keep the system more simple, you know, we can get better utilization of what we have. Okay, so, hyper-threading, nice in general, but not for databases. Okay, so we were, all right, so the next thing we've got to consider is how are we actually going to get our tasks to our workers. And there's basically two approaches, either push versus a pull. And the push approach, there's some kind of centralized dispatch or a schedule component that has a global view or a view of what the workers that are under its per view or administrative control, it knows what tasks they're doing. And then as new tasks arrive, it's pushing those requests, those tasks requests to the different workers to always give them something to do. Right? And then when the worker notifies that the batch is finished, you know, it's immediately going to be given, here's the next thing to do. The pull-based approach, which is going to be the better approach, which everyone's going to do, is going to be that there, the sum scheduler component that's maintaining the queue of all possible tasks that could be executed at any given time, with additional metadata of maybe about what data they're trying to access and where is that located. And then now the workers, when they need something to do, they come to this queue and get the next thing to do. Right? And this is just easier because it's less coordination of maintaining state about where each worker is in its computation. It just says, hey, here's much stuff to do. Here's, you know, it's a la carte. And people can just come and pull things off the buffet tray when they're ready for it. Yes? Which one has lower overhead? The question is, which one has lower overhead? Overhead for what? We don't want to understand if you take too much time. And so, actually, if you are a worker who has to go pull from the queue, that probably is, you're eating up safety cycles and doing that. So his statement is, and there's actually two parts of it. His statement is, the question is, which one has lower overhead? If it's the pull-based approach, then you have every worker thread going to say, what should I be doing next on their own? Right. And won't that incur a penalty when they could be running queries? Sure. Yes. In some ways, and the second aspect also, too, is like this queue thing, which we'll see in the hyper paper, is a global data structure, which now you have to protect with latches or locks. And then now everybody could be intentionally contended on that. So, everyone's still going to choose this just because it's, you can build this schedule as a separate service and not worry about exact control or exact knowledge of what every single worker is actually doing. Because a worker may die, right? And then now you've got to figure out, like, you know, did I told a bunch of, do a bunch of stuff ahead of time and now it can't do it? Where you just say, here's everything I need to do, and then now each worker thread will then figure out on their own what's the best thing for them to do. They're all sort of working globally to solve the problem. Also it's less intelligent, isn't it? The pull-based approach. Yeah. He says it's less intelligent, but like relative to what? To this? To that, because you can have priority in the push-pull. You can have the priorities in pull-1. Just because hyper doesn't, doesn't mean you can't. We'll see it in a second. You mentioned like the push-based approach like the worker is dying. How does it also not issue a pull-based approach? Yeah, he's right. If a push-based approach has workers die, you've got to figure it out. Actually that'd be approach two, not approach one. You start the deal out in pull-1. Yes, you're right. You basically need a hard peat to figure out or they didn't come back. Another way to think about it in the pull-based approach is that you can have the logic to figure out what task it, the worker needs to execute itself. Next is basically that logic is being distributed across multiple workers. Whereas like if it's a single centralized service, then it's one beefy box or whatever it has to then figure this out. I'd probably say that's the main distinction. Yes. The question is in the pull-based approach, does the schedule need to keep track of how long each task is going to take? Yes. Hyper doesn't do that. We'll see this in a number. The question is, can you use the same cost model as the query optimizer? So the challenge there is that some cost models in some systems you can't map whatever the cost estimate to like a wall clock time. If you ever looked at Postgres, it's some number. It's a combination of is it a sequential scan, random scan, how much data is it going to read, and so forth. Maybe people have tried this, but you can't just take that number. Oh, it's $10 seconds or something like that. Now the high-end enterprise data systems do try to give you cost estimates to say, here's the relative cost of the query which is an eternal value that you can use to compare different query plans but inside the same system. They also can then spit out, I think it's going to take this amount of time. So you could do that. But again, we'll see this in a few weeks. Cost models are always wildly off. So in the umbrella, the umbrella scheduling that paper which I keep alluding to, they're actually going to watch how long a task takes and then use that to figure out, to give a rough estimate of what the scheduling time should be for certain things. Okay. So regardless of how we're going to allocate workers or tasks to our resources and our system, as we said already, it's important to make sure that the data they're going to process, ideally, is going to be local to whatever that worker is. And in the case of the hyperpaper, it's an in-memory database, so local means it's in the same NUMA region. Obviously in a distributed system, especially within a shared disk architecture, while the cost of going to get data from S3 could be basically the same for every single worker node, assuming you're in the same data center in the same region. But once we start caching things, which we'll see later in the semester, every compute node could have its own local copy of files read from S3. And then now, when I assign tasks, I want to make sure that the task is assigned to the node that's going to have a local copy of that data. Yes. This is also another reason why it's full of these tiny numbers, because it doesn't know what part to work. And it's like, I'll give me the next task. Yes. It's not necessarily that the data that's needed for that next task is there on its local story. Yes, David. The pool seems dumber, because if the worker's trying to maximize the locality of the data he needs to access, he can't do that in a pool-based approach. Why not? Because it's just popping the top point. No, you don't have to pop from the top. What are we doing here? Well, the hyper does it one way. We'll see this in other ways, in other systems. Right? I mean, you have a priority queue. You don't always have to pop from the top. Now, if you're doing work stealing, you may recognize that the thing at the head of the queue is not local to you, but you may want to go ahead and run it anyway, because you're available. Hyper does that. I don't think Umber does, and Hanna doesn't. Yeah, absolutely, yes. So it's a global queue in that everyone can see it and manipulate it, but it's not going to guarantee FIFO ordering of the elements of the queue. Okay. Right? So it's a priority queue. Sure. In hyper, it's a hash table. Right? Right. Okay. I've already said this. You could have, some data systems will have locally attached storage as a cache. Again, think of like you spin up an EC2 node. You can get ones with NVMe drives that are local to that are really, really fast. And so you could use that as a local cache. It's a fibril, so if the node crashes, you don't need to retain anything in there. But again, while it's available, you could use that instead of having to go to S3. In some systems, again, Snowflake is probably the most aggressive on this, because again, they don't want to pay Amazon S3 costs. They can also use other nodes as a nearby cache. So if you know this other node is responsible for the data and you're running the task, instead of going to S3, you could go directly to that node and get it, but they're actually not going to do that because they don't want to interfere with the other node because it might be going slow. If you're stealing work that was meant for this other node, it's probably because they're slow. So why start making requests to them to make them even slower? It's sort of the logic there. And then the new mode versus not uniform memory access stuff we've already talked about, like local versus remote memory. Sorry. Partitioning and placement. So in the intro class we talked about partitioning this idea of how do you take a dataset and pick some set of columns or some keys and then divide it up based on the values of these keys across different files. And that would allow you to spread out the data evenly, ideally across resources so that when a query arrives that can run in parallel, each worker can have the same amount of work. So they're all sort of processing things uniformly. So there's going to be some policy that you're going to use to say, here's how I want to split things up, hash partitioning, round-robin range partitioning, and so forth. And then there'll be some target objective you're trying to have for sort of deciding how I want, the reason why I want to partition things a certain way. So one thing to be, I want to reduce the amount of network traffic when I do a join. So maybe I want to partition my tables on the things, the join keys, so that the joins can always be computed locally rather than have to do a shuffle or a broadcast join. In our world, in the shared disk, in a lake house environment, we're typically going to be doing round-robin based on files because we're not the ones generating these files. Someone loaded a bunch of stuff in S3, a bunch of parquet orc files. We're not going to have time to go fix them up and put them, partition them and rebalance them according to, again, some target objective. Snowflake will do this, they call micro-partitions, but I think they only do it for their internal data format, meaning if you do insert queries to put data into Snowflake, then they can rebalance stuff later on, repartition it later on. But again, if it's a bunch of files in S3, you can't rewrite them. Yes? So your question is for, if you're doing round-robin partitioning at the file level, what is the data system that you're doing? It literally is like, file one, go to you, file two, go to him, file three, go to him, that's it. And then they're assumed to be uniform size. The question is, are they assumed to be uniform size? Typically, yes. I don't think so. I don't think so. I don't think so. The question is, are they assumed to be uniform size? Typically, yes. I think so, yes. I mean, obviously you can imagine degenerate cases where I have a bunch of one-gigabyte files and I have one one-terabyte file, and that would screw things up. And I don't know what they might break that up so far, right? I mean, the way to handle that one also too would be you could assign the same file to different nodes, but then you just within that file you say, row group zero to five goes to this one, six to ten goes to that one. You could divide that further, but I don't think they do that. Questioning on row groups makes more sense. At least those are similar in size. Yes. Again, it parquets based on number of tuples, or it's based on the data size, but in the end roughly about the same, yes. It's split further within a file? Right, that's split. There's some advantage to it beyond just giving. Why do it split up based on the file? Is there any performance advantage to it? It's easy. I mean, you maintain less metadata. I have five files. I need five entries in my catalog to say where these files are. If I start doing subdivision within that, then I have to maintain more metadata, which some systems can do, right? If you're doing range partitioning, you have to keep track of where the ranges are. If it's hash partitioning, you don't need to do any of that. It's even cheaper. You could just say, here's the column. Here's the hash key or whatever. I want to hash on and decide where it goes. Now, if you're doing consistent hashing, which Snowflake does on the file level, then you've got to maintain that data structure to do consistent hashing, but we'll cover that later. Partitioning tells you how to split things up. The placement policy determines where those partitions are actually going to go. Again, the simplest thing to do is basically I got five machines, you just get one file. There's a round robin there. You could try to be clever of breaking things up in more sophisticated ways. We can ignore that. This is the easiest way to do it. So now in our catalog, we say, I have these files that have these sizes and this information about it, whatever I was able to glean when it was imported into my system, or I was notified that it existed in S3. And then I keep track of if this worker is responsible for it. So any query that shows up then has a task that wants to process that file, again, which we would determine in the Optimizer with the catalog, ideally, we want the worker that's been responsible for that file to process that data. And whether or not that worker has a local cache or not is a matter. It's just we're just saying that rather than everybody read everything, we can get more structure and say, this worker is going to read this file. All right, so far we have a task assignment model. Basically, how do we assign workers to threads or processes and so forth and whether we're going to do a push versus a pull from the scheduler. They have a data placement policy. Again, for our purposes, on a shared disk architecture and a modern lake house or data lake system, it's going to be at the file level and round robin distribution. So then now we've got to say, how do we take a logical query plan and convert it to something that we can then execute? And I've sort of said this alluding as we go along. We know where the pipeline breakers are. That's going to be the boundary for our tasks for the most part. But then now, how do we take those tasks and run them? So if it's an OLTP query, this is super easy to do because these queries typically only have one pipeline. Do an index scan, maybe a projection, and that's it. There's not many operators in it and there's no dependencies between pipelines. There's only one. That's easy. We just hand that out and let them run as fast as possible. But for OLAP queries, it's more complicated because we know there's dependencies between these pipelines and we can't, to avoid false negatives, we can't have one pipeline start running if the pipelines that it's dependent on have been completed to produce whatever the intermediate results that are needed. So you can't always paralyze them. Yes? Why does the figure side slide say logical query plan instead of physical query? How does it... Like, because the lot... Yeah, you can say physical, yes. Yes. If you just get rid of logical, then it's fine. Yes, that's the type of... Again, the logical query plan says, scan this, scan this... I want to read this table. It doesn't tell you how to do it. It doesn't tell you how to do a join. It says you join A and B. The physical query plan is the actual... It's like the exact algorithms you want to use. So we would... When we create a query plan... When we convert a query plan to a bunch of tasks, we're going to be doing that on the physical operators, not the logical ones. Thank you. I'll fix that. All right. So the easiest type of scheduling to do is called static scheduling. And this is where the... Again, the optimizer figures out... Or the scheduler figures out in the very beginning, I have this query plan, I have these workers, and I have this data, and it does a static assignment of tasks to those individual workers. Right? And it doesn't... The simplest way to think about it is I have one task per core or per worker, and they just all run at the same time. You still can assign the workers to... Or the task to workers that... Based on the data that's local to them. But again, there's no dynamicism. There's no adapting to the behavior, the performance of the workers as it actually processes the data. Right? And somebody's thinking this generic process does this. All right. So now the problem with this is that there's going to be some tasks that are going to be slower, either because the data that they're processing, it just takes longer to execute whatever the operators that they have on it. Like, you think of like... I have a complicated where clause where there's some predicate that is... That is really fast to compute but can be very selective on some of the data. And then the remaining predicates in that where clause are slow to compute. So nine out of the 10 files, all the data gets filtered out by that fast predicate. So those tasks run really fast. But the one unlucky worker that has all the data that does satisfy the predicate then has to run the more expensive predicates. And then it's just going to be way slower than the other ones. So now all the other workers have to wait until that task actually finishes before they can move on to the next pipeline for that query plan. So there's no dynamicism in any of the decisions that we're making here. Everything's figured out ahead of time. So what morsels is designed to solve is that exact problem. How do we figure out... How do we on the fly dynamically adjust how we're assigning tasks to workers so that if there's unexpected variations in the run time of tasks, we can have other workers fill in computing things rather than waiting for the slow straggler. So the morsels term comes from the hyper guys because they were just looking for another term to mean chunk of data. They didn't want to use partition. They didn't want to use shard. They couldn't use block. The morsel's meant to be something bigger. But the high level, it's still the same thing. You don't always think it's the same thing as a row group. But I think this paper came out before the row group stuff. So in their architecture, they're going to turn up hyper threading. They're going to have an assignment of one morsel per task. So a task is going to be responsible for processing one morsel of data. They're going to do a pool-based task assignment. There's going to be this global queue that they're all going to try to pull from and figure out what to do next. And they're going to do really simple round robin data placement. And so again, they keep track of what new region each morsel is going to be in. And they would annotate the task to say this task is going to execute this morsel on this morsel and then each worker that can decide whether they want to run a task that's processing data that's not local to it. So they'll have all the operators, parallel and numer-aware, that we can basically ignore. But it's going to think of having the exchange operators keep track of what inputs I'm waiting on before I can coalesce things and move on to the next pipeline. So this approach here, this morsel paper came out in 2014 and it's fairly influential. And this is actually what DuckDB does as well. And they're very upfront about this, that they basically took this paper and re-implemented it in DuckDB. And we had Mark give a guest lecture for us last year in the spring semester last year and they basically said they're doing morsel driven parallelism. So again, this is not just for hyper. DuckDB is widely used in its basements as well. I meant to look out, I think there's a couple other systems out there that are using a similar approach. So in hyper, there's not going to be a single separate dispatch or thread. Every worker is going to be responsible for figuring out what's the next task I need to execute. So you sort of think this is a cooperative scheduling where everyone's working together for this common cause, this global, you know, trying to achieve the best performance of the system, they're working together to try to do that. But then the logic to figure out what's the next best thing for me to run is going to be distributed across the different workers. So in the ideal scenario, they're going to go look in the task queue and try to choose a task that's going to process a morsel that's local to it. If there are no local tasks that are available for the current query, then they'll go find what's the next task, what's the very next task, even if that data is not local to it. Because again, that's going to be able to do to help mitigate the issue of stragglers slowing everybody behind. So now in the paper, they're going to ignore a very key problem which I think I've already talked about in their approach, and that's going to be the synchronization cost of this global hash table. A bit hand-wavy saying that it's not a big deal, but then when we see the Umbra paper next, they basically throw it away and they switch to a more distributed scalable approach. And in the case of the HANA paper, which I covered in a few minutes, they explicitly call out the hyper approach of having this global task queue in the course is going to be a problem. Yeah, but I think it's a four socket machine, right? I mean, for the HANA guys, they're talking like 128 sockets, no more, larger, right? They even told me, they had this like, before Sigma, they would have this like, they'd invite some database faculty to come see some presentations from people working on HANA, and they had one of their customers come in and they were running on some beefy box where they were running out of address space and X64. You have 64 bit pointers, but Intel only uses 48 bits. They were running out of space, they were addressing to 48 bits, and it was running HANA. And that was a few years ago. I'm sure people are easily exceeding that now. So, okay. All right, so again, the high level idea is going to be that we have some data table here and there's going to be some arbitrary slicing it up horizontally into different morsels, and then each of these morsels will be assigned to one socket, one numeric region. So, what does that look like? What we already talked about with packs, row groups, basically the same thing, just a different name. So in there, this paper, they claim 100,000 tuples per morsel was the right size that gave them the right amount of parallelism across all the cores. If you set it too small, then you're always going to the task view and that becomes the bottleneck. If you set it too big, then you have the problem of, again, the straggler, it's the only one that can process some giant morsel and everyone gets stalled for that. When we were building our system Peloton here, we did 1,000 tuples per morsel. And then I think in the follow-up system, with NoisePage, we were doing 10 megabyte morsels because you could play some trick with Instable Floss to do 20-bit pointers, 20-bit offsets, but we can ignore that. All right, so now we have our query plan. We convert it into a bunch of tasks that we have in our task view. Instead of thinking in the task view, again, it's the computation of the operators you want to do within the pipeline and then it's tagged with what morsels they want to operate on. So on each core now, they're each going to have a memory region that corresponds to the morsels. Again, this is just the table space for the tables. Then there's some local buffer they're going to use just to write out intermediate results. Again, this is an in-memory database, so everything's all in memory here. And then you have whatever your local, it's all in your local memory, then you have your single core. So to get started, each of these guys are then going to go into the queue and pull out things that are going to process the data as local to it. And then when now it runs, again, it's just crunching on the data that's local to it. So it doesn't have to go to the interconnect on the CPU. Everything can run really fast. And then they're always going to produce the output back into their local buffer, again, because they want to avoid the traffic over the interconnect or having it write to some remote memory. All right, so now say this on CPU 3 for whatever reason, it's just running slower. In this case here, because there's no tasks that are available for this query to execute because the next stage we have to wait for the output of the first stage, the first pipeline. So these guys essentially have to stall. Now, if there was another query in our queue, you could start processing that, but then you get this contention of like, okay, well, I only have so much space for my buffers. Do I really want to start processing another query that can then interfere with the data I want to store in my buffers? Because then I have to start swapping things in and out. These are individual cores. So this is all in one? Yeah, exactly. One core, they have L1, L2, and then on the same socket, all the cores share L3, and then they have local memory. So I'm drawing the CPU symbol. I could put a thread or whatever, but I think it's like one Harbor core. Yep. All right, so when this guy then finishes, this then frees up all the other guys. You can then go back to the queue and pull out, again, more tasks. And again, because the last task that they executed for this query wrote data to their local buffer, we want to then have the affinity of making sure that the next task that's going to process that data that we just generated in the previous pipeline is going to run on the same core. Again, avoiding that interconnect traffic. So now in this case here, say this one finishes up first. So if we do have actually a task we could actually could execute. And so Hyper says that in this case here, when you do work stealing, it's okay for you to go across the NUMA region to go get the data you need, because it's better to do that than having idle resources. Heuristic cost. Because obviously it costs something to do that transition. Yes. But you can't ever predict when that task on a CPU core 3 is going to finish. So what happens if on core 3 it finishes before it stops operating on that part of the network? So his statement is, the question is, is there a Heuristic to figure out when is it actually okay to steal? Because it may be the case that right before I steal, immediately after I steal, this thing finishes, then it could have processed it, and then I could have processed the data locally and that would have been better than this guy stealing. Again, when it's on a single node and you're measuring things like, the morsels are like 100,000 tuples, you're getting down to like milliseconds here. And the additional bookkeeping you have to do to figure that out, you pay a high penalty for that. It would be a lot of overhead to maintain that. It's just better to approximate it. They claim it's always better to steal. In the case of HANA, they're going to do even more bookkeeping and they say don't do any of that. And it's better to just don't do any stealing because the cost is too high. I guess it's a little confusing on how stealing handles partial work. This question is, how can you handle partial work and work stealing? Because like presumably like stopping in between the morsel side. So every task is one morsel. So this guy is processing, whatever, this one here, no one can take the same task because it's no longer in the queue. So when this guy, when one steals the next task, it's not processing on the same data as this guy. It's completely disjoint and separate. So there's no concern of synchronizing about partial results. Yeah, the morsels are disjoint. Yes? So in hyper, these pipelines are compiled, right? This question is, in hyper, these are compiled pipelines. Do they have compiled pipelines? No. What's normal? Yeah, it's a push-based model, but they're doing the vectorized approach of pre-compiled primitives. Again, this is completely independent of the query processing model here. It doesn't matter whether it's compiled or vectorized. In the back, yes. This question is, do you rebalance the enemy results? Yeah, so if this guy steals a bunch of stuff and he keeps writing to the buffer, it's a local buffer, then is it going to run out of space? I don't know what they do. I don't handle that. At some point, I suspect you would. Yes. Yeah, I don't know how they handle it, but you can imagine identifying that, oh, I'm running out of space. I can't process anything else. Either you don't process anything else until the query finishes, right? Or you can use an internal task that then moves things around, but then bookkeeping for that would be expensive. Yeah, I mean, that's one of the challenges of an in-memory database is that you can run out of memory. I'm assuming, I think they just assume that you don't. Yes. The question is, at some point, you have to aggregate the enemy results? Yes. But that's the exchange operator that we talked about before. Right? And then in that case, that you can't really paralyze, pulling the data from all the children below the exchange operator and then coalescing that to produce the final output. Okay. So one of the key problems with Umbra, or sorry, with Hyper, is that because one worker is assigned to, there's one, yeah, because there's only one worker per core and one morsel per task, they have to do work stealing because it's almost like this, it's not exactly static scheduling because they are allowing things to pull data as they go along, but they can't rebalance the amount of work that each task is doing. So in the cases where you're blocked on waiting this last task that you need for this pipeline, everyone has to stall until that thing finishes. Right? The other challenge, as I already said, is they don't really talk about how they built the lock-free hash table. That part's a bit hand-wavy in the paper, but, again, as we know, that's always going to be a contention point. Lock-free doesn't mean it's magically scalable. It just means that you're never going to stall waiting for something in the lock. You may have to spin until you can acquire something, though. And now you're burning cycles. The other two problems they're going to have in Hyper is that they're going to treat the execution cost of every tuple in a morsel to be the same. And as I said before, you can easily come up with examples where that may not be the case based on what the query is or what the predicates are. Right? The other issue is going to be, as you mentioned, the conclusion is, hey, it would be nice to have quality of service or priorities to keep track of these things, but they simply can't do that. It's almost a free-for-all. Here's whatever's in my task queue and then the workers are trying to pull things as fast as possible and just running it. But that means I could have a long-running query take up all the resources, all the workers while it's processing, and then I want these short queries showing up and I have no way to easily interleave them and make sure the short queries get processed. Right? And as we said, that's bad because people are going to notice when the short queries run slow. So the follow-up work to the hyper-paper you guys read or extension to morsels is this paper from 2021 on the new system that came after Hyper called Umbra. Again, the background is that Hyper was built by Thomas and his team at TU Munich. They formed a little mini startup based on it. Then they got acquired by Tableau and it was being used as the internal in-memory query cache for the Tableau, like, you know, the app anytime you used it. And then Tableau got bought by Salesforce and so forth. Thomas then lost control of Hyper because, you know, Tableau now owned it. So he went back and started building a new system called Umbra. And he couldn't use any of the source code he had from Hyper. Everything is written from scratch because he's a freak and he can. So this is the new scheduler that they built in Umbra that is meant to overcome the deficiencies that they had in the hyper-morsel schedule. So the key things are that the tasks are not going to be created statically at runtime and they're not going to have a one-to-one relationship between a task and a morsel, meaning one task can process potentially multiple morsels if it still has time available to compute things. Right? So another way to think about this is that they're basically going to be like sort of slicing up the computational resource based on time around this notion of quantum. So within your quantum, you can keep processing as many tuples as you can. And then when you run out of time, then you have to, you know, give the CPU back. But I still think even though you give it back, you're still tied to the morsel of your processing so nobody else can take it. The other thing we're going to be able to do to handle, you know, make sure that short queries aren't blocked by the longer running queries, they're going to do automatic exponential priority decay for queries so that the longer a query is running in the system, the lower priority it gets over time. And so again, it'll still be scheduled eventually but it's not going to get, you know, it's not going to be able to execute as many resources at any given time slice as a shorter running query who just arrived in the system. So at a high level, this is a variation of stride scheduling, which I think came out of the 80s or 90s. Did they teach that here or no? I don't think so. Did they teach that in OS? No. Think of it like it's a primitive way to do scheduling in an operating system for tasks and processes where you keep track of how long things have been running and how much work they're going to do every time they run. But in the original implementation, like there's a global priority list, there's global information that you have to maintain and you assume that the workload is fixed. But obviously in a database system query, they're coming and going all the time and so we can't make that assumption. So they have ways to fix that. Yes. How do you know how long a task is going to take to execute the query for the task? How do you know how long a task is going to take before you execute it? You don't. So the question is how do you know how long a task is going to take until you execute it? You don't. They just turn on monitoring on it and keep track of it over time. So you think of the morsel sizes, is it like you're just adding more data to the morsel? Yes. Or is it just giving them more morsels instead? So you think of the morsel concept. It's just a logical concept of here's the divider line of where one morsel ends and stops. So if you recognize that each task is computing each morsel really, really fast, then there's more bookkeeping to have to go back and give me the next task. You sort of increase what the boundary is for the morsels so that eventually the amount of work you do, the amount of time it takes to process that morsel is one millisecond. Okay, so you're not changing a given morsel after it's created. It's like for future morsels, you're going to make them bigger to better take advantage. You're not changing morsels after they're created. You're just making future morsels bigger. Yes, but again, make sure we're clear about when we say creation. It's just a logical boundary. It's not like I'm copying data and making it bigger. Here's how to cut things off. You don't go back to something that's already running. Hey, here's 10 more tuples you didn't have before. I'm actually changing the boundary as well as asking. If we're never changing the boundary for a thing that's already running, it's like we change the boundary to be like other things that haven't been processed. For the remaining parts of the data table? Yes. So does it look like there's some sort of small size that it starts off with and then it just keeps growing until one of them takes longer than one morsel? Yes, same thing is, and he's correct. Is it just that you start with a small size and say, okay, here's the amount of work you're going to do in a task. I mean, the morsel is going to be 100,000 tuples. But then if you complete that in less than one millisecond, then the morsel size for the next thing you're going to process will be a little bit bigger. You keep making it a little bit bigger until you, well, exponentially bigger until your task takes more than one millisecond. It's also going to be a bit of a non-issue, but what happens if their spreading amount is like, it takes like, for example, a five millisecond stack. It's never going to grow any more than it's supposed to. Or it takes like a really long amount of time for some reason to start with really big morsel sizes. This question is, what if the, when you start with like, you know, one billion tuples per morsel and you're really big, could you shrink it? Yes. Why not? Okay, I just like, I didn't know if it was a part of like the system design. Because it seemed like it was only growing. Yeah, but it's trivial to do. Yes. Why do you want to do one millisecond and run? Yeah. Because it allows you to be more dynamic and not have a, you know, work or just avoid the straggler problem. I think that it's, it is doing work stealing, but not in the, how does it, it is doing work stealing. So it's not, it's work stealing in that there's still going to be a global queue, but they're going to be clever how they maintain it. So when I got, when I got to say, what's the next thing I want, I need to go do, I got to go consider the location of the data plus a priority of what the next thing I need to run. Right? So, like in the morsel's approach, it was this morsel has to be processed by this core. It's been assigned to that. And the work stealing part is I'm allowed to run tasks that are for data that aren't local to me. So in this one, they're doing the same thing, but they're also now including the priority information about the, you know, about the query itself. So it is, it is doing work stealing, but it's, it's, it's a natural, our natural's not the right word either. It just sort of just happens because of the way they're maintaining the queue. It says, why is the goal to make it one millisecond per task? It has to balance it. Absolutely, yes. It's, I mean, we'll see a part of this in a second. The question is, is the priority assigned when the query, when the query starts? Yes. Like everyone starts with like one. The longer you run, then that decays. The question is, how do you make sure that the short running queries finish more quickly and the long running queries keep running? No, like, it seems like you guarantee that the long running won't, like, finish quickly. Yes. How do you guarantee that the long run doesn't get starved? Because the stride scheduling will handle that. There's this notion of a pass. If I haven't, if, if the, this sort of global counter, this water market keeps ticking forward and if my query is below that, then I get, then I'm allowed to run again. And any new query that shows up, the shorter running queries that show up, they're going to be assigned a water market that's above that global one. So they'll be starved out. Yes. If there was data that could be processed with what is in the buffer, that starts with prioritize, connective, and import. Yes. So here if the priority is global for each work, how do you deal with data and product? So the question is, in the morsel's case, the priority was based on what did I, you know, what did I need access in the morsel and where am I going to write it to? But in this case, am I not having the notion of locality? You do. And that would be, you would have a local priority. We'll cover that in a second. Okay. So let's first describe how they're going to avoid the global, the global task queue. And so it still has a global task queue, but the state about whether or not I need to refer to it to figure out what actually changed is we maintain in thread local storage every worker. Again, assuming we're running on a single node. So there'll be a global task set, but all this is just an array of pointers that tell you to go where to go find the information about the tasks that you have for a given query. And then within each worker, there's going to be these masks that keep track of which slots in my slot array up above are active, whether, and then change mask and return mask tell me whether something has changed up above and whether I should go confer it. And so what they're going to do is they're going to have the different workers across different threads are allowed to go right into the memory of these, of this information for the other workers as well, but they're going to do atomic compare and swap operations just to flip bits or basically XORs in the single instruction. And that, you know, yes, there's cache line and validation across interconnects, but that's not, it's not like you're copying a bunch of data, you're just doing one, you know, one compare and swap over the network. So my example here, I'm going to show that we have four slots and that could be active at any time. I think in the paper, they talk about have 128 slots and that's just to bound how much work can be run actually running at a given time. And again, the classical stride scheduling is you allow, the number of slots is unbounded, right? So again, the global task set slots, these are just pointers to where to go find the metadata in memory about what these queries are actually running. So let's look at an example of when a query finishes and when a new query arrives. Sorry, when a task set finishes. So let's say that worker one here, he's running Q1, task set one, worker two is running Q2, task set one. So when this thing finishes, we then need to go back up to this task set slot array, follow the pointer to go look and say, okay, is there something else I should be doing for this task set for this query? And let's say in this case here we've completed all, we've processed all the morsels. So we know that this thing is done. So then now the worker thread is then responsible for then taking the next task set and we're putting that back in the queue, right? But now we want to notify all the workers that, hey, something has changed in this task set queue up here. So let's go find out what it is because we want them to pull it and not have to push it, right? Because if you start pushing things, you have to maintain latches to make sure that you're not overwriting information inappropriately. So all we need to do now is just update this return mask. We just do a compare and swap at each thread to now say set a one to this slot. And then next time the worker comes back around and says, okay, I need to do something. I need another task to compute on. It knows that it needs to go check the task queue to find out new information that somebody posted about it. It's like a message board saying, hey, by the way, here's a change. I'm not telling you what it is. It's like a new email notification. I'm not telling you what it is, but you know where to go look for that information. So now let's say queue three shows up, query three. So it ends up getting put into the global task slot in this position here. And again, some other thread, like a scheduled thread or a coordinated thread, is responsible for then flipping a bit in the change mask for all these threads and say, hey, by the way, there's a new query to show up in this slot. And they're distributing the return mask because there's some bookkeeping reason that you have to do this. Yes? Well, I don't understand. I thought there was no dispatcher thread. It's not a dispatcher thread like, hey, here's this task. More like, there's something up above that takes the query that shows up, right? And somebody's got to then put that in the queue. So whatever that is, right? So you can call that a coordinator or a scheduler thread, right? But whatever that thread is not responsible for saying you're going to do this, you're going to do that, they're all pooling themselves. You just need something to flip the bit and say, by the way, like, we added something new. Make sure you go check it out. In the back, yes. So the question is, would the dispatcher be responsible for flipping the bits in the return mask? No. As far as I know, the thread that was responsible for putting the data that computed the result is responsible for flipping the bits on everyone. The question is, why not let the dispatcher handle that? Because now you've got to go tell the dispatcher to go do it. It's just cheaper to go do it yourself. I still don't understand why we're not doing push-versus-pull. This question is, why aren't we doing push-versus-pull? Because when the work is trying to pull from the global task set slots, right? There is locking, yes. OK, sure, we can't get rid of that. But it's wasting cycles to go and ask and look into the global task set slots and say, hey, what do I need to do? And if there is already a thread that the running can change in the bits, it might as well keep track of what the work is doing and then tell it to do stuff. So his statement is, if something is already responsible for putting things in the global task queue, why not just have that thing responsible for telling people what to do? Yes. But then, again, you've got to maintain that. You have to maintain the state somewhere. And they're arguing that it's better to distribute it across the different workers in TLS and have that be, and then to do simple compare and swaps to notify them of the changes that are occurring rather than having a more heavyweight approach of farming out complex messages that they need to process themselves. It's cooperative scheduling. So rather than having one thread be responsible for everything, and potentially that could be more efficient to do with a small number of cores, but this approach is definitely more scalable with a larger number of cores. Okay. Right. So then this thing knows that when it finishes, the task was running, goes back up, looks in the queue, decides that for whatever reason that we'll get the information, looking at the change mask and return mask, what needs to update. It updates its active slot now to say, hey, there's something in one I could go take, and then it decides to run Q3 task at one. And notice here now on worker two, it doesn't know about the Q3 yet. It just knows that a bit got flipped in the first slot, and eventually when I go back and look at the task set, a queue, I'll go learn what that is. So these things can run independently of each other. We're not having to coordinate across all of them, which arguably always going to be better. All right. So we're going to have a few minutes. I'll skip the priority delay, but basically think about it as like there's this notion of this global pass. Just think of the number of times I've passed through or I've executed things. And then I had priorities about, you know, there's queries, I have local priorities based on how much work I've done for this query, and the combination of these things then determines what you want to run. But idea, again, the highlight idea is that as a query runs longer, this priority decays and goes down. All right. So I'll quickly zoom through HANA. Again, this is just a... This is the other end of extreme of complexity. So thinking like Postgres is the easiest one. You just say let the OS do it. The HANA approach here, which again, I think it was a PhD dissertation as somebody that worked at SAP. So I don't think this ever actually made it in the real HANA system, especially if they rewrote it in the late 2010s. But the idea is that, again, HANA does have an easy system to do even more scousing on its own for individual threads and not let the OS do any of that. So this is going to support both workload stealing and pool scaling, meaning within a single... On a single socket, on a single NUMA region, I can add more threads dynamically and not have the limitation of having one thread per, you know, CPU core. I can start adding more and more cores if I think things are going to get stalled on doing, you know, for a variety of reasons. And then they're going to have this notion of a... Two different kind of queues of work. They're going to have a soft queue and a hard queue. A hard queue is going to be tasks that you don't want anybody to steal that has to run in that socket or that NUMA region. So I think it's something like garbage collection for data that's in that NUMA region. You don't want to have that go with the interconnect or like a networking task that has to run on a given socket. But the soft queue will be things that workers are not allowed to steal, similar to the hyper approach when you're doing scans. So everybody said this, we're going to have the soft and hard queue priority queues, but then they're going to have different... Four different worker pools of threads. So you have worker threads that are actively running something, inactive ones that are blocked in the kernel waiting for some kind of conditional lock or conditional variable, a latch. Then you have ones that are free that I wake up a little bit to see what there's anything to do. And then you have a parked threads where you've actually descheduled them and you hand them back to the OS kernel like a sleeper yield. And they are sitting down there and then if you need them, you can spin them up. And the argument here is that it's cheaper to go put some threads down in the scheduler in the OS and let them to sleep down there so that when I need them, I can pick them up and start running with them compared to having to spin up a whole process. I'll spin up a whole thread. So let me skip this real quickly. Basic idea works the same thing as before, that we have a bunch of stuff we want to actually for a query, but now I'm including some maintenance tasks like garbage collection for a multi-version concurrency tool because HANA was an MFCC system. So these queries here, they had to run... You need to run them right now, but they can go in the soft queue because, again, technically any NUMA socket in any region can run them. But then the GCC stuff, say we'll put that in the hard queue. And then the working threads are responsible for executing these active tasks and then the inactive ones, again, these are things that haven't... that are inactive is blocked on something that, like in the kernels, that we can't actually start executing them, but we expect them to wake up fairly soon. Free is going to be one that is spinning all the time, looking for work to do, and the part of the ones that are heavyweight paused down in the kernel. So, again, the free ones are allowed to pull this all the time and define something that is allowed to execute. So, the HANA guys are going to claim that, in their experimentation with this approach, that it was better for the large socket machines to turn off all the work-stealing. So you basically don't put everything always in the hard queue. And that was always better than moving things across different NUMA regions. I would argue that I think the... I like this, I like having all the stuff manage yourself instead of OS doing it. And so for the inactive ones, again, for its OLAP, because maybe this is less of an issue, but if an OLAP system, this makes sense. And HANA was trying to support both OLAP and OLAP, so it made sense to have this sort of different variations of threading. All right, I'm good running through this really fast. I want to show you one last thing about in SQL server, because to me this is this is relevant for high design systems, what? Is it OS? SQL OS? Let me get the background. It's not a full-operated system, right? So, SQL OS is an abstraction layer in SQL server that they built in 2006 that hides the low-level details of hardware and the operating system from the upper parts of the database system. What they built this is that Microsoft observed that every time new hardware was coming out, they had to rewrite all their operating implications to account for whatever the hardware was. Like you had much more cores, they had to rewrite a bunch of stuff. They had a new regions, they had to rewrite a bunch of stuff. And so this abstraction layer allows them to hide those low-level details. And the scheduling and the movement of data can all be managed by this SQL OS thing. So it's not a full-operating system like the Linux kernel or Windows or HBox, whatever. It's just sort of an abstraction layer. But the cool thing that they're going to do is going to do non-preemptive thread scheduling inside the database system. What's another word for non-preemptive thread scheduling? Co-routines. Same idea, right? Where you have the, you have a thread that you're managing threading, multi-threading within the database system itself. But since it's not preemptive, like preemptive means the OS can go steal you, take your hardware thread and give it to someone else. It's, we're all running, all these threads are running within the database system itself, so we can't do that. We can't send it interrupt to ourselves. Like that, right? So that means that in our code itself, we're going to have to go put explicit instructions to yield back to the scheduler to say, hey, by the way, go check to see whether something else could be running inside of me. I mean, they did this back in 2006 before C++ and Go and other programming languages now have built-in support for Co-routines. So there's a great article here at how they built it. I used to say that SQLS allowed Microsoft to make it, to get SQL Server to run on Linux. And I, because again, I abstracted the OS layer. The guy that built SQLS then called me and said that's not true. Actually, their attempt to kind of get SQL Server running in Docker turned out to be what got them to be able to support it. But there's a good article that talks about it. Again, real quick, let me just show you what it looks like. So you have some SQL query like this. They can set their quantum to 4 milliseconds. So say you want to do a sequential scan of this data. Again, approximate query plan would look like this. Evaluate predicate, admit it. What they're going to do is keep track of the time, different parts of the operator while they run, and they check to see whether the elapsed time since they started running is greater than the quantum 4 milliseconds. If yes, they yield back. Now, this is pseudocode. You would not want to do this. Going checking the system clock is expensive. Don't do that. There's hardware instructions to hide that for you. But basically, again, if I know I'm running longer than I need to, I'll go yield. You can do this for other things in your data system too. If you're going to go try to acquire a lock, the lock's not available, instead of your thread spinning and waiting to try to acquire that lock, you yield back. But again, because the data system controls everything, we yield back with not just like, if you yield to the OS, what do you say? Yield, that's it. In the database system, if I'm yielding back to the scheduler, and oh, by the way, I need this lock, don't schedule me until that lock is now available. And when it does, the OS can put you back. So again, complete control of everything if we do this ourselves. So Siegel-S is probably the first one that did this. There's other systems that do this now. SoleaDB just has this frame record C star. FaunaDB does a poor man's version of this. Basically, every time before they read something from disk, they just yield. That's it. And then Corebase is an experimental system out of Simon Fraser University that explicitly does built-in tidal code routines. And there's a video from the SoleaDB guys from a few years ago that talks about their C star thing. Okay. We're well over time. Distributed scheduling, basically all the same problems, but now we're over the network. Right? You laugh. That's really what it is. And then we'll cover work load scaling, dynamic scaling later, but we'll see this snowflake can do both. Some systems can do one's versus the other. All right. Main takeaway. Database systems are beautiful. Don't let the operating system boss you around. Don't let the operating system try to do anything. We want to do everything ourselves. I think the Seagull-S approach is probably the right way to do it. I think it's an overkill for what we're doing in this class, but, you know, it's beautiful. Okay. What's that? Strong will too, yes. Next class, we'll do nothing but hash joins. And then on Monday next week we'll do multi-way joins. We'll do a quick overview that we talked about. We'll do that next week on Monday. And then Wednesday next week will be the project status updates. Okay? Badoom, baby.