 So for today's class, we are still now working further up the stack in the database system. And now we're going to talk about how we do query execution and scheduling. So there's all this sort of background information we have to get to before we can understand why the hyper guys did what they did or why they chose that particular type of architecture. So I want to start off talking about different process models for in our database system and we'll talk about how to do query parallelization, what different types of query parallelization you can have. We'll also talk about data placement, non-uniform memory access models and other things like that. Then we'll finally get talked about scheduling and then we'll finish up with a quick overview of a outline of rules you should follow when you're in the streets and when you're doing databases. And we'll be following a 1997 list of sort of edicts called the 10 crack commandments. We'll go through that and we're just going to see how that relates to databases in your life again. When you leave CMU, these are some of these life skills you need to have to survive. Okay, so pretty much for this entire semester or the entire semester, we're talking about a certain kind of database that's called a multi-user database system. And this differs from an embedded database system, something like SQLite or RocksDB or LevelDB. Because in those systems, you assume that there's only one user accessing the database at a time. That user may want to run multiple transactions but you're really only trying to interact with sort of one outside entity. So now in a multi-user architecture, in order to make sure we use the same terminology in our discussion for this lecture, I want to go through what at a high level, what I mean when I say multi-user database application stack. So the term I'm going to use is that we have a client and the client interacts with the database server. So the client is actually not actually a client like an end user on their iPhone or doing something. This is like your application server running PHP, Node.js, Tomcat or whatever it is. And it's making requests either in SQL or PL SQL over JDBC or ODBC or some other wire protocol to invoke queries on the actual database system itself. And then we differentiate the client between the end users because the end users are sort of one hop removed. And these are the things that are going to be making invocations to the application server or the client, which then makes requests to the database server. So typically these end user applications will be written in something like using like a REST API or SOAP API, some kind of XML JSON thing, and you could be sending the request over HTTP. So another way to think about it is this is the front-end application, right? This is the thing that sort of makes a high-level request about what they want the application to do. And then we have the back-end application that takes this request and then converts them or executes the appropriate SQL statements against our database server to do whatever it is the operation that we want to do. So when I say client server as we go along, I really mean this setup. I don't mean this part over here. Some database systems like CouchDB can let you do JSON requests or REST requests directly to the database system, but usually this is how people always implement. There's always an application layer that is talking directly to the database server. Usually you don't let them do that. Okay, so to go further, I want to talk about the terminology we want to use when we talk about query execution and query scheduling in a database system. So we're going to say a query is comprised of a query plan. Like you take a SQL statement and you generate a relational algebra query plan. And that query plan is going to be comprised of operators. And these could be relational operators like the scan, the join, insert, update, delete, things like that, but also sort of these physical operators that allow us to have control flow in our query plan. And we'll see what it looks like in a second. And the key difference is that, the key thing that they understand is that when we have a query plan with a bunch of operators, at a logical level there'll be sort of one scan operator for the one table we want to scan. But at run time, the physical plan could have multiple instances of those operators that correspond to the same functionality being executed in the logical operator but being done in parallel on different segments of the data. So we'll say a single operator can have multiple operator instances when we actually run the query. And then we're going to organize a sequence of operator instances into what are called tasks. And this could be a series of operations that we're going to want to do to compute some portion of the query. And so all our scheduling for the most part is going to be at sort of the task level. That's the encapsulation we want to care about. And the idea is that if there are operator instances that sort of feed into each other one after another, we can combine them together into a single task and get sort of a nicer pipeline of better cache locality because we know that the output of one operator can go directly and be used immediately by the next operator instance. Okay. So now, if we want to support a multi-user application, we want to support multiple transactions, multiple queries running at the same time, we need to talk about what the database system's process model is going to be. The process model is going to define the high-level architecture of how it's going to allow for concurrent operations in the system. And then within this process model, we're going to have this notion of a worker. And I'll use the term worker thread throughout the lecture. It doesn't necessarily have to be a thread, right? It could be a single process itself. But the idea is that this worker is this high-level construct that is responsible for taking the tasks that the query planner will generate in our physical plan and actually executing them and generating some results. And those results can either be fed into another task or they can be fed back to the client itself on the application server. So I'm going to go through the three different process models we can have in our database system. As far as I know, all the database systems are out there. We have to implement one of these three approaches. So the first model is called a process per worker. And the idea is that every single worker is going to execute as its own separate operating system process. So not a thread, but an actual full-fledged process. And so what will happen is when the application sends a request, it will first go to this dispatcher and the dispatcher will say, all right, let me fork off a process that's going to handle this client connection. So it finds a worker, assigns the connection to that process, and now this worker will communicate directly with the application on the client. So when additional requests come in, it doesn't go through the dispatcher, it goes directly to this process. And then the worker knows how to interact with whatever the database is. It could be on disk, it could be on memory, it doesn't matter. So the way we're going to do scheduling in this process model is that we're going to completely rely on the operating system. So we're going to fork off a bunch of processes and then let the OS schedule do whatever it wants to do with them. We can set priorities, set the nice flag or IO nice for our worker threads, but we don't really have a lot of fine-grained control on how we set priorities in our system. We just let the OS do whatever it wants to do. In order to make this work also when we have multiple processes, is that for shared data structures like a lock table in memory database itself, we have to use shared memory and maintain locks and latches and do all the protections we would normally do in a single process environment, in a multiprocess architecture, because we want to be able to access, we want all our threads, we're using a buffer pool, we want all the processes to be able to share that data and not have complete copies of everything redundantly. So this is the approach that was most used in the systems that came out of the 1890s. In the 1980s and 1990s, because back then, threading support was not what it is now. Now everyone has Linux and there's P-threads and things are great and for the most part, although there's different distributions of Linux, everything's also GNU, everything's all POSIX, so your code is somewhat portable, pretty portable. Back in the 1980s, 1990s, there was all these different variants of Unix, right? There was Solaris, there was HPox264, Minix or whatever, all these other Unix or BSDs, all these other variants of Unix that didn't have the same threading support or the same threading packages. So you could reliably, you could assume that whatever operating system you were running on could support fork and do a multiprocess architecture, but if you had to do threading, then chances are if you wrote it from one operating system, it probably wouldn't work on another one, so you sort of limit your portability. So again, this is why things like DB2, Postgres and Oracle, all these older systems are using this architecture here. Another approach is to use, again, multiprocesses, but instead of sort of forking off one process per client connection, what you can do is have a pool of worker processes so that when a request comes in, the dispatcher just picks some free process in a pool, lets it handle that one task it wants to execute, and that task could be the entire query plan or could be a portion of it, it communicates to the database itself and then sends the result back to the client. And then when the next request comes in, even though it may be for the same connection, it may choose a different process. So again, you're still relying on shared memory to make this all work, you're still relying on the operating system to do all your scheduling, but in general, you don't have to fork things off every single time. You have this pool of resources or pool of processes you can reuse as you need. So the only database system that actually I know that does something like this is IBM DB2, and we'll see in a second, DB2 actually supports all three process models. That's because the system has to be designed to work on not only your sort of Linux and Solaris and your standard Unix variants, but also like crazy mainframe stuff that has architecture that contains a lot of legacy components from the 1960s and 1970s. So they basically let you as the DBA specify which one of these models you want to use based on what kind of hardware you're running on. So the last approach, which is the more modern variant of these three, is to do a single thread per worker. So now we don't have multiprocesses. We have a single process that has a pool of workers. And when any request comes in, we just pick whatever we want in a pool and let it execute. And we can still do, per connection, we can have a thread. We could have multiple threads run for different tasks of the same query. It's really left us to decide how we want to spread things out. So the disadvantage of this is that it requires more work on behalf of the database system developer. So people like us that actually work inside the database system, we basically have to reimplement a lot of the stuff, a lot of the scheduling architecture that the operating system will normally do for us if we're using the process model. But that's better because we can have now more fine-grained control on exactly how we want to organize and schedule tasks in our threads. And so for this, you may or may not have to use a dispatcher thread. So a dispatcher thread would be like a coordinator. It says the request comes in and assign it to a free thread. As the paper you guys read in Hyper, they don't actually use a separate thread. They let all the workers do cooperative scheduling. But in other systems, there could be a thread that organizes things out. So of course, DB2 does this because DB2 does everything. The newer versions of Oracle, I think as of Oracle 12, which came out, what, two years ago, does this kind of threading now. SQL service does this. MySQL does this. And pretty much as far as I know, every single database system that's been come out in the last six, seven, eight years is all going to do multi-threading. Nobody does the process model anymore because it's just a pain in the ass to deal with shared memory. Okay, so as I said, the multi-threading architecture has a distinct advantage over the other guys. One key aspect of this is that when you have a context switch from one thread to the next, it's much more lightweight than a context switch in a process model. Because switching from one thread to the next in the same address base or in the same process, you don't have to switch security contacts. You don't have to swap out program counters or things like that. It's not as big, you have to swap out program counters, but you don't have to swap out as much as you would if you're swapping out threads versus swapping out processes. We don't have to deal with shared memory anymore, which is kind of nice. And in general, this allows us to be more flexible in our architecture. In my opinion, this is a better way to go. I'll note that just because, though, we're doing a one thread per worker, it doesn't mean we're going to be able to do intro query parallelism that we'll talk about in a second. So in the case of MySQL, for example, MySQL is a multi-threaded database system, but when you execute a single query, it's one thread that will execute that entire query. You can't split it off into multiple threads and combine the results as we can in the hyperpaper. And in the case of Postgres, they could do multi-threading within a single process, even though they're using the process model, but they don't. Supposedly, in Postgres 9.6 or 9.5, they're going to have multi-threaded queries, but it's still going to be forking off a single process, and then within that process, you can have multiple threads. They're not going to switch to the threading process model. As I said, I don't know of any database system. You'd be crazy to build something with a process model now when P-threads is so widely available. So the only systems that I know that you could say that they're newer ones that are not doing threading are ones that are based on Postgres. So CitrusDB, the VitezaDB guys, they're predicated on using Postgres as architecture, but they didn't rewrite it in the same way we did in Peloton to make it multi-threaded. Okay. So given that we have now a process model, now we want to think about how we're actually scheduled these tasks for the execute queries. So for each query plan that comes in, usually what happens is you have a... you could assign the connection to a thread and then that thread will do the parsing, the planning, and the optimizations for the query. Or you could have a central thread that do it. It doesn't really matter. But now we have a query plan and now we have to decide where do we want to execute it, when do we want to execute it, and how are we going to execute it. So these are some four general questions that we have to deal with. So we have to figure out how many tasks we should split up our query plan into and how many cores we can use to execute those tasks. It doesn't necessarily have to be a one-to-one matching. If you have ten cores, it doesn't mean you have to use ten tasks. You could use less, you could use more. Then we also have to decide where should we execute these tasks. And this will make a big difference when we talk about the NUMA architecture because the access speed of data, the latency of accessing data at one CPU socket versus another is not always the same. So you want to make sure that if you know data is at this one location, you want to run your query or tasks on the cores that are closer to it. This is the same thing we see in a distributed database system. If you have your data across a bunch of nodes, you want to execute the query on where the data is actually located. You don't want to pull the data to the query. And then the last thing we've got to deal with is where do we actually store the output. So all of this is kind of stuff. You complete a gloss over in the intro course. You say, hey, here's a query plan. Here's your processor. Just plow through it. But now when we actually implement these things, we actually have to care about this. And it's not easy. And in general, I would say that the main for the preset you should keep in mind going forward for the rest of your life is we never want to let the operating system do this. The database system is always going to know better and it can always make better decisions. It's more work for us to implement this logic in the database system architecture itself, because that's okay because we get paid a lot of money to do it, right? And it's sort of like the same thing. And I said, never use M-map. Never let the operating system manage your memory. You want to never let the OS manage your scheduling. All right, so there's two types of parallelism we're going to want to achieve in order to get better performance in our system. So the first is sort of the obvious one that we've talked about before when we talked about concurrency control. We want to have what's called interquery parallelism. The idea here is that we want to improve the overall performance of the database system by allowing multiple queries to execute simultaneously. So we have multiple application threads invoking transaction requests and we get this query request at the same time. We want to be able to have our database system execute those things at the same time, right? And rather than having to wait one after another. And so this is why we use that concurrency control stuff we talked about before is because we're going to allow these queries to execute simultaneously and not mess up each other and still maintain the illusion that they have complete control of the system. They're running in isolation. So I would say that, you know, there's not much more we can talk about today in scheduling for concurrency control because we've already sort of handled that before, right? All the same ideas that we'll talk about, you know, fit just along, you know, just as everything we'll talk about here. I will note that I don't, in my opinion, there's not a significant software engineering difference in the amount of work you're going to have to do to implement a concurrency control scheme in the different process models. So again, Postgres is doing NVCC. It uses shared memory to make that work. I don't think there's anything much more work or less work you'd have to do if you switch to a multi-threaded architecture, right? I think there's no difference in terms of interquery parallelism, but I think there is a difference for introquery parallelism, which is the second time, type of parallelism we want to have. So for this, the idea is that we'll have a single query from one transaction ignoring other queries running at the same time. We're just dealing with this one query. And we want to be able to improve the performance of the system by allowing us to run the tasks that correspond to that query plan for that query to run in parallel, right? Instead of saying, like in MySQL, for a single query plan, we'll have one thread executed from beginning to end. Now, since we have all these multi-core systems, we have all these extra threads we can use, we want to break up our query plan tasks and allow queries to compute those tasks at the same time. So there's two approaches to doing introquery parallelism. So it's sort of the same way you can have introquery parallelism within that we can have another hierarchy to have the different kind of parallelism we can have in the operators themselves. So we can have introoperator parallelism and interoperator parallelism. So introoperator parallelism, the idea is that we're going to decompose the different operator instances that perform the same function to sort of replicate them multiple times and have them operate on different segments of the data, right? So if you have a scan operator that wants to scan an entire table, we can split that scan operator into pieces and have the different instances operate on just those pieces. And that can be all blasted out in parallel. And the other type of parallelism is to do within the query plan itself, we want to have the different operators at different levels in the plan execute at the same time. So they're doing different functionalities but they're both operating the same time and then we can feed data from one operator to the next and have that second operator compute whatever it is that needs to compute on the output of the previous guy but then the previous guy can still keep on running. So we'll show examples of both of these in a second in the next slide. Alright, so this is introoperator parallelism. Here's our query plan. We're doing a join between A and B on some foreign key and then we have the two predicates to do some kind of comparison to filter out for both of the tables. And then here's sort of a logical plan query plan for this, right? We do a scan on A, apply them to a filter operator, then we do the join, and then we have our projection. So say we start with this scan on A. Say A has, you know, a million tuples and we can divide it up into three different chunks of data. And so in our physical plan we'll have separate operator instances that each operate on a portion of the total data for A. So we'll break it up into three chunks or three partitions and then we have an operator instance for those three chunks. And then we can assign them to cores at the same time and they can execute them in parallel. And then now we can sort of expand it even further. We would see that, well the scan on A immediately sends its output into a filter operator. So what we want to do is in our task we want to assign not only the scan on A on some portion of partition of the data but then also apply the filter operator. So this is essentially here is our task that we can then schedule out on these different cores. And then because we know what we're going to do, a hash join, then we want to go ahead and build the hash table that we're going to need to do the probing and do the join later on. And the way that we're going to be able to combine these results is through what's called an exchange operator. So there was a query optimization protocol or method in the 1990s, late 1980s for this system called Volcano. And they basically propose a way to do parallel execution of query plans through these exchange operators. So what's going to happen here is because this task here is doing a scan on A1 and this one is doing a scan on A2, they're all building their own little local view of the table. And then we need some way to be able to combine them together to produce a single answer to then push it up into other parts of the query plan. So this exchange operator is a way to sort of coalesce the results from these separate tasks and then combine them into a single piece of data that can then be used by other parts of the query plan without having to know that it was broken up to three separate tasks. So as we go along, the upper portions of the query plan don't know that we executed this in parallel and it doesn't care, right? So normally what happens if you execute this using like the iterator model, when you get to the exchange operator as you go down the query plan, you would call fork or spawn to fire off these three separate tasks in separate processes or threads. And then it knows that it has to wait until all the output from those three processes or three threads finish before you notify the guy above you to say that I have the data that you were looking for. And now we'll do the same thing on B. Say B is a smaller table. It doesn't have as much data as A. So we'll just split it up into two chunks and then we have two tasks that will do the scan, the filter, and build our hash table. And again, we have that exchange operator that tells us that we need to combine these results. So now we get to the join operator and the join knows that it's going to take the input from the exchange operators and then we can farm out the actual probing, the join operation itself by probing the different hash tables into separate tasks as well. So now we have sort of five tasks on five CPUs here that we can fire off in parallel. Then we have to wait until the exchange operators finish up because now we have our two hash tables that we can computer join. And then once that's done, we can then schedule four different tasks that do join on some portion of the hash tables we built. And then there's that exchange that tells us that we coalesce them and produce the final answer. So this is called horizontal parallelism because within a sort of single hierarchy or level in the query plan tree, we can have all the operator instances running in parallel. So this is very common. This is used in most database systems that support interquery parallelism that they'll do this kind of exchange thing here. And you see this a lot too also in distributed databases is because the idea would be instead of running on separate cores, these are running on separate machines, you get the data over the network to some location, you coalesce them, and then you feed them off to the next operator. Okay, so now in inter-operator parallelism, the basic idea is that we want to have sort of a... You can think of each of these operators in our query plan as running in a separate thread as a separate task, and they're pulling the guy below it to get the next piece of data. Again, this is from the volcano model, or you think of this as doing get next, get next on the guy below to get the next data. But instead of that being a blocking call where this guy blocks while waiting for this, you could tell this guy I want more data and then spin in a separate loop up here and wait till he pushes something up to you. Let's see an example here. So let's say we deal with this join. This join is basically going to do a forward loop on the outer table, forward loop on the inner table, and compute some join, and then I'll call emit to tell it that I have a new output and push it up to the guy above you. So now in our filter operator, this thing here is this spinning on some input queue waiting for tuples to come from this guy below. So you can think of this as sort of a blocking queue where you pull it and say, wake me up when you actually have something. This guy computes a join between two tuples, produces some output, emits it to the next operator. This guy gets woken up in his thread, and then he spins, does whatever it is that he wants to do, and then pushes it up further up the tree. So this guy can just be spinning nonstop, trying to compute this join, and this guy can do whatever it wants, and once he has something to compute, he wakes up and does it. So these two different tasks, these different loops are running in separate threads. They don't matter because now we need to potentially assign this guy to a core, and this guy to a core, and chances are this guy is going to be waiting for a long time for this guy to actually produce something. So this is kind of difficult to do, and they don't do this in Hyper because they want to have a single worker per core, and you wouldn't be able to do this because this core would essentially be idle just waiting for output. So I don't know how many database systems actually do this approach. So I'm still being very hand-wavy about everything we've talked about so far. There's more questions we have to deal with about figuring out what's the right number of workers to use in our query plan, how do you want to split things up in tasks, do you want to use the same number of tasks as we have hardware resources? Does it depend on the size of the data? Does it depend on what the actual operator is doing? So there's all these other questions we still have to deal with before we even decide how we actually generate a physical query plan. So I'm not going to give you, there's no sort of one right answer to say this is exactly how you're going to do this. I can only say here's the different design choices you have to consider when you actually would implement something like this. And you have to look at your application, what kind of database applications you're trying to support, what the other parts of your architecture are like, and when you make decisions for this kind of stuff. So there's two things we have to now consider as well in our schedule. The first is how are we going to assign workers to our hardware resources or workers to our CPU cores. So one approach is to use a single worker thread per core. This is sort of the same thing we saw in h-store where there was a single thread of execution engine per CPU core and that's assigned to a partition. And the way that they made this work efficiently is that they would set the affinity of the thread that was running at that core such that it would only run at that one core. So what I mean by that is say you fork off a bunch of threads in your process, you can tell the operating system what cores you want those threads to run on and which cores they're not allowed to run on. So you can do this by setting this call called scheduler set affinity and basically you provide the operating system a bit mask that says I want my thread that it's making this call to only run on these sets of cores or just this one core. And you do this for all the other threads you have in your database system to tell them to run on this core and not all these other ones. So what happens is the operating system only schedules to run there so it sort of wakes up and the idea is that as it spins and wakes up it's always going to run at the same core and therefore it's going to have better cache locality because all the data that it accessed the last time it ran still may be in its CPU caches because nobody else was running on that same core. Another approach is to have multiple workers per core and basically the idea is that there's either for a single core or a socket or a subset of the socket and the idea is that by having multiple threads be able to run on our cores we're going to get better resource utilization because we don't have to worry about if one thread blocks then that blocks all execution of that core. There might be another thread that can come along and start doing useful work. This is sort of the same idea we saw in a disk based system because it's about stalling and going out to disk. In our case we may stall for other reasons like the network or just because you know, cache miss or something like that and the OS schedule will be able to put something else in its place and still use the hardware. So there's a bunch of different approaches. In this case here you still use the set affinity because if you're pooling workers per core you want to have them sort of assigned to a subset of the cores. You don't want anything they want to run. The other architecture decision we have to make is how we assign tasks. There's two approaches to do this. So at a high level they're somewhat the same but there's different things you can think about for each of them. So the first approach is to use a push model where there'll be some central thread that looks at all the workers that are around looks at the tasks that are available and assigns those tasks to those workers and then when the worker gets the task it wakes up, starts running it and then when it's done it notifies the dispatcher that hey I'm done and then it's told what the next thing it should do. The pool based model is that the workers are going to always be looking in some kind of queue for the next thing to do and it's going to grab that out execute it on its own and then when it finishes it goes back and gets the next task it doesn't actually tell anybody I'm done it just goes and does it. The hyper scheme and then in this HANA scheme we'll talk about at the end they're using a pool based model I think a lot of the older systems are using a push based model because they do static scheduling. Okay so going forward, regardless of what worker allocation or task time of policy we're going to use it's going to be very important for us to make sure that we only operate on local data. I've talked about this earlier where I said that we want to make sure that when a worker gets a task and that task is going to tell us some data we want that data to be close by to it right and this could be again we want it to be on the memory that's local to it or memory that's close by and again in a distributed system obviously it would mean that you want the data to be on the machine that you're on but it's the same idea. So this means that when it comes time to actually schedule and execute a task on behalf of a query we want to make sure that we're aware of the underlying hardware layout is for our memory model right and there's two basic there's a couple but I'll talk about two of them there's two basic memory models we can have the first is called uniform memory access or UMA and this is the approach that was used in symmetric multi-processing systems up until about 10 years ago when we switched to the NUMA model right and so the basic idea is that say you have four CPU sockets and each of these CPU sockets will have its local cache right L1, L2, L3 and then there'll be a series of memory dims up above that are connected together over a bus so when a process running on say this socket here wants to access memory location X right it would send the request up to the bus and then it would get routed over to the location that had it the reason why we say this is uniform memory access is because the cost or the latency of accessing item X at this dim here is going to be the same for all our sockets or all our processes running these different sockets right you're always going to have to go through the bus right and obviously there's the cache coherence protocol with this this controller here that keeps everything on sync but we can ignore ignore all that right it's basically again there's this bus reading anything or writing anything is always the same for all our processes so then around 10 years ago they actually started implementing what's called the non-uniform memory access so the idea of this has been around for a while and there was some sort of boutique shops that had systems you could buy that had this kind of layout but now this is essentially what Intel pushes forward when you buy high-end Xeons multi socket machines you're going to get this architecture here so now we see that the bus is completely gone and instead we have this interconnect in between all the different processors and we also see that we don't have a shared memory bank or shared locations of DIMs for all our memory instead we're going to have for each socket they're going to have a local memory bank or local memory DIM that it can use and so what happens is say that there's a processor running the socket here and it wants to access memory that's here that's really fast because it's really close to it sort of direct access to it but if this process wants to access memory that's over here that's managed by this DIM then you have to send a message over the interconnect go through this process and then get to the thing that you're actually looking for right now we see why it's called non-uniform memory access because the cost of accessing memory here is really fast the cost of accessing memory here is slower so even though it looks from a processor standpoint from a process standpoint you look like you have this giant chunk of memory right that's all contiguous but in reality they split up across all this different hardware right so if you're not careful about how you organize your database in memory and where you decide how to assign tasks to these different cores when you want to have interquery parallelism you may end up having queries running here that always have to access data over there now we're talking like nanoseconds of access time here but the difference is quite significant and as we'll see in some experiments it makes a big difference if you're always able to access things local here so this idea of organizing the database in our storage and in memory is not new to yes sorry no keep going keep going right so this comment is that it doesn't necessarily even need to be new mom if you have something like this if you say had to there's data that this guy had in his cache all the tasks that have to access some piece of data are always ending up here and then for whatever reason I have a task that gets landed here that will modify data that this guy has in his cache I have to pay a cache coherence penalty because anytime I update that location it's going to have to go and validate the cache here so it's sort of extra traffic to make all this work so it doesn't necessarily have to be we don't have to be careful in a numeric architecture we have to be careful in this architecture as well but I think in my opinion the effects are more pronounced in the numeric architecture than the okay so I'm going to get that so like if you have to do cache invalidation for here in order to avoid cache validation you want to try to keep data that's always used together located at a single CPU socket so any task that has used that data always gets executed here but it's the same idea for both of them okay so this idea of organizing the database in storage is not new and it's not specific to in-memory databases so this is an idea called data placement where we're going to split the database up and then we're going to organize it in such a way and keep track of where the data is actually being placed and then we can use this information to make decisions on how we schedule our tasks and so in the same way in a distributed database where the database system would keep track of this node has this shard, this node has that shard we could do the same thing in memory for here so we could say the database has been split up this CPU socket and its local memory has that portion of the data and this CPU socket and its local memory has that portion of the data and they can make decisions about how we want to organize our scheduling operations and so there's a command in Linux called move pages that essentially allows you to take some address as in memory and move it to one of these nodes or one of these CPU sockets so essentially what happens is if you call move pages and give it an address range without telling it to move it it'll come back and say this memory location is located at this socket here that you can go back and invoke it again and say alright we'll now move it to this other CPU socket so you can have fine-grained control of where you actually move data around in your database system but this assumes that we're doing it after the data has been created right but what happens when we call malloc in our database system we want to allocate a new chunk of memory where does the data actually go what will actually happen in our system so assume we call malloc and assume that the allocator which runs in user space it's not a call into the kernel the user space doesn't have a free page or a free block of memory they can give out through our calling process so what would happen in malloc here yes go sure yes that's it so when you call malloc nothing actually happens it doesn't go in the kernel actually it does go in the kernel but it doesn't actually allocate the real memory so what happens is it'll make a kernel call and tell the operating system to extend the processes data segment the memory that's being used for it but this is all just in virtual memory and it's not backed by anything in physical memory it's only until your process either does a load or store on that memory location you've allocated does it actually get a page fault and then go to the memory control and actually allocate physical memory so now the question is if I call malloc and I allocated a chunk of space where is that memory going to be stored like what's socket so say I malloc it but I don't touch it then another thread in my same process running on another socket he touches it does it land on the socket that malloc did or does it land on the socket that touched it then I'm going to take a guess yes by default yes but you can also do interleaving as well so basically what happens is when we can enforce the operatives will enforce for us that tell us where this memory is going to show up so I think the default is called first touch basically whatever thread touches it wherever that is that's where the local memory will be physically allocated but there's also a command in called numeric control that you can specify the policies that the operative system will use so one of the policies could say just interleave the data all throughout my different sockets so that way I avoid the worst case scenario right so I bring this up because we want to be careful now about how we set things up and load the database and other buffers we need to process queries because we want to make sure that our tasks are operating on data that's local and they're writing to data structures that are local as well we're not going over the interconnect and updating and doing caching validations to other sockets so data placement is a big deal in databases so in this graph here I'm showing this is from a micro benchmark that was done by people at EPFL in memory version of short MT and it's a simple workload where we're going to use four worker threads that are just going to execute the TPCC payment transaction and what I'm showing at the bottom is for these four groups these are different thread layouts or the thread worker thread assignments to the different cores relative to where the data is so all the data is going to be stored on the first CPU socket here, the one on this top corner and then for the different groups where the red marker is it tells us where the worker thread has been pinned to by setting its affinity so in the case of spread all four worker threads are running on different sockets under group they're all running on the same socket where the data is under mix we have two of this first socket and the OS is letting the OS do whatever it wants to do which is always a bad idea and so what you see here is the grouping grouping policy by making sure that the data is executed the worker threads are executed on the socket that has the data close to it gets you the best throughput here so it's just showing you're motivating why you never want the OS to do anything and why we want to be careful about where we assign sockets and where we store the data this is why we keep track of the placement policy or keep track of where the actual pages are stored in memory in our catalog and then we can use that to assign tasks to CPU sockets at run time to make sure they're only accessing the local data so in this example the performance difference doesn't seem very pronounced worst case scenario we're doing about 7,800 transactions a second but now and then up here we're doing about 11,000 for this workload the amount of data access per transaction is quite small maybe 10 or 12 queries each of them is only accessing one tuple so it's not a lot of data being moved around but you can imagine if you're doing an OLAP query that's scanning large table segments, large chunks of data then the performance difference becomes more severe so again never let the OS do your job so the other thing we want to be clarified too is that the partitioning policy of a database system or database is different than the placement policy so the partitioning scheme defines how we're going to physically divide up our database and divide up our tables and we can do this using like range partitioning, hash partitioning we can do a round robin approach where we have say four partitions we each go one by one and assign tuples to them or we can also do partial and full replications we can have say on a single socket we can replicate the same table across all the sockets and this is different than the placement policy because the placement policy tells the data system where to actually push the partitions after you partitioned it so this is telling us now if I split my hash partitioning on my tables based on the primary key now I've got a bunch of buckets the placement policy tells me where I put those buckets so you do this and then you do this and you need to keep track of both of them okay so where are we at so far we have a process model that tells us whether we multi-threaded or multi-process we have our allocation model or worker allocation model this tells us how many worker threads we want for the hardware resources that we have then we have our task assignment model this tells us whether we want to do a push or a pull then we have our data placement policy that tells us how we're going to organize the layouts of chunks of partitions of the database in memory and of course we always have a solid appreciation for the CMU database group so now we can talk about how we create a set of tasks from a logical query plan so this is now the physical task of the CPU cores and we will execute them and run them in parallel so to do this for OLTP queries is really easy because we probably don't want to use intro query parallelism why? what's a standard OLTP query? insert is one, yeah select start from table where ID equals primary key equals some value you're doing an index probe to go find one thing you can't really parallelize the index probe you can't parallelize the insert thing and just doing it so there's not much we can do for dividing up a query plan to multiple tasks for this it's for the OLTP queries that we really care about right? the OLTP stuff is where the currency control allows us to get interquery parallelism and then the stuff we're talking about today is allow us to get intro query parallelism for the OLTP queries so the sort of standard approach that a lot of systems use is what's called static scheduling basically the database system decides when to generate the query plan of how it's going to divide things up into tasks and the easiest way to do this is just to say I have 20 cores I'll have 20 tasks and then just shove it out and then let the scheduler execute them in based on some order the call it's called static because we're doing this before we generate any query plans and we don't adjust anything based on how the actual query plan executes or based on what's actually running the database at the same time and in practice this actually works pretty well because it's easy to implement and there's not much extra metadata you need to keep track of while the query runs so this is what you see in a lot of like MySQL and Postgres I'll do stuff like this but then then we get into what the hyper guys are opposing to do and they're doing a dynamic scheduling of tasks where the database system can adjust how many tasks and how much hardware resources it's going to assign to the query plan while it's actually running and it can adjust that moving that up and down based on what the performance is of the query plan and based on what else is running in the database at the same time so in their approach they're going to have a one worker thread per core they're going to use pool based task assignment and they're going to use round robin data placement for these morsels and the morsels are essentially just partitions right for every reason they didn't want to call them partitions and they didn't want to call them chunks or blocks and things like that so they call it morsels but it's the same thing and we're not going to really talk about this in this class we'll talk more about it on Wednesday but their approach is to allow them to do all the sort of standard operators you want to do like hash join, sort merge join sorting and things like that all in parallel very easily and while being aware of the layout and the placement of data in a numeral architecture so at the high level what you're going to have is in hypers you're not going to have a dispatcher thread because they're all going to do what I'll call a cooperative decentralized scheduling so there'll be these queues of all these tasks that some query planner generates and then the threads can go look in the queue and say alright this task wants to access data that I know is local to me so I want to execute that task and then the idea is that once you're done you put the result in some location that's close to you and then you go back in the queue and find the next thing to do and they use a lock free hash table to make sure that you don't have thread tripping over each other because obviously you don't want two threads trying to execute the same task at the same time alright so let's look at a high level how things look out or how things look so here's our data table and we want to first split this up into morsels so I think they choose for each morsel is going to be 100,000 tuples and this looks a lot like the row groups we talked about in SQL server right row groups, partitions, shards, it's all the same thing and then for each of these morsels they're going to assign them to one of the CPU cores or CPU sockets and then when they load it in they'll store it in the memory that's close to the core so now we want to execute things say we have this query plan here we generate all the tasks for it and they call these things pipelines and the basic idea is that the pipeline is going to be the operator instances that can be run one after another using data directly from the guy below it without having to get data from anybody else so in this case here the filter operator instance can use data directly from the operator instance that generated that does the scan on A it doesn't have to combine anything with anybody else so you wouldn't combine them into a single task called a pipeline so now we have all say in our example here we have three cores and the first thing we see that within the data for each core this local data we're going to have all the morsels that correspond to that have been assigned to it but then we're going to also have this buffer for local data this is where we're going to store the output of all the operators instead of having a global hash table that can be accessed by anybody else it's all going to have these thread local stores that they can use and know that nobody else is going to try to write this at the same time so therefore you don't have to use any lashing or locking to protect this it's the same thing we saw on h-store where every execution engine had its own exclusive access to the data and it didn't have to use any locking or lashing so now we want to start executing this query plan we have our queues and the queues are going to be for each task in the queue it's going to be annotated in some way that says this task wants to execute on data at this morsel and so each of these different threads will look in the queue and they would say we want to do a first scan on A and each of them will find the data that's going to access the data that has local in this morsel storage and so now when it does the scan it pulls data out of its morsels, its morsel storage and runs on it and then when it generates the hash table it will store it in its local data storage so now let's say in this case here these first two CPUs for whatever reason the scan on their morsels runs faster than the third guy so these guys finish and now they can go back in the task queue and try to find the next things to do and again they would say alright well I have some tasks on B here and I know that I have morsels for B1 so I want to grab this, I have the morsels for B2 and grab that so now they can execute on that and then while this other guy is still running computing on the scan on A3 this could be because we miscalculated the amount of work we could have to do it could be that for whatever reason there was a the OS stalled our thread it doesn't matter, we don't care for whatever reason this guy is running behind so now let's say the first core here he finishes up early and it would notice that the core 3 so now we can go up and say well there's this task to scan B3 and I know that data is stored over here but I'll go ahead and steal it because I can't start executing the hash join part because we haven't finished the scan operators in the query plan so in this case here we'd have to pull the data from this guy's morsel storage over the CPU interconnect and then bring it into our memory here and then we can compute the scan operator in the filter there and build the hash table but the key is that when it writes out the hash table as it builds it it writes it to its local data output and there's some catalog or some additional metadata that's tracking in our system to say if you need the hash table for this for this morsel I have it here right and this all finishes up and then it can do the hash join just as it normally would by combining the different the different hash buckets that are built by all these guys and producing the final output so the key thing to understand about why they have to do that work stealing thing or the task stealing thing is that because they only have a single core per worker if you could if each worker thread or each core could only operate on the data on the morsels that it had local these stragglers would slow everything slow everything down yes the core that finished up early could look in the queue and try to execute the next query to try to execute the hash for the next query but ideally you kind of want to have the lowest latency as possible so it's we're willing to pay the penalty of caching validations and moving data over the interconnect by stealing the work from from somebody else we'll see in a few few slides when you actually scale up to really large socket systems like 32 sockets they actually found that the work stealing thing doesn't actually play out that well but in the smaller systems that they're using for the next query, this actually makes sense and then they also use a lock free hash table to maintain the global queues we're not going to talk about that so much here we'll talk about more of that on Wednesday when we talk about doing hash joints right so since the time in the beginning talking about different types of hash tables so another alternative way to do dynamic scheduling is a recent paper in VLDB 2015 from the SAP HANA guys is how they do scheduling in this prototype or this one-off version of HANA the actual real commercial version that you get when you give SAP money doesn't actually do what we're talking about here but this could be something they would add in later on so this type of new and where scheduling is going to do pull by scheduling just like hyper but the difference is that they're going to have multiple worker threads that are allowed to operate in groups on the same harbor cores, the same CPU cores so each CPU socket can have multiple groups so it doesn't necessarily mean to have one socket per group or one group per socket and then when each group we're going to have two different queues that correspond to whether there are tasks that can be stolen or not stolen so a soft priority queue would say that here's something that I want to execute this group but if someone comes along and steals it, I'm okay with that and then there will be a hard priority queue that says no other thread group can steal this task and only can run at this socket and different from what hyper does is that we're going to have a separate watchdog thread running in the background that checks all these groups and see whether they're fully utilized and if notice that some of these threads are idle it can go steal tasks from the soft queue of another thread group so there's much more management we have to do in this than hyper because hyper says every guy's just always looking for the next job it just doesn't regardless of where the data is if there's nothing that can run for its local data in this case here it's a bit more nuanced and there's a bit more sophistication required to do the watchdog thread I'm going to talk too much about what they're doing on the inside but I'll just say that the way that they're going to have multiple threads working in a single group together is that they categorize them based on what they're actually doing so you could have a group of threads that are actively executing a task and you mark them as working and then you could have threads that are blocked for some latch to free up before it's allowed to do something then you can have a free and parked threads and these are sort of like a bullpen for the thread group that allow the watchdog thread to say well I see that you're not fully utilized or you have idle cores so pull something out of the free queue or pull something out of the park queue allow them to go steal work and start doing you know get better utilization by doing work when they would otherwise be idle so I mean I'll be real handwriting about all this but I'll just say that this paper is that it's better to have groups of threads and be able to move around what CPUs they're actually running on rather than having the sort of single thread pin per core and they found that when you scale up to a large number of sockets the work stealing approach that Hyper uses doesn't actually play out and I think that our reason is just because at that level there's so much movement around that you saturate the interconnect with all these cash and allegations and the other thing is I thought was kind of cool in the hyper case those worker threads can only execute queries they can't do anything else so you have to assign additional threads to do logging, additional threads to do networking stuff in the the HANA model because these there's multiple threads per group you could have some of the threads in the group do networking some of the threads do other background processes and overall the utilization become more uniform then having sort of one of these CPU cores that are pegged at 100% all the time processing queries and then other threads you allocate if we're doing background stuff are only used some of the time okay so to sort of finish up quickly a database system is a beautiful piece of software it's doing everything it's doing networking, it's doing logging doing recovery, doing transactions doing query processing and the key thing is in order to treat your database system right you have to make sure that you implement the scheduling protocols to be fully aware of what the underlying hardware looks like because if you don't if you don't realize you're running on a NUMA system or even the NUMA system if you don't track up where your data actually is and you just let it access whatever it wants then you're a software performance penalty in the case of the hyper paper and the HANA paper the difference is quite significant when you run OLAP queries and the main thing again is never let the OS do the stuff the database system can do the database system always knows exactly what's going on and always knows more about what it's trying to do than the operating system ever could to the operating system your database process looks like anything else yes it may have a different access pattern but it doesn't distinguish the database process between a YouTube client or a BitTorrent client the database system actually knows what it wants to do it knows what it's going to do how it's going to execute queries and therefore it can always make better decisions on how it organizes everything this is because we have a declarative query plan we use SQL we know exactly what all the operators are going to do we know where our data is so we can always make better decisions any questions about scheduling fantastic execution so there's another street lesson so in 1997 1997 there was a a noted laureate Christopher Wallace who came up with a list of rules about what you should do or how you should maintain yourself when you're hustling on the street right and in particular he was talking about how to deal crack alright because in the early 1990s crack was really big and he was in Brooklyn and he was dealing crack on the streets there he spent a little time in prison I think when he was like 17 but he came out and basically wrote down and codified a bunch of rules that how you should do all this so even though it's 20 years later and even though that we may not be hustling the same way that Christopher did a lot of these rules that he talked about are still applicable to us in the context of database systems so I'm going to go through each of them one by one so the first thing is that we want to never let make sure we never let anybody know how much money we're holding on ourselves unless people get kind of weird and so we want to avoid that entirely so in the context of databases you don't want people to know like how much money your data system is making how much money your salary is at the company you want to keep all that information private never think is that he says never let people know your next move so if you're going to start you can move your trap from one street to the next you don't advertise that ahead of time in the same ways in the context of databases if you're at one company you think about moving to another or you think about putting out a new feature in your competitors you don't let people know this until it's actually done you never want to trust anybody so in the case of hustling on the street with your trap you don't want people to know where your stash house is you don't want people to know again where you're storing your money where you're staying in the case of databases it's a little bit less a little less vicious so it's okay to tell people where you're going but in general you should be very mindful about what people's ulterior motives are keep everyone at an arms length never sample your own supply in the context of dealing obviously don't dabble in your own product in the case of databases actually maybe I disagree with that this is actually apropos because obviously when you build a database you always want to try to eat your own dog food so if you have a database system that you're building you want to use that for as many internal products as you can so maybe put a line through this one you don't want to follow this never sling where you live basically means if you're dealing don't deal out of your house doesn't matter how much they want from you don't don't give them anything just asking for trouble never lend anybody credit if you think someone who is using is going to pay you back later that's probably not going to happen in the same thing in a database system if you give people free stuff and you expect them to pay you back later with gifts or other things like that it's probably not going to happen so you won't avoid that never mix your family with your business affairs so that means don't try not to get to have your family involved in your trap or have people involved in your database company your database product that's related to not trusting anybody even though you're family because at some point you may have to make a hard decision and your family members may end up getting hurt it's just better to avoid that entirely another thing is that you don't want to keep a large amount of product on yourself and this includes your car your personal body, your house because in general this is just general good advice to have just don't leave anything on yourself yes he says that instead of us holding weight the cats who squeeze your guns can hold jumps too yes so I'm wondering what he meant by this also, what is a jump? okay, so let's go question Dana so Dana asked he says that the cats who squeeze your guns can hold jumps too, so a jump is a jumbo crack rock usually like $20 maybe a little bit more now depending on what the street is and the idea is that there's all these if you're dealing with crack, there's all these laws that say how much product you need to have on yourself before you go from just holding it to distribution so if you have a trap set up and you have a lot of product you personally don't want to be holding all this because if you get caught by the police then they might bust you on a federal charge for distribution so all your sort of minions that are hanging around in a trap that are sort of watching things, taking money, things like that you won't let them hold all the product because if the police come, they'll take the fall and they're under age, they probably wouldn't be arrested in the same way that you would so a jump just means a jumbo crack rock good question it goes without saying never talk to the police it never ends well so avoid that entirely and again in databases the same thing never talk to the police and then finally, never take an assignment job if you don't have the clientele so if someone wants you to sell their product if you don't think you can push it in a reasonable amount of time you don't want to do that because they're going to come back and say where's my money same way in a database job if someone says I have this big customer I need you to come use your database system or build some extra feature for it if you don't think you can do this and complete it in a good amount of time it's probably best not to take the job because people just end up getting upset okay so again all of these things, these are all rules that I follow in my life and I strongly encourage you to do the same because it got me to where I am today okay so next class we're going to talk about parallel hash joins so since sometime in the beginning talk about different types of hash tables lock free hash tables, cuckoo hash tables, things like that and then as a reminder for project number two, next Monday Joy will announce that we'll do a checkpoint the basic idea is a casual thing that you just meet with him, discuss your code if there's any questions you have about how to implement things he can sort of help out and then just make sure that everyone's sort of making progress and moving along okay in the back, yes what happened to Chris? Christopher Wallace in 1997 actually his, the memorial is coming up I think it's next month, it's like March he was shot on March 9th 1997, so it was almost a month for a day we could have a moment of silent for it in the class, but since it's the spring break we can't do that so any other questions? alright see you guys on Wednesday