 All right, so today's class, again, the weatheredness in the semester is going to work. We're going down the stack. The last three classes were all about doing query optimization. How do you take a SQL query, parse it, put it into a logical form, then generate an optimized physical plan for it with the lowest cost. And now that we have our physical plan, now we need to actually execute it. So our focus today is first talk about at a high level the different process models you can have in a database system. And for those of you who took the interruption class, this should be a review. And then we'll talk about the different types of query parallelization we can deal with. And then we'll focus on, at the end, doing data placement and scheduling, which is sort of the key thing that you guys read in the hyper paper on their mortal system, OK? So again, everyone should be at the same level of where we're at in understanding our conceptual model of the database system. So we're given a SQL query, and then we generate the logical operators. And then the query optimizer will then produce physical operators for those logical ones. But then now with these physical operators, we now want to actually execute them. And instead of just executing one thread or one task per physical operator, we can take that physical operator and run multiple invocations of it or instances of it at the same time in parallel. So we're going to say an operator instance is an invocation of a physical operator that's going to process some segment of data. The high level idea here is that in modern systems, we have multiple cores. So rather than just running everything, all of the operators in your query plan, one operator at a time on a single core, we can run these in parallel. So even though it's going to be the same high level physical operator that we're executing, they're going to run on multiple cores and process different parts of the database or the tables or the intermediate results of another operator. So those will define a task now in our system is going to be a sequence of one or more operator instances that we're going to execute as part of a single request. So again, we take our physical plan. We're going to convert them into operator instances that we're going to run in multiple threads. But then we may want to pipeline a bunch of them together and schedule that sequence of those operators in our system. And we'll call that as a task. And so this is not, I would say, a task is not a universal definition. This is what I'm going to use for this lecture. All right, different textbooks, different papers might refer to these things as fragments or pieces of the query. There's no one standard definition that everyone uses. So now we want to talk about how we're actually going to run the system or how we're going to build our database system to use all the additional cores that we have available to us. Again, this course is all about single nodes shared everything or shared memory systems. So we're not worrying about how do we spread the system across multiple nodes. But at a high level, a distributed database is essentially going to work the same way. At every single node, you have to have one of these process models. So the process model is going to find how the system is going to be built so that we can handle concurrent requests from multiple users at the same time. Doesn't necessarily have to be multiple users. It could be a single request that we can then split up into different tasks and run them in parallel. But the high level idea is the same. So what we'll do is we're going to define a logical concept in our database system called a worker that is going to be responsible in our system for executing tasks on behalf of the client. And then it takes those tasks and those tasks produce results. And then you need to send them back in a coalesced form back to the client that requested it. So again, at a high level, SQL query comes in. We can break it up into multiple tasks. We run those in parallel. And each of those tasks are going to generate some subset of the output. But then we need to combine them and send back the result as a single form or a single collection of data to the client. So there is essentially three ways you can build a database system of different process models. So what I'll say also, too, that this is sort of using the term process models, it's a bit of an overloaded term in systems. For this process model, again, think of the high level architecture of the system. One way you can implement a process model is through processes at an operating system level. So just be mindful of this as we go along. So the first approach is to have a process per database worker, then we can have a process pool, and then we can have a thread per database worker. And I'll go through examples of each of these. So in the process per worker model, this is actually the oldest approach, the most common approach used in systems that were built in the 70s and 80s and much of the 90s. But every single database system worker in our system is going to run as a separate process. Like if you run fork or exec inside your operating system, you get a standalone process. So the high level ideas that our application comes along, it's going to send a query request, and it's always going to go through some front end process called the dispatcher or the coordinator. And then this thing's responsible for then handing off the request to a worker process, will then do whatever it needs to do to execute it, and also communicate directly with the application. But it can then execute the request now in the data system, get back results, and then send back directly to the application without having to go through the dispatcher. So there are a couple of things to say about this. So one is that the database system is not going to do any scheduling on its own. All the scheduling is going to be handled by the operating system. The operating system has its own scheduling protocol. It decides how to time slice resources. The data system essentially punts all that decision-making to the operating system. Now, you can do some things like pass flags, like nice or IO nice to say what priority a process is given. But at a real fine-grained detail, we don't control anything. The other thing too is also now, because these are running a set of processes, they're each going to have their own private address space. So if we want to have shared data structures across multiple processes, we have to use shared memory. Because otherwise, you would have every worker process would have its own copy of the database, because it's running in its local address space. So otherwise, you have to send IPC requests between them to pass around messages. Or you can use shared memory as a place to store global data structure. One benefit you get from this is that, although typically you don't want this to happen in your database system, you want your database system to be resilient, high quality engineering, and reliable as possible. But bad things happen. The nice advantage of this approach is that if one of these worker guys goes awry and crashes, you don't end up taking down the entire system. Only that one process fails. So this is the approach that's used in, as I said, in most of the data systems that were built in the 70s, 80s, and 90s, DB2, Oracle, and Postgres do this. Like in Postgres, this thing is called the Postmaster. If you're running Postgres in your laptop, but you look at top to see the different processes that are running, one of them will be labeled as the Postmaster. The reason why these systems use this approach is because when they were being developed in the 80s and 90s, P threads weren't as widespread as they are now. Every Linux installation, you have the positive threads available to you. Back then, there was all these different variants in Unix. There was H pox, 264, BSD, Linux was still a bit early, Solaris. All these systems had their own threading packages. So if you wrote your data system for H pox and you used threads, it may not actually end up working on Solaris or H pox or 264. So for portability reasons, they would use this approach because every single system, every single Unix would have fork and shared memory. So you'd be sure that that feature was available everywhere. Another approach is to use the process pool. And this is, again, this is like the previous thing about it showed, except that instead of having a dedicated worker handle your connection and the application only interact with that worker, the application can send a request to the dispatcher and the dispatcher can then hand it off to any three process in the worker pool. And the worker can then execute things and send things back through the dispatcher. So this is still going to rely on shared memory because, again, these are running a separate processes. It's still going to rely on the OS to schedule everything because, again, we can't control this. But the advantage is that now you can be a bit more flexible in how you're breaking up queries into tasks because now one worker could decide, oh, well, I don't want to be, I want to execute this query in parallel across multiple cores. So maybe I'll hand off a task to another worker if it's free. And it can execute that and then send me back the results and I can combine things. So this is actually how Postgres is moving more towards this approach. In Postgres 9.5, they started adding support to do parallel queries, again, running queries on multiple cores. And essentially, they were doing this approach. And I think as of Postgres 10, they have a bit more aggressive strategy for doing this. So I think most queries now can run across multiple workers. And then, of course, IBM DB2 does this as well. As we'll see in the next slide, DB2 basically has to support everything because not only do they need to support commodity hardware, like Amazon EC2 instances or just regular Linux or Windows machines, but they also need to support ZOS or the real expensive mainframes that IBM sells. So DB2 has to run in every possible environment. So as the administrator, you can set what you want DB2 to run at. I might actually think, I think for ZOS, it has to run as the process per worker. But on regular Linux, they can run in the thread per worker approach. And the last one is the most common one. As far as I know, every single database system built in the last 15 years is going to use the multi-threaded approach. This is because P threads are now widely available. And you know that if you write P threads on one machine, it should work on any other machine that your system can compile on. And actually, as of C++, I think 11, now they have SCD thread. Now they have a standard definition of what a thread actually is. Whereas before, it was just P threads. So now with these worker threads, now we can send our request to some collection of worker threads running the data system. And either one of these can process our query and send back results. So for this now, now we're going to see that the data system is going to do its own scheduling for a lot of things because we control everything. So we have to be careful now about where we're going to run our threads and what those threads are actually going to do. The one thing I'll point out though is just because you're using a multi-threaded process to execute queries doesn't mean you automatically get multi-threaded query parallelism. And so MySQL uses this approach. But at least as of MySQL 5.7, I haven't checked MySQL 8. They actually run everything as a single thread. So if a query shows up, one thread will execute it from beginning to end. They can't split it across multiple threads. Whereas Postgres is starting to add that. And Oracle has a 2014 ads all these things. And the SQL server, as far as I know, is entirely multi-threaded. So again, the main takeaways of this is that the multi-threaded approach has several advantages. In particular, there's less overhead for a context switch because everything is always in the same address space. So you don't have to load in security information going from one process to the next. You don't have to manage shared memory. We just assume everything's in the same address space and we're fine. But you do have to be a little more careful and make sure that your code is a little more robust because now if one thread does something bad, you can take down the whole thing. Whereas in the process model, as a process per worker model, you don't have this problem. And like I said, I don't think anybody, I really may extend this to 15 years. Nobody uses the process model, the process per worker model, except for systems that are based on Postgres. So Postgres is based on the second approach, the process pool. So any system that is derived from Postgres, things like Green Plum, Tisa, Vertica, these are all going to be inherent the architecture that Postgres use. In fact, when we started building Peloton, we were also based on Postgres. And so we had this problem of we were using a process per worker and we wanted to get away from that because we wanted to be able to do multi-threaded queries. So the first summer when we started working on the system, we basically went through and rewrote our system based on Postgres to use multiple threads instead of multiple processes and got rid of shared memory entirely. And there's another conversation for another time. It turns out the way to actually do this in Postgres that made it actually much easier than we thought it was gonna be is if you take the Windows code for Postgres that makes it multi-process. With some minor changes, you can actually make it use positive threads and then the way it was actually architected made it much easier to be multi-threaded than we thought it was gonna be. That was actually one surprise. Okay, so now that we have our process model that we understand how the system can be architected in, how the workers are gonna execute. Now we gotta think about how we're actually gonna schedule our queries or how the tasks inside them. So for every single query plan that shows up, we're gonna have to decide where, when, and how to execute it. So the how is how many tasks we should use for it. How much should we divide up the query and try to run the different parts of it in parallel. And then how many cores we should use to execute those tasks because it's not a one-to-one mapping. You can have more tasks than cores or more cores than tasks. Depends on what you wanna do. And then when we say we have these tasks, now we need to decide what CPU cores they should execute on and where should the output of the task, like it's intermediate result from the operator, where should we store them so that the next task that may need to consume them can find them. And the main takeaway that I'm saying here, as I said throughout all the semester is that the database system is always gonna be able to do a better job than the operating system because it knows exactly what the query is actually trying to do. The operating system just sees a bunch of threads that wanna get scheduled and run, but it doesn't know anything about what those threads are actually trying to do. But inside the database system, we know what query we're executing, we know what's in our query plan and our operators. So we can be more fine-grained and have better control and end up getting better results because we know exactly what our query plan's gonna do. So again, this is the main theme about this and throughout the entire semester, we can always do a better job than the operating system. So there's two types of parallelism we can talk about. So the first one we're not gonna mention too much is because we've already talked about this before when we talked about concurrency control. So this is called interquery parallelism. It can be run multiple queries in parallel at the same time. Again, with concurrency control, we said that this is what those protocols provide for us because they're gonna allow multiple transactions to run at the same time, update the database at the same time under the illusion that they're running by themselves and we just have to figure out how to schedule their operators such that they don't interfere with each other and cause acid violations. So the one thing I'll say is as far as I know, how hard it's gonna be to actually implement a concurrency control scheme is not gonna be greatly affected by what process model we use. So whether we're using a per process per worker or the multi-threaded approach, a thread per worker approach, it doesn't make two-phase locking or timestamp ordering or MVCC any easier or harder. All the same challenges that we had to deal with before, like how to identify conflicts, make sure that things end up in the right order, these are all gonna be the same. So again, there's not much really to say about this because we already covered this with concurrency control. The thing we're really gonna focus on in this class and going forward for the rest of the semester is how to do intro query parallelism. Again, the idea is that we're taking a single query and we wanna execute as operators in parallel to improve performance because then we have the resources in order to do this. There's a big difference between modern systems versus the data systems in the 1970s and 1980s. Back then you had a unit processor, you had a single socket with a single core, but now we have multiple sockets and multiple cores per socket. So figuring out the right way to actually schedule these things and divide queries up is really as important. So the first type of parallelism you can have intro query parallelism is called intro operator parallelism or horizontal parallelism. And the idea here is that we're gonna take our operator and we're gonna break them up into independent instances that are used to perform the same function just on different parts of the data that we're trying to consume. And then the other approach is to inter operator parallelism. So this is intro query, inter operator parallelism and this is sort of also called vertical parallelism. The idea is that we can run operators in the same query plan at the same time in a pipeline fashion so that you don't have to wait for, one operator doesn't need to wait for the guy below it to process all of this data before it can start processing. And we'll go through examples of each of these. So in the case of the first one, if we do intro operator parallelism, again the idea here is that we're gonna run multiple instances of the same operator at the same time. So let's say we have a really simple query, we wanna join A and B on AID but then we have two small little filters here on A dot value and B dot value. So if we wanna generate now the actual execution plan for this query plan, we say we start with A and A is a large table so we can break it up into three segments. These can be called partitions or shards or chunks or in the case of the hyper paper they call these morsels, right? Disjoint subset of the table. Let's say that we can divide up A into three blocks or three chunks and then we're gonna sign a CPU core to be dedicated for executing this operator instance on this segment of the data. And again, it doesn't matter whether this is the process or a thread at a high level, the idea is the same. So now, the next thing we see is that we want the after we scan A, we're gonna do the filter in our wear clause. So we can actually immediately take the output of this and run it through our filter. So this will be sort of now treated as a single task that I'll get scheduled at the single core here. Then likewise, we have the projection at the top. So we can just add this above this. And because we know now we're also gonna do a join, we can have them each build a hash table here. So what I'm not showing is what the hash table actually looks like in terms of what the data structure is but also I'm not showing whether each task is building a subset of the hash table or whether it's a global hash table. That'll be our focus on Wednesday, right? When we talk about hash joins. But for now just assume that they're building a global one. And how they synchronize, we'll worry about that later. So the thing though is now we need to know that when each of these tasks finish, we need to know when they're finished so that we can then start scheduling other tasks in the system. So to do this, we're gonna add a sort of a virtual operator called the exchange operator. And this is sort of like a barrier that says the query plan cannot proceed further up into the plan, the extrusion cannot proceed until all of the sub tasks that come below this exchange finish. So the exchange operator came from the volcano project. Again, the same volcano guy that did Cascades that we talked about last week. Same volcano guy that did the iterator model for query processing, right? So the exchange operator is again, is sort of a junction point where you make sure that all the tasks that are feeding into this have to complete before you do anything else. Again, you can think of this as a pipeline breaker when we talked about query compilation as well. All right, so let's say in this case here, once we have all the results for the exchange operator, then we can feed it up into the build side of the join because now we have the hash table we're gonna use to probe. So now once this is complete, then we can start firing off operator instances to do the scan on B. And again, let's just assume that we can break this up into three chunks or three morsels. And then they're each gonna do the filter plus the projection. And then now they can do the probe into the hash table that these guys built. So for this, we don't need an exchange operator here because at this point here, these tasks here will be blocked until this thing finishes. So we know at this point here once this exchange operator is done, the hash table is complete. So therefore there won't be any false negatives. Everything that we expect to be from A should be in the hash table. So now we can just do our probe into it and find our results. But again, we can do this now also in parallel as well. They don't need to synchronize at all because each of these are operating on an independent segment of B and you don't need to know what's in these other segments in order to do your probe. But now we also need now an exchange operator above this because this joint operator is spitting out tuples and we want to be able to combine them into a single result and then produce the output to the application. So this is clear, what we're talking about when we say operator instance. All right, we're taking what was a scan on A and running it in parallel across three cores. Okay, so now in our operator parallelism, again, the idea here is that the, we're gonna allow one operator to run in one thread while other operators below it are also running at the same time and they're all sort of emitting tuples and pushing them up into the query button or pulling depending on how you're looking at it and they don't need to coordinate, they don't need other than knowing that they have to block until the other guy below it produces the tuples that it needs. All right, so let's just take the join here. The join could be running on one core and it's just doing a really simple nested loop join and it's just taking all the data that's getting from the guys below it and then just doing the join comparison and then emitting tuples up. And then above this, we have the projection that again, this is just spinning on another core that's running, taking all the output that comes from the guy below it does projection and then spit it up. And if there's anything above it in the query plan, that can run as a separate thread and process as well. So again, the idea here is that this thing essentially needs to block until it knows, until this guy spits something else, right? So as far as I know, you typically don't see this in traditional relational database systems. You see this in stream processing systems like the stream system from last week or like things like Storm or Heron because these are gonna be running on separate nodes and they're sort of pushing data out all the time continuously. As far as I know, no relational data system that supports SQL does this approach here. Everyone does the other approach with introoperative parallelism. All right, so now the issue is gonna be how are we gonna come up with the right number of workers we're gonna use for a query plan? And this really depends on the environment that our query is running in. So it's gonna depend on the number of cores we have or the number of threads that are available to us, the size of the data that we expect to process and then what those operators are actually doing. So for some queries, like an OLTP query, it doesn't make sense to do any of this parallelism stuff because it's gonna do an index probe to find one query or find one tuple in the index and then spit it up to the query plan and produce it back to the application, right? It's not doing complex joins, not doing complex aggregations or window functions. So this really is focusing on OLAP queries and it depends on what our environment looks like. So the first question we gotta deal with is how are we gonna allocate or assign workers to our CPU cores? So the first approach is to have one worker per core and this is basically you have a fixed number of cores in your system and you have one thread be dedicated to that one core and no other thread can run at that core at the same time. This is what we used in H store and VolTB. This is what the morsel approach does as well. So the way you actually can do this is that there's a sys call you can make called sked set affinity where you can tell the operating system for my thread, here's the CPU IDs that I'm allowed to run on. So what you basically do is for each worker thread you say, all right, you run on one and the next guy that says I run on two and then you basically providing a map and say what are the threads, what are the cores that these threads can't run on? So you prevent anybody else from getting scheduled on that core and you have dedicated access to this. The other approach is to do multiple workers per core and this is what the HANA guys are gonna do when we talk about later on and the idea here is that we're gonna have a pool of workers that would be assigned to either a single core or a single socket. Usually it's on a per socket basis. And then all the other, this allows any of these workers to be able to run on either the core or the socket at any given time. So the advantage of this is that if you have a your worker get blocked because of network IO or disk IO or something, then rather than no other thread being able to use that core because it's, in this case here, you'd be the only guy pinned to it, you allow some other thread to do some other task that's useful, the OS can schedule it rather than just having that core essentially be idle. So the HANA guys are gonna claim that this is actually the better approach to do when you have a large number of sockets with a large number of cores and you wanna export mixed workloads. Again, in the case of H-Strom VoltsDB, we did this one because we were always in memory and we never had our worker thread stall on IO, either the network IO or disk IO. There was other threads running the system that would handle networking and logging and things like that. But when it came time to actually a worker actually executing a transaction, it would never be blocked by anything. It was almost running at bare metal speed. Okay? All right, so then the other thing I also say about this too is this control here only matters to your process, the data system process. So if there's something else running on the system, like someone's doing Bitcoin mining, right, that's outside the control of the data system, it can use any cores that it wants, right? Unless you have administrative privileges and go out and block every other process from using your cores, this will prevent other threads within your process from running on this core but it won't prevent other processes from running this core as well. But again, if you have a data system that you actually, you know, you want to get the best performance, you're typically running the data system on a dedicated machine without anything else running, right? You're not running your Bitcoin mining operation on the same machine that you run your data system. Unless you get that Postgres bug that they had a few weeks ago, yes? You mentioned like... What do you mean? You mentioned like this too, like the problem. Say it again? Sorry. Like the problem has fixed, I don't think it's fixed. Like how do they fix it? Or just don't run like on the fixed, I don't think it's fixed. I'm sorry. I'm missing. Yeah, like you say like the problem has been fixed or is that solution just like don't run it on the same machine? Yeah, the problem is not fixed. Hold on, yeah, yeah. Like, again, the, like, this thing, you're setting where a, your thread in your process, the data system process, where your thread can run. Yeah. The operating system is gonna let everybody else because it's trying to be a time-shared, multi-shared system. Any other process can start running on your system and can start using the cores that you're trying to dedicate here. And the answer is you just don't let people run non-database stuff on your database machine. And the Postgres bug I was mentioning was a few weeks ago, someone had a malware where they would send you an image and then if you're running Postgres, they were able to run it, install a Bitcoin miner inside of Postgres. Right, but that's an anomaly. Yes. You have a question or? Oh. Okay, so again, this is the problem we're trying to solve here is we have a bunch of cores. We have a bunch of workers. How do we assign the workers to cores? The first one is you have a single worker per core. And again, it doesn't matter whether it's a process or a thread, right? And then this one is you have multiple workers per core and essentially you have a pool of workers that's again, could be a process, could be threads, running on a single core or a single socket. And then the system can decide with one guy gets blocked and somebody else can run and make forward progress. And then the last issue we gotta deal with is how do we actually assign tasks now to these workers? And again, this would be pretty straightforward and easy to follow. There's basically a choice between push or pull. So in the push approach, you have this centralized coordinator or dispatcher that has a global view of all the tasks that has to run and what those workers are actually doing. And so what happens is your tasks come in, the dispatcher says, all right, this worker gets this task, this worker gets that task, and it can be aware of what data these tasks have to operate on and it knows where the workers are assigned and what cores they're assigned to. So it tries to do a good job of assigning tasks to workers that are gonna operate on local data. And I'll describe what I mean by local data in a second. And then when the worker finishes, there's a callback mechanism or they have a way to notify the dispatcher to say, hey, I'm done, give me something else to do. The alternative is the pull-based approach where you have some kind of one or more task queues. The workers are just always pulling that and saying, what's the next thing I can do? What's the next thing I can do? Right, and essentially block on that until if there's nothing there, until something shows up. So now regardless of how we're gonna decide where the workers are gonna run at and how they're gonna get the actual task they wanna execute, as I said in the last slide, we wanna make sure that our workers are always gonna operate on local data, right? So that means that the scheduling mechanism, whatever it is that we're, how our system is, how our workers are getting tasks, we need to be aware of what the actual underlying layout of memory is and how our database is split across or stored across that memory so that we're gonna always read data that's local to us, because that's much faster. So this is gonna be a distinction between what's called uniform memory access and non-uniform memory access. So the uniform memory access approach was very common in multi-socket architectures up until about 10 or 15 years ago. And the basic idea here is that you had your CPU sockets, each socket had its own local caches like L1, L2, L3, and then there was this thing called the bus, the memory bus, where it would allow you to take requests for some location of memory from any socket and retrieve it from any bank of DRAM DIMMS, right? So if this thing wanted to read data over this DIMM over here, it would just make a request to do the load to read that memory address and then the bus would then route it here to get the data you need and then bring it back to you. So in this environment, the cost of accessing data from any CPU to any other memory location or bank would always be the same because you always had to go through this bus, right? I'm physically showing you that, you know, potentially that this thing is closer to this, but in practice, you don't know because the bus is sort of hiding that from you, right? And you have to still do cache and validation in this environment. So if two guys, two sockets read the same memory address and they have their CPU caches, if this first socket writes to it, then you have to send an invalidation message over the bus and back down here, right? So the system uses the bus to make sure everything is cache coherent. So again, in this environment, the cost of accessing data from any socket to any memory location was always gonna be the same. So we didn't care about how our database would be stored across memory because any task running on any socket would have the same latency to read anything. But around 15 years ago, they switched to what's called the non-uniform memory access model, right? So anytime you buy like a modern Xeon CPU and a multi-socket system, you're gonna get something that looks like this. So now what happens is that every socket still has its local caches, but then they're also gonna have a local memory bank or DRAM. And so the cost of accessing this DRAM here is really fast or much lower now because it's local to you, it's right there. And anytime you need to read data that's stored at another socket's DRAM or memory bank, you go over what's called the interconnect to send a message over to the socket and the socket says, oh, you need this piece of memory, let me go get it for you and then transfer it over, right? So again, we're talking, this is really low latency, right? This is really fast, but the relative difference of performance going from here to here versus from here to over here can be significant, yes? Go back to the previous point. Yes. So even for a uniform memory access, would it be like faster if the socket has like fresh data from a local cache instead of like the one beside it? So your question is if this socket has to read address a memory, it's not in its cache, it has to read it from up here, right? Can you split the cache from the cache? From this guy's cache? Like from this local cache. So like if it's in the cache, yeah, you just go read it. Yeah. If it's not in your cache, you got to go up here and get it. Yeah, so if this is the case, would it be faster than like in parentless reading the data from like another cache? Yes, your question is instead of going up to memory, can you just go over here and get it, get it for the cache? In theory, yes, because cache is always much faster than DRAM. SRAM is always faster than DRAM. I don't know actually whether they do this, because I don't know where there is. Like you'd have to have a different cache protocols. I think the most common one is like snoopy or directory. So it's like somewhere has to somebody has to store. Oh, the thing you want. Yes, it's here, but it's also here. And I don't know how they do that. If they even do that. What's that? So it doesn't make much difference. This is super abstract. There's actually row buffers inside of this thing. There's cache inside DRAM as well. So this is all hidden from you in some ways, right? Like. So the thing is like, it doesn't make any difference even if you re-attract another cache. It wouldn't be like three times faster. His question is, would it still be like how much faster would it be reading from this guy reading from this cache versus up here? Yeah, I mean with localness, like maintenance in this case. Localness makes sense in terms of like if you're long, long as you're reading from your cache, you're fine. Then as far as outside your cache, you've got to go over the bus. And then that's a whole other mess, right? Yes? Just making sure that the reason why the bus matters because it's basically everything that must be measured in bus roundtrip is because of the character here. Yeah, so his statement is, the reason why the bus sucks is because once you're on it, it doesn't matter where you're going, there's always this roundtrip cost that you have to incur for this. And that's the high pole in the tent, yes. Going through the bus is always bad. It has limited bandwidth, right? In the case of this one, you have point-to-point connection between the different sockets. So there's not a common bus that everyone has to get on and use, right? This guy can talk to this guy. At the same time, this guy is talking to this guy. And there's no interference between the two of them. So in Intel, this is called, starting in 2008, is called the QuickPath Interconnect. As of last year, now they call this the UltraPath Connect, Interconnect. In the case of AMD, I don't know when they started doing this, but they called there's the HyperTransport. And then now, as of 2017, with the new Ryzen chips, they called this the Infinity Fabric. I tried looking to see what Power does. They have a codename called, I think it's called the Mac Interconnect, which I couldn't see how they do it in Power. But this is basically how everyone builds now the multi-socket architectures. And again, from a database perspective, again, the key thing that's going to matter is that if we can read data from our local DRAM, our local memory, then that's good. Having to go to the interconnect is bad. Yes? Is there architecture for the current mainframe movie or is it because of a mainframe and laptops? This question is, is this the architecture used in mainframes? Mainframe? Don't use the word mainframe, like servers. But it's also used in laptops. If your laptop has multi-sockets, which it probably doesn't, then yes. Most desktops don't have multi-sockets either. This is like, if you buy a Xeon server, it's going to have this setup. Every multi-socket system based on AMD or Xeon or AMD or Intel or IBM is going to have this interconnect. As far as I know, nobody does the bus. They might do it in an embedded environment, but even then, those guys aren't going to run multi-socket. OK. So the newer architecture is going to make our life more tricky. It's going to make systems development more tricky for us. Because before with the uniform memory access, our system could just call malloc. And we didn't care where the operating system actually put the data, because it just came back to this, here's our memory. And we don't actually care where it's physically stored, because we know the bus is always going to be a problem for us. But now with NUMA, we have to actually care. So the question we're going to deal with is, how are we actually going to control this in our database system? And the operating system is actually going to provide us some features or methods we can invoke to actually control where it's actually going to put memory. So the idea here is that we're going to take our database and we're going to partition it to disjoint subsets. Again, in the NoSQL world, they call the shards. In the hyperpaper, they call the morsels. Same idea. But we're going to take these disjoint subsets to the database. And we're going to assign them to different CPUs so that they can store them on their local memory. And then internally, we're going to track the locations of these partitions so that when it comes time to actually schedule tasks, we can have the operator instances that are going to operate on data. We'll assign them to workers that can then operate on the data that's local to them. But you don't have to go to the interconnect and get data from another socket. You invoke your portion of the query on the data that's local to you in the back. What's the ratio which means it's going to your local memory and then going over the bus? This question is, what is the ratio of speed going from local memory to over the bus or over the interconnect? I don't know exact numbers, but we'll show experiments of how this affects performance. It depends on what the query is actually trying to do. Yes? I think it's like three times. He says it might be three times. We can look this up later. That's a good question. I should know that, though. I just know it's bad and you should avoid it. So this idea of keeping track of where the data is actually being stored and making sure that we actually query on local data, this is an old, old problem in databases, especially in distributed databases. This is called data placement. So the way to think about this is data partitioning is deciding what attribute or at the logical level you want to divide the tables up. And then data placement is how do you actually now take those partitions and assign them to some location so that you maximize the locality of the data that the queries are processing? So the way we can handle this in Linux is through the blip-see instruction called Move Pages. So if you invoke Move Pages with just a memory address, it'll come back with the number of the numer region that memory address is located on. But then if you give it a memory address plus a numer region, right, CPU ID, then it'll actually go ahead and move that data for you. So now we've got to talk about though, this is fine once the data's already been created, but is there a way for us to control where the data, sorry, this allows us to control the data after it's been created, but the question is, is there a way to control where the data's been placed at the moment that we're creating it? So the question is, what happens when we actually call malloc in our system, right? So assume that we're running Jemalloc, TC malloc, or libc malloc, doesn't matter which one. Assume that our data system is calling malloc because we want to allocate memory for a new block of data. And then the allocator doesn't have a free chunk for us, so it has to get memory from the OS. Then we're gonna know what actually happens here. Yes? So he says the OS zero is out of frame, then hands it back to the allocator because to make sure that you don't see data that somebody else wrote to. Actually, no, not yet. First you text the malloc library, if it has the required amount of space, it'll give out immediate data. Right, assume you don't have more free space. Then you give out the space so that you get more memory from the OS. Yes, but what kind of memory is that? Virtual memory. Yes, so the answer is actually nothing, right? So it's exactly as he says, the allocator goes into the operating system and says I need more memory, so the operating system is gonna extend the logical or virtual memory address space for our process. But this virtual memory is not backed by any physical memory at this point, so we haven't done what he's talking about or we zeroed out the memory. It's just saying yeah, you gotta go ahead and do whatever you want with it. It's only when the process then tries to actually access that memory, we hit a page fault, then it has to find a new block of data, zero it out and then hand it back to the process for us. So again, my question is though, where is this memory actually gonna be stored? And it depends. So after the page fault, the operating system has to decide what NUMA region or what socket am I gonna store this data? And the two approaches are to just let the operating system do interleaving where they just do a round robin approach and allocate memory from one socket to the next to the next, right? Just go around so that way everything sort of spread out evenly. But there's also a way to pass in a flag under NUMA control to say that it's wherever the data, wherever thread that actually tries to access the data, whatever socket it's running at, then that's where I'm gonna allocate it. So again, when I call malloc, the data's actually not allocated yet, right? I just get back virtual memory. But then when someone actually tries to touch it, with this approach, is wherever the thread that's touching it, that calls the page fault, that's where the CPU or actually, the operating system will actually, you know, pin that piece of memory. Yes? I'm curious to how virtual memory will work in those setups. Is like every processor core has its own virtual memory that I've been and it's filled out to this, whatever that's like in this whole, or can you actually, you know, say, oh, my virtual, my physical memory is full, I need to rect this, do some other thing? So the question is, how does actually virtual memory work in a multi-sock environment? I don't know the answer. Yeah. I think it would be the same, I don't understand this, no. Yeah, so I think every socket still has its own TLB. But then the virtual address is just, because there's a single address space across all the dims. So you just, the page table is always the same, right? It's just now some of it might be sort of one dim versus another. Yeah. All right. All right, so again, this is something you can pass in when your processor is created under numeric control just to have to do this. And this is what you get by the OS by default. So to sort of answer his question, maybe not with exactly exact numbers, but showing you examples of how this affects a database system, I want to show two experiments where they're going to run on a multi-sock environment for an OLTB workload and OLAP workload, and you see what the performance difference is for, if you're careful about the placement of data in numeric regions versus just letting the OS or just letting random things happen. So in this first experiment here, this is from the researchers at EPFL, and they took a Shure MT and they're gonna run the payment transaction from TPCC under four thread placement configurations. So the idea here is that the data would be spread across these different numeric regions, and then the red dots correspond to where the, where the workers are actually gonna run. So under spread, you have one thread per socket. Under group, you have all four threads running in one socket. Under mix, you have two up here, two down here, and the last one, OS is just letting the operating system just do everything. And so in this case here, under group, it gets the best performance because when they load the data, everything's gonna land on this socket here, so they're all operating on their local memory. Whereas under spread, it does the worst because the data is being spread out across multiple DIMMs, and therefore when these threads start accessing data that's not remote or not local to them, they have to send messages over the interconnect to get the data that they need. And the mix is sort of like a half approach where it's like half the data's at the first socket, half the data's at the bottom socket. And in case the operating system, it tries to do a good job of saying, oh, well, I think my core is accessing this data, so maybe I wanna start migrating threads. It does a little bit better than spread, but it doesn't do as well as group. So in the case of the payment transaction, it doesn't touch that much data compared to an OLAP query. It is updating the database though, but this just shows you under an ulti environment, then you can get a performance difference, and the grouping one turns out to be the best. But you see a bigger performance difference under OLAP. So this is an experiment that a former master student in this class and worked on a research team did about two years ago, where they built a sort of real simplistic query processing engine that just stored everything as a raise, and they were gonna run this on a eight socket machine in the PDL where the first one, the red line would show that the threads only process data that's local to it, and then the one at the bottom would be every thread was reprocessing some random chunk of data that could be stored on any possible socket. So the vertical line here represents when we run out of hardware threads, and now we're into hyperthreading. So it's eight sockets, 10 cores per socket, so there's 80 real hardware threads, and then everything after here is all hyperthreading. So that's why both of them are gonna go on a flat line. So from this you see there's a pretty significant difference, I forget the exact numbers, this is roughly about 250,000, and this flat line is just above 100,000. So we're getting about a two X performance difference when we're careful about where our threads are reading their data versus we just let random things happen. So he said it was roughly about three X performance difference, and then this is less than that. And as I said, it depends on a bunch of other stuff that's going on in the system while we're running. And the reason why the lines are close up to a point and then it converges is because we are using the same database size for every single experiment, we're just having more threads execute and process it. So when you have fewer number of threads, the probability that the threads are gonna end up reading data that's local to it is much higher, but then as you divide the work up more across more threads, then the probability goes lower and that's why it flat lines. So again, the two things we gotta worry about are partitioning and placement. With partitioning, that's again, that's how we're dividing up the tables based on some logical attribute and assign them to chunks or morsels. And again, there's standard approach to do range partitioning, hash partitioning, or replication. The things we're sort of worried about here is how to do placement. It's about how do we decide where we tell the data system to actually store these partitions after we divide it up. And then we need to make sure that we can then identify where the data is being stored and assign tasks to operate on the cores that have data local to it. So at this point in the class, what do we have so far? We have the process model, we have how we're gonna allocate workers, we have how we're gonna assign tasks to execute on this data, and then we have how we're gonna split up the data and store it or place it on different sockets. But now we still gotta deal with the problem of how we take a bunch of tasks from a logical query plan and figure out how many we actually want and be able to execute them. So as I said in the beginning, this is relatively easy to do in an OLTP workload because the queries are so simple, they process, they try to go retrieve one tuple, so there isn't really any opportunity to parallelize them. In this environment, it'd be one query equals one task. For OLAP queries, this is where things get more challenging, and I will say also too, this is essentially what the focus is gonna be on in this class for the next remaining week, so how do we actually execute OLAP queries? Doing hash joins, sort merge joins, a vectorized execution, because this is much harder to do because we have to process a lot of data. Yes? Maybe better because like a single transaction of a working of because like there's probably some sort of relationship in some transactions. So his statement is, for OLAP queries, would it make sense to make the task be a transaction rather than a single query? It depends, right? So a lot of times in OLAP workloads, the output of one query could be used as the input for the next query. So you can't schedule the next guy before you execute the first one, right? Furthermore, if you're not running with stored procedures, you don't know, you may not know what all the queries you're gonna execute, right? You can get batches in JDBC of queries, but again, you have to execute them in serial order unless you can figure out there's no dependencies between them. As far as I know, everyone executes simple queries like this as a single task unit, right? Because there's so much overhead of trying to like break it up and to paralyze it for something that's so simple. Like you can't paralyze me for some things again, but typically you can't paralyze an index probe to get a single tuple, right? One thread's gonna do it. It doesn't make sense to do multiple threads. All right, so the easiest way to do this is to do what's called static scheduling. And this is just we say that we, ahead of time we fix the number of threads we're gonna execute the query when we generate the plan and it can be something really simple. Like I have 10 threads or 10 workers, I'll have 10 tasks and just shove them out. And then as things change, as you're executing the query, you don't reassess whether you wanna divide this up or have more or less parallelism. You just go with whatever you selected at the beginning. Typically this is what most people do and this is what we currently do in our own system, right? Because we just say, I know I have 10 cores, let me just make 10 tasks. So this brings us to the paper that you guys read about morsels. And the idea here is that this is a dynamic scheduling approach where we're gonna generate more tasks than we have worker threads. And we're gonna spread them across the different workers we have. And then as they execute, if one guy goes really fast, he's able to process more data or process more tasks more quickly than other ones. And we don't have to worry about reassessing or reassigning tasks to workers. They can still work from other people. So under Hyper, they're gonna use one worker per core. They're gonna use a pool-based task assignment and then they're gonna assign morsels to sockets just using it in a round robin fashion. And so in the actual implementation of Hyper, of the operators in the system, the physical operators, like the join operator, they're gonna have support to do parallel new malware execution of these things. Because they're gonna have to know this task needs to operate on this data because it's local to it, so I'll make sure that runs over there. So at a high level, what's going on is that they're not gonna have a separate dispatcher thread because they're using a pool-based model, right? But then all of the threads are gonna pull a centralized queue. And the paper talks about implementing this as a lock-free hash table, which we'll cover next class. And then what happened is that the threads are essentially doing what is called cooperative scheduling because they break up the query plan in a bunch of different tasks that all goes into a single queue and each worker thread is just pulling this queue, trying to figure out what the next thing to do, but they give a preference or higher priority to selecting tasks that are local to it, or that operate on data that's local to it. But then if those tasks aren't available, then it just goes fine, any tasks, right? So this essentially allows them to do like the straggler handling that Hadoop does. If you notice that one thread is running slower than everyone else, maybe you'll also execute its same task because for whatever reason that core is having problems and maybe you'll be able to finish it up more quickly than what it could do. So essentially, this is also called work-stealing. So let's look at a high level like this. So say we have a single data table and we need to split this up into morsels, right? And again, this thing of this is sort of a horizontal partitioning of the table. So in the paper, they talk about using 100,000 tuples per morsel gave the right amount of parallelism without having morsels that be too large and cause bottlenecks because one thread might choke on this or on a morsel and take a long time to process it. In Peloton, we use the thousand tuples per block. We pick this number at random. We haven't done any validation to say whether that's a good idea. In HDOR, we use 10 megabytes of tuples per block. And again, we pick this number at random. It seemed good enough. We never actually did validation on it. All right, so now what's gonna happen is, all right, and again, these morsels will all be assigned to different sockets in the system. So now we're gonna actually execute our query. We're gonna take our query plan and then we're gonna generate a bunch of tasks for it, right? So we have our scans on A, scans on B, and then actually doing the join here. And so at each socket now, we're gonna have our local memory and then we're gonna store inside of it the morsels that we got from the table we just split up before. But then we're also gonna have these local buffers that the worker threads are gonna use to stage or write out the output of the tasks that they're executing. So these are local to them, but they're not private, meaning they're allowed to read the buffers from other sockets. But the idea here is that you can, you're always writing out the results of the tasks you're processing on data that's local to you. And then you have to coalesce them later on depending on what the query plan is actually trying to do. All right, so let's say that we start off, I'm gonna start scheduling these things. So what'll happen is you take the first task here, the first set of tasks, and again, each core will try to pick, a worker at running at a core, will try to pick data that's local to it. So this is processing on the segment A1, the morsel A1, and that's local to processor A1. And this is processing morsel A2 and that's local to process two and so forth. So now, right, so there's the reading data that's local to it, so we don't need to coordinate across these different sockets. And as they produce output, they're writing their output to the local buffer. So now let's say that the first two cores, they finish up, but for whatever reason, A3 is still running. So let's say that now we could start working on the other side of the join and start processing B1 and B2. And again, they're pulling data that's local to it, and we still haven't had to do any cross communication with this. So now at this point here, let's say A1 finishes. It could go on an execute B3, let's say it finishes that real quickly, but now in this case here, because it is B3, it's stored over socket three, we're gonna have to move data over the interconnect in order to actually process it here. But the output of this operator will again go into our local buffer. So we're not writing out the output of this to this guy's buffer, we're just always writing it locally. Again, internally the system does all this bookkeeping to keep track of, oh, this task executed at this socket, therefore the data that this task generated can be stored at what we found at this socket here. It's not like you always gotta go to where the data corresponded to the URED will be found at its buffer, it's always stored at the core that actually processes it. So this sort of clear, they're taking tasks, they try to refer things that are local to it, but they're allowed to go get data from get tasks that have data that's stored somewhere else. In the back. Who distributes the morsels to the different CPUs? Or is it perfectly there? So this question is, who distributes the morsels to different CPUs? So when you load the data, they divide it up into 100,000 chunks, chunks of 100,000 tubals, and then it's assigned at the moment you create the morsel. All right. Yes. In a round-robin way. In a round-robin way, yes. And the buffer is essentially a chunk of local memory. This question is a morsel like a chunk of local memory. The buffer is the chunk of local memory that this socket can write into. It will be the only writer. It will be the only reader. At this point, yes, there could have other queries. Like if you have an exchange operator, you may need to read from the different buffers and put things together. But for this purpose here, there's no other thread that's gonna write into this or read from it. So I don't need to do any coordination. So it's like this feeling of morsels from a different tradition or whatever different human though. If that's basically only, and then like for that writing the revolving here on buffers, that only like really optimal, like a good thing to do in OLAP because I'm just imagining like if you're like doing an update, you eventually still want to update the same. Yeah, so to be clear, this is only makes sense for OLAP, right? And this is, again, this is like an ephemeral result from the exchange of the operator. If you were doing an update, you'd have to write it back down to where the morsels are because this is the actual table with the bottom. Yes. Who exactly does the bookkeeping? This question is, who exactly does the bookkeeping? There's like again, a global data structure, the execution context of the query, you would know, all right, I executed this task at this core. So therefore, here's the memory location to go get the result if you want, if you need it. Something like a master state. No, no, no, it's like just internal state of the system. There's no dispatch or threat managing this, right? They're doing it cooperatively. Okay. All right, so just to reiterate what's going on, since there's only one worker per core, they have to use work stealing because otherwise, if we were back here, and if this guy's not allowed to steal work, then this core is just sitting idle because only socket three can process any task that operates on a morsel that's looked at. And without work stealing, this thing would be idle. So the way they resolve that is allow you to go and jump to the queue. If you don't see anything that you can process locally, then you go and try to get something that is remote, and that's better than doing nothing. And as I said, they maintain a lock free hash table as the work queue that everybody's pulling from, but we'll discuss all about hash tables on Wednesday's class. So the other approach I like to discuss is what SAP HANA does. So I should clarify here. So this is a paper from the LDB 2015. It was written by a student in Europe that worked on SAP HANA as part of the internship, or I'm not sure exactly what the setup was. But as far as they know, this is not actually what HANA does. This is sort of like a prototype or fork of the HANA system that they explored this different architecture. So what HANA's gonna do is they're gonna be do a pool-based scheduling, but instead of having a single worker per core, they're gonna have pools of workers per socket. And there's actually gonna be multiple pools or groups, as they call them, per CPU. And each of these groups are gonna have two queues. They're gonna have a soft queue and a hard queue. So the soft queue basically says that any other thread running in the system is allowed to steal tasks from any other group's soft queue. But in the case of the hard queue, nobody's allowed to steal that. Like the only, the workers that are signed to that pool are allowed to execute it. So the hard queue would be something like a, if you wanna have a garbage septum thread or a networking thread where you don't want anybody to preempt you, then you put that in the hard queue. But any other task could be put in a soft queue. So what's really interesting about this paper is that all of the threads in the system, in this version of HANA, are organized in these worker pools or thread groups. And they're allowed to steal work from anybody else. So this includes things like networking, logging, any background task is gonna be managed in this environment. In the case of Hyper, they actually do what we currently do is where you have your dedicated worker threads that only execute tasks or the execute queries, but then you just have a scattering of a bunch of other threads that are doing other background tasks or networking tasks. And they don't ever commingle. A networking thread can never do any executed query and any worker thread can never execute like logging or garbage collection. Right? So in the case of this HANA system, the way they're gonna sort of try to manage everything is that they're gonna have this watchdog thread that's gonna run in the background and essentially go and look and check to see how much work or how long are the queues for the different worker threads and then can assign tasks dynamically and then also scale up and scale down the number of threads that each group is allowed to have. And so the way again the way they do this is that they have these soft and hard priority queues and then for the different threads in each group, they'll have these flags to say what the threads allowed to do. All right? You say that the thread is essentially the status is allowed to be working, it's inactive because it's blocked on something, but it's free meaning it's allowed to sleep for a little bit to wake up and see whether it's a new task to do or park basically means like stopping it from ever actually executing a new task. So they're not actually killing threads, they're just sort of saying you shouldn't be scheduled, don't do anything or you're allowed to try to execute and do work. So now this sounds like overly complicated and this sounds a lot like an operating system, right? Sounds like an operating system scheduler. And so the reason why they wanna do this is because the HANA guys claim according to this paper is that when you start scaling up to a large number of sockets, so I think in the paper they're talking about eight CPU sockets whereas the hyper guys are only maybe looking at two sockets. But when you start looking at a large, having a large number of sockets with a large number of cores, letting the operating system maybe kind of manage these things itself becomes problematic and the data system is gonna do a much better job at doing this. And further they're also gonna say that the, if you end up stealing tasks in the same way that the hyper guys do, that become the big performance bottleneck as well and you're better off not allowing threads to steal from the soft queues and just make everything be a hard queue. So with their Numerware scheduler, they're gonna allow to again, adjust the number of threads where they're running at by pinning them based on whether they recognize that a task is gonna be CPU bound or memory bound. Again, some tasks will just don't actually have to read any data like and if you're crunching some computation in the UDF, so maybe you wanna let that run on any CPU. But again, if you're reading data from the tables that is placed in different Numer regions, then maybe you wanna be memory bound and then you pin your threads to sockets based on that. And again, they found that the work stealing approach from Hyper was not beneficial for when you scale up the largest number of sockets. I was at a SAP HANA seminar because it was two years ago and they talked about one of their customers was maxing out the total amount of memory available on Intel CPUs. Even though Intel has 64-bit addresses, it's actually two to the 48 addresses and they have some customer maxing out that amount. And they had a crazy number of sockets in the machine. So in that environment, the work stealing was not beneficial, not good for you. So instead, the way you're gonna do this is that, well, at a high level what I like about this is because, again, everything is managed under this threading architecture or threading infrastructure and so the threads in the system can just not only execute queries, they can execute other things and the system can tune that up and down as needed. Yes? What time and how would we get something in hard? Say it again, what time what? What time and how would we get something in hard? What do you mean? What do you mean what time? Like before SQD. So the system boots up, it hasn't done anything yet, it establishes these hard and soft queues, right? Like you create these thread groups and every thread group has these two queues. And then when you execute queries, you can decide, the system can decide whether I put the tasks for that query in either a soft queue or a hard queue. Or if there's other background threads you wanna run like garbage collection, those things are just tasks, but then can be assigned to a hard queue or a soft queue. Why are they revealing two different queues? Because some tasks, this question is, why would it be two different queues? Some tasks you may want to allow other threads to steal the work from one group to another, that would be the soft queue. And then the hard queue would be things you don't can't steal. But as they said, when you have a large number of sockets, it's better to put everything in the hard queue. You just say again, why doesn't it scale? This question is, why doesn't it scale? I have to go read the paper again. I think it was the coordination and synchronization across different sockets or different workers. And then the, I think in a really large memory environment, now the cost of not processing data local to you becomes more worse and worse. So you're just better off just letting, you know, these sockets have enough cores that you can probably get through the data fast enough. So there's not gonna be just, everyone's not gonna be bottlenecked on some large task, right, that you can sort of spread things out out enough and have enough smaller chunks. And there's enough cores to parallelize at a single socket. I think that was the gist of what they argued. Yes? The problem with the data is already like it's calculated at the end for execution. The priority? Oh, for these hard queues or soft queues? Yeah, like again, internally the system knows, all right, these tasks are just generated to execute this query. They go into the soft queue. This task I'm generating now, do you garbage collection pass? That goes in the hard queue. But like please respond with your deterrent now. Us, people that, us building a data system, yeah, yeah. Okay, so what am I take, what are my takeaways of this? So everything I've described today sounds like an operating system, right? We talked about tasks having priorities, we talked about controlling where the data is stored, we talked about assigning tasks to threads. That's essentially what the operating system does for us, right, with schedulers. But because the database system knows exactly what it is you're trying to do, then it's gonna be in the best position to make decisions about how to execute these tasks or either the query or other parts of the system. And the key thing that we have to make sure we handle correctly in order to get this better performance is that we have to be aware of what our hardware actually looks like and make sure that we put our data in the right place and then our tasks then execute on data that's local to it, right? Yes? Yeah, actually on that note, you said in the 80s people actually built machines dedicated to databases, but it's not worth it because it's not more small. The more small is no other thing. Why hasn't that happened to you now that, like? Okay, so his statement is that I claim that in the 1980s, I don't think I claim this in this class, but I've said it before, but this is just true. In the 1980s, there was this movement called database machines. So for people who don't know this, think of a database machine as a special appliance, like a one unit racks server that had hardware that was dedicated to execute nothing but databases, right, only a database system. And then it was thought that this is how people were gonna build database systems going forward. Everyone's always had to specialize hardware, but then I never panned out because by the time we took you to spec out, design, and build a database machine, Intel or wherever else was putting out CPUs at the time, Moore's Law caught up enough where all the advantages you've got from specialized hardware could be beaten by commodity hardware that improved over time. So database machines, there was a couple of companies they all went under. And so now his claim is because Moore's Law is essentially over, and the only way that we're getting better performance now is through scaling out with multiple cores, is it the case that we should we take a second look at dedicated hardware for, specialized hardware for data systems? So even just the dedicated OS? Dedicated OS, that's a whole another conversation. So in terms of dedicated hardware, let's start with that. There actually are some companies now actually starting to build specialized ASICs and FPGAs. The Nantesa was around in the early 2000s. They had an FPGA to do predicate push down and databases. Oracle now uses similar things in exadata. I don't think that will ever have a case in my lifetime that will have the same way they had database machines building super specialized hardware for databases. I don't think we'll go back to that. The trend now will be using what are called hardware accelerators to speed up some aspect of the database system. So for those of you that aren't graduating, I'll announce this later in the semester, but in fall 2018, we're having a database seminar on all hardware accelerated databases. So GPU databases, FPGA databases, that's an example of having specialized hardware to do one thing, not like a whole system to build explicitly for databases. Then your next statement is, could we have an operating system that could be explicitly tuned for a database system or designed to do nothing but a database system? Yeah, basically move on to the layers. Yes. That would be, yeah, so you would get a performance improvement, but the software engineering costs to do that would not be worth it. So one good example would be the file system layer can get in the way of things. So in the 80s, of course, people tried actually writing database systems that actually operate directly on the bare metal of the disk, essentially build their own file system themselves. I think Oracle IBM will still ship you with something that does that, but the software engineering overhead of that versus just using EXT4 is totally not worth it. You may get like a, I think Sturmburger said like a 15% gain is what they measured back in the 80s. And all the overhead of having to dismantle yourself wasn't worth that 15%. So I don't think the specialized databases or specialized operating systems for databases is worth it. Now can you have like a microkernel modular thing that you can put in the pieces just you need for the database? That's another story. All right, another questions. So again, this is an example where the operating system is there, it's a frenemy. We'll deal with it, but we can be smart about what we can do of scheduling our task and placing data in a way that's much smarter than the operating system could ever do and get better performance. All right, so next class we're doing hash joins. So the two lectures on joins, the first one will be parallel hash joins, then we'll do parallel sort merge joins. So the paper I'm having you guys read, I actually really like it is a thorough evaluation of a bunch of different design decisions you have in hash joins. So we'll cover hash tables, hash functions, and scheduling these things. And then on Monday next week we'll say how to do parallel sort merge. This is the longstanding debate in databases, although you'll see who the winner is when you read the paper. What's faster hash joins or sort merge joins? There's much papers over the years that go back and forth, say one's better than the other. But there is a clear winner now, okay? Any questions? I need something refreshing when I get finished manifesting to cold a whole bowl like Smith and Wesson. One court and my thoughts hip hop related, ride a rhyme and my pen's intoxicated. Lyrics and quicker with a simple moan liquor to summer city slicker, play waves and pick up. Rhymes I create rotate at a weight too quick to duplicate filiperies as I skate. Mic's a Fahrenheit when I hold it real tight. When I'm in flight, then we ignite. Blood starts to boil, I heat up the party for you. Let the girl run me and my mic down with oil. Records still turn to third degree burn for one man. I heat up your brain, give it a suntan to just cool up the temperature rise. To cool it off with same eyes.