 So, today we're picking up where we left off after, you know, before spring break where again we're now we are going down the stack of the system, right? The last class we talked about the, you know, doing query optimization and query planning. So now we're at the point where the optimizer has spit out what it thinks is the best cost or the best plan with the lowest cost of all the different options for the given query and now we're looking at, all right, how do we actually sort of execute it and schedule its tasks in the system? So, for today's agenda there's a bunch of background stuff we have to cover. So, we'll start off talking about the different process models or architecture models you can have for your database system in a multi-core, multi-thread environment. Then we'll talk about what types of query parallelization we want to do and then we'll talk about data placement which is another important problem we have to deal with. You know, regardless of whether in memory or distributed system or you have to know where the data actually is being placed or stored so that we can route queries and tasks to it and then we'll finish up talking about the sort of the scheduling approach that you guys read in the morsel's paper from Hyper. So, in order to understand what the design decisions they make for this we need to understand these other topics first. So, just to make sure we're all grounded on the same, using the same terminology, the same language, I want to first tend to talk about what I mean when I say, when we talk about the different parts of a database stack or a database application stack. So, the language I always like to use is always from the perspective of the database system, right, the database server. And in that world the database server is where the data actually is and then a client is the thing that actually connects to it and executes queries. And this is slightly different than how you may think of an application, a programming environment where the client might be the actual, what we'll call the end user, the thing that's actually, you know, someone at a terminal or someone with an iPhone app or whatever, you know, communicating with a high level or the server, you know, in the cloud or whatever. So, between the client and the server we'll use SQL, PL SQL or if it's a no SQL system, whatever their proprietary API is. And then in between the sort of the end user with the actual application or sort of the end user's application, they'll use some kind of HTTP protocol using REST or SOAP. Again, to communicate to the client server or the client machine what it is that the application actually wants to do. So, again, the way I always sort of think of these things is when we talk about how the data system is going to work, again, this is always the server. This is where the state is stored. This is usually the application server. This is like you're running Apache, Nginx with Ruby and Rails, Django, Tomcat, whatever, right? And this is usually stateless. And then there may be some state here for the front end application, right? So, again, when I talk about, you know, communication between a client and a server, I don't mean sort of like a client as someone with their cell phone, I mean the database server and the application server, right? Now, some database systems actually let you on the front end application actually communicate directly with the database server. Like Couchbase lets you do this, right? You can send, you know, REST requests to the database server and actually do reads and writes directly on that. Typically, people don't want you to do that. You always sort of want to go through an application server instead, right? But you can do this in some systems. All right, so now, if we say we're going to support, we want to have a multi-user database. It essentially means that we want to have multiple front end applications or multi invocations coming from the application server making requests to the database server. And these requests, as we talked about last time, are essentially going to be SQL queries which then get compiled down to a query plan comprised of initially logical operators and then we transform them to physical operators. And the physical operators, again, tell the data system what it actually should do to execute that query, like what index it should do and look up on, what join algorithm it should use to join two tables, right? We just saw all that before when we talked about query optimization. So now, we're going to slightly change or add some additional information to how we describe a query plan and talk about operator instances. So when you have a physical operator, it's going to say, you know, do a scruncher scan on this table here. But now when we start talking about parallelization or multi-thread environments, we can talk about operator instances of having sort of multiple invocations of a physical operator running in parallel on multiple CPU cores on multiple threads. And so the way to think about this is say your table is broken up into five chunks, right? Into five shards or partitions or whatever you want to call it. And then what you do is you'll have a single operator that says, do a scruncher scan on this table, and then for the operator instances, they would say thread one will take that operator and look at the first thousand tuples, the next guy will look at the next thousand tuples, and so on, right? So refer to those as the operator instances. So it's a one to many, you can have a one to many relationship, so for one operator, you can have many instances of that operator. And then we're all going to bundle this all now together into what we'll call a task. And so a task will be a sequence of one or more operators that we can dispatch to one of these threads in our system to actually execute some portion of the query and they can then send that data to either another task or to coalesce them to a final result and send them back to the application. So it's sort of like the hierarchy, you can have the atomic unit would be an operator and then you have an invocation that operator would be an instance and then the, if you have multiple operator instances combined together, sort of in a pipeline, you sort of bundle them together in a task. So the first thing we now need to talk about is what's going to be the high level architecture of the database system that's going to allow us to support these multi-user applications where we could have multiple simultaneous requests coming in to execute queries. So the process model is essentially the high level description of how this is all going to work. Now I recognize that process is going to be a overloaded term because we'll use this in the next slide to talk about the process, oriented process model, but to think of this as then like how is the system going to be implemented in terms of like what threads are using, what processes they're using, like how does this execute on the OS is what I sort of mean by this. And then within the database system, within this process model, it's going to have one or more workers that are the sort of component of the system that's responsible for taking the tasks that the query optimizer is going to generate and figuring out when to schedule them and execute them to produce results. And again, these results can then go to another worker who's executing another task that may require that first test output as its input or it may send their results back to the client or dispatch them to another node running on another system. It doesn't necessarily matter, right? So again, it's sort of like a worker would take a task off a queue or however it's getting signed to this worker, it executes it, produces some result and then puts it somewhere and then goes and figures out what to do next. So there's essentially three process models or architectures we can have in our database system. So I'll go through each of these one by one and I'll talk about how some of the systems that are out there that are designed like this. And so the one thing I'm being careful about in all this discussion too, I'm not taking, I'm not being careful just to say worker and not like worker thread or worker process because we'll see now that it could be either, right? From a high level it doesn't really matter but when we talk about actually the process model implementations, I'll say what they actually are. So the first process model we can have is to have a single process per worker. And essentially what happens is that every worker in the system will be its own independent operating system process. Like when you call fork or exec in libc, you get a new process. That's essentially what you're getting here. It has its own PID, it has its own address base, it has its own security controls, right? It's completely independent of all the other processes. So what happens is when the front end server sends a request to the system to open up a new connection because it wants to start sending queries, it'll first go through a dispatcher process. The dispatcher process will say, all right, this new request is coming in. You know, if you have to check username, password, or whatever credentials you have, it will do that here. And then what it does, it then forks off a new process, a new worker, and then hands that connection off to this worker and then this thing is now, this process is now responsible for communicating with the front end application to take all its requests and do whatever it is that it needs to do. So this worker knows how to then communicate to the database, either whether it's on another machine or a local disk or in memory. It is responsible for executing all the queries and all the tasks in those query plans, generating results and then shipping them back to the front end application. So once you sort of do this handle from the dispatcher to the worker, none of the requests ever goes through the dispatcher anymore, but this connection handle always goes to this worker here. So the way to make this work, because now you have, you obviously want to spawn multiple or fork multiple worker processes to handle multiple connections, is that you have to use shared memory for your global data structures, like your lock table and all the additional information you keep track of about what transactions you're doing in shared memory. And the reason why we want to do this is because if we don't use shared memory, then every single process is going to have its own local data and that means every process is going to have its own buffer manager, which then means that anytime we fetch something from a disk, if we're a disk or an architecture, we're going to have multiple copies of the same pages in memory. So you have to use shared memory to make all this work and otherwise you would have, you know, waste a lot of memory storing things. So the other aspect, key aspect of this is that the database system doesn't do any scheduling whatsoever about these worker processes. You completely rely on the operating system to do this for you, right? And you can tweak things like you can set the nice flag or IO nice to determine how much IO resources they're allowed to use, but in general you can't tell the operating system, all right, schedule this thing for something else, right? It's you let the operating system do all this for you. The advantage of this approach is that if there's ever a problem and one of these worker processes crashes, you don't end up taking down the entire system because that crashes isolated to that single process because they're independent. So this is actually the approach used in most of the database systems that came out in the 1980s and 1990s. So IBM DB2 and Oracle are probably the most famous ones that do this. Postgres is another famous one that does this as well. So you ever look, if you ever run Postgres on your laptop or on a computer, you're on. If you ever look on like H-top or top and you see all the processes running, you'll see Postgres actually does a nice thing when they rewrite the process name to have a clean description of actually what it is. You'll see like the collector running to see the postmaster running. So in Postgres, they call this faster the postmaster. And so the reason why all these early database systems end up using this the process per worker model is because back in the 80s and 90s, the thread packages that we have now weren't as nice, aren't as nice as they are now as they were back then, right? Back in like the 1990s, you had all these different variants of Unix, right? You had the BSDs, you had Linux was still pretty early, but you had Solaris, HPox, True64, right? You had all these different variants of Unix and they all had their own different threading packages. So that means that if you use a threading package in one system, it probably won't be portable to another system, but every single Unix system would support fork and shared memory so you can reliably use this approach and that it'd be portable enough across the different machines. So the next approach is sort of similar to this, it's to use a process pool. So instead of having sort of one worker process be responsible for a single connection, you sort of keep the dispatcher in the loop so that when a request comes in, you can let it, you'll pick any worker that's available in your pool and hand off the request to that process and it can execute that request and return back the result, right? But then you don't give up the connection handle to the worker, the dispatcher always still owns it and that way when the next request comes in it may not actually go to the same worker, it may go to another one. And so for this one you still have to use shared memory for your global data structures because these are again separate processes and you still have to let the OS do all your scheduling for you because you can't control actually what's actually running. This is not a usually good approach though, especially in modern systems because it's bad for TPU cache locality because there's no guarantee, I guess, I mean you could try to pin the worker to a particular transaction but all of these workers may be touching the same data right into the same data and then therefore you have a lot of cache and validation messages to pass around if one guy reads something and another guy writes to it. So this is sort of what IBM DB2 does, actually I should be clear, IBM DB2 actually does everything, uses all three approaches and the reason is why is because they have to support not only running on like X86 and commodity hardware but they have to also support on running on all their ZOS mainframes that they sell for a lot of money that are based on architecture from the 1960s and 1970s. So actually DB2 lets you as the DBA decide which of these process models you actually want to use. This is slightly how Postgres does it in the newer versions. You don't actually always go through the dispatcher, they still do the handoff to the connection to one of the workers in a pool but then what happens in the newer versions of Postgres they now support intro query parallelism. So if you say if you have a sequential scan and you need to touch large chunks of data you can actually have the worker steal other workers in the pool to hey go execute this query if you're not doing anything or at least this task for this query and get back the results that way, right? So this is actually came out in like Postgres 9.5 and 9.6 and it's been for some tasks or some portions of a query plan in Postgres 10 which is I think is later this year but I think they can support way more queries to do the sort of intro query parallelism using worker pools. All right and then the last model is to use a thread per worker and so in this case here the database system is now gonna spawn multiple threads worker threads and it's gonna be in charge of doing the scheduling on its own meaning when a query request comes in there may or may not be a coordinator here but they can set all right it can then look at the request and say all right you wanna do you know execute this query let me do the planning for you let me break up the tasks and then it can decide how to divvy up those tasks to different worker threads can then execute in parallel. This is pretty much what every single database system that's been built in the last 10 years uses this model because now we have we have positives threads, we have p-threads right so if you write your system you use the threading package for Linux although we have different distributions of Linux p-threads is pretty much standard everywhere or it is standard everywhere so you won't have any problems with portability. The downside of this one approach is that if any thread does something really bad like seg faults you end up taking down the potentially taking down the whole process and killing the whole thing whereas in the process per worker if you kill one process the thing can keep on running. So again IVMDB2 because it has to support everything does this this is the approach using SQL server and MySQL and then Oracle used to use the per process per model or process per worker model up until version 12 so this came out like 2014 so they have since switched over to the third per worker model so this is also what we use in Peloton as well. My opinion that the multi-thread architecture is the idea way to go and this is because you're going to have less overhead doing context switches between different threads versus doing between different processes and I think what is the key thing it's also to do is that you don't have to manage shared memory. We saw this when we talked about the SCUBA system and how they were thinking about they wanted to do fast restarts with shared memory they talked about how they wanted to look at maybe using a make a special version of Jemalloc that could write out or allocate memory allocate shared memory and they decided that was going to be really slow because you have to allocate things immediately when you ask for it when it's in shared memory whereas in with sort of regular virtual memory you don't have to do it right away. The thing I'll point out though is that the even though you use a single you can have a thread per worker as we'll see in a second it doesn't necessarily mean it doesn't mean you automatically get interquery parallelism right away. So MySQL uses the the multi-threaded model but they can't support multi-threaded queries. So when a request comes in one thread gets it and a run at beginning to end or proves the result even though there may be other multiple threads running at the same time doing other things. So you still have to do a bunch of extra work if you're going to do the multi-threading architecture to still be able to support interquery parallelism. We saw there's one team working on this now in the project and and you guys sort of saw this in the morsel's paper as well. So again as I said as far as I know in the last as you know actually updated this maybe in the last 10 years I can't I don't know of any database system that actually uses the the the the process per model architecture right everyone uses the multi-threaded one. Now there's a bunch of database systems that are derived from postgres and in that case they inherit the process per model a process per worker model but if you're going to write a brand new system from scratch as far as I know nobody actually uses everyone uses p-threads or multi-threading. Alright so now if we have a process model regardless of what model we're using we we we want to figure out how should we divide up our our our our tasks or for our query plan. So there's a bunch of questions we have to ask about when we want to do when we want to have query parallelism. So we need to figure out how many tasks we should use like how much should we divide the query up into multiple sub tasks. We want to know how many CPU cores we should use to execute it right and then once we figure out what those tasks are how do we assign the task to those cores and then when they produce an output where should that output actually go should go to local memory should go to a shared global space and I will say for all of these things these are important decisions that can actually make a big difference in the performance of the system and this is why it is my opinion that you always want to do scheduling an execution of queries yourself you never want to let the OS do this because it's not not going to know exactly what's going on in your system in the same way when I say with M-Map you never want to let the OS manage your memory it's my opinion that you never want to let the operating system do scheduling for you because you're always going to have better more fine-grained control and getting better performance. That requires us as database developers the people that actually work on the inside of the database system to write more code and more complicated code but in general that it's a I think it's always a good trade-off you get better performance so we want to talk about the different types of parallelism we can have so the first type is called interquery parallelism and we've sort of already covered this before when we talked about concurrency control the basic idea here is that we're going to have multiple queries being invoked by the front-end application or the client application and we want to have them execute in parallel to maximize the research utilization but of course then we need to make sure that they don't foul up with each other if they're doing transactions and modifying the database so we want to use concurrency control to provide them the illusion that they have exclusive access to the database even though they don't so we're not going to talk about you know interquery parallelism in this class because we've already pretty much covered that before when we talked about concurrency control the only thing I'll say though is that it is my opinion that the regardless what process model you use right, either the process per worker, the worker process pool or the multi-threaded model regardless what approach you actually use which of those approaches you use I don't think there's any major difference in how hard it can be to implement the different concurrency control schemes that you can have we talked about timestamp ordering, NVCC, OCC, two-phase locking I think all those things are take roughly the same amount of work regardless of what process model you're using because you can use shared memory to pass messages between different workers so again we don't need to discuss this anymore because we already covered this but now we're really looking into how you actually do interquery parallelism because that's the more interesting thing for what we're going to talk about here so the idea of interquery parallelism is that we want to take a single query and ignoring other simultaneous queries, we want to take that one query and we want to break up its execution of tasks or the execution of its tasks into multiple parallel operations that can run at the same time and produce better throughput or lower latency of the system so there's two types of interquery parallelism you can have you can have a sort of horizontal scaling where you're going to get interoperator parallelism and the idea here is that you're going to break up the or you're going to take an operator and make multiple instances of it and then they're all going to do the same function or compute the same result but just you own different segments of the data we'll have them fire off in parallel at the same time on different threads the other type of parallelism we can have is interoperator parallelism where we will have the execution of an operator in a query plan run at the same time as maybe operators below it in the plan and we'll see an example and second the idea is that you can have one thread sort of spin through and generate results as its output and then feed them up to another thread and then process them at the same time and the first one goes back and generates more results so you can sort of have multiple guys running at different parts of the plan in parallel so I'll go through both of these so hey see we have a simple query here it's a join on A and B on some foreign key reference and then we have two predicates right A value less than 99 or 99 and B dot value greater than a hundred so this is sort of the the logical query plan that the operators would generate so the first thing we can see is that say for this scan on table A say it has you know three hundred thousand tuples and what we want to do is we want to break it up the scan on that table into three chunks that have three segments and have three separate cores scan that data at exactly the same time so we'll have three operator instances of the scan on A A1, A2 and A3 and then we'll each assign them to a single core and then so that way they can run in parallel with each other alright so the next thing we see is that we have the filter on A this can be done using the output directly from this this operator here this the scan on A so we actually we want to combine these together into a single task so that as we scan A we do our filter and remove any tuples that we actually don't need so the sort of unit of a of a task in this plan will be the scan on A followed by the filter so now the next thing we also want to do is we want to since we know the output of this filter is then going to be used for the join on with B we can also build the hash table that we're going to use for probing to do the join as well and we can use locally here right we don't need to like this the task running at this core here doesn't need to know about the output of A2 and A3 in order to build its local view of the hash table right so it can also do that now in parallel so the task for this thread is just this part here but now we need to combine these different hash tables together and produce a single hash table that we then use for the join so the way we need to do this is we need a way to know that when all these individual tasks are done that they produce the output that we need and therefore we can then proceed further up in the tree because you can't do the join here until you have the hash table for both of these sides because otherwise if you start trying to join A with B before the hash tables are finished you may get a false negative right because the thing that actually should be joined on just has been stored in the hash table yet so the way you would do this in a traditional parallel database system is use these exchange operators which is essentially like a break point to say I can't proceed up in my query plan until this exchange operator gets all the data that it needs and this approach here actually comes from the so the volcano model to the same volcano system we talked about before with query optimizer or the volcano iterator model they propose this exchange operator as sort of a way to define that there's some point in the query plan where you have to wait to get all the output from your subtask so now these three guys are on a parallel the producer hash table and then when they finish they get fed in the data gets fed into the exchange operator and it knows it has to wait for all three these three guys to finish before it can proceed and move the data up to the next part of the query plan so let's say we do the same thing in B right B can be divided into two subtasks so we'll do the same thing we'll do our filter they'll build the hash table and then we'll feed into the exchange operator here all right so now we'll have for the join the join depends on having the hash table from A and hash table from B so this will then get feed into this operator and now we can do the same thing where we can now parallelize the actual join operation itself by taking the output of these hash tables generated of the exchange subdividing them further and then doing the hash join now in parallel across multiple cores or multiple threads all right so you sort of have these first five tasks execute in parallel at the same time and then all these other ones have to wait until the output of them is produced before we can go ahead and do the join where we'll see more about doing parallel hash joins in Tuesday next week but this is sort of the basic idea these are sort of called a pipeline breaker you can't proceed with these upper level tasks until these lower tasks are finished all right so let's look at how we do this with interoperator parallelism so for this we'll just do the join and then the final projection so say that we have a task that's gonna run at one core to do the join and basically what it's doing is just iterating over the outer table and over the inner table and doing the join and then calling emit to push the data up to the next operator in a tree so that would be the projection so now what's gonna happen is this guy is gonna just do make blocking calls on the incoming queue that being generated by this guy here and any time it gets a tuple in its queue it can then do whatever the filter needs to do to produce the output and send it back up so what happened here in interoperator parallelism you'll have this guy just running on one thread this is running now on another thread and this is just making suspending and keep going until it gets the end of you know gets the end of the message to say there's no more tuples and so as this thing does the join this thing can then simultaneously do the projection and send it up to the next operator or produce the output to the next you know to the client so and again you can do the same thing for all the lower levels parts in the tree I'm just showing the upper level part for simplicity so as far as I know this approach is not used in in most database systems right most database systems do interoperator parallelism where you have either exchange or other ways to have these pipeline breakers to the feed data up and you assign tasks once you know the data you need at the lower parts is available to the upper parts you actually see this approach here in usually in streaming database systems right the idea is that you sort of have these these operators always running you have some feed from the outside world like a stock ticker or twitter or whatever and you're sort of feeding the data as it comes in into all these these operators in your pipeline and they produce answers right so in this case the operators at a different stages of your pipeline can run in parallel and they can produce answers uh... in a streaming fashion so again this approach you'll see in streaming systems uh... you won't see that really in any of the systems of work that we're talking about in this course right so tricky thing now parallel queries in scheduling is it's really hard to come up with knowing what the right number of worker threats to have for your query plan because it depends on a lot of different things right depends on the hardware how many cores you have uh... it depends on how big the data is that you're reading or how much intermediate data you're gonna generate it also depends on what the actual operators in your query plan are actually doing if it's in a really expensive computation usually hash-joining is the most expensive thing then you may want to parallelize that across uh... as many cores as possible because that's going to be the best thing for you but for some really simple things like a projection that it might be a waste of cpu cores to have them do do that in parallel coming up with the right numbers is actually really difficult to do and again because we're not relying on the operating system to figure out the scheduling stuff out we have to do this for ourselves there's a couple things we need to talk about so first we can talk about how we're going to allocate workers to what i call cores and again the core could either be a you know encapsulated in a process or it could be encapsulated in a thread at this point it doesn't matter for us one way to do this is that you basically assign a single worker to a single core in your in your system in your machine and that worker is the only thread that's allowed to execute on that on that core when we saw this sort of in the h-tra model we talked about early in the semester where they had the single-threaded execution engines that were responsible for data at a single partition and so what happens is you can actually pin the the the thread the worker thread uh... to that core and prevent the operating system from letting any other thread run of that same core right and use this using the the sketch at infinity uh... command and let's see task that isn't is another way to do some of the command line the basic idea is you can tell the operating system this thread here and again it could be a process to be a threat read it doesn't matter but this thread here is the only one that can run this core and that way you guarantee there's no interference in your caches no interference from scheduling from anybody else trying to actually use that that core another approach is to have multiple workers per core uh... and what you typically do need to do this to say you have a single socket uh... say or not for a single socket in your system it has four cores you'll generate four or more uh... workers they're all signed to any thread read any core in in that on that socket and the idea here is that if you have uh... some workers that end up get blocked for whatever reason because you know maybe they're waiting on network i o disk i o or they're on waiting for a lock rather than blocking and wasting that core in its entirety while you're waiting to out to get that research you want uh... you let the o s schedule another uh... worker to use that that hardware at the same time so i'm not so i don't think any any one of these are better than another the the more so paper you guys read did did this approach uh... we'll see from s a p hana from the paper i'll show at the end of the class uh... they actually use this approach the hana guys claim that this is better when you have a a machine with a large number of sockets uh... whereas in the hyper paper i think the only looked at a uh... to socket machine uh... and so maybe they didn't have a dish this issue so this is what we do in a short and in both the b because we we're actually never get blocked on anything when we execute transactions there's no disk there's no network i o there's no waiting for the client to send us more data and so this is the best approach if you want to write bare metal speeds but this is better when you have sort of a noisy environment or you know with with different types of queries in different operations that may you know may have to go to the network or block for other other thing of the reasons i said now that's what is how if we have a workers how does the system take the task that are being generated by the court optimizer give them to the workers to execute so it's pretty simple with it basically two approaches there's a push model a pool model from the push model you have some kind of centralized coordinator dispatcher knows what are all the tasks that it has to execute it knows what workers it has and it'll put a sign those tasks to the workers based on whether they're available or not right and what happens is when the worker completes the task and produce the result it then sends you know the call back a sense of message to the dispatcher says i'm done give me something to do for the poll approach it's basically there is no sort of centralized coordinator and you have uh... some kind of work you have these tasks and the workers work together to uh... look in the look in the queue take take things that are that they would be good at executing right and by good i mean the data is usually local to it execute it produce results and go back and try to get more right so in this in this approach here there's no high-level thing that understands you know what are all the tasks i need to execute for for this query and make sure that everything's are executed in timely fashion user that it uh... in bed that logic in all the workers than themselves never girls of what what allocation scheme we're using right what whether we assign multiple workers to a to a soccer or do we have a single single uh... single worker per core uh... regardless how we do with the doing push a pool task assignment uh... it's really important that the the workers operate on data that's local to it and so this is sort of obvious in this in this distributed system right in the trip it is something you don't want to have to read data that's on another machine whether it's another data center or on the in iraq right you want to read data that's always local to you uh... we have a sense of the same problem in in a single note in memory system even though the scale of of of the distance is is much smaller but it actually makes a big difference i want to go a little bit now about the different memory models you can have in a sort of shared memory or shared everything architecture and then we'll see this is a sort of motivate why the more so that the hyper guys made certain design decisions in their morsel's approach uh... to to to overcome that the problems you can have and and uh... new more architecture and on uniform memory access right so that uniform access model is the sort of this the approach that all the chip manufacturers were using up until about a ten or fifteen years ago when you had a sort of multi-socket or symmetric multi-processor system the basic idea is that you have say these are these are all your sockets and every socket has their uh... cpu socket has their own uh... local caches l1 l2 l3 and then above that will be a bus that will allow the cpu cores to address uh... memory that's located in in in multiple dims right and so the reason why it's called uniform memory access because regardless of where what in the data it is you actually need to uh... access but what we know where the address resides it's always going to cost you roughly the same because you're always going to go through through this bus so even though this socket here he's a touch data and at this time over here the the the bus overhead is always going to make it will be about the same so talking to this versus talking to this will always be the same here you still have to do cash in validation if you start uh... if you if you have data that's stored in your cash here and this guy does a right to it you start to send the message to invalidate that but that doesn't go away but in general reading things from any any any location of memory is is going to be in the same so they said they did this for about uh... for about ten or fifteen years ago and now when you buy the new zeon chips you get what is called non-uniform memory access or new mom and the idea here now is that in in in your system every cpu socket will still have its local caches the sram l12l3 but then they'll also have a uh... dims d-ram that's close by that it sort of controls our own and so now what happens is if i want to access data that is in my dim if i'm running my thread here in my process my processor thread is running here and i want to access data that's in my local dim that's going to be really really fast but if i want to access data that's over on this game over here but now i have to send a message to this interconnect so i think this is the bus so uh... intel calls this the qpi the quick pass interconnect i don't i forget what amd or ibm calls theirs uh... but i need to send a message over to this the cpu and then go get the the data they need and then send it back and this obviously is now getting more expensive to do because before i was just going really close but now i'm actually going over uh... copper to get to get the data that i need and that's going to be slower right so again this is not uniform access that the cost of reading address in this model be the same regardless of what game you're hitting uh... in this case here sometimes it can be fast sometimes it could be slow and these are all relative terms like it's obviously be much faster than if reading over the network but as we'll see in a second it can still be be slow this sounds like uh... you know sounds like it is not pretty difficult because when you think about when you write code right you malloc you know you mallocat my memory and the operating system comes back to us you know here it is here's your here's your starting address do whatever it is that you want to do with it the so you may think well how am i going to know where my memory is actually being stored well there's a couple things that the operating system will provide to us that's gonna last control where things actually get get look at store to get located and so this is an old problem in database systems uh... called data placement again traditionally this was this is a problem you had to consider in a distributed system because you don't want to allocate memory or out you know store tuples on one machine and then have another machine to access them right you always want to have to have things be local to you but the same basically problem now on a shared memory system is sort of now at us at a smaller scale so the idea of the data placement problem is that we're gonna have a way to split up the database into partitions or shards or chunks and we want to figure out where do we assign them in our physical space or where do we assign them uh... to different demands so that most of the queries and most of the transactions that need to access access to data there will will be accessing local data and so we can do this by controlling and tracking what data we have and where it's actually being stored through a variety of approaches but though one easy way to do this is through the move pages command in in linux so what the way the pages works that if you just invoke it without an address i'm sorry if you invoke it with an address it'll come back and tell you here's the numeric and here's the socket where your data this this address is actually stored but you can also then uh... invoke the command with an address plus an identifier of a new region and that'll tell the operating system move that data to to that particular socket and now we can completely control where our data actually is and so when we do our query planning we would say this task needs to touch this table and i know needs to touch this segment of that table and i know that segment is stored at i emailed them yesterday uh... because after i went down for you guys on tuesday and uh... they said they came and cleaned it they probably just have to auto-reflect i should bring my own projector in so with move pages you can give an address and you can tell it where it actually moved data around and then that's gonna allow us to have really fine grid control about where data is actually being stored so but now let's say i don't call move pages let's say it's the very beginning and i allocate memory i call malloc what happens when i call malloc i assume i call malloc and i know the allocator doesn't have a free page or trunk of memory that it can hand out right away right what does it actually do i can read the allocator g mallocen maintaining its own little memory pool and then at some point it doesn't have doesn't have my memory has to go get it from somewhere but it gets it from the operating system so now when it goes the operating system and it says give me memory what does the operating system actually do it says it extends the heat be more specific uh... it changes the address range of the address range yeah so he says it changes the address range with s-break yes but no right so when you allocate when you when you say when the allocator goes says the operating system give me more memory the operating system does it it logically extends the address space but physically doesn't actually allocate anything right and it's only when you actually go now try to touch that data does it actually go and and and hit a page fault and actually allocate the memory right then it makes the s-break uh... call so now the question is uh... where does that data actually get stored when you call malloc and so my question what does malloc do right so you don't remember right so it does almost nothing so as he said you know you know that extended the process address space uh... but the the new virtual memory page that it's that it's assigning to the the allocator of your process is not immediately backed by physical memory right only when you actually tried to read a right to it then does then is there a page fault and then it says all right well you really wanted this let me actually go get you on our physical page so the question is now where does memory actually is you actually get allocated uh... and there's two there's basically two modes you can set in in your process to tell it what it do in a new system and again with with soon we have multiple sockets they're used to have dims we need to know where that memories being is going to be stored we can we can control it with mood pages but that sort of done after the fact the question is where you know when we actually allocate things where does it actually go so the first approach is to do in a leading is where we just uh... sort of do around robin uh... approach where we just allocate memory at different dims uh... and in around robin fashion and assume that's going to be good enough for for our process i mean i think that's bad because game the operating system is doing what it wants to do and that may actually bad for a database system the others money setting you can you can use is called first touch and basically is that you have one cpu the thread running one cpu allocate the memory but doesn't actually do anything with it so it's still you know it's still hasn't not physically backed but then whatever whatever thread on whatever cpu then actually touches that data touches that address then you get the page fault then it's actually allocated and that'll be on the din that's local to it so i think the default in uh... in linux is uh... through interleaving uh... but then using the remote control uh... in in in limc you can actually tell it to do it switch switch this mode here you can kind of think of this is like when the database system uh... is getting much answers and book loading information book loading data it you'd want to use this because otherwise you'd be sort of spraying everything all all over the uh... all all over the address this all over the different dims and that would be bad so if you show you how bad this actually can be a hot house affects the database system there's two studies i want to show you so for this one this was a paper done a few years ago uh... where they measured different uh... what would the effect would be for having threads that run transactions execute on different cores uh... in different patterns and see how you know how far away they are from the actual memory for the memory section being stored how that affects performance so for this they're going to run as a modest side days that database for the tbcc benchmark and they're only going to execute that the payment transaction with for work with red and this was done in uh... by the uh... that's a lot more keys group at e p f l uh... on the system for sure and he was sure and he actually was developed here at c m u before not also went to to switzerland so you have three basic configurations so this little this little uh... rectangles i'm showing here each of these represents uh... one socket you have four sockets each have six six uh... cores and all the data for every single setting will be stored on the first socket here and so for the spread what you'll do is you have one thread will run separately on each socket for group all the threads will run at the socket where the data is being stored for mixed you would let the uh... you have half of them on the socket where the data is half of them on the socket where the data is not and then the os is like you just let the last do whatever it is one you want to do what you see is sort of obvious the the out of you what you would expect is that when you group all the threads to run on the socket where the data actually being stored in dim you get the best performance right and this in the spread and end up being the worst because this is the worst case scenario because anytime these three uh... workers need to access data you have to go over over the interconnect to get it from from from the first socket right and the case of the operating system the operating system is going to try to to recognize that of this thread is you know thread running the socket here needs to touch data this socket so maybe i'll migrate it but you don't you don't know right sometimes it doesn't sometimes it doesn't right so again by having control of not only where the data is being stored but where we schedule an executor task we can get the best best performance so for that in here we're not seeing it you know it's not significant performance difference because they were doing all the tp so it's there's a lot other you know uh... things going on the system because we're running a current control uh... country protocol and and logging locks and other things so that the performance difference between the best case scenario the worst case scenario isn't that much right this one's running maybe eleven thousand transactions a second this is running like seventy eight hundred you see a bigger difference though in in all that queries because all that queries you're reading a ton of data so you want the the workers as the execute the task to execute the to operate on data that's local to it so this is actually an experiment that was ran uh... last year actually last semester by uh... hoban and some other students from from that the master's program and this was they were running on the machine in the pdl uh... that you guys can have access to for the class project that has uh... eight sockets with ten cores per socket and then with with hyper threading so a total of a hundred sixty uh... harbor threats are a hundred sixty threats and so for this the experiment they were doing is a load of the database with ten million tuples and they're gonna have uh... different threads do basically uh... sequential scan and different segments of the data in parallel and you want to see what the speed up is when you when you have the same amount of work needs to be done but now divided up to more and more threats so this vertical line here represents when that you go from some physical threats to to to hyper threading so there's a there's eighty real harbor threads and a hundred sixty with hyper threading so that's why i sort of uh... things fall off here so now here you see a quite a significant difference of performance almost about one point five acts for when you scan things sequentially uh... and have the threads accessing data that's local to it for the data that's remote and this is all running on the same machine in this case here they're there they're reading from from a random random partition of data that may or may not be on its own it's uh... local socket and now you see that the performance they could get you die off at at forty threads because at this point here you're saturating the interconnect by sending the data between the different sockets where in this case here there's almost no traffic on the interconnect because everybody's operating on data that's local to it so again by controlling where the data is being stored and what task execute on it we we see a big big difference performance in in the o-lap case alright so the just as a sort of an overview of the way to think about this is that the partitioning scheme is is is sort of done at a little higher level decide how we're going to split the database up uh... to assign them to different sockets based on some policy right the simple thing to do is to sort of robin like a one out when it when it tuple comes in you signed to the first stock at the next to become any sign to the second socket and and so forth uh... you can also do you know how range partitioning or hatch partitioning right and this is a basic idea where here is you you pick some attribute or some column in the tuple and you may be hatchet and mod by the number of suckers you have that tells you what partition you could store and it tells you what where to assign it right this is kind of useful because for some queries if you know if the queries doing a look up on the thing you've had you've partitioned the attribute on to the tuple one if i'm doing a join on attribute that i partitioned my tables on then i can get great parallelism because i can have uh... the different tasks run the different sockets and not need to communicate with each other at all because all the data they need to do the join will be local to it and then partial replications basically just uh... make multiple copies of tuples and store them at the different sockets so that's partitioning partitioning deciding how you split things up so then once you do that then the placement scheme will tell you how you're gonna sign them to the different sockets in in the system right you can use the round robin approach again you just take a chunk of data or a partition say it goes the first socket the next partition goes to the second socket or you try to be clever and maybe interleave those partitions across the cores like have a maybe a subset of a partition stored sort of a different different different cores um... but in general i don't know if anybody actually does this one this is the most common one because this is this one's the easiest one to implement yes it looks like uh... before the internet class is saturated i mean before like thirty cores uh... there isn't much difference between the so that means uh... the latency impact isn't as much as the interconnect so his statement is up until about up until about thirty cores right around here there's really no major difference performance uh... and then the then beyond that that you see you see a flat line because performance gets bad and you're you're tripping this because what when you like access to different memory on different socket the latency doesn't have that much impact your statement is when you access that's a difference doesn't have that much that's a so the the reason why i think i think this point it's it's close is the the process you since it's less threads are reading uh... the same amount of data the thread running um... that a threat is more likely to read data that's local to it versus if you get to the higher accounts then it's it becomes more random that's it and then we you know we didn't really go into fine-grained detail and say like you know measuring cashmases and things like that that's what we've done other expenses that's what we tripped that too as well right so uh... right so where we are so far right we have our process model we have a worker allocation model we have doing test assignment we do data placement right so we have all the things we the building blocks we need to say all right we can store data and and and across multiple sockets and what the multiple course we know how to take tasks and sign up to workers right now the question is how we're gonna create the tasks from a logical query plan and execute them across these you know using these different policies here so now for all the p i sort of alluded alluded to this before for all the p interquery parallelism doesn't actually make a difference right should be sort of obvious can i tell you why but what's the typical all to be query a single record right how do you parallelize that right you don't right because it could be you know it's gonna be a look up on an index you can't parallelize the index probe get the single tuple when you produce the output now if you get a batch of queries uh... you can maybe run them in parallel but they're going to be operating on the same the same data you want them but it is probably local together in that partitions you may want them run at least on the same socket if not the same core there are cases in order to be uh... transactions where you can't you're not always going to get a batch of queries because the output of one query has been used as an input for the next query so then you have to go back to the either the the the the client whether it's a store procedure of the program itself the uh... the application server and get back the next queries you can't parallelize that so for now before doing this for a lot of queries of that sort of see the the big difference that that's where matters a lot so the first way to decide how we generate tasks is to use was called static schedule the basic idea here is that you have uh... you have a query you're gonna break it up into different tasks and you're gonna generate a task for uh... for every single thread that you have and you say uh... you know i have ten cores i'll break it up to ten tasks and send it out that way so this this in this approach this is sort of done even for you to execute the query so it's static because it doesn't change uh... when you actually start running right uh... and this is actually turns out this turns out to be a pretty good idea and it works pretty well because and it's also the easiest thing to implement but you don't care about what else is running the system you just say here's all my tasks go and then the your your database system scheduler will just say all right what i need something to do all right this worker needs something to do it gets the next thing and runs it right doesn't try to do anything smart so that although this seems really naive this actually works uh... this works reasonably well so now we're gonna talk about though in the in them for the morsel's paper and the hana paper sort of more sophisticated approaches to do more elastic scheduling and so in the case of the morsel paper they're gonna have the one worker per core they'll be using a multi-threaded process model and they'll do round robin data placement of their of their partitions for whatever reason they call them morsel's but it's the same idea some disjoint subset of the database or of a table you can call it a shard it's the same thing we'll talk about more about this on uh... and next week but they're gonna have they're gonna have these new more aware uh... algorithm implementations for the different operators in in in your and relational algebra can run in parallel and are are aware of the layout of data in and uh... and uh... and uh... numa architecture and they can be sophisticated about how they actually store data and and past data from one operative next to avoid traffic on the interconnect so we'll focus more of that on on next week but here we're just mostly talking about like how do you schedule these things in hyper they're not going to have a dispatcher thread so they're gonna be using a pool model where all the worker threads will pull a sort of centralized queue and try to find the next thing is that they need execute and so what they'll be able to do is that we will recognize that uh... if there's a task that is running slow on one socket or one worker and it's sort of holding back the rest of the the the query plan uh... then they'll they'll allow to steal work or allow to uh... straggler duplication to have run the same task on another thread at the same time is to still running on the slower thread so that way if it finishes before the straggler finishes you you know you you can make more you make more progress more quickly so the worker will go look in the single cast uh... queue of task find the ones that have identified that are using morsels that are local to it any data that stored on the the dims that are controlled by its socket and then it'll go and execute it and then it goes back to the queue and tries to find the next task to execute again so they have a single task queue for the entire system so let's sort of walk through an example like that so it's the same query that we saw before the join on A and B so the first thing they're gonna do is they're gonna split up the data table into into morsels and they talk about how using a hundred thousand tuples per morsel uh... provided them with the uh... sort of is a good trade-off a good balance for maximizing the amount of parallelism and um... uh... not having chunks that are too large that allows you that prevents it from being very elastic so this is sort of looks a lot like the row groups we saw before when we talked about seagull server columnar indexes this basically the same idea so now when they store the data in the system they'll sign these morsels to a single socket or as a sort of a worker thread running at a core so they get stored in its and its local new memory so now we have server query plan where they generate a bunch of of tasks we're all gonna get thrown into a queue for each of these now these workers running at a core you're gonna have the morsel stored in its local memory but you're also gonna have a sort of buffer space to store local data because what'll happen is as they execute these tasks and produce output rather than putting in the output into a shared global space they'll put it into a local data space and then when you transition them to go to the next task they have a way of trying to maximize the number of tasks that are assigned to use data that's local to this to this worker and only send data around between different cores when it's necessary these three cores of workers start up and they both go they all go into the task queue and they identify that the first thing they need to do is to execute the scan on A so there's some extra metadata that's being stored inside these tasks to tell it that I can't execute the join until I finish the I finish the scans it's sort of equivalent to the exchange operator that we saw before so now the scan on A it gets for each of these tasks will execute on data that's local to it so the scan on the the morsel A1 will access the data that's here and A2 here and A3 down there so now alright so they sort of feed that in and then they produce the output from the scan when they build the hashable stored in this local buffer here so now when they finish say the first two guys finish so now these these two tasks are marked as done so then they can jump down now and start executing the scan on B and same thing though they're accessing data that's that's they're accessing a morsel that's local to that that socket so now let's say this case here the the for whatever reason this the worker running this task finishes first and B the B2 task and A3 tasks are still running these other guys here so he can now go into the Q and recognize that well we can't start doing the join or doing other things until we finish this the scan on B3 A3 is already still running so I can't take that one so I'll go ahead and start access you know processing this task I then have to pull data from uh... this new more region over to my to me in order to process it uh... that's going to be slower than accessing data that's directly local to you but the idea here is that rather than waiting for this guy to finish and then operate the you know the next task I can go ahead and try to beat them out and and finish this thing so that we all finish uh... we all can finish at the same time and move on to the next the next part of the query plan they have a display of this global task view is a lock-free hash table we'll talk more about that uh... next on next class uh... but again the main idea is that it's a single queue that everybody can look into they tried to take data they try to take tasks they know are going to access data that's local to it and if there's no other options then they can go and steal tasks from other people because otherwise they would be sitting around and be idle because remember we have a single worker assigned per core so that worker doesn't have anything to do then that core is idle so you always want to avoid that so this is different now than what's being done in SAP HANA so this is a paper that came out and be able to be in twenty fifteen uh... so this was done in sort of a sort of one off prototype SAP HANA as far as they know when you you know when you give SAP money and you buy HANA you don't actually get what's being described here this is sort of done as uh... as a master student project or a PC student project uh... in Germany and you know was sort of a you know a proof of concept and the real system as far as you know doesn't actually do this so in HANA they're also going to use a pool-based scheduling approach but rather than having a single worker assigned to a core they're going to actually use the pool model where they're going to have multiple workers assigned to to a socket and they're all going to use the same cores and then another difference between HANA and HYPER is that in HANA, sorry, in HYPER they had a single task queue for the entire system instead what they're going to do is they're going to have queues per socket right, they call it a group but it's basically the same thing, a CPU has a group and there's a bunch of threads inside of that and so you're going to have two queues per group now you'll have a soft queue and a hard queue so the soft queue means that any other thread from another group can steal tasks from that queue but the hard queue means that no one's allowed to steal this right, it has to run in this group so this would be like a soft one to be like say the uh... the example I showed before, we can go ahead and steal that task from the straggler core that would be something on the soft queue the hard part of the queue might be like some kind of network scheduling thing or garbage collection or something that's very... that you definitely want to own run at that particular socket so another aspect of the HANA system is that they have a service sort of global thread pool for the entire system meaning they don't make a distinction between background threads, utility threads and worker threads in the HYPER case you have networking threads, they handle connections that come in you have background threads doing garbage collection and logging in HANA it's sort of like this uniform or universal thread pool where any thread can do anything which I think is kind of interesting so in with HANA because now you have threads uh... that are trying to be smart about taking work and and sort of coordinating, trying to coordinate what's going on they have a separate dispatcher, or not dispatcher, but watchdog thread that's going to be a global coordinator that's going to check to see whether all the different groups are being fully utilized and if they start to notice that one thread or one group is getting too much work then it's in charge of moving tasks around to the different queues between the different groups so you can have the groups can steal each other from the so that the watchdog thread is a loudest the watchdog thread is what steals things from the soft uh... queues and it can assign them to other ones so the worker threads can't go pick it pick things out of this the watchdog thread is in charge for moving things around so I don't want to get too much in the detail because we're running out of time about how this all works but they basically have a way to not only uh... they can scale up and down the number of worker threads that are in a group per socket so that if you recognize that one group is not getting enough has enough resource enough threads to work uh... you can have another group maybe running on the same socket or nearby socket scale down the number of threads that it has and then you can spin up new ones in another group so again it's a much more sophisticated architecture than what hyper was doing but the benefit of why you want to do this versus what the hyper one is that the hana guys claim that when you actually go to a large number of sockets again think the morsel paper that was a two socket or four socket machine when you go beyond that then the the morsel approach actually does not scale and stealing work between uh... having cast threats being stealing work from each other actually comes a big bottleneck and slows you down so what they found is that when you have a large number of sockets you actually turn this off that the threat stealing stuff the work stealing and you actually get better you get better performance right um... right is that the work found works thing was not as beneficial for systems with large number of sockets i actually think that the hana one is actually really interesting uh... and it's something in the long term i think we'd be maybe fun to try out and and peloton but i this seems like a lot of work is this beer basically writing your own os scheduler inside the database management system and it's not clear exactly how all the commercial systems do do but they it's my understanding that they for the ones that can be multi-threaded they did basically do similar things like that so they have different kind of worker or worker pools for a thread pools and they can scale them up up and down based on what they what the system is actually uh... actually trying to do again the nice thing i like about this is that in the in the hyper case you have that sort of statically allocate the workers that you have uh... and it's a and the sign threads to be working to be workers sign threads do networking signs for garbage collection logging other things in the hana case anybody can do anything and you dynamically scale them up down based on what what's going on all right so uh... one of my parting thoughts about scheduling and and query execution so again the the if you can't think about this at like if you can't squint at it everything i'm describing to you here it sounds like an operating system right it's it's it has priorities it's it's controlling where data is actually being stored what data the the task and operate on and so it's really important to make sure we do all this correctly and do this efficiently because if we have a bad scheduling algorithm i have a bad way of assigning data to different places in the system then we're gonna get bad performance and even though we're running in a in a shared memory uh... system or shared everything system it makes a big difference as we saw before if you have the task for uh... your query operate on data that's let the local to it again as always you don't want to do any of this for you where database developers we can do a better job and we try to get it you know try to get a lot of out of all this as any questions about query execution query scheduling and multi-threading so can currently in peloton the way it works is that uh... we actually only have a single uh... thread pool so when a request comes in uh... there's a there's a network thread that that grabs it runs it through the parser planner optimizer and then immediately goes off and out execution in the engine produces the result and then sent sends it back to the uh... to the client and this is actually even though we're a multi-threaded system this is actually a remnant of the but postgres we we really started with postgres which has that sort of same model uh... single process with a single thread is responsible for taking in the connection parsing planning optimizing it executing sending the result back so one of the things are working on the semester and one of the teams is working on this now for the multi-threading stuff is that we are we want to disconnect the networking threads from the worker threads so that we can have interquery parallelism so that we can take a single task and divide it across multiple threads running in parallel that's that's that's where we're going this semester and then whether we do the hana approach or the hyper approach or something else we don't know what we're doing yet alright so uh... next tuesday uh... actually all next week we'll be doing joint algorithms and we're going to go to a lot more detail than what we covered in the interclass right we're actually going to say now how do you actually run this uh... really really efficiently really fast and we're spending some time talking about it how to do these these joint algorithms in parallel so on monday's class we'll be doing uh... hash joins on thursday's class next week we'll be doing uh... sortmer's joins so on monday i'll talk about different hash tables different hash functions and then different hash joins again all being done in parallel right and for some of you uh... i know i'm meeting today tomorrow to talk about your projects i have all your uh... your your presentations that you sent me if you haven't sent me yet please please do that and i will look at them either tomorrow or saturday and send you guys feedback uh... if i if i have additional pointers alright guys thank you so much see you next week