 Alright, so I have a question for Shubham. How did you get into Wu-Tang? To be honest, I was not that much into Wu-Tang before I met Andy. I mean, Andy is the one who is like, I used to listen to some bit of like old school stuff because my brother used to listen to it. Yeah. But I mean, Andy is the one. Got it. Okay. Okay. So he's corrupted you into that. So awesome. That add one to the list of long list of things that he's done. Alright. Welcome everyone back from the fall break. I know a lot of you have got have senior exam scores on grade scope. And let's just jump into what the, what the announcements are. Yes. So the usual protocol, but if you have questions, come talk to me and Andy during our office hours. I know some of you already told me that you felt like the exam was a little tough. And I gave a couple tips to people who came into my office hours to as to how you might take such exams right so three things first is you don't have to answer the question in the order in which they end the paper. For example, you say a glance to the topic storage management is something I'm really good at. Go to that question first. Second is as you read the question because they all have a little bit of a setup as to what's the background that you need to know before you can answer the detailed question. Don't rush through it, try to skim through that question really quickly to get an understanding as to what someone what that question is about. And as you write the as you assimilate that information, free free to use your pen or pencil and mark like, for example, hey, these guys said a is a primary key that may be important. So just note down what you have and put his face yourself right though if you get stuck on something. If you think that question number four is really difficult, and you hit it right now, just if you think that's really tough go to the easy stuff right so you can make those changes So a little bit of what we are seeing is that there's also some of that how do I take an exam kind of a modality that people are trying to get through. So they're going to be multiple components to this but hopefully that gives you give you some idea but come talk to me and Andy. All right, so today we are going to talk about query execution and picking up on what we discussed in the last class. If you remember in the last class we talked about how queries get converted into this internal representation of having operators that execute individual portions of that query. And the data flows from these operators and each of these operators as we had discussed, you can think of that as being a producer consumer kind of a pipeline. So for example here, you have a selection, and you have a projection, and you can think of that as being a select project pipeline. We'll pick up on query execution says you can see over here. For example, this was an operator that we had looked at, you can think of each pair of operators here as having a producer, followed by a consumer right and then the join is consuming input from the selection, but for its parent the projection it becomes a producer and the projection is the consumer right so operators like that are going to have this dual role they're going to consume from things that are below them and produce for the operators that are above them in this free representation. Okay, so you're going to go into the different ways in which you can deal with query execution. And before we go into that, what are the different ways in which you can execute these queries. So the first question is, what else do we need to worry about. Well, here's what you need to worry about, which is processors these days any computing platform these days today is going to have a lot of parallelism available at the hardware level. So here's a chart, it's a little complicated, but the main parts to focus on these orange lines which says from 1970 to this decade, how much is the number of transistors that we've been adding on processors. And the y axis is an exponential scale so you can see that exponential growth continue where that's basically at the heart of what most law and they're not scaling is for those of you can And if you haven't all you need to know is for the longest time you still have processor transistor scaling happen which is that orange line. And the thing that has really changed is is the other two lines which is single thread performance which is how much can each thread of you do as basically that blue line has started to flatten out specially over the last decade. Okay, so we will need a model to be able to deal with the hardware in which each single core is sort of become static in terms of what you can get out of it. But what these extra transistor budgets allow you to do what the processor guys to do is this black line here, which still before this point middle of the early part of the century to like 2005 or so processors were largely single core. And every generation of processor was doubling the clock frequency with strong because of power considerations. And now what you start to see is a lot more course where we are now starting to get to 10s of course and maybe So today you can buy a processor that doesn't have multiple cores even your phone has multiple processing books. So now we have to say how do we run. All of this very processing machinery that we know in this hardware that by default now is going to be a parallel machine so every processor is now a parallel data machine. Okay, so we need to harness the full power of that hardware. Things are also a little bit more complicated in the sense that we even within each processing or and within the context of each computing environment that we have, you have all kinds of different hardware parameters in there and not going to the details of this. But just what you'll be aware this is a chart that is a reincarnation of various other charts that was started by Jeff Dean in 2017. And he was going around Google and looking at the new engineers and realizing they don't really know how processors were and what are the trade offs in terms of data access. If you don't know what the trade offs are, you won't design the software correctly. So let's look at a couple elements of the chart. L1 cache is the small cache that sits right next to the processor. The processors and registers in which things get pulled into compute, but below that is the L1 cache and accessing that is like one nanosecond roughly a cycle or so. And then as you get further down to RAM access, which is the DRAM, which is a buffer pool since that's going to be two orders of magnitude slower. So if you have to fetch data from the buffer pool, it will get pulled into the caches, the L2 cache and the N1 cache before it gets processed. And the difference over there is that high. So there's need to have the locality that you do in your query processing algorithms that you maintain across these caches. And a lot of those algorithmic details we talk about in the advanced database class if you happen to take that in the spring semester, but for today what I need you to know is that memory is two orders of magnitude away from the processor. Everything that you've done so far works of the buffer pool. And that's great. But look at what happens to if you have to read data from an SSD and the buffer pool when you evict the page, it goes into this that this could be an SSD. If you have high performance, high cost storage, or it could be a disk which is at this level and between the RAM access which is 100 nanoseconds and reading from an SSD that is 100 microseconds you can see there's three orders of magnitude difference. So it is three orders of magnitude slower to access something from memory than it is to access it from SSD and four orders of magnitude if you have to go to a scanning disk. And so now you can start to see why this buffer pool is super important and why you spend so much time obsessing about buffer pool efficiency and things like that. But as you go further down, if your data processing hardware is not just a single process of which has multiple cores, but it is multiple machines. As we talk about when we discuss today about talent and distributed systems, you may have to communicate across those different machines. And if you're sending data over the network, that's about 10 microseconds right a little bit faster than going to the SSDs one order of magnitude faster communicating from a node to node. And now you have newer technologies like CXL which allows a single node to access the DRAM memory of the other. So that's going to go much faster than reading from a local disk even, but it's still going to be slower. Very likely that reading from the lab direct from your local memory. So memory hierarchies are getting more complex communicating to another machine is what's going to be needed when we talk about talent and distributed database systems, distributed systems are going to communicate across servers that are much further about sometimes geographically spread apart. And that's what it's going to cost you to go to a different machine. 100 milliseconds and see the big difference between that and the RAM access right 123456 orders of magnitude difference that we go across that. So now you can start to see why we need to understand this overall picture to really start to make high performance database systems, whether it's on a single machine, we need to exploit all the course, but in parallel database machines, we need to be aware of these different costs. Okay, questions. All right. So that's why we care about parallel database machines, we need to exploit all of this hardware and do that well. And if you do that, exploitation of that hardware really well, it also reduces the cost of ownership because you may need fewer machines, for example, to serve that same workload. Okay. And so they have multiple benefits for being efficient with the hardware you have, including fewer machines, smaller physical footprint and of course that comes with huge environmental benefit. When we talk about improving the performance, we distinguish between latency improvement, how fast can we make a single query go, as opposed to throughput improvements in which we say how fast can we make a batch of queries go. And we're seeking mechanisms that allow us to get both of those. You remember last class, we talked about the scheduler, which broke everything into smaller units and allowed you to break up a query into even smaller chunks. We'll talk about some of the things that commercial systems do today. And then I would encourage you after this lecture to go back to that last part of last class's discussion for the scheduler proposal that the scheduler idea that we had mentioned from the quickset project, because that's perhaps a more modern way to build a scheduler. There aren't many systems who do that quickset built it and hyper is another system that uses that type of scheduling, but the systems we'll talk about today for scheduling and are going to be more traditional systems and I encourage you to go and compare and contrast that as after this class. We talk about these terms parallel and distributed data platforms. And they both, they're different terms will distinguish what's the, what's the separation between those two in the next slide. But the commonalities across both of that is, instead of having a database system work on a single server, a single node, you go you have a collection of these nodes. And the database management system has to provide the illusion of a single node system to that end user to that end application. So the end application is still going to send queries. And the system now has to figure out how to break up that query and hardest that collective resources spread across all of these machines. Okay, so from the user's perspective it should feel like I'm just sending a query I'm getting my answers faster. Lower latency, I have a whole batch of queries that I've sent all of them come back. The queries are coming back a lot faster the queries per second, the rate at which the system is retiring queries are producing answers to queries measured often as queries per second or queries per hour. That's a throughput measure is happening faster. And then if you take the advanced database class, we talk about how there are certain properties about what that means of some sort of proportionality that these notions of linear scale up and linear speed up, which have to do with terms like if I throw twice as much hardware at the problem, I'd expect my performance measures to get twice as faster. And how do we get that linear linear behavior. All of that we defer to the spring semester advanced database class today we just say how can we get some of this machinery off the ground. So, you'll often hear the term parallel databases and distributed databases, those names often come from technical jargon that evolved in the 80s and the 90s when these were getting separated. Today, the distinction between that is a little arbitrary when people say something like I have a scalable cloud data platform like a snowflake or a data breaks or many of the Azure or AWS and Google data platforms, they're going to use a combination of both of these techniques. So what's the distinction. The commonalities we've talked about both of them want to work on a collection of machines and provide to the user the illusion that there's a single machine that's much faster than each individual machine. In a parallel system typically those machines are more closely connected they might be in the same data center, they may be on the same rack. So the communication between them is going to be if you go back to the previous slide here in a parallel database system, you're probably communicating across different servers or different nodes. It's going to be in that 10 millisecond range and with CXL it's probably at the RAM layer you're accessing the remote memory so it's going to be even faster. So the key distinction between the parallel and distributed system is that in a parallel system, you have multiple machines but they're close together. They're very fast network, and you will assume that the network is reliable that means if I send a message or affect something from the from another node that that's going to happen I don't have to worry about oh did my message get lost. And you know you'll assume that the hardware kind of takes care of that in a distributed system the servers might be geographically distributed so you're operating in this space, you're communicating to something else. And of course in the modern cloud scenarios, if you have snowflake installed or data breaks, it's all you have a cloud database like that. You are essentially going to get a combination of both those it may be that some of your nodes are local, they might be geo replicated to some of the nodes, and you might have a combination of that so query processing tends to get even more sophisticated, because that connectivity picture between the nodes is not just that I'm here, or I'm at the network level, it is a combination of that it's asymmetric network communication. So everything we talk about today gets even more complicated again, that's advanced database topic that gets covered in the spring semester graduate database class. Okay, so that's the distinction between parallel and distributed database systems and that that whole trend is merging but the key part is in a distributed system. You'll assume that the communication costs are higher you do different algorithmic design differently, and the messages could get lost. And so you're going to build some sort of a reliable communication layer. So if I'm sending to another node and saying hey, once you go process the selection query on the data fragment that you have, you won't just assume they got it. But with that you have to have layers about in your data layer in your database platform to be able to have that reliable communication and check with that. Okay, also things get more complicated as you talk about transactions which is coming in five ish lectures from now, you have to commit the updates that are happening because in a parallel or distributed system updates might have happened to data sitting on node here, another in this other node they may be geographically spread. So you have to commit that transaction so that all those changes get committed and the state of the database is correct and you still have this illusion of a single large database that is being served by a collection of machines. Okay, so don't worry about the distinction but you should be aware when someone says parallel versus distributed they're probably using these two terms in that sense. Yeah, so good example is a family database system is something that you would use. For example, if I've got, you know, hundreds of nodes sitting around in a single cluster, I really get about high performance. My data fits into all the nodes sitting in Iraq. They take about 40 ish computers these days. Let's say each computer, each node in that rack might have a terabyte of memory might have 10s of terabytes of storage, and I can get too close to a well, a half a petabyte or a quarter petabyte. In fact, there are models where people taking appliances like Oracle has this appliance called extra data, which is essentially a big rack. They roll that into a data center. They just announced a partnership with Azure. The Oracle parallel database system running in a rack in their specialized hardware plugged into the Azure network service by Azure for other things. A distributed system would be where you end up a beat for a distributed system that this wouldn't survive is let's say I've got my banking transaction data that master source for what's in each and every account changes are happening to that all the time. You're just swiping cards, you're debiting money from it, new deposits are coming into that. I can't keep that on a single node in a single rack in a single server. Let's say that sitting in Pittsburgh, what is that data center catches on fire? In Pittsburgh, sitting in California, no one has data centers in California necessarily. It's too expensive at times, but they're built in the middle of the desert like Arizona, stuff like that, or in Iceland where power is cheap. Usually data centers these days get dictated by that, but you can still have a single rack fail, which is not a very low probability event. The entire data center fail in the sense that network could get disconnected and even lower probability then but not a zero probability event. And some of these things are not sick sick my lens, they can happen more often like a single node failure is a lot is very calm. And so I will need a copy of my data for my bank account, which maybe it's not a massive database if you're not here by database it may be a terabyte database, but I want to keep three copies of it so I may keep one copy in Pittsburgh, one in Arizona perhaps, you know, one in Minnesota, and that way one feels I can recover that. You know, when 9 11 happened, a lot of the data was in those towers, but most of nearly all of it was geo replicated, and you could recover all of that information from the backup site. Like no one noticed anything different with the bank accounts that's because replication was working. It was geographically spread across obviously to be distributed because it has to be physically spread apart for you to have this property of single site completely failing for some do some some environmental factors that come into it. Okay. And of course if you have a copy of the data, you have queries running, you probably want to use it. Sometimes you have copies of data you have this distributed stuff happening, even for analytic work. So the whole world is changing paddle distributed hybrid systems like that is essentially what happens all the time. Okay, great question. Other questions. Okay, so we're going to go into the process model, the execution level parallelism and io parallelism. The process model is a term that we refer to to talk about how is the system going to deal with this simple three representation that we talked about in the past class with the producer consumer stuff, but now use multiple levels of a multiple parallelism that's available in the hardware. And you're going to call this thing of work, you can think of worker as a unit of work in which some of the processing happens. And then you talk about process model as saying, how is that worker allocated. Is that worker allocated to do we allocate it to a process is one worker map to a process every time a new worker to start a new process like pop up a new process in my code. Or is that worker a thread. Do you have a, do I just start a new P thread or do we have a collection of people and then just find one of the presets available and give it work. Is that a worker, or an embedded system in which I'm running my code, like in a Python notebook, I've got some processing happening is a database system running in that same context. Let's call an embedded database. Okay, and you talk about all these three different models right. So let's start with the process for worker. This is the simplest model at most databases, a lot of work in the panel and distributed database system started in the 80s. A long fit was before threads were popular. Threads didn't quite exist in that form. P threads as a package didn't quite exist. And so at that time, the way in which you would start a query, and you would start the processing. As we talked about briefly I'm going to go back to slide number two here. If you remember, in the first query execution class when we talked about, we talked about how you can start this operator tree bottom up or top down. And so in the process model, I start the process there. That makes a call to another process, maybe through some RPC mechanism or something like that. And then that calls another process in the whole process tree setup. So that's what the process model is. It's going to be a single process for each of those workers. And in this case, I'm just assuming a single worker per node. And we'll talk about intra intra query operation parallelism in which we might have multiple workers for each of those operations. But the basic idea is that I get a query from an application that connects to the database system. There's a dispatcher. I'm not calling it a scheduler dispatcher, which is much simpler. It's going to call the worker process. Start up a process. If one needs to be started up, or if there's a pool of process just waiting around for work to be given to them pull from that pool. And then the work gets done work for that operator like a selection, a projection, a join an aggregation. Okay. So really simple. And all the traditional database systems that have been around for a while, including DB2 Oracle Postgres, because they all started before Pthreads was popular, will have this worker model. What do modern data platforms do? And you can see the longer list of systems that are over here, including these traditional systems and pretty much every that slide over there, the thread model, we can fill up pretty much all the systems that are out here today that are more modern. The idea now is that instead of having a process which is a much more heavyweight abstraction for doing work, use threads and threads are much cheaper compared to a process like spilling up a process and telling on a process is way more expensive than starting a new thread and turning off a new thread or killing a thread. So here the idea would be, I have this operating systems gives me this process model, which is heavyweight. But within that process I can use Pthreads, which is not really standard and spin up as many threads I want. I can even control how many threads I want. So if I'm working on an 40 core processor, I might spin a 40 threads. If I want to use something like hyper threading which allows multiple threads to work on a single core at the same time, I might spin up 80 threads on a 40 core So now I just use the process mechanism that the operating system gives me as a container and my parallelism that I'm orchestrating is inside the process by using threading. So pretty straightforward and it's simpler from the perspective of a slider weight than the process model. One downside to that is that if a thread crashes, it could take the entire process out. If I've got 40 threads running, one thread crashes as a sick kill or a sick fall, the entire process starts. Unless you do special stuff to catch it, which you can't, but you have to write that code. Okay. Questions so far on these two models? Okay. And basically as the slide says, you know, every database that's been created since Pthreads that became popular since about 20 years now uses a thread model because it just has way more advantage compared to the process model. All right. So and the idea is a similar application comes in now that unit of work is at the thread level and not at the process level. Now some systems like SQL server, which has been around Microsoft SQL server has been around for a little while. They had to go through this pain point going from the process model to the worker model and doing different things and all of these systems Oracle and SQL server and DB2 they now have this balance between the thread model and the worker model. But what I want to do is zoom down into SQL server and talk about what they did at the dispatcher level. And as you remember in the last class, we talked about scheduling and that unit based scheduling. This is simpler than that, but it's practically what Microsoft did. Right. So the idea is for each query plan, the database system has to decide how to execute and you have to decide how many tasks and how many workers are going to get involved. And you're going to make that decision based on how many CPU cores you have and you might also decide which core should execute which task. Right. And there's sometimes advantages to picking the core as opposed to saying, hey, the thread manager or the operating system decide where this needs to be allocated. And the big reason is the database system often knows a lot more than the operating system about the context. So for example, if I have got a selection operator that had just produced its results. And it was running on four or four. And now I need to schedule a subsequent aggregate operation. I'll probably, and core four is free. I want to schedule that next worker on core four, because when it likely the data sits in the caches like the L2 caches for example. Right. As you saw from the slide, L2 cache is an order of magnitude cheaper to access than that. And often in most modern systems as an L3 cache and something that looks like a gigantic L4 cache. Some of the newer Intel and AMD systems have like a gig of S-RAM level four like cache sitting on the processor. So the gig of that, which is an amazing about. So locality is important. And you want to take that into account. Yep. Yeah, some of them are L1, L2 typically are individual to each core. And some of the bigger ones like the L3 or what looks like an L4 often shared caches. Yep. But even if you, but you know, often you'll also have multiple processors, right? You buy a server today typically will have four processors. And so you want to at least make sure it goes to the right process, even if you're using a lower level cache. Question? The data management system is very specific. Why do you still have to access the farm? Why don't you just get rid of it? Yeah. Yeah. So the question is, why don't we just build a database system without the operating system? And that's been a perennial debate in the community, including Mike Stronberger wrote a paper about 40 years ago saying how database operating system just gets in the way. But the practical reality is that you have a server. It has to do a lot of low level device management stuff. People write device drivers and test it at the operating system level. So you want your networking card to work. You want a storage device to work. That server is going to run a lot of applications, databases, maybe just one of them. So you still need a layer that does the stuff that operating system does. Now the question, the part of what the operating system does like a file cache, that gets in the way for database systems because we like to do a buffer pool. So often what database systems will do, they will use some mechanism like direct map of file where the data is sitting in the file system from the operating systems perspective and tell the OS don't cache any of this stuff. So the operating system does get in the way of things like caching, which the buffer pool does a much better job than the file system does. It does get in the way of locking, because the operating system will do its own types of stuff. Locking at a file level is what you see in most Linux systems. But we want final grade locking. I've been talking about the transaction that we talk about transactions. So that gets in the way. Some of the mechanisms related to scheduling sometimes get in the way, though operating systems have been getting much better by allowing the dispatcher to give things about activity. But certainly there are a portion of the operating system you need. And most database systems take the stuff that they don't want the operating system, because the operating system gets in the way and build it themselves and buffer pool is a great example of that. But that's an ongoing debate as to what should that boundary be. Yeah. That's the case that why don't they like take, like if you only want with certain between the operating system. Yeah. They're like differentiated to two, especially since like apparently it's faster to like. Yeah. Yeah, I would rephrase your question is saying why operating systems so monolithic. Why can't they be much simpler. And this place has a tremendous history in researching that with the Mark OS and micro colonels and stuff like that. Awesome ideas. And more and more absolutely needs to happen in the operating system community. We have what we have right now. Some of those ideas are there in piecemeal form, but not in that ultimate version of operating systems in very modular and lightweight so you can just pick the faster Yeah, and still an ongoing research topic and I encourage you to take the operating systems class if you're interested in that they spend a fair amount of time talking about that stuff. Great question. Other questions. Cool. All right, so yep, you had a question. Yeah, we'll get to that. So we'll talk about that we have a ton to cover so hold that question if I don't answer it in 15 seconds. Stop me again. So I'm going to go through the sequel OS part that Microsoft did very fast just to give you an idea I want to save time for the core topic. But what they did is essentially this type of an idea and you have to remember Microsoft, they'll very recently was only selling Windows server, everything work on Windows, Windows NT, the Windows server stuff like that. And so they had in mid 90s decided that they needed to build a database system, because that was one of the most important applications for enterprises. And so they started to build SQL server, and they reached a certain point where they had to go make changes to SQL server. And they added the sequel OS layer mirroring some of the discussion we just had over here is like, what's the boundary and roles and responsibilities that we should allocate between the operating system and the database. And they had the benefit of having gone through that debate, a bunch of times so since they started in the mid 90s, they had learned a bunch of lessons from before. And so what they did is they had built something called SQL OS, and that one they didn't build in the first version of SQL server that happened much later. And that was essentially an abstraction between the database engine and what it means of the operating system and what the operating system that implements internally for that operating system. And one of the big advantages of that was, you know, so these this management included things like iOS, I'm going to send an IO request, but don't directly call the Windows driver or the Windows layer directly have an abstraction and have it call it. And the big advantage of stuff like that was that when they had to move to Linux because he mentioned the cloud. That's what the servers run. They love me run Linux. And if Microsoft hadn't years before that with the subtraction layer to address exactly the types of problems we just discussed, they would have had a very hard time making stuff work on Linux. And because they had that moving over to Linux, that massive piece of code was relatively easy, there's still a ton of work that was relatively easy. Essentially, the key takeaway is that it is an extremely good idea to have in your database layer, a well defined interface to the operating system, so that you can make that work in the ways that you want, especially as a boundary between what the operating system does and the database system is continually getting reimagined. I'm not sure if you understand what the SQLS is. Is this a formal kernel bypass or is it more so like a generic interface? A generic interface. So it's like, I won't call the IO scheduler directly at the operating system level. I'm going to call an internal scheduler that does its own behavior. For example, it says I'm going to put all my requests into the IO queue. And before I actually call the operating system, I'm going to see if there are eight pages that are continuous in the physical layer space and make the IO part. So now you can do all that type of algorithms in that layer that's supposed to say the operating system does the right thing for it. Pre-fetching is another example, scheduling, affinity scheduling, stuff like that, or other examples of that. But anyways, the details are not as important as saying, yes, that layer matters and it should be something you should consider. I'll go really quickly through that. There's one interesting design they made, which I'm not sure makes a ton of sense now, but they didn't have the hindsight there, is they still needed to figure out, they had the process model. I'm going to figure out how to make this process model work in a more agile fashion. Do you remember the discussion from last lecture? Maybe with this work order based on when no single worker was occupied with keeping a value in the entire selection operator, simple operation the whole time. They were just operating on the block at the time and then giving up themselves and saying, tell me what to do next. So they won't like say, I've got to verify the file, I'll do the selection, I'm starting now, check back with me in an hour or two after it. They're not blocking the whole system, blocking themselves from doing something else that was very short. That was what we discussed in the last class. Here what they did was something different because they didn't have that luxury. They actually went into the code. For example, if you're doing a simple selection, this is what sort of the code would look like. A for loop iterating over each record, applying the evaluation predicate and emitting the record if it needed to. So they actually went to each of the operators and put stuff in like this, which is to say, every once in a while, check how much time you've spent in that inner loop and yield explicitly so that the control can be given up. Obviously, when they did this, they didn't have the advanced scheduling mechanism that we talked about in the last class, but that was a way they could not have a single worker block and you had this more agile way of building this system. Now, I would suggest that what we talked about in the last class is a better way to go do this now, but at least they got this thing to work in a way that makes sense for them. Okay. So we've talked about the process model. You've talked about the thread model. The last model is an embedded database model. And this is literally you can imagine the database is written as a library. And you can imagine just filing that library to your code. And essentially the database engine runs inside your application code. And you think of just making calls to the database saying, hey, create a, create a table. Here's the schema, insert records, do a selection, and then you stop. The database code is a library called embedded databases in the database lingo and effectively runs in the same worker space as the application. So in a thread that the application is running. And there are lots of examples of that. The most famous example of that is SQLite. But there was a database system called Berkeley DB that started about 5, 10 years ago before SQLite at the toward the end of last century. But SQLite is used in a large number of applications. You probably have dozens of copies of SQLite compiled into applications on your phone. Most apps will have compiled that into the application. It's the most deployed database engine. There are estimated 10 billion copies of SQLite running across the plan because everyone's carrying phones and many of them will run multiple copies of SQLite. Most apps when they need a database layer, they embed it, they just link into SQL library. It's pretty interesting. If you go to SQLite, they have a version of distribution which you get one single C file with the entire database system. It's very cool. And you just compile that into your code. And now you run. That's your database system. Okay, we could spend a whole lecture talking about SQLite and how it interacts with the OS and we've actually worked on SQLite and some of the code from the optimization stuff that we've developed in my research group, ships and SQLite. So all of you guys are running our code. Okay, question. I don't think that is impossible. I think some applications might even be doing that. I don't know why you would not be able to quite do that. There's this notion of a global locking mechanism in there, but that should still work with dynamic. Well, they have a buffer pool that they maintain and so I'd have to go and look at that and see if the buffer pool is allocated together or becomes a shared buffer pool which you may not want. I think people compile it in because a lot of the assumptions probably don't work if you start sharing the buffer pool across applications. Okay, cool. Okay. All right. So just a summary of the process model. We have these two different process models. If you're in the embedded space, you're really looking for a lightweight engine. That makes sense. The process model, the workers sit in a single process is essentially something that existed in the past, but really most modern database systems use threads, which is the better model to use. What is not on this slide is that if you look at cloud databases, what they often do is they'll have a different notion for unit of node they effectively have essentially these containers like where you're allocating some compute container that might be four or eight codes that might be virtualized on top of a given hardware and you impose that on top of that. The bottom line is that in the cloud environment is the more final brain granularity as to where are those resources coming from and people are going to allocate containers. Those containers might be units of two or four cores from a physical machine underlying that might have 64 cores or 100 cores, like we sliced up virtually, and in that you can put your process model. So there's one more level of abstraction that you often see when you're deploying in the cloud. Okay, there's just for you to know, you still have to figure out within that container what product, what model are you going to use. So the sign mentioned that this is not implied that the meaning of support is temporary. Yes, so we'll talk about interquery parallelism. Next, which is just because someone is using threads doesn't mean they have interoperator parallelism or interoperator parallelism, which is exactly what we are going to talk about next. So how common is that? Just wait for it. Okay. Okay. So, so far we talked about other questions related to the process model. Okay. So so far we've said, I need to do some work, like on a selection operation. Where do I make that allocation? Do I spin up a process? Do I use a thread? Or is it embedded? So throw the embedding on the site for the thread in the process model. I still have to decide. Oh, I have the selection operation. Should just one worker need that selection? Also that have 10 workers spin up and work on that selection. See a file with a million pages. Maybe I spin up 10 workers at each one of them work on a tenth of the data. That can be interoperator facts. Okay. If I allocate 10 millions of work for that selection operation, I still have to decide whether that worker, each worker now is a thread or a process, which we've already discussed. So now the discussion is, what do I take each operator and throw more of the workers, many workers that interoperator. Interquering is how do I execute the operations. So, sorry, forget about interoperator right now we'll talk about intraquery, which will put the break it down into intra operator in a second. So interquery is for a given query. Do I use multiple workers to do its work. And interquery is if I've got multiple queries that are sent to my system, do I use multiple workers to go with that. The simplest model would be I have one worker, be it a process or a thread. Every, if I've got 10 queries come into the system, I'll take the first query run it on that worker till it's done and take the second one. So if I can go in two ways, when I say I've got 10 workers, and they could either be stealing up a single query at a time, or they could be stealing up a collection of these queries together. And of course there are things in between as we will see. So let's start with interquery parameters. Basically, we have multiple queries that are presented to the system, and the are going to figure out how to read them to run them fast. If the queries are read only, then this is an embarrassingly parallel task, right, but I've got 10 queries that are presented to the system, I've got 10 workers, I can say hey each worker take the query, take one of the queries and start working on. And that's all fun. That can go really well. If some of those queries shared the same set of data pages like all the queries start by reading the same table, then one of them might bring it into the buffer pool that everyone else benefits from. The buffer pool runs under this LRE2, LREKLite policy, and that's what you built it so that it's sufficient with respect to a sketch. So like this right, as long as everything is read only, if there are multiple queries running simultaneously, one is trying to read all the bank accounts and try to figure out what's the total deposit for the bank. The other one is trying to add interest to certain accounts, they're interfering with each other, one is reading the other one is writing, potentially to that same data. And there you have to start to worry about how to do that correctly. And that's we will cover all of the topic when we talk about transactions. So the buffer pool stuff that we talked about you and we'll get into some of the more details about that in lecture 15. Now, inter-query parameters is one query and I want to make it go faster by using more than one word. Okay, so how do I go do that? So if you remember an operator tree, we've casted as a producer consumer paradigm between each pairs of operators. And we can now take each of the operators and to give it more than one worker. So we'll look at that next, which is called intra operator paradigm. And the other part we can do is we can do multiple operators in parallel. So before we go into the details of that, let's just take a simple example. You look at hash joint in a specific flavor of hash going to raise hash dot. And if you remember in that, what we did is we said, we're going to take the two tables that have been joined R and S, we're going to partition them to create partitions of our HD 012 up to some number of partitions. We're going to apply that same hash function H1 to the S side, get corresponding partitions. In the second phase, we're going to join the partition pair. So we've built a hash table on partition zero of R and probe it with S. And basically we can get the whole joint using that divide and conquer stack. Do you remember that? It was like T lectures ago. Right. So now if you want to take an operator like that, there's parallelism that you could exploit in that. And you could basically say, Hey, after I've done the partitioning step, the first partition could be joined by worker one. Second partition could be joined by worker two. And I can keep doing that in parallel. Right. So we'll take that and go further into that and break it down into these different levels of parallelism. Now you're focused on this intra query parallelism, a single query more than one worker is available. How do you make that query? You can think of it as I want to primarily reduce the latency of this way. So we'll talk about interoperative parallelism, which is using multiple operators to, for example, do that hash join better. And you can think of it as a horizontal way to speed things up. The other one we'll talk about is vertical parallelism, which is to look at the operators in a tree and do multiple operators simultaneously. That's orthogonal to it. That'll be the vertical way. And the textbook talks about something called the bushy way, which is like you can do a hybrid of both. Really the first two ones are what's important. And if you look at the scheduler stuff we talked about in the last class, it naturally will do both of those. But we'll very briefly talk about bushy just to keep it consistent with what the textbook says. But you really need to know about intra and interoperative parallelism. Okay. All right. So let's start with the interoperative parallelism. And the way to do this is to reduce, in the actual case, as you saw, we broke the table up into multiple pieces that divide these then allow it to conquer and do the individual partition pairs. And that conquer piece was the one in which we could exploit the balance. We could throw multiple workers. They were independent pieces of work and they could be done in parallel without interfering with each other. And so you're trying to get to this structure where the divide takes some of the work. And then after that, each of the pieces of work can be done independently at its own pace. After all of that work is done, we can get the final result. Okay. Now, how do you introduce that in a systematic way into your query operator tree, which might have all kinds of complicated operations going on? So there's a beautiful paper from Goodscraffade that talks about this exchange operator. And I'll just show that to you with an example. Postgres, by the way, calls it a gather operation. But here's how it works. Take the simplest case where I want to apply selection on a table. The table might have a large number of data pages. Let's say just five over here. And I've got three units of hardware packages that I want to exploit. Basically, I can spin up team workers at the same time. And so what you could do is spin up worker one, two and three. And tell worker one, go and work on page one independently of telling worker two to go and work on page two. And essentially, there's no interference in the work that is done by each of these workers. Because what we do is we'll introduce that exchange operator at the top. Postgres calls it gather, which probably might make more sense as a term. It's going to take the work that is done by each of these workers. Each of these workers will operate on an independent page, apply the selection, and then send that across. And then you operate your model. If you're doing this tree in a top-down fashion, the exchange will call the selection, which will call the next, which will call the next in the worker, which is being told to go work on page one. And then start producing data for page one. And as it flows up, the exchange slash gather operation is just going to collect that result and then send it upstream to the next operator. Okay. And so right now this exchange is very simple. We'll build it up into something a little bit more complex and just go walk through this example. Worker one starts A1, goes and does its work. The others can start in parallel. So at this point in time, if I look at the machine, these three workers are working on three independent pages and getting work done. As data is getting produced, they're all getting sent to that exchange operator that's combining and collecting that, producing its own pages for output that is going to create. So exchange is consuming from these three workers, and then we'll become a producer in a little bit. These three workers can get done, and then they can start to work on these other pages. So in this example, one and two got done. Maybe they had fewer pay and fewer records on that page to process, for example. And three is still working on it, but that's okay. One and two are done, so they could be allocated four and five because a selection is naturally divided by the pages. So we didn't have to do an explicit divide stack. The conquer stuff is what we're doing over here in parallel. And so at the second instance in time, now worker one and worker two are working on pages four and five. Worker three will eventually finish up and the exchange operator will get all of its results. Okay. The exchange operator has finished consuming that result, but now it has to send it and do something upstream. And so that simple version of exchange, you can think of it as a pure gather operation, right? It's just gathering stuff that could send all of its stuff as a single output. Now, there's a different type of an exchange, which is a distribute. So what we saw in the previous slide was just a simple gather version of that exchange operation. The distribute version of that operation says the following. I'm going to take something as input and I'm going to distribute it across multiple outputs. So you can imagine the distributed operation, for example, reads a table R, applies that hash function h1, and produces the partitions for page. And a different distributed, different distributed operator does the same for s. And that's how you will bring the partitions set of that race hash. Okay. So this is the opposite of gathered, right? The funnel is inverted. A gather is going to take multiple inputs produce one output. The distribute is going to do the opposite. Okay. And when it does a distribution, it will use some function, a hash function or round robin or range function. Again, those we'll talk about when we get into more details in the advanced graduate class. Then there is a repartition component, which is a combination of these two. If I take a whole bunch of inputs, gather them. But he's not saying I gather and I have a separate distribute stuff in which I like to pass data cost to operators. I can just blend that function into one. It's more efficient. And what that will do is it might take input from three input workers. And like produce output to do different work. So it might apply a two way hash function, for example. Repartition function is very general. It could have inputs and outputs. And now you can control the shape of that three and how parallel this needs to be. Okay. And so that's essentially the different flavors of the exchange operator that you see. And of course, if you look at the original paper, exchange largely referred to that last portion. And you can say if I have that operator, the other two cases are special cases of it, right? Which is a perfectly legitimate way to look at it. Right. Questions. There is all of this. Which part of the database is where all of this balance. Yeah. Interval and interact rate. Yeah. Operator and interoperator. Does the dispatcher know of all of this? Yeah. Great question. So who decides on this degree of parallelism, my m versus n. It's a complex question. I'll give you a quick answer, which will be incomplete. The optimizer often starts to make some of these first decisions. And the trend more and more is to have smarter schedulers like we talked about in the last class that can decide based on what hardware parallelism is available. And the degree of hardware parallelism is also often changing on the fly. Okay. Great question. Yep. So we're speaking does the gather effectively function as a synchronization? Yeah. Three of these finished then. It does. But wait till I talk about the intro query parallelism because it can probably start to send in some cases things out. But the simplest view of it is the. So the question was whether it's a synchronization barrier like primitive, the simplest view of the exchange operator can be thought of that. That's a barrier stops all of the flow below. But as you talk about interoperator parallelism in some cases you get don't need to do that you could pass things along if it is safe to do that. So in a similar bank in the repetition in it like receive one input is like oh that's enough for this output. So you know like it can synchronize on one thing and send one of the output. Yeah. So the question is can repetition take some input and send it to one and finish that before it does the other or know if it's done. The answer is that's often hard because you don't know where all of the input is if the repetition was refining a hash function and you knew everything below it was already hash. Then yes you could do something like that but the need for something like that is probably low. It's better to do things in the smart scheduler like I was telling you in the last class which no one does except for a few systems but that would be the better way to do it. I guess if you want to do that you could just add the what would be the output to the change in points. Yeah. And that's right. So the question is like effectively all these questions about can I shape the fan out and fan in of these exchange operators in general. Absolutely does the huge implication and performance how to do that right. Burning questions specially in new environments where in a cloud environment a query may come you may have 10 notes to work on and then while the query is running maybe it's going to take a day to run. It might have been given five of those resources might have been taken away 100 more have been added for a little bit of time and you have to work in that dynamic environment. So lots of open research issues and we talk about a bunch of those in the database class. Yep. Operator on the same level. They're operating on the same level. Yeah. Workers on the same level in that operator tree will be operating on the same operation. Yes. Is that the question. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. So if we go back in the probe site then gonna go into the $hash let me just go to that. So, here is a selection followed by a hash join. Let's say, we start three we have three units of paralysis in for the API side and we start that they start producing the output. We have in this case, the projection has been pushed down below the joint, which you can do really optimization. about that. We'll discuss that. So here my pipeline is through the selection, then the projection. And now I'm going and sticking all of that into the build side of the hash table because I'm partitioning that. And so I will have something on the build side. I will go and exchange that, feed that into the join. I can start the B side in a similar way. And in this case, there's a three by three inputs on the exchange side here. It could have been two here and four there. And the shape of the three, depending on the degree of panels and you want to allocate across those different operators, they're all legitimate. At the end of the day, however, the join needs to end up having the input with a certain partition strategy. If it doesn't, then it has to repartition internally in the operator code. Again, lots of interesting stuff. Many of those will be covered in the advanced database class. Let me keep moving. So the other approach, the biggest benefit from families and it's going to come from what we just talked about that intra operator families and throwing more operate, throwing more workers, the given operation and using this exchange operator to make all of that three logic work for us. With those two mechanisms, you'll get the most out of the hardware that you have. Now, a secondary mechanism, which often gets used in streaming environments where this becomes a primary mechanism and in streaming environment, the application for these are cases that they are flowing in. For example, you're getting ticker prices, you know, you're getting ticker updates for all, for a thousand stock prices coming in every millisecond or 10 millisecond. And you're trying to compute a sliding window aggregate for that. In that case, the work that you need to do is not a lot, but you have lots and lots of pieces of information. You don't want to work through that entire party, look at all the ticker prices and into the output. You're effectively in a continuous environment called streaming environment, where as the input comes in, you're producing the output and the yield of work is somewhat slow. So the systems like Storm, Play, Kafka, and I worked on both Storm and Heron that will do, their main parallelism is going to come from a combination of inter and interoperator parallelism. So what is this interoperator parallelism look like? It's essentially also called pipeline parallelism. So if you know UNIX pipes, you can connect two processes, connect them by a pipe. Process one is sending stuff, even before you're just finishing everything. Process two starts working on whatever is being sent to it. Both processes are operating on a stream of data and you're not waiting for everything to be done by the first process before sending it. Same idea, same idea here. So if I've got this operator tree, orthogonal to doing all the interoperator parallelism, what I can also do here is effectively say run, let's say a thread, if I'm in the thread model, that's going to effectively go and do this join, but as it produces the join, after it gets a result over here, the first time it produces a result, it hasn't finished doing itself. The first result it produces, just send it and if that operator over there will project the output and send it up. So I don't have to wait for all of this to be done to see my first result. And now you can see how in a streaming environment this would be extremely useful. And even in a regular system, it helps because you can use the other worker, the one being the projection is not sitting idle, it can start to do its work while the downstream work is still happening. So both of these are orthogonal mechanisms and you can use both of them in parallel. But they're very distinct mechanisms. I'm just going to glance over this. The textbook talks about a bushy parallelism, which is a hybrid of both of those. There's really no magic to it. It's even unclear if you should call it that. I think they're the only ones who call it a bushy parallelism, nor the textbook does this. If you read the textbook, I don't want you to get confused. The main idea is if I've got three, often trees of this type, where your joints feeding into other joints is called the bushy tree, that you've got those two joints up there feeding into an exchange operator. So the exchange is getting input from two different joint operations. And effectively what this is saying is I could do interoperator parallelism at some level, like in the first joint, I could be using interoperator parallelism. I could also be doing pipelining across each of these individual branches of the tree. I can be working on multiple branches at the same time. So it's just a combination. I don't want to spend too much time on that, but basically you could do that. Essentially, if you look at the scheduler that we talked about in the previous class, all of these types, the bushy type of parallelism just implicitly comes for free. So you don't have to over engineer it. There's no rocket science to do it here. And just hold that question till the end of the class. I think I know what you're going to ask, but I don't want to spend too much time on the bushy tree. Just want to dot t i and cross the t on that. So what have we talked about so far? That we want to have to use all the hardware parallelism that we have. And to do that, we have all these complex worker models. But now we're going to look at, go from that level to a slightly lower level, which is, you know, so far, essentially everything we've talked about is essentially saying, I've got 40 cores. How do I use them? Oh, I've got 100 nodes sitting in my rack. How do I use them now? But when an IO needs to be sent to the IO subsystem, more often than not, that IO subsystem is going to be a complicated system, might have multiple disks in it. And so we're going to start talking about IO parallelism, right? So we are now one level down and starting to make requests for the IO subsystem. And they too have got a different parallel system, right? It's the IO subsystem parallelism a little bit different than some people parallelism, but they both have to work together. Does that make sense? Because I'm not running on a single disk and probably running even on a given server with a bunch of disks, and there's parallelism there too. So this idea, sorry. And here, to make it a parallelism, you're going to have some notion of how we have laid out our data across the different disks that we have. There are many ways of doing it. I could say I may have four disks and four databases, each database gets a disk. Obviously, that's pretty harsh, right? Or you might say I've got four disks and every table sits on its own disk and some disks might have multiple tables. I'm just worried about the single database, for example. Or I could have multiple disks for database and then one of these. Or I could go to the extreme and say everything can be partitioned whichever way. All I need to worry about is for each table, if I've got four disks, does it go to four disks or three disks or two disks or one disk? And I can control that. And then I have mechanisms to go deal with that level of parallelism. See if all the data was sitting only in one disk and I have four worker threads running on it through interoperative parallelism, they're all going to hit that disk, right? That will become my bottleneck. So that's why our parallelism is important because if you don't pay attention to that, everything that you're trying to do in that compute layer is going to get bottlenecked here next. Does that make sense how these two are connected? Okay. All right. So what can we do? So first is, there's a whole class that talks about things like data on disk doesn't last forever. I'm going to just present a high-level overview and anybody teach you take classes in the term of research that Rashmi does over here on erasure coding. It all has to do with this massive ASA. I store data on a desk, whether it is an SSD or a spinning disk and bad things happen to it. There are trillions of bits sitting on these devices and some bits are going to get corrupted. I may be running some sort of checksum and stuff like that at the page level, but bits in a disk rot is called bit rotting. And just like actual fungus and rot, that rot tends to spread. So essentially, I have stored data on disk, but I need to be able to get data back and it better be the data I wrote. I don't want to lose data. I don't want to get wrong information. And I nearly always in the server environment won't have a single disk, but a collection of disks that collectively behave like an IO subsystem. So I need to work with that. So there are three competing dimensions. I want high performance. If I've got four disks and I'm going to make a call to that, can I get the collective power of those four disks on a single scan? That's performance. I know disks are going to lose, some bits are going to rot, other bad things are going to happen to this. I want to replicate in that layer so that if something's bad, I can get the copy and I need to keep the copies consistent in just the IO subsystem. With that, I want to use as much of the capacity of the disk collective storage capacity I have while getting these other properties and obviously they compete with each other. As I said, this is a whole area by itself. So I can't do justice to it, but just want to have you appreciate that when you talk to an IO subsystem, even on the local server, this pattern is about the IO level and we need to be aware of it. Of course, if you talk to you in a family or distributed system, the IO subsystem is spread across and the cloud environment often the whole storage layer is it's in a different cloud and there's a compute cloud and a storage cloud. So it gets even more complicated, but let's just keep it simple so you can appreciate the basic and then gets you set up to do more advanced work in that area. So imagine I have a file of six pages. With the database system, this is a logical view. I start a scan on this file. I will say get me page one, page two, page three. It comes to the buffer pool. The buffer pool just gets about these page IDs, but at the file system level, or if we decide that we are going to take that over from the operating system, we have to figure out how to lay these things out. And many database systems will take this part over from the operating system and lay things out to have certain properties at the database layer. Again, those are things we'll talk about in the advanced class. I just want to give you a flavor of that. So imagine I've got three disks, three physical disks. I could take those, those pages and spread them across in the following way. One, two, three. I'm just striping in a robin fashion. First page and first this second on the second and so on. And now if I start to say scan page one, I start worker one to scan, work on page one, work on page two, work on page two, work on page three, guess what? I'm going to be able to leverage the entire parallelism in the in the disk subsystem. This disk can scan three pages at a given time because each disk can serve up a single page. I'm going to get that ideal parallelism. If I don't pay attention to this, and if I say, oh, the first worker is going to do page one, second worker is going to do page two, that disk will be the bottom. But now if I have this striping scheme, which there used to be this old subsystem called RAID, which has been superseded by erasure codes, this used to be called RAID zero. It's simple striping. Other technical term for that is striping. I'm just striping across the disk. Okay, now I can get high performance, high capacity, but what do I not get? Durability. If page one is rotted, I lost that information. Okay, the other way is, oh, I really care about durability. I'm going to mirror everything. I'm really paranoid about bitrock or some of the form of the storage device getting corrupted. And I'm going to make copy of page one three times, make a copy of page two three times, and so on. Every page has three copies. Now you can still get power with them. So if worker one starts, I can say, you go get it from here. Worker two starts on page three. You say, you go work here. Worker three, you can say, go use this. So I can still get performance. I can get durability, but I cannot get my capacity. My capacity is third by a third, because I'm making three copies. Now, as I said, this is a very simple technique. Two of the earlier numbers in RAID mirroring and striping. The modern way to do it is to and this used to be done all in the hardware. There's to be hardware RAID device drivers that do that. That's still deployed in the wild right now, but the modern way to do it is to do all of this in software. Connecting to the question that was asked here is that, take that over, have this nicer boundary division of labor between hardware and software. It's much better to write that stuff in software and adapt to it. And you can also do things like in software, I don't need a device driver that is tied to three controllers or four controllers. That software could actually be controlling this set of spread across the planet in a distributed system. We can build all kinds of fancy stuff at that level. They're called erasure codes. They'll do parity bits. And then there's a question of how many parity bits do I need for a copy of the data to balance that trifecta of competing goals. So again, there's a whole class on that and some cutting edge research happening here at CMU on how to do that well. And I noticed this camera has been off the whole time now. Yep. Go for it. Question. Yep. Does, for example, these classes with VTRFS and ZFS2A use erasure codes? Some of them use some version of that and I have to look at the details for that. But erasure codes are used all across in cloud file systems. That's the way modern cloud file systems are run. Yeah. Okay. Other questions? Okay. I'm going to make sure I get you guys out of here in time. But as I said, lots of really open and interesting questions here. And most of the time when this is done at the file system or often there are these entire subsystems that do this software storage architecture, this is often transferred to the database system. But as I appreciate it, is if the database system means kind of what this was, it can probably do a better job with that allocation. So again, this is one of those things that is still evolving. See how transparent should this be from the database system? Many of the storage vendors are right now building it, asking pretty transparent, but that battle is being fought as people try to figure out what's that right control. Okay. All right. Okay. Database partitioning. This relates to what we just talked about. How should control the layout of that data? Okay. That's a little bit of talking into how many copies do I make for the durability aspect? And how do we increase all of these competing goals apart? Of course, databases benefit by knowing where the data is, because that allows us to do things with locality. And so many database systems will allow you that work in this parallel distributor environment will allow you to specify where this data goes, either through hints or explicit mechanisms. Certainly in the embedded system, you will be allocating that in the local file system. And in post-recipe install, it will tell you, hey, which is the directory in the local file system, because it's just single node, do I store all my stuff? That's very simple, but there are more complex mechanisms. And database systems are going to use the Bucklepool manager to map the pages to a disk in some sense, insulating ourselves from that little pain saying, oh, if the IO layer hasn't given me control over where the pages are or knowing their location, fine, at least if the pages are re-accessed over and over again, they're in my Bucklepool. And I can insulate myself. First, I get a lot more efficiency because it's way faster to access data in the Bucklepool, but it can mitigate some of that other challenges that come from not knowing precisely where things are. When we talk about recovery and transactions, you'll notice that things like the recovery log, that's a place where we keep track of what changes have we made to records and pages in a transaction. And that needs to be, that needs to hit the storage device before that query, that transaction query can be declared done. And there are complications over there, too, if that recovery log is in the file system and it gets cached in the file system, but not actually put to disk, we haven't really committed the transaction. And so there's going to be that interplay that also happens between the file system. And we'll cover that as we talk about those discussions when we get to the transactions component of this class. Partitioning is a rich topic. You can take a table, you'll see many systems which will allow you to say things like here's the table, hash partition, all the records in this table based upon a certain set of keys, and you can explicitly support that or partitioning may be transferred to the application. And when you have a parallel or a distributed database system, there'll be additional richness in the SQL schema that allows you to specify, optionally, some of these partitioning constraints. And they obviously have big impact on what that performance of that end to end system looks like. But those will get covered in the advanced database class. So again, this is a plug for those of you who are really interested to take that in the next semester. Okay, so we covered a lot today. All of these techniques that we discussed are addressing the point that underpalas in this every web, you cannot buy a single code machine anymore. In the cloud system, it's a huge set of palasins that's available locally in a single node signal server to a cluster of servers sitting in a rack, to rack sitting in a data center, to data centers being spread across the globe. And how do we exploit all of this parallelism? We need mechanisms for doing scheduling. We talked about that in the previous class. We need mechanisms to change the operator tree to introduce the exchange operator. We need to have interoperator parallelism, which is a big workhorse for getting a lot more performance out of the hardware. And of course, we need to better understand how partitioning works so that we can work better with that underlying layer. And so with that, this is what we'll talk about the next class. We'll start going to query optimization and talk about the different steps for what makes a query optimizer works. Look up, show me where the safe sat for a blow your face back. I got a block on taps. The best can trace that style is like tamper proof. You can't lace that at the Dominic and all you could call me Dominican blacks, Kelly black leather, black suede, Timberlands, my all black dirty haters send you to the pearly gates. You get your solvent trying to skate and that's your first mistake. I ain't lying for that cake. Your family see you wait. My grand's is heavy weight and ran through every stake. When they actually how I'm living, I tell them I'm living great.