 So today's class, I'm going to finish up for the OTP stuff that we missed last time. I'm going to cover a little bit, go over again some of the things that people had questions about, and then we'll jump into the last lecture on distributed OLAP systems. So again, as a reminder, Project 4 is due a week from now, and then next Monday in class we'll have Barry Morris from New ADB, come give a talk about their system, and then on Wednesday, December 6th, I'll do the system potpourri of picking different systems that you guys want to learn about, and I'll give 15 minutes to each of them. But again, also we'll have the final review in the beginning, and I'll be handing out the paper copies of the practice final exam. So again, if you haven't voted yet, please go to that URL and vote. All right, so the last class, I see all of you are screen pointer options. So last class, people had some questions about the sort of the centralized coordinator versus the sort of the standalone coordinator versus the middleware. So again, we're talking about a centralized coordinator for our distributed database system due to transactions. And for the partitions, again, this could be done in either a shared disk architecture or a shared nothing architecture. I'll explain what I mean by logical versus physical partitioning later in the lecture, but basically think of this as like, these are the machines over here that control these individual partitions. And so the coordinator approach is akin to the TP monitor that I mentioned before, this is a separate piece of software that could be external to the system, the data system could be internal to it. But it's sort of this separate thing that the application goes to in order to do anything on the actual database system, right? So in the TP monitor case, I said this is from the 70s and 80s when people had these enterprise applications that had these disparate databases and they didn't have a way to federate them due to transactions across to them. So a TP monitor would be the standalone coordinator that could say, all right, you want to do a transaction on this machine or this database, you can do it there. Plus you can do another transaction over here and I'll make sure that they both get committed atomically. So the application server goes to the coordinator, says I want a lock request to lock these partitions. And then the coordinator is maintaining the state information about what transaction controls or locks what particular data item. And I'm showing partitions, but you can be at a finer granularity like the way we talked about when we talked about two phase locking. So then once you get the locks, you get an acknowledgement and then now you're allowed to access the partitions and do whatever it is that you need to do. And again, I said the coordinator doesn't know exactly what's going on inside of the partitions on how you actually modify the data, because it doesn't see those queries, right? It just knows that you want to lock something to do something and then it allows you to go ahead and do that. But when you want to commit, you go to the coordinator and then it's up for the coordinator to then coordinate with everyone. But for all the partitions that you modified or accessed and say, this transaction wants to commit, are you allowed to commit? And again, we're doing it this way because the different partitions don't know what you did at the other partitions. It only knows what it did locally. So it can only make a decision about whether your transactions are allowed to commit based on what it knows, what it sees. And then once everybody commits, then you get back an acknowledgement that your transaction is done. I wanted to contrast, though, with this with the middleware approach, where the middleware is essentially a proxy where all your transaction and query requests go through that middleware. And the middleware is going to maintain its own lock table or whatever else information that it has to figure out who's modifying what data and what partitions. And so for this, the application server only sees the middleware machine and it sends all the requests there. And then the middleware is jobbed to figure out where it is that actually you need to go to do the operation you want to do. And then when you want to commit, it's responsible for talking to the other partitions and saying, all right, are we allowed to commit this transaction? So then contrast this now with the centralized coordinators where essentially the partitions are going to organize themselves and figure out whether the transaction is allowed to commit or not. And the way you do this is you typically have one transaction be the coordinating transaction or the home partition that communicates with the other partitions and they decide whether it's safe to commit your transaction. So now what I think was tripping up people was when we started to my two phase commit. So in all these cases, this safe to commit part that I'm showing here, this is the two phase commit that I'll go over again real quickly. So even though you have a centralized coordinator or you have a middleware or where there's decentralized system, you're still going to use two phase commit to talk to the other partitions, get them to agree that it's safe to commit your transaction. And then when they go ahead and say that's okay, then you actually go ahead and do the commit. So now if we jump now to where we left off with two phase commit. Again, this two phase commit, the atomic commit protocol we're going to use to allow a transaction that has touched multiple nodes to go ahead and commit atomically across all of them. And I said that two phase commit is the most common approach used in distributed databases. There is a three phase commit but no one uses that. But then there's also these consensus protocols from the distributed systems world like Paxos, RAF, and Zab. And you can use these to do the atomic commit, but in practice everyone does two phase commit. Again, the basic idea of two phase commit is really simple. The application service says I want to commit. Then whatever node you go to say I want to commit is considered the coordinator and then everybody else are participants. And then the first phase you say, all right, this transaction wants to commit, is that okay? And they say yes, and then you go back and say, all right, when everyone agrees to commit, so go ahead and commit. And then you get back the acknowledgment that they committed. Now at this point, once you get back the acknowledgments, after the second phase, the transaction is considered to be fully committed and therefore you can return the acknowledgement to the application server to say your transaction has committed. And then if one participant decides that they don't want to commit the transaction and they send an abort message in the prepare phase, then you immediately go back and tell the application server that you committed and then you just tell everyone that they aborted. Now, what could happen on our two phase commit is that in the first round everyone says okay here, right? But then in the second round, the second phase, one of these guys could come back and say, I failed, right? I'm not committing, right? In that case then the transaction is considered aborted and you have to then clean things up. But in practice that doesn't happen, right? And usually the optimization everyone does is once you get the first phase back and everyone sends their acknowledgement, you don't have to wait to figure out what happens to the subsequent phase. You can just tell the application server that you committed. Technically, this is not correct. It opens up a small window for possible failures. You can sort of get around this in some ways by logging every step you do in two phase commit but in practice this is whatever he does because it's one less round trip you have to do to get data or to commit the transaction. So is this clear what two phase commit is? Again, we have to do this if we're in a shared disk or a shared nothing architecture, right? Because we have to coordinate between the different nodes. His question is, does this work to decentralize? Yes? We're sending a message back to the application server. In this situation, it seems like for each round you have to run them in the pf and you know what should be up for you. Yeah, so his statement is, his statement is in theory on our two phase commit any node that was involved in the, actually any node in the cluster could be designated the coordinator, right? And the protocol technically would still work correctly. In practice though, it's, say it's a decentralized system. You go to one server and say begin my transaction, that ends up being the coordinator, right? And again, you typically want to have the coordinating node be the one where you did most of the work, right? So that way most of your queries go to this and it has the most information about the state of your transaction there and can do things like, I know I'm not going to commit at all because you've done something you shouldn't have been allowed to do at this node. So I'm not going to bother doing two phase commit. I'm just going to tell everyone that we aborted. So his statement is, yeah, his statement is, in this case here, this essentially looks like the same thing as the centralized model, yes, right? You need a coordinator always. You need something to say, all right, I'm in charge of deciding whether we're allowed to commit, right? The difference though is, so this server here executed transaction and went to node one and it tried to commit. By telling node one, I want to commit. So node one is the coordinator for this. But another transaction at the exact same time could be going to node two and trying to commit there. And then there's another round of two phase commit going on where that guy's the coordinator and this guy's just a participant. So any node in the cluster could be a coordinator in a decentralized model. Whereas in the centralized model, it's always that single software piece, that component that's doing this. Okay, right, so again, the things we've got to deal with is if the coordinator crashes, then the participants have to figure out what to do. And again, typically what happens in a distributed environment, you don't get a notification that immediately that node goes down. It's typically like there's a heartbeat to say, are you still alive? Are you still alive? And at some point you don't get the heartbeat from the other machine. And then you time out and decide that I need to commit transaction. But essentially what happens to happen though is under two phase commit, you have to have all the nodes still up to commit the transaction. Because if one of them goes down, the whole thing has to abort. And then if one of the participants goes down during this, the coordinator would just assume that the other guy responded with the abort and it goes ahead and abort that transaction. Right, typically if the coordinator crashes, then one of those guys, one of the transactions would time out and you could send a message back to the client application server to say your transaction was aborted. But typically the connection between the server, between the client and the server just times out. And then the application has to decide whether it can find out whether that transaction committed or not. Okay, so again, the key thing here is though is two phase commit is considered a blocking protocol as opposed to Paxos because once there is a failure, you have to wait around to figure out what to actually do next. So I don't know how much too much I want to go into Paxos here. But I'll just say that two phase commit is actually considered a degenerate case of Paxos. So at a high level, they're essentially the same. The difference is that in two phase commit, if the coordinator fails, then you have to block everybody and you have to have all the participants vote that they agree to commit the transaction. Where they're in Paxos, you just need a majority of participants to say that it's okay to commit something or abort, right? Well, maybe we'll talk about Paxos next week when we talk about Spanner. Cuz that's what everyone's always gonna vote for to talk about the last class. But then again, the basic idea to understand this is that you can use Paxos for things like leader election. You can also use it as a replacement of two phase commit. But in practice, everyone does two phase commit to commit transactions. And then you use Paxos or RAF to do leader election, okay? So again, so there's this great paper from Jim Gray and Leslie Lamport from I think the mid 2000s where they basically prove that or they show that two phase commit is again is a subset of what Paxos does. Okay, so now we're gonna talk about replication. So this is really important in a distributed system cuz obviously now if a machine goes down and we have our database as partition, say in a shared nothing architecture. If one machine goes down out of a thousand, we don't wanna have to stop the entire system, right? Technically, we should if that happens because now the database is essentially offline, right? Cuz a portion of it is not available and we could get false negatives or false positives when we execute queries on that data that we can't see. So the way we get around this is by replicating the data at a partition. So there's essentially two types of configurations you can do for this. You can have master slave, also sometimes called leader follower, and you can have multi master replication, sometimes called multi home. It's not considered coger to use the term master slave anymore. And so I'll slip up cuz I'm old and I'll say this. But people are trying to say leader follower or sometimes I say master replica. But it's the same idea. And then you have actually two ways to propagate the changes now. You can do what's called physical logging or logical logging. And this is gonna look a lot like recovery we talked about last week. Cuz it is essentially the same thing. So essentially what's gonna happen is we're gonna use Aries to do our logging as we do it as the execute transactions on a machine. And then we're essentially gonna transmit the right ahead log, the changes that we're making over the wire now to our replicas. And they're essentially gonna be in this sort of special recovery mode where they're just gonna keep replaying all the op log or right head log entries you send it and implying it to the database, right? You see again the same protocol as we talked about with Aries. And so with physical logging you're essentially sending over the bites that you actually made, that you modified and the replica will replay them. And then logical logging again is just sending over the queries and they get re-executed on the other side. So let's talk about how to do different system configurations. So the first approach is called master slave or master replica. And again the basic idea here is that you have some master node or master partition and this is where all your rights are gonna go to. You also can have transactions read data from this master. But the key thing is though all the rights have to go to this. And any time you modify data at the master node or master partition, it then sends the op log update over the wire to its replicas. And so now because the replicas are technically gonna be slightly behind the master in this case, you're only allowed to do reads against the replicas, right? All updates have to go to the master and if you want to read possibly sort of slightly stale data, you can read them from the replicas. So this is good if you have a workload where you're not that insert heavy. You can always go to this master node, it can handle everything with the master partition. And then if you have a lot of reads and those reads can be slightly stale, then you can always go spread them out across the replicas. In multi-master what happens is every node is considered to have a complete copy, exact copy of the data set all times. And so what happens is that, or at least for this partition, what happens is that you have to then transmit the updates back and forth between the different partitions to make sure that they're always in sync. So in this case here, the reads and rights can go to either one. And then you have to figure out using two-phase commit and whatever commit you're protocol you're using to make sure that these things are always in sync, right? The advantage of this is that if you crash in the multi-master case, you will immediately know that the one of the nodes crashes and the other node has an exact copy without missing any data. Because they're always going to be maintained perfectly in sync. And obviously that's actually much slower. When the master slave environment, depending on how you propagate your changes to the replicas, you could have a transaction commit on the master and then it crashes and then before that change gets propagated to the replicas. In the back, yes. Could you have a setup where master slave replication with sharding so that I could go on? Yes, he brings up a very good point. So in this case here, I'm being really simple and I'm only showing a single partition, right? So his comment was, let's say I have two partitions and I'll have this node here be the master for partition one. And then this node here would be a replica for partition one. But it's also going to be the master for partition two. And this will be the replica for partition two. Yes, you can do that. Some, I'm trying to get the system off hand who does this. DB2, the parallel version of DB2 might do that. I don't know of any of the open source guys that do this. Yeah, I think that only shows up in the commercial systems. But yeah, and again, the advantage of this is that tricky thing though is if you have transactions that span partitions, then this has become problematic. Because now where do you go? Do you go to the master P1 or the master P2? That's what I'm saying. I don't think any of the open source systems do this, but I could be wrong. So again, the thing I want to stress here though in a distributed environment is this could be really slow. So think about this now, right? I have to, say in the master replica setup, I'm going to do my right on the master. And then propagate the changes to the replicas. And I have to, depending on how much I care about the safety of my data, I may have to wait until this thing flushes over here before I can then send the acknowledgement back. And then depending on where the replica is located, that might be a long network round trip, right? Originally, I don't know if this is still true anymore, but the way Facebook did replication was in this model here. And so the master copy of the data would always be somewhere on the West Coast in the US, but then they would have copies of the data, different data centers around the world. So anytime you would read your timeline or whatever it's called on your Facebook page, you would read from the replica that's close to you. But anytime you did an update to, again, your feed or whatever it's called, that right would always have to go to the master copy of the database at the data center on the West Coast. And so the way they would trick you to make it look fast is that they would maintain a cookie in your browser so that when you did your right to update your feed, the change may have not made it to the master node yet, because it's far away, but then if you try to refresh the page right away, then they would have a local cache to say, here's the thing you just posted. Even though it's actually not read in the database yet. I think since then they actually doomed out the multi-master approach. And they have a way to sort of keep this sync underneath the covers. All right, so regardless of what setup you have, I sort of mentioned this briefly, but you can also change on how much you have to help them. You can also change whether the master has to wait for the replica to fully commit the data that you sent over to it before it sends the acknowledgement back to the application. You can actually change whether you want to do this or not. And then this is sort of like you're sacrificing the durability or the consistency of your data because you're saying, all right, maybe I'll commit to the master and I'll tell the server the application that I've committed your data, but the data has been propagated to the replica yet. So if I crash before that happens, then my data gets lost, right? And so for some applications, this is okay. For other things, you don't want to do this. And so you can tune what the propagation level is for these different protocols, or the different replication protocols. So the most protective one is to do synchronous replication. And this is where the map, for this we'll assume we're doing master slave setup. For this one, the master is going to send the updates to the replica, and then it's going to wait for the replica to apply that change locally to its copy of the data, and then it sends back the acknowledgement. And then and only then does it actually send out the acknowledgement to the client that your transaction is committed. So it looks like this, say we have a simple two server setup with a master and a slave, right? The application says, hey, I want to commit my transaction. So the master then goes to the replica. Hey, here's the changes we made. Can you go ahead and flush this? Now at this point here, it has to stall and wait until that server takes the packets of the message you sent over, fully applies to the database, writes it out to the log, and flushes it. And once that's done, then the replica will send back the acknowledgement, say I got the data, I flushed it, and we're done. And then at that point, the master knows that the data has been propagated to its replicas, and then go ahead and send back the acknowledgement to the client. With asynchronous replication, what happens is the application server says, hey, I want to commit, maybe you send a message out to the replica, I'll say, hey, can you go ahead and flush this? But you don't wait to see whether it actually flushes it. You immediately go back to the application and say, yes, your data is fully committed, or your write is done. And then at some point, eventually the replica will get the packet, write out the change of flush, and it's done. And now it's fully copied. So the way to think about this, these are like sort of knobs you can tune in the data server to say how you want to do replication. And it's the same way you can tune knobs to specify how strongly durable do you want the right-of-head log to be. If you want things to always be fully acid, then you make sure that anytime your transaction commits, you do the S-sync to flush out the changes from the right-of-head log buffer out onto stable storage. But if you don't care that much, or you're allowed to sacrifice and have a small window of vulnerability, then you can do a synchronous where you don't wait for the S-sync and then you just merely return back the acknowledges of the application that your transaction is committed. Same thing here, right? You're not going to wait for the flush on the replica and the acknowledges that it's actually durable. You merely go back to the application and say, I got the data that you wanted. Semi-synchronous is a term that comes from MySQL. And it's sort of like a halfway in between the asynchronous and the synchronous. So what happens is the application tells the master, I want to commit. Then you send the message to this, hey, flush to the replica. And then what happens is, you have to stall for that. But what happens is the replica will immediately come back and say, I got your message, I will flush it eventually. All right, and then you merely send back the acknowledgement to the master and then the master can tell the application server that I committed your transaction. And then at some later point, it'll end up actually getting flushed to disk, right? So again, we're sort of shrinking the window of the vulnerability for us losing data by, we know that our packet made it to the other machine, but it hasn't been applied yet. And that's considered good enough. So again, this is called, MySQL, this is called semi-synchronous. I don't know how many other systems actually do this. So is this clear? So again, the master replica master slave setup is probably the most common one. Multi-master is a bit more tricky and it's less common. Okay, so I'm going to skip the discussion about the cap theorem. We'll cover that, we can cover that next week. If you understand two-phase commit, then that's sort of the main protocol you're going to use. Pactos allows you to do other things. So just to finish up for distributed transaction processing, I guess I said, I barely scratched the surface on all the things that actually matter or things you have to worry about. We could probably teach an entire semester on distributed databases and still not cover everything. But again, the main thing I want you to get out of this is like, it's all the same techniques we've covered throughout the semester. But now we just have to worry about other things in addition to the regular transaction processing, durability and other things. Because now we're in a distributed environment and we could have failures and we could have other problems. So in practice, it's really hard to get strong asset transactions in a distributed environment. This is why most of the NoSQL systems don't actually do this, right? And the reason is because it's hard to get right and it's also hard to get to run efficiently. So they forego transactions and other strong consistency guarantees in exchange for better performance and availability. So if you want to learn more about distributed transaction processing and actually see how real systems, both open source and commercial stack up to all sort of the lofty goals of asset transactions. You should go check out Kyle Kinkbury's blog on what he calls the Jepsen Project. Kyle's a funny guy. He basically wrote a torture device for distributed databases. So he has this nice nifty environment where he can simulate failures and network partitions and all these problems. And you can see, you need to test to see how distributed databases react to these sort of hostile environments. And what's really fascinating about it is that oftentimes companies make claims about how durable or strong consistent their systems are. Kyle comes along and destroys them and they have to go back and change their marketing to say they're actually not as strong consistent as they thought they were. So I will say though, these blog articles he writes are extremely detailed and methodical. So I cannot recommend them enough but you should dedicate time to actually understand what he's talking about and I think it's really good. So now basically Kyle has a consulting company where database companies hire him to come again, come on site and torture their systems and see what happens, see how they fail. And it's actually really interesting because it's like then they figure out, we have these problems and then they can fix them to overcome his issues. Okay, so any questions about distributed transaction processing? All right, so now we'll get to what we want to talk about today. And that's distributed analytical processing. Okay, let's give all this cuz we covered this. All right, and again the reason why I broke up these two lectures is that for transaction processing, it's concurrency control, it's replication, it's two phase commit, all of those things we have to care about because we want to make sure that our transactions are acid. In a OLAP environment, we're mostly gonna be read only. We mostly wanna read large segments of the data and do complex aggregations and joins on them. So we have a bunch of different other problems we're gonna have to deal with. So we still have to care about, we still have to worry about our default tolerance and things like that as we did in a distributed transaction processing workload. But now we don't need to care so much that our queries or updates are durable and things like that cuz we're not trying to do a lot of them all at once. All right, so for today, we're first gonna now talk about partitioning. Again, partitioning is relevant in the OLAP environment, but I want to motivate it to see why it actually matters a lot more as well in a analytical environment. And then we're gonna finish up talking about distributed join algorithms. Cuz that's the main, where you spend most of your time in a distributed OLAP system doing this. And we wanna sort of see what cases we have to deal with and why this can be problematic. So as I said before, the idea of database partitioning is that we're gonna split the database up across multiple resources. And this can be done on multiple nodes, this can be done on multiple CPU sockets. We're basically gonna split the database up so that we can have different computational units or storage units, crunch on different segments of the data at the same time. So we can sort of paralyze our operations. And as I said, in the no-sicle world, they'll call database partitioning sharding, but the idea is basically the same. So what's gonna happen is that, in a OLAP environment, we're gonna have a single query come in. And we're gonna go through the normal planner optimization phase that we talked about before. But then we're gonna break up that query into different fragments that'll get distributed across the different resources or partitions. And those partitions will then compute their portion of the query based on the fragment they're given. And then we have to then later on combine them, put it back together as a single result to the application. So if you open up your terminal, you type a select statement into it in a distributed database system. Then underneath the covers, it'll go out and actually distribute your query across multiple machines. Then put it all together back to a single result and send that back to your terminal. So the most naive scheme, the way to do partitioning is just take every single table you have in your database and assign it to a different partition, right? Because obviously it assumes that you're gonna have enough storage space or memory to handle that single table. But it's really the easiest thing to do because you don't have to worry about how to figure out what key to use to distribute your data. So it's a really simple example like this, I have two tables. The first table is gonna go to one partition, the next table is gonna go to the other partition, and then I'm done. So now if I have a query that has a touch to do two different tables, I do have to then move data around or copy data around as needed. But in case we're doing updates, all my updates for a single table could go to one location. So the ideal case is a single select statement that only acts as one table because you don't have to do any shuffling or any data movement at all. So this is often done, this is sometimes done in some systems where you don't read to a table, you just mostly just insert into it like a log table. So I know MongoDB can do this where you can say, all right, one table should be on a single shard by itself or single partition by itself. Then all your inserts go there and you never worry about going reading it again. But if you have a complex query that has to join these two tables, then you're going to have to move a lot of data around. So this is typically done for OTP, not so much for OLAP. What people do normally do instead is called horizontal partitioning. And again, when people talk about sharding, they essentially mean this. And this is where we're going to pick some set of attributes from the table. We're going to use to decide how to assign a particular record to a partition on a node. And as I said before, the partitioning can be done either at the physical level, meaning the actual physical location of the data itself. Or done at a logical level where the sort of the nodes that are actually your queries on a particular segment of the table will be assigned that logical partition. So they'll go out to the shared disk and just get the data they know they need for the partition they're assigned to. Now I'll show an example of that in the next few slides. But so basically again for horizontal partitioning what's going to happen is we have some table, this one has four attributes. And so we're going to pick one attribute to be the partitioning key. It doesn't have to be one, it can be multiple ones, but for this one we'll pick one. And then depending on whether we're using hash partitioning or range partitioning, for each tuple we're going to look at the value of that column and then decide where to send the, where to physically store, where to assign, what partition to assign that tuple to, right? So in this case I'm doing hash partitioning. I'm going to take every tuple, I'm going to hash it and then mod by the number of partitions I have, in this case four. And then that's going to tell me what partition each record assigns to and then they get mapped to those partitions. If you were doing range partitioning you could then say A to B goes to one partition, C to D goes to another partition, and then E to F goes to another partition, right? It's basically the same idea. Depending on what kind of system, depending on what your workload looks like, this, depending on what your workload looks like, will tell you what actually what attribute you want to partition on. So some attributes are obviously better than others, right? So say you have an auto increment key and your workload is always inserting new records and the increment key is always going up. So if you do range partitioning on that, what'll happen is all your inserts will go to one partition for a brief period. Then the partition will fill up and then all the inserts will go to another partition for another period, right? Because it's sort of broken up on those ranges and your auto increment key is always increasing in monotonic order. So that would be bad because now you're blasting everything to a single partition. But if I have queries that, if my workload does not update the data that often and I want to do queries that have to touch all this data, then maybe that's good because that'll evenly distribute all the different keys across different servers and I can have better parallelism in my queries. So typically what happens is the DBA will look at your workload, look at the query traces, look at your database and then make a manual decision about how to partition things. There are tools that figure this out automatically. MongoDB has this auto sharding feature where if your shard gets too big, they basically split it in half and move it across different machines. Yes, that's a great question. So the question is, in this case here, I'm basically doing static partitioning, right? Because I'm fixing the number of nodes that I have when I mod n, right? Or the number of partitions I have. So if I now add a new partition, I basically have to reshuffle everything because it's now mod 5. So yes, so in this case here, in this particular example, you would have to reshuffle things. So systems like Apache Cassandra for example, they do what's called consistent hashing. And think of it as the hash table is basically, logically is a ring. And then when you add or drop a new node, you just need to move data back and forth between adjacent nodes and the ring. I can cover this next week because if you guys want me to talk about Cassandra, I can cover what they do. So they didn't invent consistent hashing. It came from MIT in the late 90s or the 2000s. If you know BitTorrent, they had the distributed hash table. When you have magnet links, it's running the same kind of protocol. Consistent hashing. That's a good question. Okay, so now the distinction between logical partitioning and physical partitioning. Again, in a shared disk system, logical partitioning basically means that the execution nodes will be assigned to some portion of the database to operate on when it executes queries. So the shared disk is gonna have all the data. So let's say we have four tuples here, IDs one through four. And then we'll logically assign these different execution nodes to be responsible for data for some portion of this data. So if I have a query that says I wanna get ID one, it's always gonna go to this node here because that logical partition has been assigned to it. Same thing, if I need to get ID three, I go to the bottom one here. So the key thing here is again, the physical location of the data is always in our black box shared disk, right? And I don't know if it covers the shared disk system whether it's EBS or a file server could be doing its own partitioning and parallelism down there. But we don't know that in our database system. But then when it comes time to execute queries, we know that one node will be responsible for some again for that portion of the data. Now contrast this with physical partitioning as you would have in a shared nothing environment. Now the actual physical copies of the data are assigned to this node. Because again, in a shared nothing environment, it has the disk. It's gonna maintain state there. So again, just like before, anytime I have a query that says I wanna get ID one, I know I go get it from there, and I get ID three down here. So again, the partitions could be done where we're actually physically storing the data if we control where the data is being stored as in a shared nothing environment. Or it just could be how do we assign which node should read what's data at runtime when we execute queries. All right, so now as I said in the beginning, joins are the most expensive operation you can have in a distributed OLAP system. And to no surprise, the partitioning scheme of how you decide to divide your database up, again whether it's physically or logically, will greatly affect the performance of your join algorithms, right? So the most naive approach to do distributed joins is you just try to get all the tables you wanna join onto a single node and perform the join there, right? And this obviously wouldn't work if your tables are really large and sort of defeats the purpose of having a distributed database because you get no parallelism because you always have to shove everything to a node and then do the join, right? So we need a better way to handle this. So there's gonna be four different scenarios we could possibly have in a distributed join. For this example, I say we have two tables, T1, T2, or A and B. The basic idea here is that we're gonna try to distribute the join across all our nodes, all the partitions. And we need to figure out what data do we need to get or what data do we have or what data do we need on that node in order to compute our portion of the join. And just like we had in two phase commit where we had a sort of coordinating node to commit the transaction, we'll have a coordinating node that's responsible for coalescing the results from all the different partitions that are running our join in parallel and then combining them together into a single result to send back to the application. And this typically is whatever the node that first got the query request. It doesn't have to be, it could be though. So the key thing I wanna point out here though is that the join algorithm at the partition, what we'll call the local join, that's gonna be all the same join algorithms that we talked about before. Your hash joins, your nest loops, and your sort merge joins. There's no magic join that we're gonna do because we're in a distributed environment that's gonna make this go much, much faster. It really is just trying to figure out what data do we have and what data do we need and then move the things around as needed. So the first scenario is where you have the two tables, T1, T2. And T2 is gonna be smaller than the T1. And then it's just gonna be replicated at every single partition. So now to compute this join, right? So say we have a query as we wanna join T1.ide with T2.ide. So we partition our T1 table based on the ide column, and then T2 is replicated everywhere. So that means each partition can compute the join on the data they have locally without coordinating with anybody else at all, compute their localized join. And then the one partition will send their output of the query to the other partition who then combines them together and then produces the final answer, right? We do this because we know that all the data we could possibly join, all two of us were to join T1 with T2. Every partition has a complete copy of T2. So there's not some other tuple that we could be missing, right? And this is equivalent to us putting T1 and T2 on a single node and producing the final answer. The next scenario is another best case scenario is where T1 and T2 are both partitioned on the join key or the join attribute. So again, same thing here, that we don't have to get any additional data at each partition from another node. They have everything that they need. So we just do our sort merge or hash join or nested loop. It's almost always a hash join to compute our local join output. And then the other guy sends the data over the network to this other one who combines it together and produces the final result, all right? So these are easy, all right? Things now get tricky when things are not partitioned so easily. So the next scenario is where T1 is still partitioned on ID. We want to join an ID. But now T2 is partitioned on some other key that we're not joining on. So in this case here, it's the value key. So in this case, if we decide that we would decide whether T2 is, we figure out whether T1 is smaller than the T2. And whatever one is smaller, we're going to copy that smaller table. We're going to move that data around to the different tables or different partitions so that we can then compute the join. So let's just, here, say T1 is larger than T2. We'll end up making another copy of T2, a portion of T2 at each partition, where it's essentially partitioned on the thing we want to join on. So what happens is, all the different nodes, partitions have to broadcast out their copies or the tuples they have for T2. And they send it to the partition that has, based on the partitioning key. So again, so say I'm using a range partitioning for T1. And my range is 1 to 100 for this. So over here, for any tuple, I would have to scan all of them, look at their value, the ID value, and then look at the partitioning scheme for the T1, and figure out what partition to send it to. So now I end up in the same case after I do my copy, my broadcast. What I showed you in the second scenario, where now on a single partition, I have both the data for T1 and T2, where we just have the ranges of the ID values, or the ID attribute that we need to join. So now once I have this, I have now the copy of T1 partition on ID, T2 copy partition on ID, and I can compute the join for them at each partition, then send the result over, and then coalesce them. So now we're getting progressively more expensive here, because we're moving data around. And if it's a couple of megs who cares, but if it's billions of tuples, this starts to get expensive. Furthermore, I'm also only showing two partitions here. What if I have a thousand now, right? Now I have 1,000 nodes sending messages to another 1,000 nodes, to copy data around. Yes, so your question is, these guys? So we start with this, right? We recognize that T2 is not partitioned on the thing we want to join, right? ID, that should be ID there, that's a mistake. But we're not partitioned on ID, so we make a new copy, a second copy of the table, where it is partitioned ID. We have to send all our tuples around, right? And then again, this is, think of this as like a temporary table, right? And it just gets stored in the buffer pool like everything else. If it gets swapped to this, that's how, that's what we have to do, because we don't have so much memory, right? But it's temporary, it's only used for this one query. Then I do my join, locally at each node or each partition. And then now I have a, for each partition has a slice of the join result. So this one has, on P1, it has the join result for T1, T2 partition ID for these ranges. And then the same thing with the other side. Where would this be stored? This is just the output of the query. So it gets put, again, you could stream it. Yes, as every time, yes, because the question is, where are we gonna store the output of the join at each partition? So, yeah, so the naive thing is that you could just materialize the result, store it in your buffer pool, and then once you have all the results for your join, the join's done, then you transmit it to the coordinating node that's gonna combine them together, right? Could be another node, yes. It could be another node involved in the query, or could be another node that's designed to do aggregations like this, right? And so, what I thought you were actually asking is, do we have to materialize this or can, do we have to wait until we get the final result done before we send it? And the answer is no, you could just have a thread say, every single time I get a tuple, I'll just stream it over the network and have this guy start coalescing that as needed, right? It doesn't matter, right? Again, the expensive part is always is the broadcasting. Yes? Absolutely, so his question is, do you ever run the problem where the coordinating node does not have enough memory to store the entire result? Absolutely, yes. Again, so think of it, we would have this problem in a single node case as well, right? If the output result's too big, you run out of memory, then some thread, the thread that's sending back the result to the application is just gonna, they're essentially doing a sequential scan on that, sending back the tuples, and things get swapped in as needed, right? So in, I don't talk too much about map reduce, but in a map reduce environment, like something like Hadoop, they don't actually always coalesce the result into a single answer. You could have each of these guys write out the HDFS directly, right? And then the client has to then put everything back together. But typically, again, if you open up the terminal, you send your select statement, you get one, you expect to get one piece of output back. All right, so in this case here, again, the bottom one was not partitioned on our join attribute, the top one was. What's obviously even worse than this? Where both tables are not partitioned on the join key, right? This is the worst case scenario. So let's say now, T1 is now partitioned on the name attribute, T2 is partitioned on value just like before. So we're essentially gonna do that same broadcast we have before. But now we're making a complete another copy of both of the tables. So this is sometimes called reshuffling. And again, you're just essentially repartitioning the data so that you can then do your join locally, right? So to do the same thing, we'll send all the tuples with IDs in these ranges, the different partitions, same thing for value. Then when it's done, we can do our join and then coalesce them. So yes, his question is, her statement is, is this just not like MapReduce? Yes, if you go, there's a paper if you Google, I think, comparison of large scale data analysis or something like that. To read that paper, it basically lays out how MapReduce is essentially the same thing as joins and aggregations from distributed databases from the 1980s. People don't like that paper, at least when it came out they did not like it. I helped write that paper. But yes, it's very student observation. MapReduce essentially is just aggregations and reshuffling the same way. And now when you look at, nobody really writes raw MapReduce queries anymore. Everyone writes, the specialized shared systems like Impala, Stinger and Presto that can do things on top of HDFS. And they can do more efficient query plans than what sort of a naive MapReduce program can do. But one thing, I mean, everyone should know what MapReduce is, right? Hadoop, it's a bit data technology, but everyone should be aware of it. So one important distinction between what MapReduce does and what I'm showing here is, and it's sort of related to his question, is that this output here of the join is not actually durable. Meaning if we crash while we execute this query, that intermediate result is lost. But in MapReduce, at every single time they do the map and then the reduce. They're writing out the HDFS, multiple copies of the data, and they do this because if one machine goes down while you're processing your query or your program, then another machine can pick up and where you left off, right? It's slow, but it makes it so that you can be more fault tolerant. And they do that because they're targeting environment with thousands of nodes, where in most cases the distributed databases are so fast that you typically don't run with 1,000 machines, right? You run your smaller machines and therefore the failures are less common. I think in some, okay, I can't think names off hand. In some systems now, they do actually do materialize and store on disk some of the intermediate results so that things are durable. But in practice, nobody does this, right? If you crash, one minute of this join, the whole query just crashes. It's not a big deal, you just execute it again. And the old days when your queries could take hours to run, when it wasn't a column store, that'd suck. But now the systems have gotten pretty fast where it's okay. So that's pretty much it, the cover for OLAP systems. The building an efficient OLAP system is obviously hard. The network is always the main bottleneck, there's no way to get around this. But all the things we talked about before of trying to reduce the amount of work we do on a single-node system definitely apply here. So you wanna make sure you pick a right join ordering for your query, so that you try to filter things out as early as possible. So you're not moving all this data around, right? Or you wanna figure out, there might be a join ordering that's less efficient for your database because your tables are partitioned one way, versus for another query, or same query on another partitioning scheme, another join ordering might be more efficient. So there's just more information we have to now take in consideration, when we do query planning and scheduling and execution that we'd have to do in a single-node system. Yes? Usually run with a lower isolation level, so you don't have that. So this question is, in an OLAP environment, are the queries typically run with a lower isolation level? So if all your queries are read only, isolation levels don't mean anything, right? But then typically, yes. Typically what happens is, Snapse to isolation is good enough, right? I just wanna read existence snapshot. It's often the case what happens is that changes are pushed into an OLAP system in batches. Like at the end of every day, take all the market data I've collected and all the trades I've made, and shove it into my OLAP system. So now if you're running your query during the day, you're gonna see yesterday's data and not today's data. That's not great, but some people, that's okay. In H-tap environments, that's when things get more tricky, when you're actually doing fast transactions and analytics at the same time. And again, Snapse to isolation is good enough. But another way to think about this too is that OLAP queries don't really care that much about absolute correctness, right? Like say I have a, I have my database and I'm keeping track of everything everyone's bought on Amazon. Do I really care whether it's 1,000 or 1,001, I don't know what the hot toy is this year, but some particular pillow. Do I care and get the exact count of that? No, right? And so that's why in other queries you usually don't care about. And there's a whole other area called approximate systems, approximate database systems that can use sampling and other things to actually estimate what the answer should be. And even then for some workloads, for some environments, that's good enough. I don't need to know exactly the number of items I bought within a one month window, as long as I'm close enough. As long as I'm not off by orders of magnitude. But typically, yes, you run the, you run your queries in lower isolation level. And then the other key thing that I want to stress too in a OLAP environment, the typically you want to push the query to the data as much as possible, right? You don't want to have to, like in my first example I said, all right, I want to try to pull all the data I need to do my join onto a single node and then compute the join once I have everything, right? That obviously is really inefficient because you lose parallelism and you're sending a lot of data around. But it'd be much, much easier to send a small query fragment, which will be kilobytes, maybe at most megabytes, over to different nodes and have them execute in parallel, right? So that's way, way more efficient to send the queries out rather than send the data out. So typically in OLAP systems, you try to minimize the amount of data you have to push around or pull around to the system. You want to try to get everything to be on a single node that's already there and do all the processes in the way you can. OK, so this covers everything for the one that would be included in the final, but also in the core database topics you need for the semester. Again, there's a bunch of other advanced topics that we'll cover in the spring semester and then whole other areas of databases where people are doing interesting things. But everything I talked about here today or in this entire semester is what you need to build a basic database system. And everything you do on top of that is just sort of extra goodness. So on Monday next week, again, we'll have Barry Morris come talk about NuoDB, and then we'll have the final of review and the systems potpourri on next week. And obviously everything covered next week will not be on the final at all. OK? Any questions? Yes. Your question is, can all the queries be distributed across all nodes? Yes. Ideally, if you don't have to, though, you don't want to. Yes. Yes. So I think your question is, when a query shows up, should you always break it up into different fragments that get sent to all nodes? Your question is, is it always viable to do this? I'm not sure what you mean by that, which we don't care about. Is there any possible query that, ignoring security, for correctness reasons, has to be run on a single node? No. That's sort of the beauty of the relational algebra, is that you can break these things up and distribute them and sort of put them back together. All right? OK. All right, we're done, guys. Awesome. See you on Monday. And I will send out additional information about what's expected for the extra credit, probably tomorrow.