 So today's class is the first part we're going to do on distributed transactional database systems, or distributed transactional database systems. I modified the schedule this morning because it was originally going to be distributed LTP, then distributed OLAP, then Cloud Systems. I've decided that I'm going to extend what we're going to talk about today now across two lectures because there's just so much we want to cover. Then when we do talk with the OLAP stuff next week, if we come back from the break, I'll talk about the Cloud stuff that I would have mentioned in a separate lecture. So I think this is the right pacing because last year we rushed too much. So this is the final docket for everything that's due in this class. Project 3 is due today. Homework 5 will be going out today or tomorrow, and that'll be due two weeks from today. Project 4 will be going out today or tomorrow, and that'll be due on the 10th. The extra credit assignment is due on the 12th. I extended that by one more day. Then our beloved final exam is Sunday morning, not in this room, some other room at 8.30 AM. I had no say on this. I'll bring coffee and cigarettes for people if you want to. I know they give out candy, but that's at 8.30 AM, I don't want any candy. So this is Nick, we're almost done for this semester. The other things that are coming up to be mindful of is that, we're having the lead engineer from VoltDB come to give a guest lecture in our class on Monday, December 3rd. So I'll send a reminder about that everyone should attend. Then on Wednesday, December 5th, so after the VoltDB guest lecture, I'm going to do the final exam review. But I'm also going to do what I do every year is what I call the systems potpourri. So we spend a lot of time mostly talking about Postgres and MySQL and databases at a high level throughout this entire semester. But on this last class, what you can go do is if you go to that URL, it'll be a Google form with a bunch of different database systems from the dvdv.io website, and you guys can vote for whatever system you want to learn about. Then I'll pick whatever top three systems get the most votes, then I'll come and do like a crash course, like spending 10-15 minutes for each system, just to tell you what are the things you need to know about it, what makes it different, what makes it special, or what makes it suck, okay? So again, the idea is if there's some system that you use at an internship or you think you want to go give a job at a company that's building it, and you want to learn more about it, use this opportunity to have me come teach it. Don't use it to cheat on your extra credit, if you're writing the article on system X and you let's have any vote discuss system X so I can figure out how to write my article for extra credit. Don't do that. So vote once, but you can vote for as many systems as you want. Then the other thing I'm also doing, and I'll post this on Piazza as well, is that the extra credit is due on Wednesday, December 12th, but one week before on Wednesday, December 5th, if you want to submit it early and get feedback from me or one of the TAs about what you're doing right and not right, what you should correct, what you'd fill out more, then I can give you feedback and that way you can ensure that you can get full credit on the assignment, okay? Because in previous years, you may not know what I'm expecting because it's different from all the other projects and homeworks you've done in the class and it's a subjective thing because it's based on how well you find information and how well you can write about what the systems are doing. So I'll provide you with feedback to say, yes, you're doing this correctly or you're doing this wrong, okay? Then for the upcoming database talks, we have so next week, if we come back to break on November 29th, we have the founder of Swarm64 coming out from, I think it's either London or Norway, I can't keep track and these guys are really interesting because this is actually now a specialized hardware accelerator that they've developed specifically to do database operations on hardware. So instead of using like a general-purpose GPU or using the CPU, they'd actually have a custom chip that they use for their system. So think of this as like the Google's TensorFlow processing unit, TPU, but specifically for databases instead. So that's very interesting and then also in the same day when the VoltiV guy gives a guest lecture, he's going to give a more researchy talk the same day at 4.30 PM with a database research group meeting, okay? So any questions about what was expected from you from now to the end of the semester? Okay. So let's jump right into the material. So as I said, this part of the semester now we're going to start talking about distributed databases. We already talked about parallel database of the four and we said the distinction between the two of them was that a parallel database was one where you're going to assume that the resources you're communicating with like the other CPUs, the other nodes in your cluster are connected to you with a fast communication fabric and that that communication fabric is going to be reliable and that you don't have to worry about messages getting lost, nodes going down, right? You can write your algorithms, you can build the system architecture assuming that whoever you're handing off work to or reading writing data from is going to be there throughout the entire execution of the query, right? Again, the simplest parallel machine to think about is a single-wrapped unit with two CPU sockets, right? I can write my hash-joint algorithm talking to the two CPU sockets without worrying about one socket catching on fire and disappearing. Because if that happens then my sock is going down too. So now we're going to talk about distributed databases. For this one, we can't assume now that the other nodes that are in our database system are reliable and that the communication between them is going to be fast. Fast would be like on the same motherboard, right? Now you might be in the same rack but you're on a different machine and now you got to go over ethernet cable. So in this environment, we need to build our system out in such a way that we can be more fault tolerant. Meaning if one node goes down, we don't take down the whole cluster. Now, it may kill a query that we're running right now, depending on how fault tolerant we are, but we're not going to grind to halt if one machine goes down. So to build now a distributed database system, we're going to use all the things we talked about through the entire semester now as building blocks to build out a multi-node architecture. Now, maybe thinking like why am I waiting till the end of the semester before we start talking about distributed databases. I want to start talking about distributed databases now. The problem is you can't talk about distributed databases until you understand what the hell is going on in a single node. So all the same issues that we have before, how to do query planning, how to do optimization, how to do logging recovery, how to do concurrency control, we're still going to have to do that in our distributed system. Just because we have more machines doesn't make these things magically become super easy to do. It's actually the opposite, become harder to do. Because again, we can't assume that our machines are always going to be reliable. So the way I've set up the next three lectures is that I'm splitting it the discussion up in between transaction processing systems and analytical processing systems, OOTP versus OLAP. Now, we have some stuff we talk about in this class, that'll be relevant or applicable to both types classes of architectures or systems. But when we start talking about distributed concurrency control, that matters more for transaction processing because analytical queries are read only. So there's a reminder of what the difference between OOTP and OLAP and what the work on looks like and how we're going to design our system to account for these differences. In an OOTP environment, we're going to be executing short-lived transactions that are going to be read-write, we can read data and modify data from the database. But the amount of data we're going to read per transaction, or modify per transaction is going to be small. Again, using Amazon as an example, when I go to the Amazon storefront on the website, I log into my account, I add things to my cart, I make purchases for my credit card. I'm operating on just my records. I'm not doing large scans across all the customers in the state of Pennsylvania. Again, these operations are very repetitive as well. Contrast this with an OLAP system exactly as I said. Now, we're going to run long-running read-only queries that want to extrapolate new information from the data you've collected from the OOTP side. So I want to look at all the customers in the state of Pennsylvania, what items they buy mostly when it's a rainy day on a Monday in November. So I want to scan a bunch of customers, do group buys and aggregations, and extract this information out. So we're going to be complex joins across multiple tables, and the queries that we'll be executing may not be similar from one after another. It might be somebody opening up a terminal on writing a raw query or an ad hoc query, or it might be somebody opening a tool like MicroStrategy or Tableau and playing around with a visualization. These queries can be different from one after another. So based on this economy now, now we start talking about how would you build a system to handle one versus the other. The main difference as we see as we go along, once we start getting into the OOTP-specific aspects of a distributed database, is that because we have to do read and write transactions, then we have to spend more time talking about how we do fault tolerance with concurrency control when we touch data across multiple machines. When we talk about analytics next week, it's really about doing joins across multiple machines, because that's what we're going to spend all our time. So today we're going to start talking about the at a high level of the different type of system architectures you can have for distributed database, and then we'll talk about the design issues, what you have to consider when you actually build one of these systems. Again, these first two categories, actually the third one as well for partitioning schemes, these are germane across both OOTP and OLAP systems. But then we'll finish up talking about how to do distributed concurrency control. Again, applying the techniques we spent, two or three weeks talking about, but now putting them in a distributed operating environment. So the first thing we need to discuss is what the system architecture is going to look like. The way to think about this is that the system architecture dictates how the shared resources will be exposed and used to other processing units or CPUs in our distributed system. The terminology is going to be slightly vague here, what's a node versus what's a CPU, but it'll make more sense as I go through the examples one by one. But the way to think about this is that the only thing that can actually do computation, obviously is a CPU, and the new computation on things in memory. But where do we get that data that we have in memory? Well, we're going to get it from disk. So the system architecture is going to tell you how you actually access memory and disk. Whether you're accessing disk that's local to you, remote to somebody else, memory that's local to you, or local to somebody else, right? This is what the system architecture is going to define, and it's going to now tell us also to how we're actually going to coordinate the operations between different CPUs, and how they're allowed to go and get and retrieve data. So everything we've talked about so far in this class, is in what is called a shared everything architecture. The way the world looks like in this environment is that, I have my CPU, I have my local memory, and I have my local disk. These are the only things I know about. I only know how to get data that's from that disk, and I only can put it into memory that's local to me. The sort of thing of this is this stack as a single node. So the three different types of distributed architectures you can have are shared memory, shared disk, and shared nothing. So shared memory looks like this, where I have now my different CPUs, my different processing units, and they're going to access a single global memory address space, and they're going to communicate with that address space over some communication fabric, networking fabric. I'm not defining what this is. It could be TCP IP, it could be InfiniBand, it doesn't matter. All the way to think about this is that, again, that this memory is somehow not local, could potentially not be local to my CPU, but I don't know this, because I just see a single address space. Then the disk down below, again, is just some single thing that I access. I don't know where it is. A shared disk architecture is where now each processing unit, each CPU has its own local memory, but the primary search location, the database is now down on some shared disk. So the way to think about this is that, in this case here, say I'm building the lock manager, the lock table as I'm running transactions. The lock table is where? It's in memory. So in this environment here, there's a single address space of memory across all these different processing units, so everyone can read and write into the same memory location for my lock table. But that's all hidden, how I actually do a write into the lock table is hidden by this messaging fabric. But I know there's some of the processing units, I don't know where they are, but they could also be writing to the same lock table as me. In this environment here, if I want to read and write something in memory, I know that my memory is not visible from these other processing units. So now I need to send messages to them to say, hey, by the way, here's some state I'm maintaining in memory that's local to me. But now if I want to read and write the data on the database, that's all now going to a single location to a shared disk. If this process unit writes data here, then this guy, when he goes to read that data, he'll go down and see that change. But anything I have in memory is going to be local to me. The last type is called shared nothing, and this is where each processing unit has a local disk and has local memory. The only way to communicate with the other processing units in my distributed cluster is going over the network. So if I want to read data that's just like before, if I'm maintaining a lock table in memory for this processing unit here, nobody else can see that. So in order for them to know about it, I have to send a message over the network to say, oh, by the way, here's what's in my lock table, here's what I'm doing. But then likewise, if I write data that now goes to my local disk, these other guys can't see this unless they go over the network and say, hey, do you have record one, two, three, what's the current value of it? So I'll go through these two. We're going to spend those for our time on shared disk. Remember, I'll talk a little bit, shared disk and shared nothing. I'll talk a little bit about a shared memory. But as far as they know, no data system actually implements this. Because this is something you would have at the operating system level. So again, shared memory is when you have the CPUs that could be running on a different machine or the same machine, it doesn't matter, but they're communicating with each other through some fast interconnect over a global address space in memory. Now because it's a global address space, every CPU has a global view of the current state of the database or everything that's in memory. So the way to think about this is, you could have your process running SQLite, that could be on one CPU and you have another process running on SQLite running on the CPU, and it knows how to communicate them as if they're running in the same process. So as I said, nobody actually does this because this sort of messaging fabric or distributed shared memory, this is something usually that's provided by the operating system level, the kernel level. You have to make modifications to your database to be aware that, oh, I have these other CPU instances that are running other processes that are part of my distributed cluster. It's not like you just take MySQL and plop it down and tell it to use this global address space here, and it's just going to work if you pop up MySQL at different locations. Because each MySQL instance is going to think that it's the master of the universe, that it has control of everything, but now there's some other process running to the same address space, and that's going to cause problems. So this is just not magically taking your single node database and using shared memory and everything works. You still have to modify the data system to be aware that there's other instances of the software running at other CPUs, but the way to communicate is hidden through this messaging fabric in the network. As I said, as far as I'm aware, no commercial or open-source data system actually does something like this. What's more common is shared disk. Again, the idea here is that each CPU is going to have a single logical disk that they will read and write data from. Again, think of this as I have my database process. Just look at four, I have to be aware that there's other database processes running on different nodes, but they're all communicating to the same disk. They're all reading ready data from the same disk. Each CPU has its own private memory, and that means in order to coordinate any state we're maintaining in memory, we have to go over the network and pass messages back and forth. So this is very common now in Cloud systems. So you can think of this shared disk, it's just your favorite distributed file system, HDFS or EBS on Amazon or S3 on Amazon. So there's a bunch of systems now that are aware that I'm reading ready data from HDFS, aware the limitations and properties of HDFS, and they can design their architecture accordingly. So the most probably famous one that does something like this is Oracle Exadata and RAC go pretty far back. There's a bunch of systems based on HDFS, Splice Machine, Stinger, HBase, Snowflake and Aurora built on top of S3 or EBS. Again, there's a single location of the disk that you multiple nodes can read and write to. So let me give it an example. So the way to think about this is that again, these nodes are stateless in that the consider the database is the state. If we shut the system down, we want to preserve that state, these guys are stateless. They just have local memory that they're just using as the buffer pool, the same way that we use a buffer pool in a single node. Then we have our shared disk, it's some storage device we don't know, we don't really care that all these different nodes can read and write to. So we won't talk about this now, just yeah, we'll talk about this in a second, but assume that the application server knows how to communicate with both of these nodes to go get the right data that it needs. So let's say once I say give me the record that has for ID 101, somehow the application server knows it goes to this node here, and then the same thing we did before on a single node system, we had a page directory where we do a lookup and say ID 101, well that's going to be in page ABC. So let me go out to my shared disk storage, go get page ABC, bring into my buffer pool memory, and then do whatever computation that I want to do for the query. If I want to get ID 200, same thing that goes to this node here, this guy knows that it's page XYZ and goes out to the shared disk and get it, yes. Do I need a load balancer or a manager in front of those nodes? Perfect. So he says, do I need a load balancer or a manager, a coordinator if you will, in front of these two nodes? Potentially, yes, not always, we'll talk about that in a second. Right? Because his question is essentially saying, this guy somehow the application is ever new to good, for 101 go here, this guy wants 200, so he goes there. How do they know that? We'll talk about that in a second. But this is the idea I want to stress here is that this memory, there's no database storage at these nodes. Everything's always out on this shared disk. So I can have page XYZ in my buffer pool cache, so the next time somebody asks get ID equal to 100, I know that I don't need to go out to just get it, I already have it. But I can kill this thing, and I'm not corrupting the database because that's stored out here. So the big advantage as I was saying now, the big advantage of a shared disk architecture is that these guys again are stateless. So if I now want to scale up my compute layer, I want to add more nodes so I can execute more queries in parallel, then I just bring up a new machine. Internally, I somehow update everything so that if I want to get now ID 101, I go here instead of here, and it does the same thing, it goes out the shared disk and brings it into the buffer pool. But I didn't have to move really any data. All I need to do is update the routing table and say, all right, ID 101 goes in the middle now. So when everything's read only, this is fine. Where things get tricky now, if you want to support updates in this environment, because now if I'm not careful and if I'm allowing any node to go read any data that it wants, if I say update 101 here and I want to write it out to the disk, then I potentially need to send messages now to these other guys and say, hey, if you have 101 in memory, here's the new version of it. The way to avoid this is to do strict partitioning so that no other node could ever look at 101. You know only the top guy can do this. But in general, if you want to allow anybody to read anything, you have to maintain this state in memory to send past messages to say, hey, I have the new version. Because otherwise what's going to happen is that if the next guy tries to read 101 and maybe goes down here, if he doesn't know that this guy updated and I get it in between the updated happens here before I write it out the disk, then I may end up going out the disk and reading the older version which may not be what I want. The other thing I want to say too, I sort of animation out a little out of order. But the advantage of this approach as well is like, say this is using Amazon EBS or using HDFS. Underneath the covers, if I want to now scale up my storage capacity, then I can just add new disk at the storage layer. I don't change anything in the front end because they don't know, they don't care that underneath the covers, now I have more disk, I have greater capacity. Because these guys are stateless. They're stateless in the terms as I'm executing queries there's some state. But I can kill them and then go away and everything still persisted out here on the shared disk. Yes. You mentioned the string partition, what are you, a certain range of partitioning, pretty complex at one time, but you only have one node for that partition. So he says, what would happen if that there's a partition that is hot? Meaning like a lot of queries are trying to read and write to it. But now I'm maxing out the capacity of what I can do on a single node. How do I handle that? So typically what happens is you try to isolate the hot items, and then you basically scale up not out. So this would be scaling out. So I have two nodes and now I go to three. I'm scaling out horizontally. Scaling up would be this node is getting all the traffic. I can't split that data item up and put across multiple machines because that doesn't make any sense. Because now I may be trying to update the same thing on two different machines. I ended up going to be worse because now I need to coordinate across two machines. So you basically say, all right, well, this node is usually like a Twitter, like there's a whole separate machine for Justin Bieber, whole separate machine for Donald Trump for better or worse. Because everyone's trying to read his data, so you can add more capacity for just his partition. So you can say, all right, everything's trying to go to this node here. I can't split that data item up because it doesn't make any sense. So I'll just add more capacity to this single node here. That's a very common approach. Do not want to stall when there's update. So the alternative to share disk is shared nothing. This is what most people think of when you think of a distributed database. You think of this model. Again, you have on a single node, you have your local disk, you have your local memory, and you have your local CPU. That CPU can access that memory and disk directly on its local box without sending any messages over the network. But if it needs to now read data or read state or communicated state with other nodes in the cluster, you have to go over the messaging fabric. You have to go over the network. So the benefit of this is that in theory, it's easy to increase capacity now because I just add new machines and now I'll start reshuffling the data. Where things get really tricky though, is that you now need to ensure consistency across these different nodes. Not only do I need to coordinate on the state of memory for each node, now I need to coordinate on the state of the disk. In the shared disk model, everyone was going to one location. In this approach, everything is just disaggregated. So yes, capacity would either be, I want to add more memory, I want to add more disk, I want to add more CPUs. Because again, it's like, well, we'll talk in a second. Shared disk actually sometimes is actually easier because I can scale up the compute layer and the storage layer independently. I can add more compute nodes without modifying the disk looks like. Yes, I would say it's easy to increase capacity, but not easier than shared disk. I think actually shared disk is easier. So again, this is what people most people think of when you think of a distributed database. There's a ton of systems that are based on the shared nothing architecture. When my advisors, Mike Stonebaker wrote a paper in 1980, whatever, espousing the benefits and the superiority of shared nothing, of a shared nothing architecture, of a shared disk. I actually think the trend is actually in the opposite direction now. I think shared disk is going to be the prevalent architecture just because of the cloud environment. You can have EBS. EBS can scale out infinitely in theory from your perspective of your database. So let's look at this example here. So again, all in a single node, we're going to have a CPU, we have memory and have disk, and it's going to have some portion of the database. Whether it's all of it or a fraction of it, we can talk about it in a second. But in this example here, let's just do simple range partitioning. So for our key space on ID, we'll say the node at the top has one to 150, the guy at the bottom has 151 to 300. So again, ignoring how the application knows to go to this machine, if it wants to get record 200, it just goes to this machine and then goes and gets all the data that it needs there and then sends back the response. But let's say now if I want to get 10 and 200 and I send the request to the guy at the top, and now on need the covers, the system will know that all right, I don't have the top node knows it doesn't have 200. It doesn't have a shared disk or it only has a portion of the data. So it'll have to send a message now over the network and say, hey, I know you have 200, go get, please go get it for me. If I want to increase capacity, now I actually need to reshuffle data. So let's say that I want to add a third node in here and now I want to make sure that each node has an equal portion of the entire database. Then now the top guy and the bottom guy need to reshuffle and send some fraction of the data that they have into this middle node here. So again, in the shared disk environment, I can scale up the compute node independently of the disk because they were sort of separate layers. In this environment, if I want to add new CPU, then I have to move data around. When I add more disk, I have to move data around. So is this, yes? So he says, what's the difference between this and data is partitioning? This is data is partitioning. This is data is partitioning in a shared nothing architecture. We will talk about data is partitioning in the next couple of slides. Okay, so distributed databases are not new. They're old, they go back to the 1970s. So all the sort of famous data, these people that built the early first node systems, they've all built some variant of them for in a distributed environment, distributed operating environment. So Mike Stonebreaker at Berkeley, he was the guy that built Ingress and Postgres and a bunch of other stuff. In 1979, they have a tech report at a Berkeley that describes a system called Muffin, which is a distributed version of Ingress. I asked him one day what Muffin actually stood for. He stood for Muffin and Ingress. So that one was a prototype that never made out of a lab. It wasn't clear whether they actually built that. SCD-1, the system for distributed databases one, was built by Phil Bernstein. He was the guy that did a lot of early work on country control and distributed country control. He built this when he was at the computing, construction, and consortia or whatever. It was some old computer company that got bought out and doesn't exist anymore. This is the early work he did building the first true distributed database in the late 1970s. A lot of the similar work that he did on distributed country control protocols came out of this project. That system doesn't exist anymore. That was just a prototype. System R-Star was the distributed version of System R built by IBM in the early 1980s. And this was led by Mohan, who was the guy who did the airy stuff we talked about last class. Gamma was a distributed database machine out of University of Wisconsin built by David DeWitt. And the non-stop sequel, this non-stop sequel is the only commercial one out of all these. Jim Gray was the guy that was at System R, did the early work on two-phase locking and a bunch of other stuff in System R at IBM. He left IBM, went to go work at a company called Tandem that was building these super resilient and fault-tolerant distributed machines called non-stop. And then they ended up building a sequel engine to run on the non-stop environment called non-stop sequel. So this actually still exists. HP bought them at some point in the 90s. And there's a lot of people, a lot of older enterprises are actually still running non-stop today. And it's crazy the kind of shit they would do in here to make this thing super fault-tolerant. Like think of like space shuttle fault-tolerance. Like you have like three CPUs running at the same time all doing the same thing, stuff like that. All right, so. So now we understand what sort of, at a high level what the system architecture is gonna look like. And again, for this, the next two, three lectures we're gonna focus on shared disk and shared nothing. And I'll just extinguish as we go along when we're talking about one versus the other. So now we need to decide what, how we're actually gonna, you know, build out the full distributed database. And the most questions we have to answer. We sort of, these sort of cropped up as it's been going along because I'm sort of being very vague or hand wavy about certain aspects of how the application is communicating with the database system to find certain things. But now we need to talk about how this is actually gonna work. So the first question is most obviously is how is the application gonna find data, right? How does it know that I need to go to this machine here to go find record one, two, three? Then when we actually execute queries on this data, how should we move data around or the queries around to process them, right? Does it make sense to push the query to the data wherever it's located? So if I'm in a shared nothing architecture and my query wants to get record one, two, three, but record one, two, three is on a different node. Should I go send a query fragment to execute the query on that data and get back the result? Or should I move the data to the node where the query is running, then process it? And then how is the data system actually gonna ensure correctness across multiple machines, cultural nodes, if we're updating the database at the same time as we're doing other operations? So we'll talk more about this next class, but remember I skipped over consistency when we talked about asset and I said, doesn't really matter on a single node because the database is always consistent because you can see exactly what you wrote. Now in a distributed environment this becomes an issue because if I write data on a single machine, if I write data for a record on one machine, but then I try to read that record on another machine, am I gonna see my own write or are other transactions gonna see my write? If so, when? So that's what consistency actually matters. So we'll cover that as we go along. So another high level architecture issue we're gonna have to deal with is what should the nodes look like in our cluster and our distributed database? So one approach is sort of have a homogeneous cluster where every single node is sort of a first class entity in the system that can do all the operations or all the tasks that my database system can do. So the way to think about this is say I have five nodes, every single node can execute queries, move data around, log, do configuration, partitioning, can do all the things that my database can do. So the advantage of this approach is that if any node goes down then I can bring up a new one and have that take over all the responsibilities of the old one. This most makes sense in typically in a shared nothing architecture, but you could still have this in a shared disk architecture. The other approach is that in the heterogeneous cluster where you have different nodes and we assign separate individual tasks and in order to, and I have to communicate between those different nodes in order to do whatever it is that I need to do to execute the query or move data around or log things. The advantage of this is that now you can have specialized nodes that can handle certain tasks and you can scale those things out independently from each other. This is sort of like in the shared disk model I have my compute layer and I have my storage layer and scale those things out separate from each other. But now of course this means that I have more moving part to my system. I have to make sure all these different services that I need from a shared database are working and I need to make sure that they can all communicate with each other correctly. So no, I'm not gonna say that one is better than another. VoltDB uses the top one, a bunch of other systems use the bottom one. It just depends on how you wanna architect your system. So let me give an example of what a heterogeneous architecture looks like. So this is actually how MongoDB works. So MongoDB has three different classes of services or nodes in your cluster. So you're gonna have a router node, a config server node and then the shard nodes. So the way it works is the application server always sends every query request to the routers. In this example here we wanna get the record with ID 101. So the routers are stateless. They don't know anything about where to go find this data. All that's stored down now in the config server. So we send a request from the router to the config server and say, hey, look inside your partitioning table and tell me what partition is gonna have the record for ID 101. And then the config server sends back this information to the router and now the router knows where it wants to go out to the shard to get the data that it wants. So each of these servers, the config server, the router server and the shards, they can all be stored running on separate nodes. And you can scale these things out separately from each other. If you wanna be more economical or thrifty or if you're not taxing the node for these different services, there's no reason you couldn't put these on the same box, right? But if you wanted to scale these independently, you could do that. So what's one potential downside of this? So now, again, for a single request, I have to do an extra hop. I gotta go from my router server to my config server and then get a response back and then figure out where I need to go next. And obviously you can cache some of this partitioning table up in the router server, right? It's extra communication that you may not wanna have to do. But for a lot of things, it's probably not an issue. Okay, so at a high level, what we're trying to do in our distributed database. Again, regardless of whether it's shared disk or shared nothing, regardless of whether we're using homogeneous or heterogeneous cluster, is that we wanna have what's called data transparency. Meaning the application can submit SQL queries or submit queries to our distributed database. And it doesn't know, it doesn't need to care of where that data is actually physically located or how it's being partitioned or whether it's replicated at one node versus another. We don't wanna know anything about this. Again, it's the same thing we talked about when we talked about parallel databases. I wanna take my same SQL query that I can run on a single box that is running with one thread. And I wanna take that same SQL query and run it on a single box with multiple threads or run it on multiple machines across the distributed cluster. I should not have to modify anything in my SQL query to be aware of the actual physical location of the data. And that's one of the benefits you get of a declarative language like SQL, okay? All right, so now we wanna talk about how we're actually gonna split the data up. So we already talked about database partitioning before. We talked about horizontal partitioning versus vertical partitioning. But now we wanna talk about how this is actually gonna be implemented, how this actually looks like. So again, we're gonna split our database up when we partition it into these, across different resources. And this can be assigning now partitions to either disks, to nodes, to single CPUs, to chunks of memory. It doesn't matter at a high level, it all works the same. So in the relational database world, this is called partitioning. This is the term I use. If you're coming from a no SQL world, they use the term shard. It's the same thing. It's a disjoint subset of the database. So what's gonna happen is we're gonna execute a query and the database system is gonna take a query that was originally written at a high level language like SQL, break it into a physical plan and now we're gonna divide that physical plan up to fragments that are gonna target different partitions or shards. And depending on whether we move data to the fragment that's gonna execute it or we send the fragment to the data, it doesn't matter. We just know that we wanna coalesce all the results that are generated, the intermediate results that are generated from these different query plan fragments, put it back to a single result and then send that back to the client. Again, the client doesn't know that it executed on one machine versus a thousand machines. All that's hidden from you underneath the covers. So the easiest way to do partitioning is what I'll call Na'e partitioning scheme. Where you don't look at anything of how all the queries are actually accessed to the data. You just take every single table in your database and do a round-robin approach and you sign them one by one to each node. And this assumes that each node in your cluster can handle a complete, the entire table. Right, so really simple example. I have two tables. And all I'm gonna do is to say table one goes to this partition and table two goes to this partition. That's it. And this is super easy to do now in terms of figuring out where queries need to go because if all I have to do is look at my query and say, oh, it's accessing table one, I know exactly what partition has table one and I route the query there. What makes this hard? What's an obvious problem with this? Yes. She says, if you need data replication, we're not there yet. We don't care. Yes. It's not scalable. It's just not scalable from two points of view. One is, again, as example, he said, if I have a hotspot and all my queries are going to table two, I'm gonna stress this partition. I'm not gonna be able to take advantage of a distributed cluster. I'm still running on a single box, so that's no better. It's also not scalable too because now if I start doing joins, right, now I gotta move data back and forth between if I wanna join table one and table two and it's a complete join, I need to move either all table one down here or all table two there or be smart about moving to some of the data, right? So it's not scalable. We're not getting any of the benefit of a distributed cluster. So some data systems actually will do this for a subset of the tables in your database. So MongoDB can do this. MongoDB can tell you, you can tell MongoDB, hey, table two is put this on a single chart or partition on a node by itself. And you do this for tables that you're never gonna read. So let's say you have a table, we're gonna just log events, you just insert events that occur in your application. You can just have all your rights go to this one partition, right? And you don't need to coordinate across any other nodes. All the queries just go there. Horizontal partitioning is what people most think about when you think about distributed database. And again, especially in a shared nothing environment. Again, this is where we're gonna pick some number of attributes or columns in our table and we're gonna use their values to decide how we assign them to different partitions. And the idea, the goal would be to pick a partitioning key that is used often in queries such that most queries are gonna end up being touching on a single box. In the case of OLTP, that's usually the case and all that queries, that's not always the case. And we're gonna do it in such a way and we divide it up in such a way that we have an equal amount of data or queries or total load accessing or going at each partition. So we're evenly balanced across the entire cluster. Of course, there's anomalies, like everyone wants to go read Justin Bieber's Twitter feed. Although he's a bit old, so maybe not so much, but Donald Trump, sure. Everyone wants to go, for better or worse, everyone wants to go look at Donald Trump Twitter feed because they really like him or they think it's a dumpster fire. And so all that traffic is gonna be going to a single node. So we wanna cheat that separately. But in general, for everyone else, everyone else is not like Donald Trump and we wanna divide them evenly across all the different partitions in our cluster. So we can partition our database either physically in a shared nothing environment or logically in a shared disk environment. So let's first look how we see how you do horizontal partitioning in general. So we have a single table and we're gonna pick the second column here as our partitioning key. So in this example here, we're gonna do hash partitioning. So that means that we're gonna look at every single tuple and we're gonna take whatever the value they have for that attribute that we've identified as the partitioning key and we're gonna run it through a hash function and modify the number of nodes that we have and that's gonna tell us what partition we go to. What partition this key should be assigned to. This tuple should be assigned to. And then when we load our database, again we just run this hash function and that tells us where they go. So now if any query comes along and it does something like where partitioning key equals some value, I can just run the same hash function and modify the number of nodes that I have and that tells me exactly what node has the partition that I want. Yeah, what node has the key that I want. In my earlier example, I did what is called range partitioning where I said, well ID from one to 100 goes to this node, ID 101 to 200 goes to this node. For different applications, one partitioning scheme is better than another. Right, if you're trying to do a partition on say an auto increment key, then using range partitioning is probably a bad idea because every insert's always gonna be 101, 102, 103, 104, and that means they're always gonna go to a single partition. But if I use hash partitioning on the auto increment key, then that's gonna evenly distribute the load, those inserts across all those different partitions that I have. And again, if I wanna do quality predicates, hash partitioning works really great because I can jump to exactly what I need, if I wanna do a range predicate, then it's a broadcast query because the hash function, I lose all notion of locality, I don't know where the data's actually is. Yes, her state, but the question is, is it true that partitioning key will never change? You could change it, in general it doesn't change. Think about this, right, so think of it in the context of OLTP. I build my application and I'm gonna use the, it's keeping track of the university's enrollment, I'm gonna use the student ID as the partitioning key. The student ID value for an individual student could change, but I still wanna partition on the student ID. I could come along and say, all right, I can't use student ID, actually when I was an undergrad, they used your social security number as your student ID. Right? And at some point they realized, oh that's a bad idea, so then everyone got new student IDs. So you can imagine that environment, I changed my partitioning key from the social security number to the synthetic student ID, but that's like a one-time thing where I dump all the data out, load it back in, repartition. It's not like you're gonna do this on the fly all the time. Yes? So what's gonna happen? Good, so he says, what's gonna happen if you add a new partition? Right, because the issue is gonna be, I'm mod five four, if I now have five, then we talked about this before with hash tables, now it's mod five, that's gonna reshuffle everything. So in, with range partitioning, this is not an issue because you just split up the ranges. With hash partitioning, you'd have to use something like consistent hashing, which we can talk about when, if you wanna talk about Cassandra end of the semester, that's what they use. And think of it as a ring where when I add a new node or just remove a node, I only have to reshuffle the data that's adjacent to me in the ring. All right, so this is an example of sort of physical partitioning, or it could be, because you could think of this, these partitions being stored at a single node. In the context of a shared disk system, as I said, each node doesn't actually have any of the data. They only have caches in memory and they're buffer pool. Out on the shared disk, I have all my data. So logical partitioning is gonna be where I'm gonna assign the nodes to be responsible for certain keys. Again, it doesn't matter whether I'm using hash partitioning or range partitioning or whatever I want, the idea is basically the same. So now what'll happen is, the application server says when I wanna get ID one, it knows it has to go to this top node here and this top node is responsible for going, getting it from the desk. If I wanna go get ID three, again, I go to the bottom. The advantage of this approach is that now any updates will only go to a single node here, right? If I wanna update ID three, I know that this guy is the only one could have a copy of it in memory and it's responsible for writing it out to here on disk. So if this guy wants to go get ID three, it has to communicate through the channel here to go directly to the node to get it. It can't go out the disk to get it. Again, physical partitioning is in a shared nothing architecture and this is where the data is actually is physically split up and each node is responsible for it. If I want ID one to go to the top, I want ID three to go to the bottom, right? And then if I need to go get any data, this guy wants ID three, it has to go over the network to go get it. But this is sort of what I think about this. For ID three and four, this is the primary storage location of those pieces of data is at this node here. And unless I do some repartitioning or shuffle things around. So the goal in our partitioning scheme, in the context of transaction processing workloads, is to try to make every query be, and every transaction only have to touch a single node. Because if all my queries go to a single node, I know that nobody else, I don't have to worry about coordinating whether transactions allow to commit with any other node in the cluster. Because all the information I need to know about that transaction is at my single node. So typically you want to pick a partitioning scheme, or partitioning key, or keys, plural, in a OLTP distributed database, such that you maximize the number of single node transactions. Where things get hard now, is if you have a transaction that touches multiple partitions. This is typically called a distributed transaction, or a multi-partition transaction. The reason why this is hard is now if I'm modifying the state of the database across multiple machines, I now need to coordinate with them to say, is this transaction allowed to commit? And this is where things get hard. So again, if we're gonna allow transactions to support multi-operational distributed transactions, they could touch data across multiple partitions, and they wanna be able to update the database at the same time across those different partitions, I need to coordinate them. And so we're gonna use all the same current control methods that we talked about before. All those same protocols and algorithms still apply here, but now we can have different ways actually, however we're actually gonna maintain the state to figure out whether transactions are allowed to commit or not. So the two approaches that have a centralized model, sort of thinking as a global traffic cop that has complete view of all transactions that are running in my distributed database, or I could have a decentralized model, where there is no centralized coordinator, and all the nodes have to coordinate amongst themselves. So the oldest approach to the way to do this in the very first distributed databases is it was the centralized model. And they had a, and they used a sort of piece of software called a TP monitor. Now it's called, the abbreviation TP means transaction processing monitor. I think in the original versions of the 1970s, I think they were called telecommunication processing models because they meant to do with like, think of like phone networks and things like that. The idea here is that we have a centralized coordinator that is gonna be responsible for coordinating transactions across different federated systems. So I could have, say I have sort of, I have one sort of application or one system running over here on another machine, I have another system running on this other machine, both of them do transactions, but they only do transactions that are local to them. But now if I wanna have a transaction that span those two systems, I need something above them to coordinate them. And that's what a TP monitor was designed to do. These things are so used today. There's a Oracle, has one called Tuxedo. There's another one from IBM called Transarc. Actually was developed here at CMU in Pittsburgh. These were used again to federate systems that did do transactions, but only at the local state. So now a bunch of databases will actually support the same functionality that a TP monitor will give you internally, and you don't need them, but if you're dealing with a legacy architecture that doesn't have this, you'd wanna use something like a TP monitor. So let's look at how a centralized coordinator would work. So we have our partitions with our data over here. Again, we don't care how it's partitioned. We don't care where it's physical or logical or range or hash, it doesn't matter. And then we have an application server that wants to update data on these partitions. So let's say this transaction, we wanna execute a transaction that updates data on these three partitions. So in order to begin our transaction, we have to go to the coordinator and say, hey, we want to update these three partitions. We wanna acquire the lock for them. So now say we're doing the same two-phase locking that we did before, right? On a single node, but now we're doing the stripping environment. So the coordinator's gonna have its own lock table to say, here's the locks that are being held by actual transactions on these partitions. So once this transaction is allowed to acquire these lock requests, it gets back to the acknowledgement, say you have the locks, go ahead and do whatever you wanna do. And then the application server is responsible for sending out the query requests to the different partitions to do whatever modifications or read whatever data they wanted to read. When they're done with this and they wanna commit, they go to the coordinator and say, I'm done, please commit my transaction. And now the coordinator is responsible for going out to these different partitions and say, this transaction came along, it locked you, it did something, but I don't know what it did. Is it safe to commit? Again, the coordinator only knows that you hold the locks of these partitions. It doesn't know what you actually did at each individual node, so it doesn't know whether you're gonna violate an integrity constraint or a referential constraint that would not allow you to actually commit. So now once all these nodes with these partitions say it's safe to commit this transaction, then we send back the acknowledgement to the application server and say yes, your transaction actually fully committed. If any one of these partitions say no, we can't commit, then the whole transaction has to fail. And I'll talk about in a second how we actually ensure that that actually happens in a second. So, as I said, TP monitors were first examples of this. This architecture is actually exists today. You can get something from Apache called OMID, which actually developed out of Yahoo Labs. Think of this as a centralized transaction coordinator for HBase. Transarc was actually a Pittsburgh company here that spun out a CMU that built their own TP monitor or transaction coordinator that did, again, sort of at a high level of the same kind of thing. If you ever wonder why there was an IBM office at the corner of Murray and Squirrel Hill because they bought Transarc in the 1990s. Another example of a centralized coordinator, which is probably more common, is to have a middleware system that sits in between the application server and the partitions. So, now instead of going to this separate coordinator and then sending lock requests and then sending the queries directly to the partitions, I'm gonna send all my query requests to this middleware. And the middleware has the internal state about where the data is actually being stored. Just like in the case of MongoDB, think of this as the config server and the router merged together into a single layer. And it has its own internal lock table. It is what data is available. So, again, the application server doesn't know where the data is actually being stored. It says, just execute my queries. So, then it can, once it acquires the locks, it goes ahead and executes whatever queries that it wants on the data, sends back the acknowledgement, then we go to commit requests and then the middleware is responsible to go out to these guys and say, hey, are we allowed, is it safe for us to commit? So, this is the most common approach people use when they want to shard my SQL. Facebook is probably the most famous one of this. They have a middleware layer that intercepts all the query requests and then they figure out what my SQL partition or what my SQL node should write your query to. Okay. So, again, these are all centralized approaches. A decentralized approach doesn't have any middleware. It doesn't have any centralized coordinator. There is no global view of what's going on inside of our system. So, it's up for the application server to send a request to begin a transaction and it'll pick some node in our cluster that it's gonna end up being the base partition of the home node for this transaction. And this is gonna be the node that's gonna be responsible for determining whether this transaction is allowed to commit or not. So, I could have another transaction request that could go to another node and that could do all the same steps and these guys can run independently of each other. So, I begin my transaction and then once I get my acknowledgement, I start sending the query requests to my different partitions. Again, I'm ignoring the fact of how the application server knows it can go ahead and do this. We could have some kind of lookup server like in MongoDB or we can push down state about the partitioning table to the application server. It doesn't matter. Then we go do our commit request and again now this guy was anointed as the home partition for this transaction. It's responsible for communicating with everyone else and deciding whether this transaction is allowed to commit. And what's gonna happen is at the same time where we're trying to figure out whether we're allowed to commit for this transaction, these other nodes could be also trying to commit other transactions that may conflict with my transaction and then we have to decide who's gonna abort, who has to roll back or who's allowed to finish. And it's all the same stuff that we talked about before, whether it's time stamp ordering or two-phase locking, all that still applies here. It's just now it's in a decentralized environment. So, distributed congratulatory is much harder here because not only do you have to worry about the delay of sending messages between different nodes, we also don't have a centralized view of every single transaction in the system, but there's other assumptions we made on a single node actually cannot be made here. So, before we talked about time stamp ordering, we would say, oh, we could just use the system clock to decide whether one transaction's older than another. Now I can't do that because now I can't guarantee with high precision that the clock on every single node is gonna be in sync. Right, okay, I could have one node could always be 10 milliseconds ahead and every transaction that's gonna start in that node is gonna be in the future and it's gonna kill all of my other transactions. We don't want that to happen. But I can't use a logical counter like I showed in Postgres because that's global state that now I need to keep in sync on every single node. That means for every single transaction I need to start, I have to send a broadcast message to every single node and say, all right, add one to our counter. We don't wanna do that because that's gonna be slow too. So, we have to adapt the same protocol as we did for single node concurrency tool to make them work in a distributed environment. Time stamp order we get by by using a combination of system clocks with a logical counter and also using the host named to prefix that to my transaction IDs to make sure that I always at least have order of transactions going in one direction. Right, two days locking is a bit harder because I need a global state of who holds what locks. So that we have to coordinate over the network and the home partition is responsible for figuring out whether you're allowed to commit or not based on whether you have a conflict. So, we'll talk about replication next class. Network communication is much slower in this environment. So, maybe the same protocols we used before may not always be the best for us. We also have nodes going down while we're doing two-phase locking or other time stamp ordering protocols and now we have to account for that in our algorithms and that's harder. So, I'm gonna punt on replication in this class. We'll pick this up on Monday after the break but I just wanna spend time a little bit talking about what two-phase locking might actually look like in this environment. So, let's say now I have two nodes and this is a shared nothing environment so they're communicating over the network and we're doing hash partitioning where each node has exactly one record. This has A, this has B but now I also have two application servers at the same time trying to update the database at the same time but starting at two different nodes. So, the first guy here wants to do an update on A but it's gonna use this node as its home partition and this guy's gonna update B by using this node as his home partition. At this point we don't need to coordinate because node one is a response for A so that's fine, node two is a response for B so that's fine. So, I can acquire the locks for these two records at these two partitions without having to ever communicate with the other guy. But now the tricky thing is that if I come along and the first guy wants to update B and the second guy wants to update A well now I need to go send messages over to from this home partition to this other partition to try to acquire a lock on A and this guy wants to get, sorry, the lock on B this guy wants to go over to the network and get a lock on A over there but now I obviously have a deadlock. The tricky thing now is that who's gonna figure this out because this guy over here he only knows about his lock table. He knows that, all right, I can control a B I know I have a lock on B for this, my transaction over here. It doesn't know that A is being locked over here because it just sent the message, hey, go lock A because I'm gonna update it. So, obviously in our wastewater graph we have a cycle but somebody needs to figure out when we have a cycle. So this is why if you have a centralized coordinator this is easy to do because I know that this guy has a lock on A, this guy has a lock on B and they have a deadlock with each other and I can make a decision in my centralized view that I need to go ahead and kill a transaction to break the deadlock. That's the benefit you get from a centralized coordinator but if using a decentralized system you have to manually have, or you have to not manually, you have to have some way of recognizing that, all right, something's going on here, let me go communicate with all the different nodes and figure out what's in their lock state or what's in their lock table and then I can determine whether I have a deadlock. So, a centralized coordinator could end up being a bottleneck because everyone's going to one location to go acquire lock requests and commit transactions but I have a global view of the database that makes the algorithm easier and a decentralized model, I'm able to scale potentially more easily because I don't have a single bottleneck that everyone's trying to go to but now I have the problem of I have to do extra work to figure out what the state of the world is. All right, so any questions about any of this so far? All right, we have 10 minutes left. I'm going to stop here because this is now jumping into atomic commit protocols. So, what have we gotten so far? So far we know how to, we know the different architectures you have for a distributed database, right? Whether it's shared disk, shared memory, or shared nothing. We know how to partition the data in the different environments, right? Whether it's logical partitioning in a shared disk or a physical partitioning in shared nothing and then we just covered now how do you actually want to coordinate transactions in a distributed environment? Whether you have a centralized coordinator that everyone goes to and tells the coordinator what they're going to do before they do it or as they're doing it or have a decentralized model where the nodes themselves have to figure out what the right course of actions are. So, hopefully that you're starting to get the idea that this is not going to walk through the park because now we're distributed. We haven't even gotten to the issue of replication. We haven't even got the issue of making sure that when we commit transaction that everyone agrees that we can go and commit and we can commit atomically across the entire cluster. So, we'll discuss more of this after the break in class on Monday. Again, these additional issues we have to have if you want to do distributed transaction processing. And hopefully I'm also impressing upon you that this is really hard to get this thing to run correctly. And if you want to get an idea of just how hard it is actually to get correct, there's this great website from a, it's a software engineer out of San Francisco called Kyle Kingsbury and he has something called the Jepsen Project. And he basically has made a torture device for distributed databases that he runs on open source and commercial databases and tries to run weird failures and corner cases on these systems to see where they break, see where they violate correctness guarantees that they claim that they have, right? And it's awesome because he does these write ups that are extremely thorough, like it's not something you just read in your underwear real quickly while you're eating breakfast. You get that dedicate time to understand what the hell he's actually saying when he talks about these things because there's really complicated ideas. And so what's awesome about it is he basically was just doing this for fun on, you know, and he got a lot of traction, a lot of people interested in this. And there's one example, I don't say who, there's one database company that we're making claims about how correct or how strongly consistent their database was. He comes along with his torture tool, shows that it violates all sorts of these consistency guarantees and they actually had to go change the website and the marketing literature to now, you know, not say they're as strongly consistent as what they used to be because they proved that they weren't. So now he actually has a consultant company and you pay him to come to your company and run his torture device on your database and you get certified by them, right? And so he has a whole, you go to his website, he has a list of all the different systems and what virgins he's tried his thing on and who fails and who doesn't fail. So VoltDB, I think, failed. My impression was it was stronger than the other ones. I think it was the worst. At least it's all over to me. The newer virgins have gotten much better. They actually now support transactions. So if you wanna learn more about this, go read his blog article, it's awesome. He's a weird dude, it's awesome, okay? All right, next class, replication, cat theorem and then real-world examples of other distributed devices, okay? All right, guys, have a good holiday and I'll see you Monday next week. Ha ha ha ha, that's my favorite all-time job. Ha ha ha. Yes, it's the S.T. cricket, I-D-E-S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Here comes Duke, I play the game where there's no rules. Homies on the cusp so y'all gonna focus on drink brook. With the bus a cap on the ice bro. Bushwick on the go with a blow to the ice. Here I come, Willie D, that's me. Rolling with fifth one, South Park and South Central G. And St. I's when I party by the 12-pack case of a fart. Six pack for the act gets the real price. I drink proof but y'all are drinking by the 12 hours. They say bill makes you fat. But St. I's is straight so it really don't matter.