 Alright, DJ, appreciate it. Thank you. You doing OK? Good, good. Alright, a lot to discuss, so let's jump right into it. So for you guys, what's due, there's the last two, last project and last homework is out. Project 4 will be due on December 11th. Again, we're having the Q&A session tonight on Zoom, as announced on Slack, because next week's the holiday, and I'd rather get this out of the way for you guys to get started sooner rather and so again, we'll post that on Piazza afterwards and then homework 5 is out as of last night and that'll be due on December 4th. Okay? Any questions about project, or project 3 was due last night? Any questions about project 4, so no one has gotten started yet? Okay, alright, so this part of the lecture opens up the new chapter of now talking about distributed databases. And as I said at the end of last class, like, we spent the entire semester so far just talking about how do you build a single node database, and I didn't want to bring any, you know, you don't want to bring in the distributed execution in because that opens up a whole another can of worms that complicates things. So now that we understand how does a single node database work, we can then extend the architecture to now support a multi-node environment. So that's what this class and the next two classes will be about. And hopefully the main takeaway will be just, from all of us, is that now only a single-node database is hard, but they're even also harder if you make them distributed. And so I would just say, oftentimes people, you know, oftentimes you're better off trying to make your single-node database scale vertically, meaning like getting better hardware and getting faster and faster as far as you can before going distributed. Now replication complicates things and we'll cover that next class. So if you recall before the midterm, we talked about parallel query execution and that was about, again, on a single node, how do I take a single query and run it in, you break it up into smaller fragments and run that in parallel on multiple threads, multiple CPUs. And so we said this, we had the distinction between the parallel database and distributed database and my definition was the parallel database is when the execution nodes are physically close to each other, could just be different CPUs on the same motherboard. And we didn't worry about the communication costs or the communication being unreliable. But now in a distributed system, all of that, we have, you know, we can't make that assumption anymore. Now the nodes could be physically far away from each other. Like if I send a message or I send a request from one node to another, it might be in the same rack, it might be in the same availability zone, might be the same data center, or it could be in a data center on the other side of the planet. So we have to take that into consideration in how we design our algorithms and execute queries. So it also may be the case too that like the messages we're sending between the nodes may never show up. So we have to be aware of that and design for that problem. So as I said again, this is the, for this point in the lecture, we're going to use the single node databases we've covered so far as the building blocks to now support larger and larger systems. And everything again just becomes more complicated, especially query optimization, that was hard, it's harder, slightly harder. But concurrency is definitely harder and logging recovery is harder as well. Okay. So today we'll talk about beginning like the different type of system architectures you could have for a distributed database system at a high level. Then we'll talk about what are the things you have to consider when we actually build a distributed database. Then we'll talk about how to we split things up, partition the data across multiple nodes and multiple resources. And then I'll give a teaser at the end of what the challenges are will be in distributed concurrency control. And that will be what we will focus on in next class. How do we take all that two phase locking stuff that you guys can implement in project four or OCC? How do we make that run across multiple nodes? Again, and the spoiler is it does but it's slow and it's hard, right? So the first thing I'm going to figure out is what is actually the system architecture look like? Meaning how are we designing the database system in the terms of storage, memory and compute? Where are these resources actually physically located? And that's going to determine what kind of protocols or algorithms we can use and when you run queries and run transactions. It's also going to have a interesting side effect of determining how we're going to scale our system elastically, meaning add more resources, add more nodes, right? If we are running on a single box, a single system, if you want to add more CPUs, we literally have to turn the machine off and either migrate the database to another machine that has more CPUs or physically go put more CPUs into the motherboard. Disk is usually swappable. I don't think RAM is swappable, right? But now in a distributed environment, especially in the cloud, in theory, that we could add more nodes and scale it horizontally. And we want to be able to do this without having to take the database system down entirely. So what the system architecture is, what to determine, what are the capabilities, how we actually would want to do this? So there's basically four types of database systems. We can ignore embedded databases, as I said before. Those are just, those are like a library you would link in like SQLite into your application and they're running in the same address space in the same process as your application. And you can argue that those are basically shared everything, but for now we can ignore them. We don't care. So everything we've talked about so far in this course, what are known as shared everything systems, meaning like the disk and the memory and the CPU are all local to each other. And if I want to have say two threads running or two queries run at the same time in my shared everything system, then I can just pass memory messages over memory to each other. If I write any, if one transaction writes something to the local disk for the shared everything system, the other transaction can just read that, right? The next type of architecture you have, now if you want to start to have distributed is called shared memory. The idea here is that the the the CPUs are not co-located, but there's some messaging fabric that allows the some some some memory to be shared amongst all of them. And then the disk likewise is also shared amongst all of them as well. So now I could have like to think of this as two boxes. And anytime I want to have one thread and running on one machine, communicate with another machine, they could write to this memory address. And then there's some hardware mechanism that's sending that message over the wire to the to the other node. RDMA is sort of something like this. What is more common databases are the last two shared disk is when you have separate nodes that are have their own local CPUs and all their own local memory. But then the disk will be shared amongst all of them. And again, some messaging fabric something over the network that can take the request to that from each of the different nodes and send them down to the disk. So again, if one CPU writes something the disk, the other CPU could could could be able to see it when it goes and does the read. And then when you think of distributed systems, this is oftentimes what most people think of what are known as a shared nothing system, where every single node is its own, it has its own local disk, local memory and local CPU. And the only way to communicate between the different threads, you know, different queries running on different nodes, is to send a network message over PCIP to pass it along to the different nodes. Right. So think of like I buy a whole machine has, you know, disk memory and CPU, and I have multiple ones and each one is its own sort of kingdom, but they can communicate if they go over the network. Right. So you can sort of think of like, I should have drawn borders, but like, this is like its own node. This is his own node. This is own node. Wherever the CPU is, that has to be its own node. Right. Yes. His question is, do we assume that the network is reliable in the second and third case? No. Okay, so go over them one by one. So again, in shared memory against CPUs are running on separate boxes, but they have access to a common common address space. And it's soon to be a fast interconnect. Does it not guaranteed to be reliable? Because the other node that you're reading, writing memory from could just go down and you have to deal with that. Right. But like the idea is that like, even though the CPUs may be running on separate boxes, they think or they believe that they have a global memory address space that they can all read and write to. And so when they boot up, they have to be told, hey, here's the other instances. There's something to tell them here's what the giant memory pool is. And then hardware can keep track of how to pass messages around. So as far as I know, there is no distributed data system that actually does influence this. This architecture appears mostly in the HPC world, high performance computing world, because they're trying to do like simulations of nuclear explosions or whatever on on, you know, terabytes of data and they need a really large address space for their programs. The one that sort of sort of looks like this is Oracle rack. The way that works is the buffer pool, you have sort of multiple nodes and the buffer pool has been modified so that the when you read right pages to it, it knows how to do send that request to the different memory locations of the buffer pools on other nodes. It's not exactly shared memory because if I allocate memory for my like a query, you know, just to fill up a hash table when I'm doing a join, that's local memory and nobody else can see that it's only for whatever is backed by the buffer pool will be shared across multiple nodes. What is a lot more common is shared disk, especially in a modern cloud setting. And again, the idea is here is that we have different nodes, they each have their own CPU and low and local memory. But anytime they need to read and write to disk, they're going to go to some central location. And if you're running on prem, it could be like a GANAS, like a network text storage device. If you're running in the cloud, think of something like S3. Right. And so if they need to get the state of the one CPU needs to get the state of other of another CPU, like what query they're running, what transaction they're running, then they have to send messages over the network to like TCP IP, which between them. But again, if one query updates some record that gets written out the disk, the other the other the other queries or sorry, the other CPUs can be able to see that because they're always going to read and write from the same disk. All right. So this is super common now, especially in the in the cloud. So pretty much any like modern OLAP system you think of today is going to be using this. Like I think of this again as like S3, there's a bunch of buckets, I could have a bunch of compute nodes that are going to run queries and they're pulling data out of the S3 buckets into their local memory and then process them to produce the query results. Now, yes. Would each page be used to store the entire data? So his question is, is in case in S3 is each page is in blob or is the giant in the entire data is the blob. You would have like at this level, you wouldn't necessarily call them pages. They might call them like a higher level, like a segment, one of the comprised of multiple pages. But the idea is that you'd go fetch that like 10 megabyte of chunk of data comprised of multiple pages and then run your query on that. The lines get kind of blurry because there are systems. So yeah, the lines get blurry because again, this is like, think of the, think of this thing as like it's its own EC2 instance, right? So it had, you could have locally attached storage. But that locally attached storage can be used as a cache, like a fast MVME drive. But the state of the database, like the final resting place of the database has to be down on the shared disk storage here. So as I said, this is pretty much every cloud system that exists now today for the most part is using something like this. The, yeah, so for example, how it actually work. So you have your application server and say we have two nodes and then we have our storage and then storage layer is like this own independent thing where again, different nodes can read and write from it as needed. So my query starts says when I want to get ID some object ID 101, like some record. So then it has to go down, sorry, goes down here, it gets page ABC. Again, it's a single object or be, you know, a larger object and I have to pull the page out of that. That gets shipped back over to the node and answers the result. This other transaction, other query shows long says I want ID 102. So it has to go and get, go to this node. How exactly, find what node has the data we want. We'll talk about that in a second. But the idea is again, it goes back to the shared storage device and gets this. All right, so now the advantage of a shared disk architecture over a shared nothing architecture is that it makes it really easy to scale up with the compute layer and the storage layer independently of each other. Because the final state of the database, like the state of the database that we want to persist forever after crash and after stopping is always out on the storage layer. So in theory, I could kill off these front end nodes and it doesn't lose, I don't lose my database. If you ever hear like, when they talk about serverless architectures, that's sort of what they mean by this. So like, I can since this thing doesn't have, again, disk that's not storing anything on disk that is the database that we have to persist forever, we can kill this node, no big deal, and then shut everything off. Amazon keeps maintaining my S3 storage. And then if I want to run out of the query, then I can spin this node back up and I can pull pages from the storage layer easily. Or if I want to add more nodes without taking out the other ones, so I just pop this node up here and then now the queries can go to that and run the same kind of request and get results back. Likewise, in addition to scaling up the compute nodes, I can also scale up the storage nodes. I will ignore this. Sorry, sorry, sorry. Yes, I wanted to update. I want to update here. Somehow I know that this node up here at the top is responsible for updating this record 101. How we get there, how it knows to do that, we'll get there in a second. So in this case here, I would go out to disk, update page ABC on S3, that's fine. But let's say now that I also know that these other nodes have a cache copy of this record 101. To make sure that they have the latest update, I can just send them a request over TCP IP, say, hey, by the way, I know you know something about record 101. I just updated it. Here's the new value. Or you tell me to invalidate it so then the goal will fetch it again from the storage device. Yes. Yes. Yeah, so his statement is, and he's correct, if I always make it so that 101 always goes to this node, then I don't have to do this extra step of telling everybody invalidate. Correct. Some systems do that. Other systems sometimes cache. Yeah. Like a read only copy. Because then otherwise you would have to do distributed transactions, which makes things harder. We'll get there in a second. But you're right. If you don't have another copy somewhere else, you don't have this problem. Right. So the thing I was going to say also too, in addition to being able to scale up the compute node independently of the storage layer, I can scale up the storage layer as well. Right. I can add more disks. I mean, Amazon don't do this on any of the covers for you. You don't really, it's not something you tell S3 to do, but it can make multiple redundant copies. You can increase your provision IOPS, get better bandwidth. You can make the storage layer be faster without changing anything at the compute layer. Right. I'm using S3 as an example. Azure has their own thing. GCP has their own version S3. I forget what it's called. But the basic idea is the same. It's like an object store. I can do region writes or puts and gets from and deletes. All right. So share nothing is, as I was saying, it's where all the nodes are independent and have their own local disk and local memory. And historically up until probably the last 10 years, this has been the preferred or conventional wisdom, the right way to build a distributed database system. Right. There's a bunch of actually shared disk systems even before the cloud, going back to like the 1980s. But these things are always sort of seen as problematic. And a bunch of them, like the products didn't go anywhere. And so everyone sort of said, okay, if you're going to build a distributed database, you would build a shared nothing system. But now because the cloud, you know, Amazon or whoever, the cloud manager takes care of the storage layer, it's just easier to build, better to build a shared disk system. Again, because you can scale them in and penalty. All right. The basic idea is again, with the share nothing system is that every, every node has their own local attack, locally attacks disk and storage and memory. And if I need to communicate between different threat, different nodes, I just send messages over TCP IP. Yes. I'm just assuming because like I'm pretty sure I'm behind the seeing Amazon doing their own like own way of like doing a distributed like storage. Yes. So you're just kind of assuming that like they are going to manage it better than you do in that situation? Yes. So the statement is if you're using something S3, isn't Amazon managing the disk underneath the covers for you? Right. Where I thought you were saying it's also like replicating and all that stuff. Yes. And they're going to do a better job than you could. But like isn't, but don't we know better about what kind of data we're storing and then you can manage it better? It's kind of like I'm at this point. I see where you're going. Okay. So the statement is that like I made a big deal about how the OS doesn't know anything about your database and therefore don't let the OS manage things in your database. But now I'm saying hey, if you use Amazon, you can use S3 and let them manage your things about the storage about your database. Right. Yes. You're right. And there would be optimizations we're not going to be able to do if we use S3 like predicate pushdown. Like we can't, like we got to go fetch the entire blob of, you know, if we need a page we can't do any like filtering where the data resides. We'll cover that topic in a second. Yes. The difference I think is though like at that scale Amazon is going to do much better, much cheaper than you could yourself. Right. It's just again, like S3 is pretty good. It's slow. It's like 100 milliseconds, 100, 200 milliseconds per fetch. Right. That's a long time. And so you wouldn't want to use that for transactional workloads. You could maintain that local cash on the node as I mentioned at the local disk. Right. But if you want to be able to shut things down, you got to flush it out the EBS or S3. Other systems will make more use of S3 because it's very, very cheap and affordable. The other systems will use EBS. Yes. Yeah. So go back here. Yeah. Amazon or Amazon Aurora is a very interesting one. So we can cover this in this semester. But they Amazon Aurora is basically they took my SQL, they took Postgres, ripped up the bottom half of the storage layer, took the storage layer off and then rewrote them to be shared disk systems. But instead of just storing things on S3 or using EBS, instead of just storing things on EBS, like every other database vendor would normally do, since they control the whole stack, they actually put a little bit of database magic up above, right, a layer above EBS that knows about replication and transactions and durability. And so they can actually send messages at the storage layer between different nodes in a way that you and I can't because we don't control EBS. So Aurora is a special case where they again, it's a shared disk system. But to his point, the disk layer isn't dumb just like getting set. It's doing a bit more because it understands what transactions are and what the database is actually trying to do. And that's more efficient than you can run like old to be. That's more efficient for old to be. Yes. Yes. And there's other complications with shared disk as well. Like there's examples where people start off with a shared nothing system and they want to say, oh, I want to run into the cloud. And so they run, they want to run it and like, you know, and Kubernetes cluster. But like S3 is already replicating and then the system itself, the database system is also replicating. So like, like in Cassandra, it used to be, if I do one update into Cassandra, it would replicate that three times in the system. But then S3 would also replicate the right three times per node. So it'd be like one right would get written nine times or something, some crazy number. Because again, they're treating the disk as a sort of a dumb, like dumb object store that does get and set. It doesn't know anything about databases. They've since solved that problem and they're just not replicating as much. But his point, yes, we're sacrificing quite a bit of things using shared disk, but the scalability and the cost difference is quite significant. So it's worth it. Okay. I think I've covered everything and shared nothing. Right. But again, here's a bunch of different systems that all do this. Right. Again, pretty much every distributed database written maybe before 2010 would do something like this. But then now in the cloud everyone wants to do shared disk. So it would look so the example would look like this. So now we don't have a shared disk. Now every node has its own disk. But now I'm explicitly showing what data is actually storing. So we'll talk about partitioning in a second. But think of like it's a way to take a table and we're going to split it up into separate just joint subsets. So node one at the top will have all the all the tables for ID from one to 150. And the node at the bottom will have 151 to 300. So query shows up. So he wants to get ID equals 200. Again, we'll cover in a second how do we actually find where to go. And then it knows that this the data that needs for this query is located here. So I don't need to coordinate with the other node here. I don't need to go out to a shared disk. Everything I need to run this query is local. Likewise, this guy can go up here. He can say I want 100, 200. Again, there's some metadata, a catalog we're keeping track of where what node has what data. So for ID 100, it could get this could process this locally. But for the ID equals 200, it could either go down here and say, hey, I know you have the record for ID equals 200, send it up to me, or it can say, hey, go run this query for me and then send me back the result. Because again, we want to have the distribution of the data and the distributed database be transparent to the application. So ideally, we don't want to have to have the application to know 100 here and 200 down here because they send one SQL request to this one node and then the nodes can figure out how to get the right result, get the results that it needs and send it back to the single response. And we want to do this every time you say a node goes down or we have to rebalance or whatever, we don't want to have to go change it in the application, everything should just work. Doesn't always work out that way, but ideally, that's what we want. All right. So now the challenge with share nothing as opposed to share disk is when a new node shows up or I take a node away. Because when we initially boot it up, there's nothing inside of it. So we need to now shuffle around the data from the original two nodes that we had and put something in here. All right. So let's say we just split it up evenly. So the first guy's got one to 100, he's got one to 200, the last guy's got two to 300. So this is hard to do because we want to do this safely, we want to do this transparently and we don't want to give, we don't want to have a small window where queries show up and it's looking for ID equals 150, but because we probably initialize this node with all the data, it shows up to the first, the query shows up to the first node and it says, I don't have what you need because it doesn't know about this guy yet. So you essentially need to do this migration in the context of a transaction and in shared disk, you don't have to do that or you don't have to physically move data, you just update logical information to say the data you're looking for is this compute node or that compute node in a shared nothing system, it's physically moving data which can be expensive. So I would say this is actually what this, this is sometimes called auto-scaling or elasticity. This is actually one of the major selling points like this idea of repartitioning or reconfiguring your data. This actually was one of the major selling points of MongoDB back in the day. I think they still support it but they wouldn't do it actually in the consequences of transactions because the early versions of MongoDB didn't have transactions. So like what would happen is you would have these partition nodes, you would add a new node and try to rebalance but there could be a small window if your query showed up, it would say, give you a false negative and say you didn't have data that you actually had because it was updating things and moving things around. Right? So again, so this is not easy to do. All right, so distributed databases are not new. They go back to the 1970s. Probably the first two ones were this thing called Muffin from Stonebreaker, the guide event of Postgres and Ingress and this was basically it was multiple instances of Ingress and then it got sort of a processing layer on top of it. SCD-1 is oftentimes credited as the first one from Phil Bernstein. This is, there's a bunch of papers that describe all the early architectures of distributed databases in these papers but he gave a talk once where he said he didn't actually they actually never actually built it. It was just like a bunch of scripts and it was like prototypes to get money from the government but they lay down a lot of the key theory. At IBM, they built a distributed version of System R, the first relational database they built in the 70s called Rstar. Gamma was a very influential parallel distributed database out of the University of Wisconsin and then of all these, the only one that's actually still running today is this thing called non-stop sequel. Again, this was Jim Gray, the guide event of Two-Face Locking worked on System R at IBM. He went to Tandem and helped build a non-stop sequel. Non-stop sequel is still widely used. If you ever used the ATM, you probably used it. Tandem got bought by Compact. Compact got bought by Deck. No, no. Tandem got bought by Deck. Deck got bought by Compact. Compact got bought by HP. So you still can get non-stop sequel. A lot of banks still use this but like if you're a brand new startup you're not using this. But this is one of the first like so Tandem used to sell these devices or like appliances called non-stop machines or something. And the idea was it was like super redundant hardware. Like think of like NASA level. Like we have like three CPUs running at the same time. And then you check to see whether they run some computation and you check to see whether they produce the same result. So the idea was like it would never stop. And if you're a bank you want that, right? So yeah, it's still around. I mean they still make a lot of money off of it but like it's you know it's pretty much a maintenance mode at this point. But I'm sure the contracts are in the millions. I mean IMS like the first database that IBM built in the 1960s that's still around. That's still widely used. And that I think I remember reading somewhere that's like the number one moneymaker still for IBM. It's like IMS for all the maintenance fees because it's like mission critical software that people don't want to change. Okay, all right. So now we know at a high level like you know we do share disk or share nothing. There's a bunch of these questions that came up when I was describing the different architectures that we had to figure out. And so the first one obviously was like how do we actually find that how does the application find data? Like I have a query where does it go, right? And then if the data that I need is not of that query or sorry at that node that I sent the query to what happens? Right and I sort of said there was two cases. There is the the query rise at one node the data needs is at another node. Do I send the query down and get back to the result or do I send the data up and run the query locally? So it's called push or pull so you either push the query to the data or pull the data to the query. If it's shared disk it's always pulling the data from the shared disk layer to the compute node. But it's shared nothing you could choose one versus another. And there's this trade-off of course obviously like if the query is like the SQL string is like 10 kilobytes or the data I want to scan with that 10 kilobyte SQL string is one terabyte it's obviously better to send the query to the data but if I don't have any computational resources where the data is then I can't do that. So this is mostly an issue in a shared nothing system but even in shared disk if you have a local attached disk as a cache you still can make that and make that choice. And then the question is how do we show correctness if we want transactions so we update data on multiple nodes multiple data items how do we make sure that everyone's in sync and says this is the correct version of the data and then how we divide the data across resources. So we're going to cover all these except for this one we'll cover this in more detail next class because we'll get a two-phase commit we'll get an apaxos and rafts and all that good stuff. All right so the next time station we also make a vision is like what do our nodes actually look like and this is the notion of like are the nodes homogeneous or heterogeneous? Homogeneous cluster would be every single node in the system or in the attribute database system is like a first class entity meaning it can do all the tasks that any other node can do. Right it can take any query it can take any transaction request it can move data around right. This is ideal because this arrangement is ideal because it makes it easier to do provisioning and fail over because if a node goes down you could take another node and just fill in for it. Right you know if I'm designating some nodes to do some special operation these other nodes to use other special operation if I run out of nodes in sort of one category I have to mainly go and say okay let me take resources out of another one. Less of an issue now with the cloud because it's easy to spin up new instances if you're running on prem in a non-virtualized environment this is this is harder to do. In a heterogeneous cluster all the nodes are assigned to specific tasks. Now you may have a single physical node have multiple demons or processes running the same time that can do the different responsibilities and that avoids the network network message between them but again at at a really large scale you want them to be separate because each each sort of task could be very compute heavy. So to explain what so sorry the homogenous one is pretty obvious to understand to give an example what the heterogeneous one looks like I want to give an example of what MongoDB's architecture looks like and I'll say that this is the MongoDB this is like the version of MongoDB you run on prem they have their own cloud version and they sort of hide all this for you but if you download MongoDB and run it on your on your local machine and just ship it a setup you would get something like this so we have an application server and it's going to send query requests to this this special node called a router and it would examine the query request and say I want to get you know the record ID 101 so then it would go down to this thing they call the config server that's going to maintain its own internal state about where all the data is located in the database right what partition has or shard has the data you need it sends the information back to the router who then then now routes the query to the appropriate node that has the data that it needs get back the result and then you send it back to the client right and sort of this there's this transparency between the application and the database itself where it doesn't need to know how many partitions it has doesn't need to know where the partition is located how much disk they have and so forth all that's hidden from you because the router thing is responsible for sending a request to the right location so this sort of clear so you have different nodes that have special tasks and so this is just repeating what I've already said we want this notion we want to achieve what is called data transparency where the application should not know anything about our distributed database should not know how many nodes it has should not know anything about where the data is actually located and any query that we'd run on a single node instance of our database should also run and produce the same result on a distributed database again regardless of how I scale things and change things right now in actuality when you build it when you're building application that's using distributed database you kind of do need to be aware of where your data is actually located so you don't do stupid things like if I have to update two records and those two records are on different parts of the world if I do it in a transaction then what's going to happen and then I have to do multiple round trips to coordinate over the network going on to the ocean between these two data items in my database and that can be very expensive or if I run around a query that has once I scan a petabyte of data I don't want to pull that petabyte of data from Europe or Asia back in the US and process it ideally I want the you know maybe I could send my request to the right sort of the not the right node but like the right right router thing that knows how to get data you know run queries in that part of the world right so what I'm just trying to say is we don't want to have special syntax or introduce any modifications to our SQL queries to say hey scan this you know run this query on this table located at this location right we want the data to handle all that for you but again you need to be sort of where where things where the data is actually located to make sure you don't do you know you don't run things that are more expensive than they should be so now we need to talk about partitioning because this is something that we've we've alluded to on when we talked about parallel execution and I'm showing examples now where I had these different databases and they're being split up and queries going to one location versus another so now we need to talk about how we actually do this splitting so the idea is that we want to split our data across the multiple resources that we have again depending on it's if it's a shared disk or shared nothing or shared everything right it could be splitting across across different hard drives different physical nodes different CPUs different memory regions I'm going to use the word partitioning and because traditionally in the in academia and also for the relational databases that refer to partitions in the no sequel world or on the hacker news world they're going to refer to these as shards or sharding the basic idea is the same we're to split the database up into disjoint pieces and the idea is that just like as we did on parallel execution where query shows up we generate the query plan and then we want to break this up into physical fragments that we then send to the different partitions to run in parallel it's the same idea and it should be a database and then we want to combine the results from the different fragments into a single answer so the database system can be able to partition a database either physically if it's a shared nothing system or logically if it's shared shared disk right and again it has to do with where the where the where the the final resting place so that the primary storage location of the database is actually being stored so let's go through a bunch of examples that we're trying to partitioning schemes you can have and then we'll discuss the trade-offs of actually having trying to support these so the easiest way to do partitioning it's just what is called naive table partitioning where you take the entire table and assign it to a single node like if I have two tables one goes to one node when the other table goes the other node right this obviously assumes that you have enough storage space to store the entire table in each node assumes that most of your queries are going to only access one table at a time right this would be terrible if they're joins and then of course and what happens now if you add a new table where does that go this complicates a bunch of things but this is the easiest way to get to split a database up so it'll look like that so I have two tables one and two I literally I literally take all the tuples from table one sort of that node that's partition one all the tuples from table two and that goes to this partition two here right and the best case scenario for me and for this for this arrangement is every query only access is one table so very few systems will actually let you do this Mongo is the one that I could think of where you could say this node this table would be dedicated or collection they call it this table will be dedicated to this single node and they would do this for they would do this for some cases where they have applications that want to do maybe update the actual database itself but then append like a log record into a table and so you'd put that log table on a separate node so you just blast the inserts over to that thing and it didn't interfere with updating the actual database itself because you never went back and read the log unless you were doing like an audit later on another type of partitioning it's called vertical partitioning and the idea here is basically you want to sort of simulate a column store where you're going to take a table and split across its attributes so all the all the tuples for with all the values for some set of subset of attributes for all the tuples go to one partition and all the the values for some other subset of attributes go to another partition as I say I have four tuples here and they have four attributes and say the the last attribute attribute four is a large text field so what I want to do is just break it up so that there's one partition that stores just the first three attributes and then another partition stores the the last attribute and again it's like a column store I would have to know how to you know how to of course how to map or combine them back together to produce the the original tuple right what is more common in in distributed databases is called horizontal partitioning and the idea here is that we're going to split the the tuples up into these subsets where all the attributes for that for that tuple all the columns will be stored together continuously on on the partition or it doesn't have to be continuously if it's a column store you can just do the same thing but like if I want to get the all the all the values for a single tuple it's going to be on one on one partition right and the challenge here is that we want to try to pick a partitioning key and a partitioning scheme the mechanism we're going to use to divide things up in a way that's going to divide the database equally across the different nodes by some metric right the the storage size how much how much work or how much queries are going to get sent to them how often they're going to be updated and so forth so picking the right partitioning scheme is actually the mp complete problem and there's a bunch of research on how to do this automatically often times for LTP it's quite obvious like the it's usually that the the schema is a tree structure like customer ID like you have a customer account customer orders customer order items that you did partition the the database or those tables on the customer ID for all that because it's it's a bit harder as we'll see next week because I could be doing joins across any arbitrary keys and you know one partitioning scheme for one query may be the best but another part it could be the worst for another set of queries so if trying to figure this out is is non-trivial so I've already shown sort of range partitioning already right is there's some partitioning key there's some column I'll say you know some zero to a hundred goes one partition zero to one on one to two hundred goes another partition often times though hash partitioning is is is is widely used predicate partitioning is not that common basically think of like a where clause for any tuple that you then would use that to figure out what you know was like a like a a routing table it'll say where that tuple actually would go it's usually hash partitioning or range partitioning and then between those two hash partitioning is usually more common especially in OTP workloads or the no SQL systems all right so let's see how to do this hash partitioning for this table here so the first thing we got to do is pick a partitioning key so say we pick the second column here I'm going to do hash partitioning so we have four partitions so what we're going to do is going to take take the value for each tuple for this for this column here run our hash function right xx hash whatever one we want modify the number of of nodes that we have our partitions we have and then that's going to determine where this data goes right and then maybe split up in the partitions like this so now any query that comes along and and then their where clause they have an quality predicate that is doing a look up on exactly our partitioning our partitioning key column we just take this value they're passing in hash that modify four like we're doing an example here and that tells what node we has the data that we want and again this is something the database system does it's not you wouldn't do this in your application the data system would figure this out for you so this clear okay so now let's talk about the difference between logical partitioning and physical partitioning so in a shared disk system logical partitioning means it basically it's it's a way to denote what node is responsible for processing the data or the queries that touch the data within some partition right so say we have out on our on our storage device we have a bunch of we have four values and then now there'll be some meta data we keep track of at each node that says I'm responsible for you know for IDs within this range or things that hash hash to me so now when the application service says I want to get ID one right there could be something up in front in front of this that tells how to route there we can ignore that for now but say it goes here and this guy knows I'm responsible for this so I know I can go to disk and get it right this guy wants to get ID three same thing he's responsible he can go do that so now if I want to get multiple ones as I showed before we have this decision where we we know that this node can't doesn't have ID equals two and so rather than going just to yes dicks going out to disk and getting it because then everyone has a could potentially have a copy and that would be problems if I want to do an update where do I go to update it how do I tell everyone I've updated it we could have this thing say well I'm not responsible for for ID equals two so let me send the query request or the data request up to this node and that node can either send me the result of the query or the actual data that I want and I can maintain an ephemeral copy and then I send the result back there physical partitioning in in share nothing is where again the data is actually physically located so just like before I'm splitting up on ID so when I want to get ID one I know to go up here and get it that's fine when I want to get ID three I go down here and get it and that's fine right and if I do the same message passion as before if the bottom guy wants to get ID equals two then I can either send the query request or the data request and I can have a local copy at the node on the bottom okay so let's go back to this example here so again we were doing hash partitioning on the second column here where I take the value for each for this value or take the value for this column for each tuple hash it mod four and then that tells me where the partition will actually be stored what's the problem with this approach yes bingo he said as you add add add new nodes or remove nodes you have to move stuff around so if I add a new new node here right then I can't mod four anymore because I have five I have five five nodes or five partitions so I got to go ahead and rehash everything which is going to require me to shuffle everything around right now range partitioning you can you can excise the ranges and that's sort of you can play a trick there but if you're using hash partitioning this won't work right you have to remod everyone again right so the solution to this problem is actually called what is called consistent hashing who here has heard of consistent hashing before less than half less than 25 percent okay so consistent hashing is a really neat technique and it's widely used in both distributed databases mostly on the operational side operational systems less for OLAP but certainly distributed systems in general as a technique to basically avoid this reshuffling problem and there's a developer at MIT in the early 2000s so the way to think about it is that the the hash range for for you know some some partitioning key is going to be represented in the circle from zero to one and the idea is that we're going to hash the data that we have you know we want to start each partition we're going to hash it and then map it to this circle here so say partition one is this part in the ring partition two is down here partition three is over here so now when I want to do a look I'm going to say go get me id equal one on one or whatever the key is I'm going to hash it then it's going to land somewhere in the ring and then I just move forward in the ring to say what's the closest what's the closest partition you know we're based on my location where I'm in the ring what's the closest partition to me going in clockwise order and that's where my data is going to be stored right so I land here in the ring and I have a scan down saying oh p one is actually what I want right likewise if I do another hash over here I land in this part of the ring slide up p three has the data that you want so the way to think about this is like the these partitions are going to cover the key space for the or on the hash ring from the point where they're located around to the where the next the next node is right all right so so far this seems pretty okay this is fine what what makes this special well when I have a new partition show up here the only thing I need to do now is just move the data that was used to be covered by the next partition above me in the ring I says I'm excising or splitting in half all the data over here from from p four to p two that used to be in p three it only needs to move get moved to p four all the data and the other partitions don't have to move at all right again the same thing I add another partition p five here that splits the range from p three to p one I only need to move the data from p one that that falls at the point of the ring same thing I had p two I only move that data so I'm not I don't have to reshuffle the whole the whole ring or the whole partition scheme the the change of modification is is localized to just that part of the ring okay so how do I find data now so to have better availability which will cover more in that next class I actually want to store the data across multiple nodes or multiple partitions right so you would have this notion of a replication factor meaning like for every single time I insert a new record into the database I'm going to store on three different nodes or three 300 partitions so let's say at any data that I was storing p one here I'm also going to store them in the next two nodes in the ring so p six and p two so now when I when I want to do a look up and say go find me key one I'm going to hash it it lands here in the ring and then I know I could either go fetch it at p one p six or p two and all of them would potentially have the data that I'm looking for of course this makes writes more complicated right how do I make sure that if I write something to p one it gets to p six and p two okay we'll talk about that next class you either wait for all of them to acknowledge it or you can wait for majority to acknowledge depending on your tolerance so as I said there's a very influential paper in the early 2000s for MIT called cord that invented this technique but then most famously Amazon picked up this idea and they implemented it in dynamo db in the late 2000s there's another system called memcash d which is a in-memory distributed hash table for for caching things that's basically a database system they use this Cassandra was written by Facebook originally to power their mailbox but the guy that found a cladera broke Cassandra at Facebook for the for the mail Facebook says we don't want to use this so then they they threw it away and made it open source so Facebook never used Cassandra even though they wrote it and a bunch of other people picked it up and then React is a is or was they distributed key values right I think it's still around the the company backing went under but you can see in their logo they have this the ring and they're doing that sort of three three replication thing I showed before right because it's a it's a it's a key value store that's using a consistent hashing scheme like this you mean by dynamo rather than dynamo db right so this question do I mean dynamo versus dynamo db dynamo db is the public commercial version of dynamo reference scratcher the design from the design and then I'll double check they never popped they just wrote the paper this year about the dynamo db design but they didn't talk about consistent hashing okay I'll double check that okay so yeah the original dynamo paper from 2009 does or 2007 does this I'll double check that you should cover that the dbs class good all right so so now in this example here when I said okay I do a right I want to write the date of the three locations right this now gets us into the world of how do we make sure that when we do updates those good because updates can apply to multiple nodes and everything is it's like safe and durable and so now we have this notion of a you know what's the sort of the scope of the transaction meaning how much data or how many nodes we're actually going to have to touch and again this doesn't matter where it's a whether it's shared disk or share nothing if the data is being located or either logically or physically partitioned at different nodes right I got a coordinate across them when I do an update so it happens atomically so if a transaction only touches data at a single partition it's obviously like no single partition transaction a single node transaction and this is the best case scenario because potentially I don't need to coordinate the the commit process of that transaction with other nodes in the system right because the data it touches only at that single partition I don't care what other what other data other nodes or what data other transactions running on other nodes modify because they don't care about what's local to me right that's the best case scenario and you want to ideally pick a partitioning key if it's an OTP system that max is maximizes the number of single partition or single node transactions and oftentimes you actually can do this right if you think about like by example of using the customer ID as the partitioning key for a table when you go log into amazon you can only update things to to your table or you know your customer record your your account right you're not updating all the people's accounts so if your data is partitioned such that all your information on a single node you would get a single node transaction right and that's the that's the best case scenario a distributed transaction would be when where the transaction did you do the motion do i know you you've been served holy s*** yeah i thought i thought i was in the clear man okay are you sure they didn't like reach out to you they i how did they know i was here did you nark on me i mean you didn't stitch on me no are you sure you're not gonna read it all right i mean i'm sure it's just they're suing me for child support but like okay all right uh yeah okay all right transactions right so again if if a transaction has to touch multiple multiple things that's a distributed transaction and that gets very expensive right and we'll focus let me go on this next class but the the question is now who's actually going to coordinate these transactions right whether it's two phase commit or sorry two phase locking or OCC it doesn't matter we still need somebody to say okay this transaction is committing like let's let's go ahead and do it and the the two approaches that you have a sort of centralized centralized coordinator or a decentralized coordination between the nodes themselves the centralized one it probably is historically more common it's things get blurry because the way the way these the systems sort of work now is it's a decentralized coordination but they run a leader election like Paxos or Raph to say who is actually the coordinator so it is become essentially become centralized but in theory if that leader crashes and goes down you elect a new leader so it's not a clean dichotomy but I don't I want to cover both ideas all right so a centralized coordinator the first version of this was they're typically called TP monitors TP would now it stands for transaction processing in the old days I think it sounds it stands for the telecom processing because they they would build these these sort of centralized transaction coordinators for distributed databases to manage things like people calling on the phone keeping track of call records or like ATM machines and so forth right so they built these early TP monitors back in 1970s and like the idea was you had these single node databases that could do transactions locally on the data that it had right similar to what you're building in bus tub but now if I want to touch multiple multiple database systems and run transactions across all of them at the same time I need this like centralized coordinator I just sort of sort of be a god above them and decide who can do what so they built these in 1970s a lot of these things are still around today so most famously for like I think American Airlines they built a system called Sabre that uses TP monitor that they're they still use today to coordinate airline reservations in the 1990s there was an attempt to standardize the sort of the protocol for committing transactions between these different nodes and called XA or X open you newer systems don't support this but like the enterprise systems like Oracle and TV2 could do this so the idea is you could take a again you could take a single node Oracle database have multiple runs run at the same time use a TP monitor and they can they can coordinate with each other using this this TP monitor over the XA protocol so the idea looks like this right so we have an application server so before it's allowed to go touch data on any partition and say we want to touch touch partitions one and three it has to go to this coordinator and say I'm going to need this data at these partitions let me go ahead and lock them right and then the coordinator is has the it has the lock table like you guys are building project four that can keep track of who holds the lock for what data so it gets you know makes the lock request once it granted the lock then it gets acknowledgement then it goes the application server can then go out to the different nodes and do whatever it needs to do to run the queries do the updates but now what it wants to commit goes back to the coordinator says hey I want to I want to commit the coordinator knows how to talk to the different nodes and say hey is this this guy wants to commit he touched you you're you're allowed to do this and if yes then it'll do the commit and get the acknowledgement back now what this wire protocol this part here like how do we say it's safe to commit get everyone to agree and then commit that's paxos raft or two base commit which will cover next class all right so this tp monitor there's there's there's a bunch of older systems out there that that that that that do this BA had this thing called tuxedo that they bought from they bought from AT&T because they built one of these first ones in the 1980s and then Oracle bought them and you still get tuxedo today trans arc actually came out of CMU came out of the AFS project it was like a tp monitor for doing traditional transactions and then IBM bought them late 90s early 2000 if you if you're an ISR whatever the the software engineering department is called now there's a guy Jeff Eppinger he was the he was the the founder of a trans arc and then IBM bought them and then this is hard to read it's a terrible logo but this is omid this is a tp monitor or a transaction coordinator for for age base that came out of yahoo labs and it's an open source project a patchy project I think is still available today all right so again the idea here is that the there's this coordinator on the side the application has to know that there's a coordinator has to go to the coordinator and say hey can I do something the coordinator says yes or no and then you then send a request directly to directly to the partitions of the nodes right what is more common is to have a middleware as a centralized coordinator where the application doesn't know about the partitions doesn't know about these nodes it always sends requests to the middleware and it has its own lock tables and management everything and then it can just send the request to the different nodes as needed right and then once and once you go to commit you go to you ask the coordinator or the middleware can I commit and then like before it goes and ask the different nodes am I allowed to do this right and again the the advantage of this one is that the this handles the query routing this handles the transactions the application doesn't need to know what's what's behind it so there's a bunch of open source systems that do this the test came out of youtube to basically a middleware in front of my sequel and they do this now planet scale then commercialize this I should have showed the example of MongoDB they worked sort of this way this is what google did in the early days with sharding my sequel this is what facebook does now with my sequel ebay did this with oracle right there's a there's a bunch of people that that that do this right okay the one that's sort of again more common or again blurs the line between the middleware and or centralized coordinator and decentralized coordinator for decentralized coordination the idea is that the the application server will send a begin request to start a transaction to some partition right some node in our cluster this guy then gets anointed as the leader and it's responsible for determining whether this transaction is allowed to commit or not and if it does it's not allowed to commit to tell everyone that the rollback and make sure everyone is in agreement that this thing has to commit or not so now the application could send a query request into individual partitions or it could go through the leader it doesn't matter but then when again when it goes to commit it asks the leader can i do this they all have to agree and then you send back the deposit again we'll discuss next class what that what that protocol actually is yes this question is can the leader be any any node in theory could be anyone yes yes so same as when you're using redshift you have one leader node you send all your requests to that right and then it results for for dispatching them to the different things that's the middle way approach yeah how is this decentralized because another transaction could be sending could start a transaction at p3 the p3 is the leader for that transaction but now you got to deal with like I'm the leader you're the leader you're doing one thing I'm doing something else and we got we got to coordinate right yes see this question is do you run leader election per transaction no you would run like you'd have leases because otherwise if you're running Paxos leader election then commit your transaction it's kind of redundant it's wasteful yes does each partition do their own converse control or the leader yes the coordinate so his statement is does each transaction do its own concoction control or does the leader do it for you so I could have two transactions hit up p3 and and without any other node and you you would have to run to base locking whatever you want locally as well right it's just you need to be aware of okay well like somebody else is also updating data this thing at this this this node here and they're their leader's up above so I need to make sure I'm aware of that but if you're using centralized yeah same it's if I'm doing centralized coordinator since everything everything goes through the at least the middle where if everything is going through the middle where the middle where it depends it could be like coarse grain locks like I have a I have locks on a portion of the partition and then for individual like tuple level locks you you still do that locally right so again this is like it's two phase locking distributed but then also on on the local node you do two phase locking as well and the reason why you have to ask whether it's safe to commit because you could be sending clear you're sending SQL requests over just go back let me go back for the centralized one couldn't think more clear like the application could have sent SQL requests over the middle where it says well I know how to route it to the location that has the data you want but I don't actually know what you did right because it's like you look at the where clause but in like the set clause of as an update it might do something like set a set somebody's new age or negative value and then there's an integrity constraint that says that's not allowed to commit so you have to make sure that like okay I know you did something here did this transaction fail or not and if it failed you have to tell everyone else that you failed and the middle where we handle that for you all right so the last thing I said I wanted to cover is distribute currency control and I've been looting to all this and this this will be what we'll talk about next week um ideally we want to have multiple transactions be able to execute simultaneously across different nodes at the same time it doesn't matter whether it's shared disc or shared share nothing right we want to be able to handle this and so we need to make sure now we want to have all the same asset guarantees that we spent the last three weeks talking about for our single node database we want them to carry over to our to our distributed database right of course this is going to be a lot harder as I said before because now we've got to worry about the network the cost of setting that messages messages not arriving at all we have to deal with machines going down crashing what happens then right if we're using like clocks for for the timestamps if we're doing like timestamp ordering or even multi-versioning or using the timestamp to keep track of like you know who started first was another how do I actually keep the the clock in sync with different nodes running at the same time that's super hard right and then we'll talk about also now since I want to have high availability I don't want it to be the case where if one node goes down I my whole database goes down because now I'm missing data I want to be able to do replication where I update data I write data to to one node and it gets propagated to other nodes right share disk solves that problem in some ways but if you share nothing you have to you have to worry about this right so let's see why this is hard so say we want to distribute a TPL we have two nodes that are separated by some network and the node one has has a single key A node two has a single key B so at exactly the same time these two transactions start running they don't know about each other at this point the first guy wants to update A that guy wants to update B so we just run a local two phase locking no no big deal they both get the exclusive lock on on these two objects right that's fine but now they want to the the first guy wants to update B the other guy wants to update A but the system knows that this data is not local it's on the other node so it's got to send a lock request over the network back and forth and we have the classic deadlock that we had to deal with from before all right so we know how to do deadlock detection right this is just a wait for a graph right to figure out who's waiting for who and are they you know is it never going to get released but now the problem is I got to send I got to coordinate between these different nodes because they have their own notion of who's waiting for what someone's got to run deadlock detection someone's got to tell somebody who's who's going to die but like if say the message gets lost or say like they both try to do the same thing they both try to kill each other and then like you know nothing ever gets done right it's the same everything's just harder because now the network's unreliable the network is slow and the state of the system is split across two different you know geographical regions now the centralized coordinator solves this in some ways but you still have the problems of you know communicating between different nodes and say hey are we allowed to do this yes or no so just because you have a centralized coordinator doesn't make this doesn't make all the problems go away likewise with shared disk doesn't make all the problems go go away okay all right so just to sort of finish up and then we'll talk about project four so I barely scratched the service on distributed database systems again next week we'll be all about first doing for OTP and doing it for OLAP and often times I say in this class it's hard to get this right this is why you don't want random JavaScript programmers writing their own distributed database which they often do right I had that centralized coordinator thing the middleware people actually build that thing a lot inside their application like in their application code they'll say okay well I hashed the thing I'm looking up on and that'll tell me which database connection I go to right they're basically doing sharding internally in their application my opinion that's a bad idea you don't want to do that you want to use it you want to use a database system in most cases most people probably don't need a distributed database at all like 99% of the applications out there don't need distributed database replication yes and we'll cover that next class but most people don't need a sort of horizontally scale out distributed database so this question is when should you be considering your distributed database my answer would be if you're doing a startup and you you cannot scale vertically postgres anymore because you have a lot of data you have a lot of operations a lot of transactions at that point you have money and then you pay either for me or for him or somebody probably him somebody come and like tell you what to do right but most startups when they start off can get by with a you know post a well-tuned postgres running on a single box with replication for availability right mostly like the no seagull stuff a lot of that came out because people started building these things because it's what google did and google was super is and was super successful in the 2000s and certainly now like a lot of people try to say oh google they're making a lot of money they have a bunch of scale buddy problems I'm going to be the next google I have the same kind of problems and they sort of try to make their own versions of big table and other things that google was building internally but a lot of the design stations that google made were right for google but not right for a lot of people so people were sort of unnecessarily building complex systems that they probably didn't need to sort of reinventing the wheel for a lot of things okay all right project four project four is out and what you guys are doing is adding support for transactions doing two phase locking in bus top and so you're going to do deadlock detection hierarchical locking just with tables and tuples no page locks you have to support multiple isolation levels I don't think we require serializability because that requires extra steps so we're not going to do that and you have to support transaction rollbacks and aborts or rolling back data because of aborts so you do not worry about doing right ahead logging in your system so we essentially are doing A and I animicity and isolation for your transactions we're not doing durability because this is the we don't worry about it as well okay so there's three tasks actually project four not three the first one the lock manager again you're going to maintain the internal lock table and then and the cues of what transactions are waiting for what locks you need to keep track of what phase they are in and to phase locking and then when a transactions the transactions waiting for a lock you need a way to notify the transaction to wake up and unblock them and then let them acquire acquire the lock you also need to support a lock upgrades as well which I'm missing from this you have to build a deadlock detector it's going to maintain its own weights for a graph and then have a deterministic protocol for identifying what transactions you want to kill and I'm underlying the term or the word deterministically because you wanted to be the case that for every single for every sort of arrangement of transactions who's waiting for who you want to make sure you kill the same transaction over and over again because that's how the test cases will check for things right if you're flipping a coin then it'll unlock something that the test doesn't expect and then you need to modify your execution engine for project three to update like the sequential scan and the index scan I think the insert executors to now support transactions okay you don't need you don't need to a lock take you don't need to lock tuples in your hash your nest loop join operator because you've already scanned the data at that point or you assume that that the access methods below that already already acquire the locks to read the data again project four leaderboard not three but the we have a new we have a new sequel benchmark that does a bunch of more stuff than than what you maybe saw in project three so we use this benchmark as a way to measure the performance of your system so now we're looking some more holistically the entire thing who can build the fastest while doing transactions it's a simulation of NFT scam so if you're into that you'll enjoy it all right just look before you only want to submit the the files that we tell you to submit on grayscope only modify them make sure you pull the latest changes and as always come to post your questions on piazza come to the q&a session tonight come to office hours and then again you can compare your solution against the against bus up in the browser don't plagiarize oh we'll mess you up okay all right so next class we'll talk about more distributed OTP systems replication the cap theorem and then more real-world examples get it