 Give it up, two base locking. I'm back. Yeah. Yeah. Yeah guys, I've been gone for a while. How you doing? I'm doing well, how are you? I, pfft. My first wife gave me COVID. She betrayed me, so that sucked. But while JinS was teaching, we spent a lot of time writing papers and grants and things like that. Nice. And your show has been crushing us when I've been here. Yeah, yeah, yeah. And how was your girlfriend? She's doing well, she was, last weekend she was at the lab. She wanted to find out who's giving flowers and who's Abby met her. And Kyle also met her, so. She's still pissed. Yeah. I know, she's trying to get to the bottom of it. Yeah. I mean, it's on you, you have to handle it. Ha, ha, ha. What do you mean it's on me? I mean, it's on you gave me this gig, so I'm fine. Okay, there's that. Yeah, yeah, yeah. All right guys, let's get started. Get the right slides. So again, administrative stuff. Homework five is out. It's gonna be due on December 3rd. And then, so that's in two weeks. Project four is out. That'll be due on December 10th. And we'll be announcing the info session for that on PSA, I think, this week. And then, on the last week of class, it's gonna be two special lectures. So we're gonna have Sony from Single Store come on over and zoom and talk about the Single Store database system. They're gonna have a heavy emphasis on LLMs or vector database stuff that they're building in Single Store. Again, that'll be over at Zoom, so we don't have to come here for that. So please, please attend that. And then on the very last day of the class on Wednesday, December 6th, we're gonna do the final exam review, but then we're also gonna do something we'll call the system speed run. So I'll post a form on Piazza, go select or go put in what are some databases that we don't really cover during the semester that you wanna learn about. We'll try to cover 10, maybe like 120 minutes. So we'll cover like one database in 10 minutes. So like a crash course. Like here's what this system does. Here's why it's interesting. Here's why it matters. Or here's why it's stupid. And we'll try to get through as many systems as we can. Okay? So every year it's always like Mongo's number one, Spanner's number two, but if you guys wanna do something we're challenging, again, we'll look at, I'll send a link to DBVIO, pick any random stuff that you want. Okay? And then the final exam's been scheduled. I don't know where, but it's gonna be on Tuesday, December 12th at 8.30 a.m. And again, if you need any accommodations or if you have three exams scheduled and a 25-hour period, according to University of Politics, email Jaganesha myself and we'll reschedule for the makeup exam. Okay? Any questions about any of these things? All right, so then the, for the seminar series of having this semester we have two last talks. The talk today is actually super exciting. We're gonna have somebody from Amazon talk about PG Vector. PG Vector is the most popular vector extension for Postgres. So you don't need to use VVH or Milivus or Pinecone. You can do some of those operations directly inside of Postgres. So we'll talk about how that works. And then next week will be a, the last talk will be a new database startup out of San Francisco called Chroma. That is, again, it was one of these specialized vector databases. Okay? Again, all that's over Zoom. All right, so at this point in the semester we've covered everything you need to know at a high level of how to build a single node database system. Right? We covered like at the lowest level at the storage layer. We talked about how to bring things in out of memory. We talked about how to actually run queries, do query planning, take a SQL query and turn it into an actual physical plan. And then we went back and crossed the entire stack and now then we added recovery. We talked about how to handle through redhead logging or other methods to be able to recover the database after a crash. We talked about how to have multiple transactions run at the same time and update the database through concurrency troll. Right? All that, everything, these topics here, again, we didn't go too deep in any one thing but this is basically what you need to build a single node database system. So at this point in the semester we're gonna build and pump what we've done so far and now we're talking about distributed databases. Right? It's hard enough to build a single node database by itself. It's even harder to build a distributed one. All right? And there's be a bunch of these different design decisions we're gonna have to make about how we want these different nodes to coordinate with each other or how they're gonna talk to each other and how they're gonna work together to execute queries, execution actions. And so like, you know, you could have things that come from the top and coordinate send messages to the top. You could go through the bottom. You could go through the middle. Right? There's all these different design choices you could have of how to build a distributed database system but at its core it's been doing all the things that we've talked about this semester. Right? There's still gonna be a disk. You still gotta read things, you know, in a buffer pool. But now when you run a query are you gonna send messages to another node to run part of the query over there? Should you pull data from that they have or is there a central location you can go get data from? And so we only have three classes to discuss distributed databases. So obviously this is gonna be a crash course in this topic. But again, hopefully what you'll get out of this is a way to assess real-world systems because you understand the vernacular, the vocabulary of what these systems are talking about and how they're implemented. And that'll help you make different design decisions when you, if you ever wanna build one yourself or need to start using one in the real world. Okay? So we showed this slide early in the semester where we discussed parallel query execution that we made this distinction between parallel database systems and distributed databases. Right? We said the parallel databases are ones where the nodes are physically close to each other. Like think of like different CPUs, sockets or different CPUs that are running different, different, sorry, they're running separate instances of the database system and they communicate over some high-speed interconnect or they're running in the same rack and it looks like just one giant shared machine. Right? So the big thing about in a parallel database system is that we're gonna assume that the communication cost is gonna be small, like nanoseconds or microseconds to send a message from one worker to another. And we're also gonna assume that the communication is reliable. Meaning like if I send a message to another thread, that thread's gonna get that. You can modulate any like software errors, but like the message is not gonna get disappeared. Right? But now in the distributed database system world, we can't make those assumptions. We can't assume that the nodes are gonna be close to each other. We can't assume that one node or one worker talking to another worker is cheap. Right? Cause you can either be in the, ideally in the same rack, but maybe you're in the same data center but at the same different availability zone. Or the worst case scenario is that you're in completely two different data centers and one database node is on the other opposite and the planet. Now you're restricted by speed of light issues which you can't, there's no magic wand to make that go away. So the other, so in addition to the communication between nodes being expensive, communication could be unreliable. Right? TCP IP will help a bunch of things make sure that the packets end up in order, but there's no guarantee that if we send a message that the other side is actually gonna get it. Right? Or it might get it in different orders. Right? Not because the network gets severed, but like say the node itself starts doing something weird. Like it says it's a Java based database system and then the garbage collector kicks in. So there's like a 30 second pause because it's cleaning up the heap. But now the node looks unavailable and now messages might, and when it flips back on, messages might show up in different orders. So all of these reasons, you have to account for all of these things in our distributed database system. We can't assume the hardware's gonna hide it for us in the same way we can for parallel databases. So this is just repeating what I said before, but we can use all the building blocks of a single node data system that we've constructed so far. And now we can layer on top of that and say, okay, here's how we want to distribute the execution across multiple physical nodes. Right? And we can do this for OTP systems, like running transactional workloads and for analytical workloads. So today's class is gonna be a high level of like, here's what a distributed database looks like. Here's the things that you have to worry about. Then next class after the break will be entirely about distributed transaction systems. And then we'll cover distributed OLAP systems or analytical systems. And 7.21 next semester will be entirely on analytical systems. So in a distributed environment, everything's just harder. Optimization and query planning, that was hard enough. On a single node, it's even harder in a distributed system. Because now you may not be sure what data's actually at another node. Or how fast can you communicate between them? Currency control is probably the hardest part of all this. How do we make sure that the different nodes that are involved in that transaction all agree that a transaction can commit? Then when we say, yes, let's go commit, they all actually did that. It's basically a distributed state machine if you're taking distributed algorithms or distributed systems. And then logging recovery, again, how do we make sure that if one of our nodes in our system crashes, ideally we don't want that to take down our entire system. If our distributed database has like 20 nodes and one of them goes down, we don't want the other 19 to be completely locked until that thing comes back up because it might never come back up. So again, we'll touch on all these things as we go along. We won't really talk about query optimization too much. We'll talk a little bit about it next week. All right, so for today's class, we're going to first talk about the high level overview of what the distributed system architecture could look like in the context of databases. Then we'll talk about the design issues we have to worry about when we actually implement one of these things. Then we'll talk about how we want to partition our database so we can split it up across multiple nodes. Again, we talked a little bit about that when we talked about parallel execution. And then we'll give a preview of why distributed concurrency tool is hard and that'll be a segue into what we discuss next week. Okay, okay. So let's start from the basics. So a distributed database in architecture is going to basically define where the resources are that are gonna be available to the CPUs. Again, we're assuming a Von Nomen architecture. There's things on disk, we bring it to the memory and then there's a CPU that uses that memory to execute instructions and process data. So the discussion really, when you just design a distributed database system is where's the memory and where's the disk and who's allowed to read and write to it at any given time. And this is gonna determine how the CPUs are gonna coordinate and talk to each other, like what kind of messages are they gonna send and when do they send them and then who's gonna be involved in deciding when certain things should happen. Again, if it's a transactional system, somebody's gotta decide, okay, it's time to commit this transaction, can everybody agree to that? And someone has to be in charge of making that decision. So the architecture we've been mostly discussing so far, actually entirely discussing this semester is what is known in the database literature as a shared everything system, meaning there's a, on a single box, I think a single node, it has local disk, it has local memory, and that machine or that database system can access all those things equally, all right? And there's no other node, right? It's just, this is basically, this is what a single node database system, this is what Bust Hub is, or Postgres, or MySQL, ignoring replication for now. But then when most people think of a distributed database system, they think of something like this, what is known as a shared nothing system. This is what is a term invented in the 1980s by the guy at the end of Postgres, Mike Snowbreaker. And this again, this is what most people think about when you think of a distributed system, right? That there's individual nodes, and those nodes have CPUs, and those CPUs can access memory, can access local disk, but any time it wants to communicate with any other node in the distributed database, it has to go through some network protocol, again, for our purposes, it assumes TCP IP, and it sends messages to the different nodes, to coordinate with each other. What is more common now, especially in the cloud, is what is known as a shared disk system. And the idea here is that there's still individual nodes that have, you know, CPUs and have local memory. They may even have direct attached SSDs that could use for caching, but the primary storage location of the database is gonna be on some shared disk, right? You think of like a giant NAS appliance, like some shared storage server, or if you're in the cloud, think of like S3, or whatever the one for Azure GCP is. Like that's the resting place, the primary location of the database. In the case of a shared nothing system, the database could be replicated on every single node, or we could partition or break it up to subsets and have each node be responsible for that, right? But in a shared disk system, it looks like a single logical disk that every node can communicate with. I'll go through each of these more details with examples. And just to be complete in the research literature, there's also a category systems called shared memory systems. And the idea here is that the disk and memory is shared across some kind of fabric and the CPU nodes or the database nodes can only coordinate or communicate through some kind of network. Again, this looks a lot like shared everything. I'll show this in a slide in a second, where like, as far as I know, nobody actually does shared memory, right? This architecture mostly exists in high performance computing, where you wanna sort of have a single disaggregated memory pool and all the nodes can coordinate through that. But again, as far as I know, no database system actually implements this. All right, so we're gonna spend most of our time focusing on shared nothing and shared disk. So as I said, the term shared nothing I think dates into like 1986, the first shared nothing database system that was actually commercially available was Teradata, which I think is like 82, 83. There were some academic prototypes that predate this, predate that. But those were actually never actually implemented in use, but Teradata was sort of the first one. Right, again, the idea here is that our each node can't view the memory or disk of any other node in the cluster. And they can only communicate through the, through some network. So the conventional wisdom is that in a shared nothing system, this is gonna give you better performance and better efficiency because now, if I shard or partition my database across the different nodes, then I can distribute a query and break it up into smaller pieces, have each node basically run full blast and all of the local storage, a local copy of the database and then I somehow aggregate the results together. And the same way we did for parallel execution using like an exchange operator, right? So this gets the best performance because you're not bottlenecked for any individual node. You're not bottlenecked in any other node because you're only accessing data that's local to you. Of course, as we know in the real world, you may have to talk to other nodes to get data that you're missing and that becomes a problem. The challenge though, and we'll see this in contrast with a shared disk system, is that in a shared nothing system, it's harder to scale capacity because the data stored in the local disk is tied to each node. So if I wanna add a new node in my cluster, I gotta copy data from other nodes to spread it out. It's also gonna be harder to ensure consistency because again, since every node has their own local portion of the database, copy the database, they can't see the memory of other nodes, then they gotta send messages to ask, hey, what are you doing for this transaction? Or hey, what are you doing for this query? So there's a lot of databases that use this architecture and I said this was the conventional wisdom of how you build a scalable distributed database system up until maybe the last 10 years or so. The cloud changes, but if you're gonna build a distributed database in the 80s or 90s, early 2000s, you would use this approach. Yes? So each of these... Potentially, yes. It depends on how you wanna partition it. We'll get that in a second. So if the... We'll say that we're getting it. You gotta go through the network and bring it over, yeah. Like there's no... His question is like, assuming, say the database is just... We'll talk about how to do partitioning in a second. Like you literally just take one column, hash it the value, and then you distribute it out to across the different nodes, right? You make a face, right? That's very common, right? And it's actually a good idea, we'll see that. But that, so basically, this node one has some range, next node has the next range. I actually showed the next slide. Like, if my node here needs to get the data that this node has, I can't just peek over to the disk. I gotta send a message and say, hey, send me the data you have, right? So the query goes to all... His question is, is the query going to all the CPUs? If you need all the data and all the CPUs, yes. We'll get in a second, like there's gonna be some... There's some intelligence to say, okay, you look at the query and it's declarative, so I know what you're trying to do. Find me the minimum number of nodes that has the data that I need, right? Well, next slide. All right, so say we have a really simple distributed database, two node clusters, shared nothing. Again, so on each node, there's a local CPU, local memory, local disk. And then what I'm showing here is that I've taken a single table and I partitioned it based on the value of an ID column. So it's the primary key. And so the first partition at the top is gonna have all the tuples where ID equals one to 150 and the one at the bottom has all tuples where ID is 151 to 300, right? This is an example of range partitioning. I said hash partitioning is another approach. We'll see that in a second. But now, depending on what my query is, depending on what data I need to access, I would use this information to figure out which of these nodes I need to go to. And the way I'm gonna figure out what node I need to go to is through some kind of catalog or metadata service. And I'm showing this as a cloud because it could be on the actual nodes themselves or it could be like a third-party external service on the side that says, okay, what do I actually need to get this data, right? Different systems do different things. So I'm just saying this is a more for this thing is somehow is gonna tell us in our application where should we go. So now, say my query is get ID equals 200. Based on the information I've gotten from the catalog, I know that this node at the bottom has all the tuples where ID is 151 to 300. ID equals 200 is in that range. So I know I wanna come here and get the data that I need. Now when I execute this query on this node down here, I don't need to communicate with the node at the top because it doesn't have any of the data that I need. Everything is down here, right? Then at the same time, I have another query that comes along and say this one wants to get ID equals 100 and ID equals 200. So we'll talk about data transparency in a second, but ideally, we don't want the application server to be aware and be in charge of deciding how to get the data that needs for a particular query. Meaning I just wanna send my query request for these two IDs to this node and then have this node figure out, so the one at the top figure out how to get the data that it needs to process this query. And so it could be the case that the, in this example here, I send a message from the top node to the bottom node, hey, I have a query up here that needs to get ID equals 200. Send me that data up to me. Click this up, okay? Right, and then I can return the response to the server. Is this clear? Yes? If you're doing aggregation, for example, so we don't... Yeah, so the statement is, if you're doing aggregation where you, well, just an example, you need such data at both partition or both nodes. You want the data server to do that for you, not the applications over it, yes. So we'll see that in a second. You could have something in front of this, a middleware or coordinator that the query shows up there and then it knows here's the data that I need and if I have to then combine results, it can decide, okay, well, most of the data I need is on the node here, so I wanna maybe just pull a small amount of data from the bottom to the top, because that's gonna be cheaper. We'll see this push versus pull in a second. Another question? Yeah, so like this question is, should the top node also talk to this metadata service? Yes, like it has to be, and it's gonna have to be ideally, well, yeah, no, yes, ideally. You want this all to be basically a transactional, so like if something gets added or new data changes or the ranges changes, I wanna do this in a transactional safe manner because then I can guarantee that anybody comes along and looks for an ID, if I'm start moving things around, it gets the right answer. We'll see how Mongo did it wrong in a second, yes. All right, so now let's say I wanna add, I got a lot of activity on my database node, I need to add more servers because my latency's too high, getting too many queries, I wanna scale up, or sorry, scale out. So I wanna add a new node here, but again when it boots up, there's no data inside of it, so now I need to start getting data from the other nodes in order to fill in the disk and start being able to process queries. So in this case here, you have to have the top guy and the bottom guy for simplicity, soon they're gonna put this range in half by equally. So ID from 150, 101 to 150 down here, and from, what was it before, I'm sorry, it was 151, 300, so we'll move 151 and 200 up here, right? And then to this question over here, I update the catalog services by the way, this new node exists, here's the ranges that it has, so that any new query comes along and says where can I find the data I'm looking for, it will see a consistent view of the catalog. Now as said before, MongoDB did it wrong, so MongoDB had this auto scaling thing and I think they still have it, the early versions of MongoDB, one of the big selling points is that they could do auto scaling, so if your partitions or nodes are getting too hot, it can split the ranges up for you automatically, but they didn't move data around in a transactionally safe manner, so they would do this copying that I'm showing here, but they couldn't guarantee that the catalog would be synchronized when this change occurred, so there may be a small window where you could actually have a false negative where the catalog wasn't updated yet, your query goes up here to the top node because you're looking for ID 150, but the data hasn't moved yet, or I'm sorry, the data was moved which you didn't have the catalog, so it points you to the top one and then that node says it's not there. So the reason I'm highlighting this metadata stuff is it's just another transaction. We want to have a consistent view of the database and that does mean what the data looks like on each individual node, but the metadata itself that tells you where that data is located, we want that to be transactional as well. All right, so shared disk as I said, the cloud has really made this the most popular way to build a database system now. Shared disk and this architecture was first designed in the 1980s, but a lot of the systems that were based on this architecture didn't pan out, it was a huge pain to build and became very unreliable. The cloud changes that now. So again, because the shared disk isn't just gonna be something that we as the database system developer have to build, although some systems choose to do that, we just rely on the massive infrastructure of the cloud vendors and use that as our backing storage. So like you could use Amazon S3, that's basically infinite disk. So you never to worry about scaling or provisioning new storage, Amazon has infinite storage for you. Or basically infinite, like if you get to the point where you start running out of space on S3, you're gonna get calls from Amazon way before that even happens, right? Your credit card is gonna get denied. I could also use the distributed file system, HDFS is a bad example, but that's something that you could use, but there's better ones now that you could use it as the backing store. But it's not just in this architecture, it's not enough just to say, okay, let me take my single node bus tub or single post-grass and make it stick it, stick it on a distributed file system. The database system itself needs to be aware that it's talking to distributed file system, because there are some optimizations you can do and there's some, obviously some logic and make sure that you don't have two guys trying to write to the same file at the same time because they're doing transactions but they don't know about each other, right? So it's not just enough to say I'm a distributed file system, you have to have the compute nodes be aware that they are part of a larger system. So the nice advantage of using this approach is that you can now scale the two parts of the database system independently. So if I need more compute, I just add new compute nodes and the compute nodes are stateless. So it's not like I need to copy data between the different nodes as I did in the nothing system. They'll just pull data from the shared disk. Now I still need to update my metadata to say who's responsible for what portion of the database, especially if it's a transactional system, but that's way easier than a shared nothing system. We can still use direct attack storage. So even though I'm not showing in the boxes here at the top with my pointer, sorry, new clicker. You can't even see. All right, so even though the nodes themselves I'm not showing a disk, it can have a local SSD as well. And we can just use that as a bigger slower cache. So like right now your buffer pool basically writes out the disk in bus tub and that the disk is the final location of the database. But I could have another stage in my buffer pool where I could write things out to the SSD and then manage that as if it was just DRAM and do eviction and throw things out of that as well. So when people talk about data lakes they mean this architecture. The data lake stuff that's interesting or we can cover this next week because the idea is that instead of all the right path for all updates to the database going through the database system, I can just write things out to S3. And then there's some metadata service like the Hive Metastore, I think Databricks calls there's Unity. There's some catalog service that you can update and say, hey, by the way, I have a bunch of these CSV files or Parquet files on disk and then it now knows how to incorporate them or use them when you execute queries. But it's still gonna be a shared disk architecture. When people say they have a serverless database they typically mean this as well. Because again, the compute nodes are stateless. The final resting place of the database which is what we care about is here on shared disk on S3 or whatever it is. So now if I spin up my database system I haven't executed any queries on the database system for an hour, well I can just shut off the compute nodes. And in a shared nothing system, the database would go away because again, the disk was attached to each node. So if I turn the node off, that portion of the data isn't no longer available. But in a shared disk system, I shut the compute node off and all my data is still in S3. And then an hour later, if I wanna execute another query then I spin up another compute node and start pulling data from the shared disk storage. It's basically how a serverless database works. So there's a lot of systems that implement this as well both for transactional systems and for OLAP systems. And as I said, most of the newer database systems designed in the probably less five years are using this architecture. And then a bunch of systems that were maybe originally shared nothing have since come around and actually retrofitted or refactored their code to become a shared disk system. All right, so here's that same, yes, question. So the question is, do these systems rewrite their entire server backend like for storage or did? Yeah, I mean, so, if you're relying on S3, you don't have to technically, you don't have to build something that reads and writes a disk like a, you still need a disk manager, right? But that disk manager isn't writing to local storage, it's just coding, writing data to S3. Now, I made a big deal at the beginning of the semester and said like, oh, the OS is terrible, we don't trust the OS for anything. But now I'm saying, we're gonna trust Amazon, right? Yes. So, you can think of S3 as just being, it's just from our perspective, it's just another disk. It's bigger, slower. It's not directly attached, but from the Davidson perspective, it's just another disk. Now, there's some things you can do with like using S3 as an example that you can't do it with a regular disk, you can do some predicate pushdown. So in S3, for example, you can actually run select queries on S3 and S3 can natively parse a CSV or JSON file and run part of your query down there. So it's a little more sophisticated than a dumb disk, but from our perspective on the Davy system, it's just a disk, right? So we're still gonna be responsible for deciding when things get written out, what gets brought into memory, if we have a multi-stage cache, where does that go? How do we decide how to split all the data up? All that we have to manage ourselves, that doesn't go away. Yes. This question is, is there any benefit to doing this on your own versus using S3? Oh yeah, could you get better than S3? Absolutely, yes. And some systems do that, right? There's like, do I want to run my own distributed disk or distributed file system? And then, so that's one aspect of what you could rewrite. Very few people do that. Because again, Amazon is Amazon, they have punders of engineers working on this. And like I said, it's infinite disk, you don't have to provision it any yourselves. You pay more for it, right? And the latency can be like 50 to 200 milliseconds sometimes, that's a lot. But like, they handle replication for you, right, which can be good or bad, right? Again, the Davy system is aware that it's written to S3 and that S3 is gonna replicate itself. Then you maybe, the Davy system itself doesn't have to do replication. So that's one aspect of this. Then the, do you want to rely on Amazon's libraries to talk to S3? And there's one data, it's Yellowbrick, they gave a talk with us a few years ago, where they were like, we tried all the Amazon libraries, they're all crap and they rewrote everything for themselves. And they did kernel bypass to make, you know, reason rights, puts and gets to S3 even faster than what Amazon will give you. So there's various levels of optimizations you can do before like, okay, let me run my own S3. But I think having the ability to do predicate pushdown in S3, I don't think, I think Microsoft supports that, I don't think Google does, or one of them supports it. But like, be able to do some predicate pushdown, that's a big win as well, because now I could be more selective on bringing back data that maybe I don't need. What are the downsides of this, or share nothing? We'll see this in a second. Let me go to the next slide. The big challenge is gonna be you almost always have to pull data from disk into the compute node. Again, it's called pushing the query to the data or pulling the data to the query. In a shared nothing system, you can make that decision and you almost always can push the query to the data. So even if you can't push the entire query, like I can do some predicate pushdown, that's better than just blindly grabbing blocks and fetching them, that's the key difference, right? But the not having to worry about how to do replication and all these other things that Amazon takes care of for you, that's a big enough win and the speeds have gotten so much faster that it's worth paying that penalty. All right, so let's go back to our example here. All right, so now my application server runs a query, says it gets ID 101, the node can go to the catalog service and figure out where to find that data. Like again, what bucket on S3 has the data that it need and then now when it accesses storage, it's not, it's like before in a referral manager, you have to convert the record you're looking for to a page number or a block number or a bucket number or a segment or whatever you wanna call it to go out to the disk and go get it and then it copies it into the local memory of this node who then compute the query and produce the answer to once, right? Same thing with this guy down here. Again, he can go get ID equal one or two, it goes faster to that page and then processes the query. So now if I wanna add new capacity, you wanna add a new compute node, again, these nodes are stateless, meaning the primary location of the database is not in the compute nodes, it's on shared disk. So even though these nodes may have cash copies of pages, which you would wanna do because you pay money every time you look things up on Amazon, it's a temporary lease, temporary ownership. So I could spin up this new node, not have to copy any data between my other nodes potentially and then the new query can show up and this thing go gets the data that it needs. And then if I wanna add more disk, what do I do? Well, actually, let's do an update first. So if I do an update here, one on one, since this node here has a copy of it, I have to make the modification to the shared disk, but then I maybe have to update everyone else, say, oh, by the way, I know you have a cash copy of this tuple, here's the new version of it. Or whatever version you have now has been invalidated, go back to the shared disk and go get the new version. Yes. So question, how does the top node, how does the top node know that bottom two nodes have a cash copy of this? So, we'll see this in a second. In the shared disk, we're doing that partitioning, but it's logical. And we're just saying that one node is gonna be responsible for handling the rights to another node. And then if another node wants to also get a copy of it, it could either go to, it has to tell somebody, hey, by the way, I know you have a copy, you're the owner for this record or this partition. Give me a copy of what you have. And then now this guy knows that the bottom guy has a copy of it and is gonna get updated. Or you could tell the catalog server, hey, by the way, I need to copy it. Or he's broadcast to everyone like a gossip protocol and say, hey, go get the latest version. And then now, we talk about isolation levels, we'll talk about, it's the same idea. This is actually the C in ACID, which we sort of danced over before. Do you wanna have a consistent view of the entire database? Yes or no? If you want a consistent view, then you gotta make sure that anyone can possibly have a copy, gets this update, here's a new version. Or if you're okay with things having stale reads, then I do my update, but then I eventually tell everyone else, hey, by the way, here's a new version of it. And it may be the case that any query that goes here and reads the old version, that's okay. For some applications, that's okay, for others, it's not. Again, we'll cover that more next week. All right, if I wanna add new storage capacity, again, if it's S3, it's easy. You just keep having some more money, right? And they're gladly take it. And you just get more storage capacity. Or even if it was managed by yourself, right? I can just add new disks to my distributed file system. And because these guys are stateless, it doesn't matter. I don't have to do any coordination there. All right, so I'm not gonna say much about shared memory. Again, this is sort of like the, it's almost like a, not theoretical, because you could build one. And there's people who have dance around designing prototypes in the 80s and 90s. But there's, again, as far as I know, there's no real system that actually does this. And again, the idea here is that you have these stateless compute nodes. And then there's some kind of shared disk thing, but then the memory is also shared as well. So anytime I wanna go send messages to another node, I just write to some memory address. And there's some hardware that magically makes sure that everyone gets the update. It looks a lot like shared everything. It's just the distinction is that I'm saying there's separate physical nodes. And there's some interconnect like RDMA or something like that, or Finnaband, so that they can talk to each other. All right, but again, nobody actually does this as far as I know. All right, so I sort of mentioned this before, but the idea of a distributed database is old. It goes back to the 1970s. So as far as I know, the first two prototypes of distributed databases were this thing called Muffin. It stands for multiple, something of Ingress. It's the guy that built Ingress and Postgres Stonebreaker. They have a sort of tech report paper out of Berkeley that describes here's how you could build a distributed version of Ingress. The more famous one is SCD-1 by Phil Bernstein. Phil Bernstein did a lot of the great initial work on how to do concurrent control in distributed databases. But he gave a talk once at a workshop I was running, where he talked about SCD-1 actually wasn't a real system. There's a bunch of scripts that could build a prototype so they could not trick the government, like show the government, hey, we can actually do this and they got money from it, but they actually never built the real system. IBM built a version of System R, the first relational system that they were building called System R-Star. Gamma was an early prototype of a distributed database at a University of Wisconsin from the 1980s. Actually, it was built by Jignesh's PhD advisor at Wisconsin. But the only one of these that actually is still around today is this thing called non-stop sequel from Tandem. And that was Jim Gray's one of the projects that Jim Gray worked on at Tandem. Jim Gray, again, won the touring work for databases in the 1980s. He met at Tubi's locking a bunch of other stuff we've discussed this semester. He left IBM to go to Tandem, but he had a non-compete clause where he wasn't allowed to work on databases for five years or something like that. And then non-stop was building this super fault-tolerant hardware. Think of like NASA-level fault-tolerance. Like there's like two CPUs running and they're running the same computation and they check to see whether they get the same result. Like this is obviously big in banks in the 80s, even today. But Tandem non-stop sequel is still around today. If you ever use an ATM machine, chances are your transaction is gonna go through non-stop. Or IMS from IBM. But you're going through some old systems. Muffin was multiple, faster, faster ingress. And I asked Mike once when I was like, yeah, that's what they had to put in the paper, but he really said that the real name was ingress. So they called it muffin. Mike doesn't curse that much. No, I was surprised when he said that. Anyway. All right, so you guys answered a bunch of these questions and we've been sort of dancing around these things and now it's time to talk about, okay, how are we actually gonna do these things? So we know what the high-level architecture looks like. So now we gotta talk about how we're gonna actually run queries and execute transactions. So one of the first things that came out was like, okay, how does the application find the data? In my examples, I said there was this catalog service. That's one way to do it. And then the application decided where I actually wanna go themselves. Or we can see another approach where there's just a single coordinator, a single URL that everybody talks to. And that thing knows where all the data it needs. And likewise, where does it actually send the data? So I said send the queries. Do I send it to that coordinator? Do I send it to individual nodes? Does the application is even aware of those individual nodes? Ideally, no. And then we wanna execute queries and say the data that we need is not on a single disk or a single node. What do we do, right? And again, the two approaches are doing to push the query to the data or some portion of the query to the data where it resides and processing and get back a subset of the results. Or I wanna pull all the data that I need from a node to another node and process the query there. How are we gonna make sure that if we execute transactions that update data at multiple locations, and then we say commit, that it actually commits. And everyone's in sync at the same time about, and agrees what these are the changes that are getting made. And then how are we gonna decide that we wanna split the database across different resources? Is it the partitioning stuff I was saying before? So as always in all parts of systems, especially in databases, we're gonna have to make trade-offs, all right? Because we're not gonna be able to guarantee that our database system is gonna be online all the time, it can answer any possible query, especially if nodes start going down and messages start getting lost and we can't communicate between nodes. So we have to make decision on what should we do? Should we produce incorrect results or should we just stop everything until we can get back online? So next class we're gonna focus on this. Again, how do we ensure correctness? How do we make sure that we can coordinate transactions across multiple nodes? The TLDR is gonna be something like Paxos or Two-Phase Commit or Raft. If you're familiar with those protocols. But then we have to handle replication and other things. All right, so the first decision we gotta make is what should the nodes actually do? And the two approaches are you have homogeneous nodes or heterogeneous nodes. And so what I've shown so far are more or less homogeneous nodes where every node in our database system cluster can do any task. I mean, I can send a query to any node and that node can then figure out, okay, where's the data that I need? How do I send it to that location or get the data that I need to put it back together, right? And what is nice about this approach is that it makes provisioning the resources you need for your database cluster easy. And if a node goes down and you spin up a new one, it just replaces the, you know, fits in with the rest of the systems and the rest of the nodes and you don't have to worry about rebalancing who's doing what. And a homogeneous architecture, you have nodes be assigned to specific tasks. I've already alluded to this already. You could say you could have this catalog service at the separate node, you could have a coordinator node or middleware as a separate node, right? And then you make decisions about what's gonna be stateless, what's gonna be staple, and should I have multiple sort of virtual nodes assigned to a single physical node so that one box can do multiple things? But now if that box goes down, which it will in a real system, then how do I decide where to place the new task or to fail over to the new task, right? Again, different systems do different things. I would say in the cloud architecture, the heterogeneous approach is more common now with a coordinator or middleware sitting in front of the rest of the compute nodes. Like this is what Snowflake will give you, Databricks and others. But some of the no-segal systems like Cassandra, for example, or these distributed key value stores, they'll be homogeneous nodes. I don't actually, I don't want to give you the impression that if you're no-segal, you're homogeneous, if you're sequel or relational, you're heterogeneous. Everybody does something different. And I'm not saying one approach is better than another. All right, we're ready to talk about data transparency. Again, the idea is that we don't want our application to be aware of where the data is actually located and where the physical nodes are. Now, in some ways at a high level, you kind of need to be aware of where your data actually is. Like if I'm going to run an expensive query, I don't want to do a two petabyte join between data that's across the country and a data set in my local data center because that's going to be super expensive. Now, if it's unavoidable, sure, right? But you don't want people just to be too loosey-goosey with sending whatever query that they want. Some high-level understanding of where things actually things are. But like the exact physical address of what data is at what partition, ideally, we want all that to be hidden so that the same SQL query that someone builds that runs today could run the next day even though the physical nodes have been reorganized or data has moved around because I've rebalanced. So again, I don't want ideally not to have specific hints or physical location aware hints or keywords inside my SQL queries. I want to have all that abstract away and let the data system decide the best way to handle all that. All right, so now let's talk about how we want to split our database up. So again, it's just like in a parallel database that we talked about before where we want to divide our database across into disjoint subsets so that I can take advantage of the system and take advantage of all the additional hardware that's available to us, right? I don't want to pay for 100 machines in my database system, my database system, but then only be able to use one of them or two of them, right? That'd be stupid and a waste of money. So I'm going to use the term partitioning in the relational database world or an academic world. That's the term we use. If you ever read documentation about the NoSQL systems or other open source distributed databases, they might say the term sharding. The idea is basically the same. We're going to break the database into disjoint subsets and we're going to store them on those subsets into different locations. And then now just like in a parallel database a query shows up, I may want to break it up into the query plan into query fragments and distribute those fragments to the different compute nodes and have them execute on the partitions that they have and then there'll be some exchange operator, some way to coalesce results and produce a single answer back to the application. So because again, I don't want to rewrite my SQL query if I add new nodes or take away nodes. The same SQL query that works on a machine, a distributed database with 10 nodes should work on also with 100 nodes without having to make any changes. So the database system is going to be able to partition the database physically if it's shared nothing. Because again, we have to physically divide it up across different nodes or logically in a shared disk system because again, those compute nodes are technically sort of stateless and they're pulling data from the shared disk layer but I would still want to know who's responsible for what portion or what partition of the shared disk. All right, so let's talk about different ways to do partitioning. So the most naive approach or the simplest approach is called simple table partitioning. This is not that common. I know Mongo does this, I don't know what other systems do. But the idea here is you basically just say, all right, this table, its entire contents goes to this node and this other table goes to this other node. And this works great if you don't do joins across those two tables and most of your operations on the tables are very fast and only touching a small amount of data because then you can make sure that the load of the application is spread across the different nodes. As you see really simple example, I have two tables. So I'm going to take color code them. All the rows or tuples from partition table one goes to the first partition. All the tuples from table two goes to the second partition. And then in my idea scenario of any query that just looks at only one of those tables, no joins, this will be okay. Again, assuming that there could be hotspots and other issues but for simplicity we can ignore that. So in the case of Mongo, the example they told me and the reason why they supported this is that they had some customers where they wanted to do horizontal partitioning, we'll see in a second, across most of the tables but they had one table that was like almost like an application log. So anytime there was a change, they would insert a new record into that table. And you never actually read it, you just wanted to write it. So to ensure that that write operation and if you're with other partitions, it would just all go to a single node for them that worked. My vertical partitioning is like a way to do like a poor man's column store. The idea here is that we're gonna split the table based on the actual attributes themselves but not the values of them but rather just the entire column and the entire, you know, all the values are given attribute. So let's say in this case here, we have a table that has four columns. The first three columns are 32 bit integers so that they're small, they're cheap. But then I have a fourth attribute that's a text field and maybe these are like 10 megabytes or something like that. But most of my queries only wanna access the first three attributes. So instead of having to, again, assuming I'm a row store, pollute my buffer pool and bring a bunch of data in for this attribute, I could just do vertical partitioning where I split it up, almost stored as a virtual table, if you will, and then have that be stored as a separate partition and managed separately in my cluster. All right, so you could do this but you still wanna do, like, you know, to separate the two portions of the table but you still wanna do horizontal partitioning which we'll see in the next slide because maybe I wanna distribute the tuples across different partitions as well. All right, so horizontal partitioning, I think we covered also for parallel databases earlier but this is what most people think about when they think about distributed database, how to divide things up. And again, if you say sharding, this is what typically people, this is what people mean. The idea here is that we're gonna choose some column in our table that is gonna be a, gonna distinguish the tuples enough and distribute across our partitions so that we get even load across the, across our compute nodes, right? That, like, there's no one hot spot partition, ideally. Can't, it doesn't always work if you have, like, if everyone's going to get, you know, updating a single key, you can't, you know, you can't distribute that. That's gonna be within a single partition, all right? But that's not always the case. So the three ways to handle partitioning are, there's actually four, there's round robin. We could ignore that. Hashing is pretty common. You basically pick some column, take the value of every single tuple, hash it by something. Let's do a hash table and then you decide what partition it's gonna go to. Range partitioning we've seen before, where you set some kind of, some continuous range of values and you say that that's all goes to one partition, another range goes to another partition. Try to get partitioning is where you can basically put a where clause expression to determine what partition something's gonna go to. You can almost like manually assign and say, like, you know, where name equals Andy and age equals something, go to partition one. Where name equals Andy and age equals something else, go to another partition, right? That's not as common as hashing and range partitioning. Hash partitioning is probably the most common one and most of the NoSQL systems are gonna do this. All right, let's go back to our example here. So the first thing we gotta do, do partition, we gotta pick up partitioning key and let's say for whatever reason, this one is the one we wanna use, right? And say if we're doing hash partitioning, we then take the value of every single tuple, hash it by some hash function and then mod it by the number of partitions that we have and that's just gonna determine, you know, where the different tuples go. And so the ideal query for this scenario here is if you're doing a lookup with an exact value on the partitioning key, because now I can take whatever value passed into this query, hash it using the same hash function, mod it by the number of partitions and then I know exactly where the data you need, the data that the query needs. So that was, so this example here going back, this is physical partitioning because I'm like taking, well, it's all PowerPoint slides, I'm saying that the actual data itself is going to some physical location. But again, in shared disk, we don't really have that, right? We have these stateless nodes. So the idea is here is that we would just do, we would logically assign different values or different hash values or in this case, a range of tuples within our table or database and they would assign them to the different nodes. So now when a query shows up, I get ID equals one, my catalog service would tell me, okay, this node is responsible for that ID value and it knows how to go get the data that it needs. The same thing for ID equals three. And then if I want to get multiple ones, again, I can then potentially go up to the one atop and it can get the data that for me. Yes. Is the main benefit of logic? This question is the main benefit of logical partitioning cache locality as opposed to what, physical partitioning? So like, so you have to do, in a shared disk system, you have to do this because the resting place of the location of the data is these nodes here. It's over here. So you got it side. Okay, well, if our query shows up, get ID equals three, get ID equals two, what node should be responsible for going at that data? Because yeah, to your point, like if you just make it random, then anybody is reading any data and then like you're just fetching from shared disk over and over again and it costs more and it's gonna be slower. But if by doing this local partitioning, you're potentially pinning the data here on this node so any query that shows up, you're more likely to have it already in your cache and not pay the penalty going to the disk again. Yes. All right, so let's take a time, I'm gonna skip, we've already discussed physical partitioning a bit, but it's basically the same idea that you keep track of like, where the data actually physically is located on the nodes. All right, so going back to my example here, what's the problem with this approach? If you're doing hash partitioning. He says, yes, that's one, right? If I have a range scan, you know, get start, select start from table where partition key between this and this, if it's hash partitioning, you can't do that. What's another problem? How do I scale out with this? Right, add a new node, what do I need to do now? I got update on my hashing to now mod by five and that sucks because that's gonna move data from basically reshuffles the entire database system. So that is one advantage of range partitioning, but now how to figure out the range, that's not true, non-trivial as well, right? So there's actually a way to handle this, which is really clever. Who here has heard of consistent hashing before? No, okay, well, about half, there's more than in previous years. So consistent hashing is a really neat technique. It was invented by in the early 2000s at MIT in this project called CORD. And basically what it's gonna allow us to do is it's gonna allow us to do incremental addition and removal of nodes in our cluster using hash partitioning without having to rebalance everything. And a bunch of different database systems are gonna take advantage of this. So the basic idea is that you have this, say this ring of locations of where a key might exist in your database, right? And let's say I have three partitions, P1, P2, P3. So now if I'm gonna do a lookup, say find me key one, I would hash it and produce a value between zero and one, and I would end up with some location in my ring. And now all I need to do is just have some kind of metadata, some lookup table to say, okay, for this range on my ring, what's the, if I go clockwise, what's the next partition I'm gonna find? So I land in the middle here, and then I know that the partition is gonna have the data I need is on P1. Likewise, if I hash key two, I land in this part of the ring, then I know the slider going around clockwise, and I find P3, right? So the way I think about it is these colors here correspond to sort of the range of hash values that these P3s from P3 here all the way back to P2 and so forth. So far that's nice, but that doesn't solve our problem of how do we actually add new partitions, right? So what the ring provides for us, because it's circular, is that we can introduce a new partition somewhere in the ring, in this case here we had P4, and the only thing that we need to reshuffle is any data that is now managed by this partition here along the ring. So it only has to do with P3. So now all the data from P4 over here to P2 in the ring that used to be on P3, P3 has to send over here, and I don't need to move any other data at any other partition, right? Likewise, I gotta add P5 here and P6 here, and it just changes the range of the values that correspond to a given node. So now what's interesting, and we'll see next week, is that you can actually use this ring also for replication, meaning like, if I save a replication of factor three, meaning I wanna have three copies of any key or any tuple I write in my database, I want three copies on different partitions. So if I do a write into P1, I just follow along the ring and find the next two partitions along that range, and I'll make sure I write the data there. So now if a query shows up, say I wanna find key one, again, I could get actually data from either P1, P6, or P2 because they're the three closest ones clockwise in my ring. Now there's a bunch of games you can play about like, okay, if I do a write, and I do a replication factor three, should I wait for all three nodes to respond with the correct answer or acknowledgement that I did the right, or maybe I can maybe just get a majority, because if I do a read, should I wait for all three, should I ask all three nodes in my ring, or is one of them coming back, is that good enough for me? Again, this is how the, we'll see this next week, the, when we do transactions, like we don't have to have full consistency or strong consistency, we may be okay with things eventually getting propagated across different nodes. And again, this is what the Nosequo guys do. So there's a bunch of systems here that use this. This is actually a technique for databases, the original idea was developed at MIT in 2000s, and they had sort of distributed hash table called CORD, but it famously was used in databases in Amazon in this key value store called Dynamo. It's a paper in 2007 that talked about how they were using this approach. In the follow-up paper in 2022, they didn't say, okay, when we took the research system Dynamo and made the commercial version DynamoDB, they dropped consistent hashing. And then they use a, the hierarchical replication scheme that we'll see in the next class. But a bunch of these systems use this, and actually React was a Nosequo system. They went under six, seven years ago. We can kind of see in the logo here, like there's the dots of the ring and the replication stuff because they're using consistent hashing. Snowflake doesn't do this for the catalog. They use FoundationDB, which is a fully transactional key value store for the catalog, but they use consistent hashing for caching of, in their shared disk architecture, right? They're using consistent hashing to do logical partitioning of the metadata where things are located. And the Cassandra is probably the most widely wanted that does this, as well as cache base. Okay, so we'll talk more about transactions next class, but the basic big challenge is gonna be a transaction shows up. We look at our metadata service, a catalog service, and try to figure out what data they're gonna need to access. And then we're gonna use that to figure out whether it's a single node transaction or not touching one partition, which is the best case scenario because we only have to check data within that single node. Or if it's a distributed transaction, meaning we're touching data in multiple nodes, in multiple locations, then we gotta run distributed concurrency control and a consensus protocol to make sure that everyone agrees that this transaction was allowed to commit, allowed to make the changes that it made. So we'll ignore replication for today. And then next class we'll cover how we actually wanna handle that, as well. Again, I showed replication and consistent hashing, like I do a write to my database, I don't have like some number of copies to make sure that my data is always available, even if there's a crash. All right, so for, if you wanna support multiple operations on different nodes, then we need some way to again coordinate the execution of that transaction. And the basic two approaches that you could have a sort of centralized coordinator that acts as a global traffic cop that has a complete view of what's going on at any time in our database system. Or it could be decentralized and let the nodes organize amongst themselves and talk amongst themselves to figure out which transaction they're running and who's allowed to commit at what time. The most distributed databases are gonna use a hybrid approach where it's gonna be decentralized, meaning that there isn't gonna be one dedicated machine or node that's gonna be the traffic cop. But since it's slow to do distributed or decentralized concurrency control, they're gonna elect a leader that's gonna temporarily be the traffic cop, the coordinator, and decide whether transaction allowed to commit. But if now that node goes down, then you do a new leader election and somebody else can take over. And again, the spoiler is gonna be we're gonna use raft or PAXOS to do that election. So I'm gonna go through two different approaches, different examples of how you do these decentralized, decentralized approaches, and then that'll segue into how to even coordinate transactions next class. So the first example, or the earliest examples of doing distributed transactions was a centralized approach using what is called a TP monitor. So you know, most of you have not heard of a TP monitor. I think the original TP, what it stood for was, I think originally telecom processing monitor, but that nobody refers to it as now. So now you just say it's a transaction processing monitor. But think of it as like it's a separate server or separate demon running somewhere that can coordinate transactions across different nodes. And this was built in the 1970s and 1980s because there wasn't really, as I said in the early days, there wasn't a single distributed data system that was aware of different nodes and things like that. People sort of cobbled things together. And so they would build this TP monitor as a separate system that then could then coordinate transactions across different disparate database systems that didn't know that they were doing transactions in a distributed way. It just saw something that they thought was a client telling it whether to commit or run a query and so forth. So the most famous one of these TP monitors is a system called Saber. This is built by American Airlines back in the 1970s for all like running transactions across different databases for doing airline reservations. And there's a bunch of airlines that all still use Saber today. If you ever wonder why it's really slow to like book airline stuff because it's running on shit from the 70s, right? In the 1990s, there was a movement to try to standardize the protocol for how the TP monitors could talk different things. So this is called OpenXA or XOpen. And again, most of the enterprise systems that none of us in this room can afford a good Oracle and Teradata and non-stop, they're all gonna support this protocol. But I think actually Postgres might be able to support some subset of it. But this is how they were gonna have a standard API to have these, to have you coordinate these TP monitors. So let's see an example here. Again, I'm not saying whether it's a shared disk or a shared nothing system, it doesn't matter at this point, right? It just matters like, okay, we have these partitions, assume that they can't talk directly to each other, how do we actually coordinate transactions? So say, if I wanna have a transaction that wants to touch data at these three partitions, again, assuming we know how to go to our metadata service to figure out what data we wanna touch, the application server goes to the coordinator and says, hey, I wanna lock data at this partition. And the coordinator's gonna have its own local lock table, just like you have on a single-node system that knows about all the different partitions that are in the distributed database. And assuming now we just do in course the entire partition. So it'll go ahead and say running two-phase locking just as before and acquire the locks on that data, gets back an acknowledgement to the application server. Now the application server can send whatever queries it wants to the different partitions to do whatever it wants to do. And then when it's done doing those updates or lookups, it goes to the coordinator, says, hey, I wanna commit this transaction. The coordinator then communicates to the different partitions and says, hey, is this thing safe to commit? Yes or no? Then if yes, then we give back an acknowledgement, right? So as I said, there was a bunch of old systems that are still predicated or still use this technology. BEA had this thing called Tuxedo from the 1980s that Oracle bought 10 years ago or 15 years ago. That was a TP monitor. TransARC came out of the AFS projects here at CMU. And then they did a spin it off as a startup. It was acquired by IBM. You know, Jeff Ebinger in the software engineering department that he was the founder of that company. This is a bit, it's hard to read, but it says OMID, O-M-I-D. This is, they don't call it a TP monitor because that's a data term, but it was a centralized transaction coordinator to run H-based transactions that was developed by Yahoo Labs a few years ago and it's still around today. All right, what is more common is to use this middleware approach where there's some software, some service running in between the application server and the database system itself. And so the application communicates directly with the partitions of the nodes. Everything has to go through the coordinator. All right, so you send query requests. The middleware maintains its own lock table just like the TP monitor. And then it's responsible for sending the, acquiring the lock, sending queries to the different partitions. And then when it's done, you get the commit request. You know, it's responsible when we're going to the different partitions and say, hey, am I allowed to commit yes or no? All right. So, I mean, there's a lot of commercial systems that do this now, but this is how Facebook scaled out MySQL back in the day, right? Because MySQL couldn't do distributed transactions. They put a middleware thing in front of it. Or Google did the same thing with using MySQL for ads. If you're familiar with the tests or there's a startup called PlanetScale, that's how they did transactions on MySQL for YouTube. I say YouTube runs on something like this, right? This is very common. All right. And then the last one is a centralized approach. Again, where there is no middleware, there is no global transaction coordinator. Query shows up, or a request to start a transaction shows up some partition. How they decided to go to there versus another one. It depends on what's in the metadata. It says the leader node for this transaction. And then it may request to different partitions. Then at some point it's going to go to the leader and says, hey, I want to commit. And then leader is responsible for coordinating with the other nodes and deciding whether this is a lot of commit or not. Yes or no. Again, we'll go in more detail this next week. All right. So I'm going to show you a, I won't expose you to this idea of federated databases. I don't think this actually probably isn't in the textbook. It's an old idea. I just want to show this to you again to see as an example of like, you can start doing some really interesting thing with distributed databases where it may not be the case that all the nodes in your database are from running the same software from the same database system. I sort of mentioned that with the TP monitor stuff was like these disparate systems were being cobbled together and we're using TP monitor to decide how to coordinate transactions on them. But the idea with federated databases is that it's almost like the middleware approach where you put something in front of the database systems that can make it look like it's all a single type of database system but underneath the covers it's rewriting queries for you. Right. And the reason why we want to do this is because in a lot of organizations when you guys go out in the real world a lot of companies have a ton of different databases. Right. Because some guy in some corner of the company that nobody's paying attention to but a little app internally using Mongo or whatever because they thought it was cool because they saw it on Hacker News and it was fine if he was just using it but then his buddy started using it and other people started using it before you know it half the company has to use this application and now the company has to support Mongo. But never mind they've been using Oracle or PostgreSQL for years and now they have a bunch of new databases that support. So large companies, it's never homogenous. People choose different databases all the time and now you have these different data silos and ideally you want to have a single view of all your data and federated databases are one way to do this. Right. So the idea is it's going to be a distributed architecture using a middleware approach that's going to expose to the application a single logical view of the database. Even though underneath the covers maybe something stores stuff as JSON something storing stuff as relational tables or whatever but ideally you want your application only at the right queries against one data model. So as I said this is an old idea goes back to the 1990s. Nobody does this really well and nobody does this as efficiently as you possibly can because it's sort of like the lowest common denominator to support one system well you want to be able to push down as much of the computations you can the query itself to that single system but you may not be able to be able to do that based on query semantics and other issues. So you got to pull a bunch of data to the centralized coordinator then you're joined there or do whatever you need to do. Right, because the different data systems can't talk directly to each other. You always have to go through the coordinator. So let me go through an example here. So I have four different databases. My query goes to this middleware and the middleware is responsible for dividing that query up into the corresponding queries that are on the different database systems and then they get results back on the middleware and they put it all together. Right, but again the key idea is that we have a single logical view to the application of the database even though it's spread across different machines. So again it's like a distributed database that's doing partitioning it's just that we now have to do some extra work to make it look like it's all unified even though it's not. So if you ever look at the database literature these are gonna be called connectors usually. PostgresFarmDataRappers can be used for this. There's this distributed or that system called Presto and there's a fork of it called Trino that came out of Facebook because people didn't like Facebook or something. But they have a bunch of connector to the different type of systems. And again in some cases they can do complete query pushdown. They can take a query, push it down and tie it to the database system that has the data you need. In other cases they have to copy a bunch of data back up then do processing there but it's the answer that you need. Okay. All right, last two slides again quick preview of what you're in store store for us next week. So we already said this multiple times we may need to allow multiple transactions to execute simultaneously across different nodes in our system and we need to make sure that when they go to commit that we make sure that everyone agrees that this is allowed to happen and that it looks ideally that the changes are happening atomically even though the spread across different machines. It was hard enough to do this on a single box but now if they do this on multiple machines across different data centers that's challenging. So replication we'll talk about next week how do we, if we have the data in different copies of it in different locations how to make sure that they're all on sync. The network communication is also going to be expensive nodes could go down and it could be permanent like the machine catches on fire and it's never coming back or like this is appalled because the GC kicks in or the disk starts defragging or something stupid and then messages get delayed or something stupid like someone trips over a wire and pulls it out and they plug it back in and it comes back online and has to figure out what's going on and now maybe it missed the last like 30 seconds of transactions what should we do? And then clock skew will be a big issue when we start to find timestamp ordering because how do we make sure that everyone agrees that this is the right timestamp when transaction wants to commit? Because again you can't use a logical counter always because now like how do you make sure that everyone's plus one at the same time? You can't maybe use a physical clock because there's gonna be drift and skew on the actual hardware itself so it's really hard to make sure that clocks are actually in sync and the spoilage is gonna be the way Google handles this with spanners that they put atomic clocks in the data center and they use that to get the time make sure everything's in sync or they get the time from the GPS satellites they use that for when they run transactions in their database which is amazing. Nobody does that. So again we'll cover spanner next week. Spanners probably in my opinion one of the most advanced transaction database systems. Google did a lot of amazing things in that. Took a while to get there, right? Like they did a bunch of NoSQL crap before but when they actually built a fully transaction database system they were well ahead of everyone else. It's really fascinating. All right so let's see why this is hard. Here's how we wanna do two phase locking. So say we have two different application servers and our database is partitioned into two pieces here. So application server one once it set A to equal to two application server over there once set B to equal to seven that's fine because I can take locks on that data and for this first transaction here he doesn't need anything else for it and vice versa are going the other way. But then the challenge is gonna be if I wanna start in my same transaction the first guy wants to update B and the other guy wants to update A. Now what's the problem? It's a deadlock, right? But now it's a deadlock over a wider network potentially and I have to figure out who's gonna kill what to break the deadlock. Well I can have a wait for a graph as we did before but where is this thing located? Is every node maintaining its own wait for a graph? Is there a centralized coordinator? And what if one of those nodes goes down? You know say I decide oh this one's the younger one I wanna kill this transaction to break the deadlock but then that node goes down and doesn't get the message and it comes back up and still think it has the locks. What do we do? All right? There's gonna be no magic bullet to handle this. It may oftentimes just like okay I waited long enough you know full steam ahead let's go or maybe the case of like okay well I the majority of nodes I can't read I can't talk to so I'm assuming I have a split brain meaning I can't see the other side so I'm gonna stop running any queries until things get resolved and I come back. So this is what we discuss next week after the break. Again the main takeaway from all this should be that this is all very very hard to do and that in most cases people do not need most people don't need a distributed database. Reputation is a separate issue we'll handle that next class but most people don't need to scale horizontally. Like 99% of the databases are like 10 gigs 20 gigs maybe a hundred gigs but even then it's not gonna be that big. And the cases that you do need a distributed database well there's a lot of these cloud services like Snowflake or BigQuery or whatever like they'll handle all this for you and you don't have to manage yourself. Transaction stuff is still very hard for OLAP because it's there's other challenges but the current control stuff is less of an issue there because you're not making a bunch of updates all the time. All right so next class distribute OLAP systems, replication, cat theorem and then we'll talk a little bit about some real-world implementations systems. Okay, question yes? Yes, so in a shared disk system do we still? This question is in a shared disk system given that the storage is far away from the compute nodes do we still use a lot of pages? For OLAP absolutely yes. All the stuff we talked about doesn't go away. Yes, okay. All right, hit it. This shit is gangsta. The poppy with the motherfucking hookup, 28 a grand, the clip will tell you, look up. Show me where the safe's at, before I blow your face back. I got a block on taps, the FETs can't trace that. Styles like Tampa Proof, you can't lace that. The Dominikin' or you could call me Dominikin'. Black Skelly, black leather, black suede Timmelins. My all black Dirty Eight is sending you to the Pernigate. You get consignment, trying to skate and that's your first mistake. I ain't lying, foot I take your fam, I see your weight. My grand's is heavy weight, they ran through every stake. When they ask me how I'm livin', I tell them I'm livin' great.