 Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySQL and post-grace configurations at autotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. All right, guys. Welcome. It's a new semester. Welcome to the vaccination database seminar series here at Hale to Carnegie Mellon. And we have a whole semester of awesome database speakers lined up to tell you about the various systems that they're building. And one of the things we're trying to do this semester is also have a sprinkling of some of the top researchers in academia come talk about the systems that they've been working on as well. With that, we are very excited today as the first speaker for the seminar series to have Dan Abadi. Dan doesn't need an introduction, but I mean, he's one of the leading researchers in databases in academia right now. He should have come to Carnegie Mellon, but he went to University of Maryland. That's OK. He's won the Jim Gray Award. You've won Sloan and a bunch of other great things. He is, again, he is one of the mere experts of databases. So with that, Dan, the floor is yours. I'd like to say, again, we want to do what we did last semester. If you have a question for Dan, please unmute yourself, say who you are, where are you coming from, and then ask Dan your question. We want this to be interactive, so please engage as much as possible. All right, Dan, go for it, floor is yours. Yeah, I just want to reiterate what Andy said. Yeah, definitely, COVID-19 interrupted. I'm very happy to interact with people as we go. And I just, how long do we have, Andy? I forgot what the time constraints are. I give an hour. No, OK, all right. So we'll see how far we can get. These slides can probably last a little bit longer than that. But we'll stop when Andy tells us to stop. Go for it, yes. OK, so I also want to take an opportunity before I start to sort of publicly thank Andy Paula. That's the beginning. Now let's jump into SLOG. So SLOG is a distributed system that is really designed to process transactions at scale across the world. So if you have an application, which is a truly worldwide application, where you have users all over the world, and that can meet the data, what is the right way? Well, first of all, what are some problems they run into, and how do you solve those problems in building a distributed score that has certain high-level guarantees? Specifically, you want to guarantee things like trick sizeability, where we have very high-level isolation and consistency in the system. That's the goal with SLOG. And so we'll talk about what the challenges are and how do we solve, or at least make progress towards solving those challenges. So that's the goal here. So before we get to that, let me actually progress this slide here. OK, good. So before we get to that, just sort of remind ourselves how databases work in the body of a single node, which is the easy case. And we kind of figured this out about 30 years ago. And again, if you want to find out about how control work, you should look at any lectures on MVCC versus OCC versus locking. So all these things, it's been known for decades now. But let's just review quickly how they work. So let's say we have a database running on a single node, single machine, in a single location. And we are going to be a retail company. This is a retail application. We're storing data in a database. And so our entire application, as shown in this slide, is two tables. Obviously, there will be more tables in that, but just to make this slide is two tables, a widget table, which are the things we're trying to sell, and a customer table, which are the people trying to buy those widgets. And we're going to assume that people buy widgets with store credit in our application here. So we have a widget table, we have a customer table. And now a transaction comes along, say customer two wants to be, sorry, I have to get rid of this pop-up over here. So customer two comes along and wants to buy widget three with store credit. OK, so what would the transaction code look like? It'll look like something along the lines are shown on the right here, where we have to read the widget table to figure out, do we have enough of widget three left in stock? We can read customer two to see, does customer two have enough store credit to be able to buy this particular widget? And if both of those things are true, then we can proceed with the transaction, which involves subtracting one from the inventory, because we're selling this widget now, and subtracting the amount of the price of the widget from the store credit of the customer. So this can be initially written potentially in SQL and converted to this, or written directly in some kind of store procedure language. But either way, we have some code which runs this transaction against our database system. So the way you would do this, let's say we're going to use a locking base concurrently control. The way this work is that you would lock the relevant records. So widget three is relevant. You're going to have to read it and update it eventually. Customer two is relevant. So we'll have to also read it and update it. So we'll lock those two records in our database system. And then we'll go ahead and run all this code, process this code, doing all the check, all these statements to see if we're actually able to run the transaction. And if so, then making all the changes. We'll do all that while holding the locks. So you see that the store credit has not been updated down to 21. And the stock has been updated down to zero. And once we're done with those updates, we can release the locks. And that is a pretty straightforward way to guarantee CI's ability of CI's isolation in a database system. So that's where it works. There's other ways to do this, other control protocols to do this, but this is a pretty common way to process transactions in a single-node database system. So what's the challenge? When you go from a single node to a multi-node and then eventually to a geographically replicated database system, then what are the challenges that show up? So the first is that just the process of scaling an application usually means we're going to partition our data across multiple machines. So rather than having all data being on a single machine, instead, now we may take our customer table, which had eight customers, and dividing it off across four machines, say in this example. Of course, in practice, there's going to be much more than that. But again, one slide here, we'll just make it four machines. So we'll just send two customers to each machine. And we have, say, we had four widgets in our original database. So we'll divide that across the four machines as well. That's one widget per machine. So now we just partitioned our data across four different machines. And for now, it's having the same location. Eventually, we'll make it distributed across the world. But for now, it's just in one location. But still, there's a challenge here. Because the basic problem that could happen is that once on different machines, you have sort of failures happening independently from each other. So one machine can fail, another one cannot fail. And so you have sort of a potential to have new types of violations of atomicity that you wouldn't really see so much in a single load database system. So let's see this in more detail. So let's get rid of the partitions which are not relevant to our example here. So we'll just leave two of them on the screen here, which has customer two and widget three, since our example was customer two buying widget three. OK, so here are two positions on two different machines. You have the same transaction we had before. But now we're going to sort of run this code across the two machines, whereas each machine will run the part of the code that is relevant for that machine. So for example, this machine on the top right of the screen over here, which has a customer two, eventually will do the update of the customer store credit. And the machine on the lower left here, which has the information about the widget, will eventually change the stock from one to zero. But in order to run the transaction, there has to be some communication between the machines, both during and after the transaction. So during the transaction, we have this line of code in our transaction which says if the store credit is less than the price, this is sort of a check rate to do to make sure the customer has enough credit to be able to buy the widget. So this if statement cannot be done by each machine or any machine individually. This requires information about store credit, information about price. And that information is stored separately on different machines. So there has to be some communication to be able to check this line of code to see if the credit is less than the price, right? So let's assume that we will do this, is that we'll still use locking in our example. So we'll lock the relevant customer or we'll lock the relevant widget. And then what we'll do is we'll send a message from one machine, the machine that has the widget information, the price information and send it to the other machine which has the customer information so that that way the customer machine has enough information to run this statement to check to see if the customer has enough credit to be able to buy the widget. So assuming that that check proceeds, this doesn't fail, then the transaction can continue and each machine will go ahead and do the relevant, it's relevant part of the transaction. So the top right machine will update the store credit down to 21 and the bottom left machine will update the stock back down to zero where it was one before, right? The last two lines of the code or this transaction. So that'll happen independently. But then the problem is, is that there has to be some protocol that checks to make sure that each machine was able to do what it was supposed to do, right? So for example, the bottom left machine doesn't yet know whether the top right machine actually did, you know, was able to, there was enough credit for the stock credit of customer two was higher than the price. It doesn't know that yet. So that's some protocol where the machines check with each other to make sure that each machine was independently able to do their part of the transaction. And that's typically done via some commit protocol, a commit time, you know, the most common protocol is two phase commit, right? So that's sort of like a sort of two rounds of communication back and forth between the machines, prepare the voting stage and then, you know, sort of acknowledgement, a declaration of the final decision and then a acknowledgement stage between the two, all the machines involved in transaction to run this commit protocol. So that's happening. But the key thing with these commit protocols is that you can't release your locks so before you run the protocol because it could be that the protocol comes out, you're gonna have to abort the transaction. If you abort the transaction, you wanna make sure that nobody else will see your board of data. So we end up having to keep the locks throughout the commit protocol. And that's what that's gonna cause is it's gonna prevent conflicting transactions to run during the commit protocol, which overall sort of reduces the people of our system, right? So if we have a workload against our system where we're frequently running transactions which conflict each other, which are accessing the same data, reads or writing the same data, you know, that's a conflicting workload and having all this communication back and forth between the machines both during the transaction processing and after that transaction processing to run a commit protocol since each machine can fail independently, that is gonna add to the amount of time that we're holding locks much more so than if we were running on a single machine. So one basic problem you run into when you're running in a distributed database system is that you end up sort of sending messages which take time while holding locks and that ends up sort of reducing our ability to run conflicting transactions in the workload. So that's one problem, and that problem exists just even within one sort of physical location, even within one data center it still takes some time to send messages and back and forth and that time certainly, you know, that time we're holding locks. Now, of course, if the machines are further apart from each other across different locations, it's even worse. So we're sending machines, we're sending messages across, you know, across continents that they'll take much longer and we'll hold the locks for even longer and it'll be even worse. So that's one problem is the commit protocol. The other problem is application, right? So in general, we're gonna replicate data if we're running a truly global application, we're gonna replicate data across the world because we want our reads to be fast. You wanna have our users locate any location in the world to be able to read the, read data as part of application very quickly. So they shouldn't have to send a message all the way across the world to find out the state of the application rather than should be able to see the state of the application from a local read. So this is typically done via replication. We'll replicate data across the world to be near our users of our application. So the two types of application, that asynchronous application and synchronous application. So typically speaking, if you wanna have a consistent view of the data no matter where our users are, they'll see the same database, then the replication is typically done synchronously. And we'll see later on that, you know, it can be an asynchronous as well, but it's much harder to do it at synchronously. And typically done, it's typically it's synchronous to ensure consistency. So what that means then is that we run in a traditional AV system, the way we typically make replication work is that you run the, a transaction, sorry, a transaction gets submitted to the system, gets submitted to one of the locations, right? It's a right transaction, changes the state of the database. So it gets submitted to one of the replicas. Net replica will go ahead and process that transaction locally, right? So here we'll have the same, we'll have the same transaction we had before where the customer two was buying which at three and so we updated the token at 21 and the stock down to zero as part of that same transaction, right? So it gets processed in one location first and then after it's done processing and after even we run the commit protocol, typically, although there are, you know, optimizations you can make there, but typically after running the commit protocol, we then replicate the data synchronously to all the other locations or at least the majority of, at least the majority of replication replicas that we have across the world to, before we commit the transaction, right? So we'll take out a change made in one location, replicate it into other locations and then they'll acknowledge back that they made those changes or at least they've received the set of changes. And that whole time also we're holding locks. We cannot release the locks on the original location that where you first run the transaction until we received the act back because you're running synchronous application. If you would at least the locks early before we get the act back, it's number one, it's possible that the other location never received the message and number two, even if they did, now you may have some, you may have some consistency issues if you're not holding locks while you're doing replication. So for all these reasons, we hold locks during application to be taking this application that's again extending the period of time when we're preventing conflicting transactions from running. So basically we have two problems. We have sort of the sort of the latency of the commit protocol which causes a reduction in throughput because no perfect transactions can run. And at the same time, we also have once you have a truly global application and if you're doing a synchronous application for consistency, then you're also having holding locks during replication. So you have the latency of the replication protocol plus the reduction of throughput as a result of holding locks during that take this application. So for all these things that causes sort of major latency and throughput problems in a traditional labor system, if you care about consistency, you care about isolation in the database. So once there's just forget about, just forget consistency, forget isolation and do no SQL and all those options that are out there, the sort of doesn't make these strong guarantees. But what if you do want to have the same guarantees we're all used to in a traditional system, right? So how would you sort of, how can we fix these problems, these throughput and latency problems which are caused by the distribute nature of the application? So the answers are going to basically be determinism. The determinism is going to allow us to solve a lot of these problems and it's going to happen stages. There'll be some things come very naturally and some things which require a little bit more thought to allow determinism to help us. But we'll see that determinism is going to sort of allow us to really allow us to make the same guarantees as a traditional system, yet remove a lot of these problems that we mentioned in the previous slides. So let's see how this works. So let's do application first because application is much easier on it. It's much more natural to understand how determinism helps the application and it's sort of, it's just a little more straightforward to discuss. So let's do that first and then we'll get to the whole current protocol and the throughput reductions from there afterwards. There's a little bit of application first, right? So in a deterministic system, the definition of determinism is that if two separate copies of the system sees the same input, then they were able to process that same input independently from each other and because of the deterministic, they're guaranteed to result in the same output, right? So the same final state at the end of it, right? So even though they don't communicate with each other, as long as they see the same input, they will end up in the same final state. So that's the deterministic guarantee. We'll see soon that that's not a guarantee that is done by traditional systems, but if you were to have this guarantee, application is very straightforward, right? So the way it works is you simply replicate the input. So every system will see the same input transaction. As long as they see the same input, then through determinism, they'll end up in the same final state, right? So they'll both sort of see the same input, they'll both independently do what they would do if they were a single location system. They'll acquire their appropriate locks, make the appropriate changes, and then release the locks, and they'll end up in the same final state because of the fact that there's a deterministic guarantee, right? So if we have determinism, replication is very straightforward and you saw there was no coordination during transnational processing, right? So whereas before we saw that we were holding locks during application, whereas here in our example, in the deterministic example we saw on the slide, we did the replication before the transaction ran, right? We replicated the input before we even started processing transaction. So therefore, once we started acquiring locks and running the transaction, there was no communication across the different locations of the world, right? So the U.S. wasn't communicating with Europe, wasn't communicating with Asia in the middle of transaction processing. So that sort of completely solves one of the major problems we saw before that we were holding locks during the synchronous application, right? So we're still doing synchronous application here, right? This is happening now at the input level. So it's still synchronous application, but that's happening before the transaction begins and therefore outside of holding locks. And this is, by the way, this is true also if you weren't using locking, if you were using OCC or using MBCC or other types of current and control protocols, everything I'm saying about locking is true for the protocols as well, right? So there's sort of, we have this notion of a contingent footprint and you want to sort of not, you want to sort of have that application happen outside of the contingent footprint, so no matter. So you should hopefully take my word for it, but if not, we can talk more offline that everything I'm saying is true not only for locking but also for the current and control protocols as well. So Dan, I don't want to derail you, but like I think it's worth saying a little bit that it's like you're assuming store procedures here, but this still works if you don't have store procedures. I don't know if you can talk about that later. So yeah, so these examples are all showing store procedures. So Determiners myself certainly works without store procedures. There is some complications that arise. So I guess, yeah, well, I think there's some, I'll slide on this later on if I, if we get to it. So yeah, let's come back to it. I think at this point. Yeah, I don't want to steer your thumb, but keep going. Okay, sounds good. So yeah, so we'll just try to remember to clean that up again in probably around 15 to 20 slides. Okay, so anyway, so there's no point in the transaction processing and that's good. And by the way, there's also a side benefit as well, which is not the main benefit, but it's just worth noting which is there's less stuff being replicated, right? So if when you're when you're replicating a, say a, you know, sort of the MySQL bin log or an Aries log in other systems or sort of, you know, a David system, when they do application, they typically send the log of changes. And those log of changes is usually pretty big because every single change you make is it has to be recorded in the log. And often it's, you know, if it's a physical log, it has a bunch of details about those changes. Whereas here we're replicating the program rather than the actual state of the state of the changes of the program, right? So typically speaking code is smaller than changes that the code makes. So therefore more often than not, not only do we get a benefit of having sort of less no coordination during transaction processing across replicas, across the world, but also sending that data across the world as well, which also has to make benefits as well. Okay, so that's sort of nice if we have determinism, but we should mention sort of why this is not possible in a traditional system, right? So, or why it doesn't happen for free. I mean, it's possible of course, but it doesn't happen for free in the traditional system. So the reason is that the typical guarantee that database will give you if you're running, if you buy a database out the shelf and it says it guarantees sizeability, which is pretty much the highest isolation guarantee you can get on the market today. So it says it guarantees sizeability, what does that mean? It means that it guarantees equivalent, you know, it'll process transactions concurrently, but that concurrent processing transactions will end up in a state that there's equivalent as if it had random in serial order. So that's nice. That's very strong of isolation. It has all kinds of nice properties to it, which we'll learn about in his database class, but the thing though is that it doesn't have any guarantee about any particular serial order. It just has a little guarantee equivalent to some serial order. It doesn't tell you which serial order. So if I'm running transactions in parallel, two databases running the same set of transactions in parallel may end up equivalent to different serial orders. So for example, let's see an example here on this slide. So we have a same example we had before of customer two buying widget three, that gets sent to both replicas. And then we have also a separate transaction where customer six is trying to buy widget three as well. And there's only one widget left, there's only one left in stock, right? So they both can't succeed. And so we have two transactions running at the same time in parallel. And so even if both copies, right? So the left side of the screen is one replica and the right side is another replica. So even if both replicas see the exact same two input transactions and they even see them potentially given in the same order, still because they're concurrent with each other, the serializable system will only guarantee equivalent to some serial order. But it doesn't say if it's T1 before T2 or T2 before T1. And so practice what can happen is that on the first replica say the thread which is running T1 grabs the lock for widget three first and therefore T2 will get locked behind it and T1 will be successful. And when T2 eventually gets to go, it'll fail at W in stock, that's the one that abort. And so basically T1 will succeed and T2 will fail. Whereas the other replica T2, the thread running T2 may grab the lock first on widget three. And therefore T2 will succeed and T1 will have to get blocked behind it until T2 releases the lock. And then T1 will eventually fail at that W in stock, that's the one. So basically what will happen then is that you'll end up with the replica diverging. So because of the way the lock that thread was scheduled across the different replicas, you end up with different final states because of the fact that we're in transactions and currently with each other. So if you're running a traditional system, then you don't get determinism. It's not all kinds of non-termistic events that happen in the system such as thread scheduling and so on. We'll see on the next slide, I think some of the examples of non-termistic events that will cause potentially cause uppercase to diverge if you don't enforce determinism through the system itself. Okay, so that's bad. So in a deterministic system, the way it's gonna work is that we're going to, we have to agree on the input still. So what's gonna happen is if we do have, if we have a bunch of transactions being submitted to the system that currently with each other and those kinds of actions may be submitted from different parts of the world, right? So T1 may come from Boston and T2 from Silicon Valley, T3 from London and so on that it can be different parts of the world but it has to be some sort of way for all of uppercase to agree on the input, right? So in practice, the way it has done in a deterministic system is that you have an input log, right? So which is typically implemented through Paxos or RAF or some kind of consensus protocol where all of uppercase agree on what the input is and in particular, they agree that T1 is called T1 and T3 is called T3. So they give us sort of a name to all of the actions which kind of comes with a natural order of them. So that way we can sort of declare globally, declare what is the input of the system, right? So there is some, before we get to, so I should say for now, this is a temporary fix, right? So we'll see soon that this log is gonna become a latency ball and we're not gonna want it. So we have to get rid of it soon. But just to sort of at this point in the discussion where we are now, let's assume we have it just to make the rest of the discussion a little bit easier and then we'll figure out how to get rid of this log afterwards. But for now, we'll run Paxos across the whole world. So all transactions enter this Paxos log and all Rafficas see the same log. And once we do that, then all the different Rafficas will see the same log and they will process transactions with a guarantee even stronger than serializable isolation, right? So serializable isolation isolation, we said before, guarantees equivalent to some serial order. But in a deterministic system, what we'll guarantee equivalence to is it will guarantee serializable equivalence to only one possible order, which is the order that transactions appear in that Paxos log. So that's sort of a stronger isolation guarantee than the regular serializable guarantee. But if we do that, then that's gonna go a long way to helping us be able to enforce the deterministic guarantees. I saw someone about to speak. I don't think so. Okay, I saw a microphone go off, but I guess that was, okay, all right. So I guess we'll pause at any questions so far if I move on, I know I'm speaking pretty fast, that's my natural way of dealing with things. It actually seems slower at this time, I'm gonna be honest. Oh, that's good, okay, good. Yeah, that's good. I know it's good or bad, I know, but yeah. Anyway, so let's keep on going, if there are no questions. So let's keep on going. So okay, so let's move on to failure now, right? So basically, just to summarize before we move on to failure, so in a traditional system, you don't get determinism for free because of non-termistic events such as threats being scheduled differently on different replicas, such as messages being delivered in different orders, such as nodes failing non-termistically in different replicas, which are things can happen in different events can happen in different replicas. And so if you only guarantee sizeability, you end up with divergence. And so you end up having to have a stronger isolation guarantee to enforce determinism. So that's one thing we'll keep that in mind. We'll come back to it soon. But the other thing I wanted to mention is that, is that we saw before that one other problem that we deal with in traditional everyday systems is the commit protocol. And the fact that holding locks during the commit protocol, which I mean, the commit protocol wanted to ensure that nodes that communicate with each other don't fail independently. We so guaranteed the durability and atomicity of the system as a whole once we divide the system across different partitions. So let's come back to that now and discuss how we can deal with, how determinism also helps us get rid of commit protocols, which is also very important. Okay, so let's see how this works. Before we do that, let's remind ourselves how failure, what happens during a failure in a traditional system. So in a traditional system, we're running a transaction. And again, it's non-termistic. So only running a one copy first. We'll do application after we do the transaction, after we run the transaction changes on one after the first, right? So the transaction comes in, customer two wants to buy widget three, gets run on one after the first, we require the locks to make the changes. And now, oh no, a machine fails, right? So unlike what we saw before, where everything was working fine, now a machine fails in our system, right? So in a traditional system, what happens is that we detect that failure during two phase commit, between the commit protocol, we see the machine failed. And therefore, we're gonna make the whole transaction fail, we'll undo the changes that we've made so far, you see down here on this tour panel, go back to 100, we'll undo the changes because of the fact that the machine failed, right? So a machine failure causes a transaction failure in a traditional system. And the commit protocol is designed to detect those machine failures to ensure that we end up with all machines agreeing that this transaction should not succeed. That's how it works in a traditional system. However, in a deterministic system, everything has to be different, right? Because a machine failing is a fundamentally, or it's usually, essentially fundamentally, it is usually a non-deterministic event, right? So therefore, if a machine fails, then it may fail in one replica, but not in another replica. So if we're gonna enforce that the replicas cannot communicate with each other, during transaction processing, then there is no way for a replica in London to be aware of the fact that one of the machines in Boston failed, right? So a machine failure in Boston absolutely cannot cause a transaction failure because there's no way for London to find out that the machine in Boston failed. We don't want it to have to find out, that would take way too much time. So therefore, what happens in a deterministic system is that, again, we said before that the input is replicated, right? So the same transaction goes to all replicas. They're all running the same transaction at the same time. And, right, so everyone's running the same time and now they're doing independently from each other. There's independently running the same transaction. And now a replica fails, right? Oh, sorry, now one of the machines of one of the replicas fails, right? But the same machine in the other replica may not fail. In fact, it probably doesn't fail, right? So it depends on, you know, I would say even usually it does not fail. Usually a failure is an independent event that occurs only in one replica. So in that situation, what we said is that we cannot allow the failure of a machine in one replica to cause the whole transaction to fail. So instead, what has to happen is that this failure of the machine does not cause a transaction to abort. Instead, what needs to happen is that if the transaction could have committed, if it wasn't for the failure, it must eventually commit, right? So therefore this copy over here on the right hand side of the screen in London say, we'll commit the transaction immediately. And the copy of our database on the left side of the screen, which is in St. Boston, that copy will have to parse, right? It can't sort of continue yet because of the fact that a machine has failed, but it can't work the transaction either. So what will have to happen is that this machine that failed will not cause the transaction to fail, but rather once this machine recovers, it will recover a state where it was at the time they crashed by replaying the input log. And remember, the input log is deterministic, right? So it's able to get back to where it was at the time they crashed by replaying the input log and then play it forward from there rather than play it backward, right? So whereas before, we'd undo changes that we made as a result of the machine failing, now instead, we play forward. We get back to where we were at the time of the crash and then we play forward from there as if the crash never happened, right? So a machine failure does not cause a transaction to fail in terms of existence. And that's huge. If you do that, if you're able to do that, then you don't need to face commit anymore. So then we'll see, let's see why, right? So hold on a second, where are we right now? Right, so yeah, okay, sorry. So we're still, so before we get to how you don't need to face commit, I got over ahead of myself actually. So before we get to how you don't need to face commit in this system, but let's see how to face commit works in a traditional system, right? So in the traditional system, the transaction comes in, we acquire the locks, like before, we change the changes. We run, you know, to face commit to make sure that no machine fail, like if the machine fails, then we have to put the transaction to run to face commit to ensure that no machine failed. And only once we figure out that no machine failed can the transaction commit. Whereas in a deterministic system, what happens is that a machine failure does not cause a transaction failure. So therefore what, this is the way the commit protocol works in a deterministic system is much more simple. The way it works is that each machine, we have before each other, one's independent from each other. So the replicas go ahead and make the changes they need to make. And now with a single message, we just have one the machine telling the other machine in this example, that it's not going to abort. But if this machine didn't reach any code in the transaction itself, that would force the transaction to abort, right? So this machine down here on the bottom left of the screen, you know, there is code that could cause it to abort in particular if it runs in stock, that's the one that abort. So that code, you know, that's deterministic code which we call it an abort, right? If the stock was zero at this point in time, there is no way for this transaction to succeed because it's a deterministic abort, right? So there has to be some communication that this machine is not going to abort. But that communication happens with a single message. Just to say, you know, once it gets past this line of code, it doesn't have to wait on the transaction, right? Just once it gets past the second line of code and the transaction, and it figures out that it's not going to abort, it gets hit and messaged, the other machine is not going to abort. There's no way from here on out that it can possibly abort. And then at that point, the other machine, the first machine, the top machine in the slide over here, can then immediately commit, once it figures out it's not going to abort either, can immediately commit the transaction, that any kind of commit protocol, right? So basically, rather than having these two rounds of communication across machines, instead, we just need sort of a half of a round of communication for each machine to have the other machine, or at least to tell some leading machine that there is no deterministic code that will force it to abort. And once it gets past that, once it can be insured, it will not abort deterministically. At that point, we can immediately abort the transaction, immediately commit the transaction without any further communication. So rather than two rounds of communication, it's a half round communication. It's much faster. And in some cases, it's zero rounds of communication, right? So if the transaction didn't have a possibility to terministic abort. So for example, if the transaction was give everybody a 25% raise in salary, but there is no way that that transaction can abort. So in fact, that transaction where there is no logic in the transaction, which would cause an abort can commit immediately as soon as it hits the log, the input log, it can commit immediately. There is no way for any machine to be able to, any machine to fail to process that transaction. So in some cases, commit protocol requires zero rounds of communication. In some cases it requires a half round of communication, but it's much, still much, much cheaper than the full two round of communication and two base commit. That's good. Okay, so, so yeah, this sort of summarizes what we just said, right? So if there is a dependent abort possible, then there's no protocol whatsoever. And if it is possible, it's a half round of communication, which is much better than two base commit, like you said, okay. So the key question now is, how do we do this, right? So how do we ensure, right? So what we said before, if we have determinism, the life is great. Replication is really easy and two base commit goes away. The question is, how do we enforce determinism? How do we make sure that each machine, we said before that one key part of it, part of forcing determinism is ensuring that each machine processes transactions equivalently to the order they appear in the log. But how do we even do that? How do we run transactions currently in a way that still ensures that the final result will be equivalent as if we ran every transaction in the same order they appear in the log serially? That's a question, how do we do that? So let's discuss that now. So turns out this is actually pretty easy. And we have written in my group about five or six different papers that discuss five or six different ways to do this, right? So there's a way to do this via locking, a way to do this via optimism, a way to do this via multi-versioning. There's all kinds of different ways to do this. So I'll just tell you the simplest one, which is using locking. And the simplest one is not the most optimal one as far as speed goes, but it works. And it's pretty straightforward to understand. So I think let's discuss that one now. So every transaction immediately requests, so the way this works is that what we're gonna do is that as each replica reads the input log, each replica will use locking. So each replica will request the locks for all data they need to access in the order that transactions appear in the log. So in other words, if transaction one appears before transaction two in the input log, that means that transaction one will make all of its lock requests before transaction two makes its first lock request. So we have to request locks in order they appear in the log and we have to grant locks in the order they were requested. If you do that, then that's basically all we need to do. Basically, it's fairly straightforward to prove that not only if you request locks in the order, will are we guaranteed to be equivalent as if you ran with transactions in the order they appear in the log, but furthermore, even prove that this is a deadlock freed at the protocol as well, because of the fact there's no way to get cycles because requesting locks in the order they appear, all transaction one's locks are requested or transaction two locks even get started. So you're both deadlock free and you're also equivalent to the order that the transactions appear in the input log and that's great. So that's good. The only problem, the issue I'm into though is that what if the system doesn't know what locks will be acquired before the transaction begins, right? So if transaction connected into the system, we can't request the first lock of transaction two into the last lock of transaction one has been requested. So therefore we can't pass the transaction two into the last lock of transaction one is requested. So if you don't know what transaction one needs to lock, then we can't even start running transaction two until transaction one is done, figuring out what needs to lock and that's gonna end up being sort of very slow because we'll end up nothing other than one transaction two can currently have transaction one. So if you wanna have concurrent execution of transactions, we can't really sort of wait to request the first lock of transaction two until transaction one finally figures out what needs to lock, right? So in practice what we do in this version of implementing the termism is that we don't actually start, we don't actually sort of submit a transaction blindly to the system. So instead, if a transaction is submitted to the system which has logic in it for which the system is not able to look at the transaction to figure out immediately what it's gonna look into access, what it's gonna read or what it's gonna write, then what it does is it first runs a trial run that transaction, it makes a guess. It sort of does a sort of a very simple sort of recall your reconnaissance phase, a very simple process where sort of it runs enough of the transaction to figure out, to generate a guess of what it's gonna access. And then once it creates that guess, only then does it go into the input log that every applicant sees and then they all sort of once they see the guess, they'll request the locks that a guess it'll need. And if it turns out that they made the wrong guess because the data changed between the time when it made the guess and now when it runs in the official termistic execution, then every single applicant will see that the guess was wrong and also independently decide to re-start the transaction. Dan, I think it's probably worth mentioning that DynamoDB is the way they do transactions is very similar to something like this. Let me see if I can agree or disagree with you. We had a speaker, our group last year. So this didn't come across. So go ahead, so go ahead. I mean, my understanding was like you say, I do all my reads and writes and I store them in the variables on the code. And then I call the transaction and then it basically checks that you get the same thing when you read the same thing again. But what's the, I mean, so DynamoDB is not from the delivery system, right? No, it is, hold on, where are you been? Yeah, they have transactions now. Well, they do, okay. I guess maybe I'm a little out of date then. Oh, okay. I will send you the video after this. Okay, fine. Yeah, I guess you had a more recent model. Oh, they publish in the OSDI. That's why you probably didn't see it. Okay, when? Last year? Last year, okay. Yeah, definitely. I don't think I saw it, yeah. All right, keep going. This year I've been basically, whatever, you know. Oh, you have a zero, you have a zero-month-year-old baby or how old are you? So yeah, I'm not on this year's papers. Anyway, so. I'm only bringing it up to say that like this idea of reconnaissance transaction is not far-fetched, but it's people are doing this. Yeah, not only that, this is Panos Kazantis or University of Pittsburgh. Actually, one of these protocols was also proposed. It's called speculative transactions. And basically, you run the transaction and possibly you run also while you're acquiring the logs, one optimistically or whatever in the pessimistic fashion and whichever finishes first succeeds. There is a variation of that that was proposed in the. In the hardware community, this is very quickly done, right? Of course. And yeah, I mean, this is not supposed to be, not clean, this is a new idea. We're just using, first of all, DiamondDB is way after, you know, we published, you know, well, people called way before that DiamondDB. So, but even then, even then when we published it, we didn't claim that this part was new, right? So we're just sort of using the idea. It's a cool idea. We don't know the comment that we make is not to dispute the novelty or whatever. It's a cool idea. It's the sense that, you know, it was proposed now in surface and it's implemented in a cool way. But, you know, as general point, it's how you compare and contrast them to see what you're learning in each and every time that this idea is implemented. Right. I mean, just the point I'm making though is that like, is that the solution isn't speculation. The problem, the problem, speculation is coming to solve a different problem, right? So we introduce a new problem by having terminism. You know, life was great. But we ran into a new problem as a result of this particular locking protocol, which resulted in sort of a reduction in currency because of the fact that transaction had to wait to figure out what needs to be locking it. So then speculation just comes along to sort of solve sort of a side problem that we introduce as a result of a totality protocol. Right? Absolutely. Because again, this problem was also faced in the multidiverse systems that you were executing in autonomous sites, but you wanted to execute this transaction in deterministic fashion. So they use the notion of a ticket where basically, which is similar to this instead of trying to speculate, get all your logs and try to come up, then every transaction goes and visits the site, gets a ticket, and if there are tickets, now the tickets are compared and force everybody a serial execution based on these tickets, these acquired tickets. So that's realized as basically across all sites in a specific serial order. So there are different ways. And again, it still has deadlocks, as although you say here that it's deadlock free, still you have the rollback in the event that something goes wrong with the speculation. So it's hidden deadlock recovery, but it's cool. What I'm saying is not negates what you're saying. Okay, sounds good. Okay, so that's, okay, I see Andy put a comment here also on the Dynamo talk. Great, okay. It was fast 2019, even more reason why you didn't see it. Okay. Yeah. Let's move on, let's move on. Okay, sounds good. So thank you. That's good. Okay, so let's keep on going. So additional, right. So just, yeah, we're kind of on time here. Let me just breeze through this slide very quick, just sort of get a very high level. I just wanted to quickly mention that you really mentioned a bunch of the vendors of determinism already. We talked about like, you know, they're moving to base commit and we're moving sort of sort of moving HolyLock string application and moving deadlocks. You should also mention it also makes it a design that is much simpler as well. So, you know, sort of makes much more modular because of the fact that sort of the logging and the recovery module is totally separate in the system. The locking is also pretty much, it's basically not separately as well. So sort of you end up with a much more modular system if you're determinism and then other systems. But I think, you know, that discussion is much deeper discussion that we can't really spend much more time on now. So I just want to give you a general sense of performance before we get to solving a separate problem in the last 10 minutes, which is just a quick sense of the throughput of just the locking before we mentioned already, right? So we'll see soon some other problems or to solve those problems, right? But just of the protocol we've actually discussed so far, you know, the basic goal that's trying to do is it's trying to reduce the prior time holding locks, right? So we said that we hold locks during two base commit, you hold locks during cyclist application that reduces the ability to conduct current transactions and that's bad. And so you sort of see it in a traditional system as you increase throughput in the system, as you increase protection in the system, throughput drops because of the fact that we're holding locks, we're doing all these things, we're holding locks and that's bad. Whereas in terms of the system, because we significantly reduce the prior time that we hold locks, as you increase contention, you actually get the reverse, you actually get better performance about the most performance because of the fact that we hold locks for so long and we have more contention, we have more cash locality as you end up actually in many cases actually seeing the potential improvement, seeing performance improvement and contention while they're growing down. But anyway, but that's the sort of, we haven't really gone to the main point yet. One of the main points yet, which is the main contributions of SLOG, which is that we said before that we have this sort of, this input log, we said that in the terms of the system, all transactions to go through some sort of Pactos or RAF log at the input of the system. And so that all replicas can agree on the input, right? And so even though that we got rid of the commit protocol, latency and the through production from the commit protocol and even though we got rid of the through production of replication by replicating outside of transactional boundaries, we still have to pay the cost of the consistency of this consensus protocol if for every right transaction, right? So when transactions come into the system, every single one has to go through this RAF or Pactos protocol. And so if our replicas distribute across the world, if we're dealing with a truly global application, we have a replica in Boston, a replica in Europe, a replica in Asia and they're all, you know and we're communicating across the whole world for every single right to insert it into the Pactos log. That's a pretty big latency and that's really unacceptable in practice for many applications. Even though reads can be done fast, writes are slow and that's not okay. So we have to get into this input log as well and turned out you can. So it gets a little more complicated and that's the most recent log paper from last year which goes into a lot of detail on that particular protocol but the basic idea is that we have to replace synchronous application with a mixture of synchronous and asynchronous application. So what we'll do is we'll do synchronous application locally and so nearby regions and we'll do asynchronous application across regions. We're still gonna guarantee full strict sizeability and the highest levels of consistency possible. And the way you do that is that you for every single data item in database system, you declare for it a primary location, right? So it has some primary location somewhere and typically it's gonna be where it's accessed the most frequently, right? So a user's data whose lives in Boston will probably have his primary location for his data in Boston or as a user in China will have their primary their database have primary location in China and so on, right? So every data item will have a primary location. And linearizable reads and writes to those data items has to be processed by the primary location, right? So if I may use it in Boston, I wanna read my data in Boston, that's gonna require a simple read or write to a local African Boston. But if I wanna read data that has a primary location in China, I can't read the copy in Boston even though there is a copy in Boston, I can't read it because that data may not be consistent with the data in China which is constantly being updated, right? So I wanna get a consistent linearizable read of that a data item. I have to send that read or write to China. So yeah, so that's the basic, so that's sort of the basic background that we're gonna assume. So we're gonna assume the data as part of locations. We're gonna assume that the data is accessed most frequently near the primary location. And then what we'll do is we're gonna get rid of our global that global practice log and instead of gonna replace them with local logs near every region. And so the goal is gonna be then it's gonna be to sort of guarantee the same levels of isolation and consistency when you have transactions that may access data to have multiple and primary locations. A transaction comes along where I wanna transfer money to user in China. So both, so my primary location is Boston, the China's primary location is China's. There's gonna be some, that both locations have to be both of the transactions. So how do I get both locations of all the transactions are still guaranteeing all the guarantees of high isolation, high consistency that we have before. That's gonna be the challenge. So at a high level, we only have five minutes left. So let's just kind of go through this at a high level. So single home, so single region transactions appreciate for what to run, right? So if transaction one comes along and accesses data items A and C and to A and C have a primary location in region one and B and E are part of location in region two and D part of location region three. So A and C go to region one. D go, since it's a single, this transaction two only touches data IMD which is primarily region three. So it goes there. T three, which accesses B and C. Now we have a problem, right? Can now B and C have, B has a part of location in region two and C has a part of location in region one. So at a high level, what we're gonna do is we're gonna sort of spit the transaction into pieces, right? So part of T three will be done into region two and part of T three, we're done in region one. So it'll get spit up into pieces and then we're gonna have to reconstruct the transaction after the fact to ensure that we still have sort of a global sizeability even though the transaction will be done in pieces of different regions. The T four goes to region two, say T five is also multi-region. So go to both, it'll get split up and go to both regions and so on and so on. And then so basically each region will receive the input log consisting of transactions which are relevant to that particular region. And so they'll create that same global practice log we had before that we saw in the previous slides. Now we'll have a local version of that log per region. And then so these input transactions will go into each local log for each region and then we'll replicate each region's log to the other regions asynchronously, right? So region one's input log of T one, T three and T five will get replicated asynchronously to region two region three and region two's log will go to the other two regions and region three's log will go to the other two regions. So eventually asynchronously all the regions will see the logs of all the other regions but that happens sort of after the fact. So the only sort of up to date versions of A and C are only found in region one and only up to date versions of B and E or region two and so on. So then the way the protocol is gonna work give me a few minutes left but at a high level the way the protocol is gonna work is that single home transactions are gonna be a single region, home or region I use them synonymously. So single home transactions commit as soon as they complete at their primary region, right? So if T one, which is single home and only access data in A and C it can commit immediately as soon as it is processed to region one. However, multi region transactions such as T three and T five, which access data in multiple regions they have to wait for the asynchronous replication of the input log before they can commit the transaction, right? So they have to wait for the log to arrive and then they're gonna process the transaction once they have all the relevant log records at each local region. So they'll think a little bit longer where the assumption is, is that it's better to have more transactions be fast at the cost of having the multi region transactions to be a little bit slower. So that's the basic idea here. And so just skipping to some experiments let's skip that while I go through to, well, I don't know, I guess they're both interesting. So roughly speaking very quickly if you look at latency, so traditional systems that have a global practice log like Calvin or like Spanner or like the system we described earlier and the earlier side we had a global log those systems, you had to run a consensus in every single transaction. So every single transaction takes the cost of running that intensive protocol. So the protocol is running globally say across say 200 milliseconds of radius of time across the world that every transaction is to take it in right transactions to take at least 200 milliseconds. So that's very slow. Whereas in SLOG, most transactions are orders of magnitude faster than local and only the multi-regional ones are a little bit slower or the same order of magnitude as all the transactions in say Calvin or Spanner. And if you look at throughput then sort of you have sort of another interesting point here which is that here we're comparing throughput of SLOG with Spanner. So Spanner is a system sort of works like a traditional AV system in the sense that they sort of hold locks during communication across regions. So Spanner, they don't actually allow you to have a set up of regions that span more than a radius of 1,000 miles that the right replicas in the practice protocol on Spanner has to be within 100 miles of each other. So as a result, because it's a traditional system as you add more contention to the workload the performance of Spanner drops off a cliff. So if normalized throughput is one say when contention is low as you increase contention because it's holding locks during replication and during coordination that you face commit then the throughput of Spanner drops off a cliff whereas in SLOG because it doesn't hold locks during replication, it doesn't hold locks during commit protocol. Therefore, it still does go down a little bit here that the throughput does decline a little bit as you increase contention across the whole world but it still degrades much more gracefully than Spanner. And keep in mind that SLOG is a really truly global deployment across Asia, sorry, this is actually across East Coast, West Coast and Europe where Spanner is only within 1,000 miles of each other, so much more narrow radius. Anyway, so by the time I don't wanna get buoyed off Zoom by Andy, we'll stop here. Show a conclusion slide, come on. Sorry? Show a conclusion slide. Do I have one? I don't. Okay, all right, there you go. Awesome, Dan, thank you. I will clap on behalf of everyone else. All right, we have time for one or two questions. So if you have a question, please unmute yourself and ask work. Panos, Panos, please. I don't have any question. I'm just, I'll just clap as well. Yes. Anybody else? All right, so Dan. So let me point some, then Sasha at some point, I was trying to find his paper that actually he tried to, as part of privacy preserving work he attempted to do the same with using different versions and executing everything locally and then merging them, the locks. And if the locks were merged in the proper order then determinism was preserved. I don't know if you're aware of that work. Oh yeah, yes, we cited that work in our first, I think two papers, the Calvin paper and the one before that that motivated determinism, we cited that work. And yeah, I mean, so that was early attempt with it. It wasn't anywhere near the same performance that we can achieve in Calvin and Slog. But yeah, that was certainly an early attempt at determinism. I think that one actually didn't even get with a two-phase commit there, right? I think that one, or did it actually work? Right, so I think it was a replicated system that wasn't, I think it's been a while, I think it was one node system was replicated to another node, it wasn't a partition system it was just a replicated system, if I remember correctly, is that right? That's correct, it was not partitioned. Then they never noted that, so one key contribution of Calvin and Valor to that work is that we showed that determinism could even get rid of two-phase commit in a partition system as well, which is a very important contribution, cool. Okay, awesome, Dan, thank you so much for doing this. I'm guessing you're in the basement cause we hear the chairs from your kids upstairs. Wow, you can hear that? Wow, yeah.