 Maybe maybe we should get started. It's been a long time since we've all been in the same place, and I hope everybody's doing well Today I'd like to talk about spanner The reason to talk about this paper is that it's a rare example of a system that provides Distributed transactions over data that's widely separated that is data that might be scattered all over the internet and different data centers I'm so almost never done in Production systems, of course, it's extremely desirable to be able to have transactions the program is really like it and also extremely desirable to have data spread all over the Network for both for fault tolerance and to ensure that data is near That there's a copy of the data near everybody who wants to use it and On the way to achieving this spanner Used at least two neat ideas one is that they run two-phase commit, but they actually run it over PAXO's replicated participants in order to avoid the problem with two-phase commit that a Crashed coordinator can block everyone and the other interesting idea is that they use synchronized time In order to have very efficient read-only transactions And the system is is that actually been very successful. It's used a lot by many many different services inside of Google It's been turned by Google into a product to service for their cloud-based customers And it's inspired a bunch of other research and other systems both sort of by the example that These kind of wide area transactions are possible and also Specifically, there's at least one opens her system cockroach DB that uses a lot of explicitly uses a lot of the design The motivating use case the reason that the paper says they first kind of Started the design spanner was that they were already had a Actually, they had many big database systems inside Google, but their advertising system in particular The data was charted over many many distinct my sequel and big table databases and Maintaining that sharding was a just an awkward and manual and time-consuming process in addition their previous advertising database system Didn't allow transactions that spanned more than a single basically more than a single server But they really wanted to be able to have to spread their data out more widely for better performance and to have transactions over the multiple shards of the data For their advertising database, apparently the workload was dominated by read-only transactions and you can see this in table six where the there's billions of read-only transactions and only millions of Of read-write transactions, so they're very interested in the performance of read-only transactions that only do reads and Apparently they also required strong consistency that you know what transactions in particular So they wanted serializable transactions and they also wanted External consistency, which means that if one transaction commits and Then after finishes committing another transaction starts the second transaction needs to see any modifications done by the first And this external consistency turns out to be Interesting with with replicated data. All right, so I Want to draw out just the basic Arrangement sort of physical arrangement of their servers that that spanner uses It has the servers are spread over data centers presumably all over the world certainly all over the United States And each piece of data is replicated at multiple data centers. So the diagrams got to have Multiple data centers. Let's say there's There's three data centers really there'd be many more oops So we have multiple data centers and then the data is sharded that it's broken up You can think of it has been being broken up by key into And split over many servers. So maybe there's one server that Serves keys starting with a in this data center and others starting to be And so forth lots of lots of sharding over lots of servers. In fact, every data center has Any piece of data's any shard is replicated at more than one data center So there's going to be another copy in the replica of the a keys and the b keys and so on The second day in the center and yet another hopefully identical copy Of all this data at the third data center in addition each data center has Multiple clients or their clients of spanner And what these clients really are is web servers. So if an ordinary human being sitting in front of a web browser connects to some Google service that uses spanner. They'll connect to some web server in one of the data centers and that's going to be one of these One of these spanner clients, all right So it is replicated the replications managed by Paxos In fact, that really a variant of Paxos that has leaders and is really very much like the raft that we're all familiar with And each Paxos instance manages all the replicas of a given shard of the data. So this shard All the copies of this shard form one Paxos group and All the replicas of this shard form another Paxos group and within each of these Paxos instances is Independent as its own leader runs its own version of the own instance of the Paxos protocol And the reason for the sharding and for the independent Paxos instances per shard is to allow parallel speed up and a lot of parallel throughput because there's a vast number of clients, you know, which are representing working on behalf of web browsers So this huge number typically of concurrent requests That's what pays them more immensely to split them up over multiple shards and multiple sort of Paxos Groups that are running in parallel Okay, and you can think of or each of these Paxos groups has a leader a lot like Raft so maybe the leader for this shard isn't data is a replicant data center one and the leader for this shard might be the replica in data center to and and so forth And you know, so that means that if you need to if a client needs to do a right it has to send that right to the leader of the Of the shard whose data it needs to write Just with Raft these Paxos instances are what they're really doing is sending out a log the leader is sort of replicating a log of operations to all the Followers and the followers execute that log which is for data is going to be reads and writes sort of executes those logs All in the same order alright, so the reason for these for this setup The sharding as I mentioned for throughput the multiple copies in different data centers is For two reasons one is you want copies in different data centers in case one data center fails If you know maybe you power fails to the entire city the data centers in or there's an earthquake or fire or something you'd like Other copies that other data centers that are maybe not going to fail at the same time And then you know there's a price to pay for that because now the Paxos protocol Now has to talk maybe over long distances to talk to followers in different data centers The other reason to have data in multiple data centers is that it may allow you to have copies of the data near All the different clients that use it. So if you have a piece of data that may be read in both California and New York Maybe it's nice to have a copy of that data one copy in California one copy in New York So that reads can be very fast And indeed a lot of the focus of the design is to make reads from the local the nearest replica both fast and correct Finally another interesting interaction between Paxos and multiple data centers is that Paxos like raft only requires a majority In order to replicate a log entry and proceed and that means if there's one slow or distant or flaky data center The Paxos system can keep chugging along And accepting new requests even if one data center is Is being slow? All right. So with this arrangement There's a couple of big challenges the paper has to bite off one is they really want to do reads from local data centers But because they're using Paxos and because Paxos only requires Each log entry to be replicated on a majority that means a minority of the replicas may be lagging and may not have seen the Latest data that's been committed by Paxos And that means that if we allow clients to read from the local replica for speed They may be reading out of date data if their replica happens to be in the minority that didn't see the latest updates So they have to since they're requiring correctness requiring this external consistency idea That every read see the most up to date data They have to have some way of dealing with the possibility that the local replica may be lagging Another issue they have to deal with is that a transaction may involve multiple shards and therefore multiple Paxos groups You may be reading or writing a single transaction Maybe reading or writing multiple records in the database that are stored in multiple shards and multiple Paxos groups So those have to be We need distributed transactions Okay, so I'm gonna explain how the transactions work that's going to be the kind of focus of the lecture Spanner actually treats implements read-write transactions quite differently from read-only transactions So let me start with a read-write transactions, which are Sort of a lot more conventional in their design All right So first read-write transactions, let me just remind you what a transaction looks like So let's just choose a simple one. That's like mimicking Bank transfer So on one of those client machines a client of spanner you'd run some code you run this transaction code the code would say I'm beginning a transaction And then I would say I want to read and write these records So maybe you have a bank balance in database record X and we want to you know increment and increase this bank balance and Decrease wise bank balance and oh, that's the end of the transaction Now the client hopes the database will go off and commit that All right, so I want to trace through all the steps that that have to happen in order to In order for spanner to execute this read-write transaction So first of all, there's a client in one of the data centers that's driving this transaction So I'll draw this client here. Let's imagine that X and Y are on different shards since that's the that's the interesting case And that those shards each of the two shards is is replicated in Three different data centers. So we got our three data centers here and at each data center, there's a server that I'm just going to write X for the replicas of the Shard that's holding at the bank balance for X and Y for the These three servers Spinner runs to face commit Totally standard to face commit and to face locking almost exactly as Described in the reading from last week from the 6033 textbook and the huge difference Is that instead of the participants and the transaction manager being individual computers the participants in the track transaction manner manager are Paxos replicated Groups of servers for increased fault tolerance. So that means just to remind you that The shard the three replicas of the shard that stores X is really a Paxos group same with these three replicas destroying Y And we'll just imagine that for each of these one of the three servers is a leader. So let's say the server in data center two is the Paxos leader for X's shard and the server data saying one is the Paxos leader for Y short Okay, so the first thing that happens is that the client picks a unique transaction ID, which is going to be carried on all these Messages so the system knows that All the different operations that are associated with a single transaction first thing that is the client has to read so Despite the way the code looks where it reads and writes X and readers and write Y in fact The way the code has transaction code has to be organized It has to do all its reads first and then at the very end do all the rights at the same time essentially as part of the commit So the Kindness to do the reads it turns out that it in order to maintain locks since just as as in last week's six of three reading every time you read or write a data item the Server responsible for has to associate a lock with that data item The locks are maintained the read locks in spanner maintained only in the Paxos leader So when the client transaction wants to read X it sends a read X request to the leader of X's shard and That leader of the shard returns the current value of X plus sets a lock on X Of course, if the locks already set then you won't respond to the client until whatever Transaction currently has the data locked releases the lock by committing and then the leader for that shard Sends back the value of X to the client the client needs to read Y got lucky this time because the Assuming like clients and data center one the leaders in the local Data center so this reads can be a lot faster The read sets the lock on Y in the Paxos leader and then returns Okay, so now the clients on all the reads it does internal computations and figures out the rights It wants to do what values it wants to write to X and Y And so now the client's going to send out the updated values for the records That it wants to write and it does this all at once at the end towards the end of the transaction So the first thing it does is it chooses one of the Paxos groups to act as the transaction coordinator And that's that it chooses us in advance and it's going to send out the identity of the which Paxos group is going to act as the transaction coordinator So let's assume it chooses this Paxos group Let's put a double box here to say that not only is this server the leader of its Paxos group It's also acting as transaction coordinator for this transaction then the client sends out the updated values That it wants to write so it's going to send a write X a write X request here With a new value and the identity of the transaction coordinator When each The Paxos leader for each written value receives the right request it Sends out a prepare message to its followers and gets that into the Paxos log so that I'll represent but that by P into the Paxos log because It's going to commit to being able to or commits the wrong word It's it's promising to be able to carry out this transaction that it hasn't crashed for example and lost its locks So it sends out this prepare Message logs the prepare message to Paxos when it gets a majority of responses from the followers Then this Paxos leader sends a yes To the transaction coordinator saying yes, I am promising to be able to carry out my part of the transaction the right to why the and Notionally the transaction to the C the client also sends the value to be written to why to wise Paxos leader and this server acting as Paxos leader sends out prepare messages to his followers and logs it in Paxos waits for the acknowledgements from a majority and then you can think of it as as the Paxos leaders sending the Transaction coordinator just on the same machine maybe the same program a yes vote saying yes, I can I can commit okay? So when the transaction coordinator Gets responses from all the different from the leaders of all the different shards whose data Is involved in this transaction if they all said yes, then the transaction coordinator can commit otherwise it can't Let's assume it decides to commit At that point the transaction coordinator sends out To the Paxos followers a commit message saying look Please remember that permanently in the transaction log that We're committing this transaction And it also tells the Leaders of the other Paxos groups involved In the transaction then they can commit as well and so now This leader sends out commit messages to his followers as well as soon as the commits are Actually, I think the leader of the transaction coordinator probably doesn't send out the commit message to the other shards until It's committed safe in the log so that the transaction coordinator is not guaranteed not to forget its decision Once commits these commit messages are committed into the Paxos logs of the different shards each of those shards can actually execute the rights that is Place the written data and release the locks on The data items so that other transactions can use them And then the transactions over so First of all, please feel free to ask questions if By raising your hand a few if you have questions Okay, so there's some points To observe about the design so far, which is only covered the read write aspect of transactions One is that it's that the locking that is ensuring serializability That is two transactions conflict because they use the same data One has to completely wait for the other releases locks before it can proceed So it's using so spanners using completely standard two-phase locking in order to get serializability and Completely standard to face commit to get distributed transactions The two face commits widely hated because if the transaction coordinator should fail or become unreachable then any transactions that was managing block Indefinitely until the transaction coordinator comes back up and they block with locks held So people have been in general very reluctant to use two-phase commit in the real world because it's blocking Spanner solves this by replicating the transaction manager. The transaction manager itself is a Paxos replicated state machine So everything it does like for example Remember whether it's committed or not is replicated into the Paxos log. So if the leader here fails Even though it was managing the transaction Because it's raft replicated either of these two replicas can spring to life take over leadership and also take over Being the transaction manager and they'll have in their log if the transaction manager decided to commit Any leader that takes over will see a commit in its log and be able to then tell the other right away Tell the other participants in the two-phase commit that look over this transaction was committed. So this effectively eliminates the Problem with two-phase commit that it can block with locks held if there's a failure This is a really big deal because this problem Basically makes two-phase commit otherwise completely unacceptable for any sort of large-scale system that has a lot of parts that might fail the other another thing to note is that there's a huge amount of messages on In this diagram here And that means that many of them are across Data centers and said some of these messages that go between the shards or between a client and the shard whose leaders another data center may take many milliseconds And in a world in which you know computations take nanoseconds this is Potentially pretty grim expense And Indeed you can see that in from in table six and table six if you look at it is describing the performance of a Spanner deployment where the different replicas are on different sides of the United States at East and West Coast and it takes about a hundred milliseconds to do Complete a transaction where the different replicas involved are on different coasts. That's a huge amount of time. It's a tenth of a second It's maybe not quite as bad as it may seem because the throughput of the system Since it's sharded and it can run a lot of non conflicting transactions and parallel the throughput may be very hard high But the delay for individual transactions very significant I mean a hundred milliseconds is maybe somewhat less than a human is going to notice But if you have to do a couple of them To just say generate a web page or carry out a human instruction It's starting to be a amount of time will be noticeable start to be bothering bothersome On the other hand for I think I suspect for many uses of spanner all the replicas might be in In the same city or sort of across town and they're the much faster times that you can see in table three Are relevant and there it's table three shows that it can Complete transactions where the data centers are nearby and I think it's 14 milliseconds instead of a hundred milliseconds So that's not quite so bad Nevertheless These read write transactions are slow enough that we'd like to avoid The expense if we possibly can So that's going to take us to read only transactions It turns out that if you're not writing that is if you know in advance that all of the operations in a transaction are guaranteed to be Reads then spanner has a much faster much more streamlined much less message message intensive scheme For executing read only transactions Okay, so so read only transactions Starting to topic of the reader only transactions work although they rely on some information from read write transactions the designs quite different from The reader the read write transactions In spanner eliminates two big costs And it's read only transactions assigned eliminates two of the costs that were present in read write transactions. First of all as I mentioned it reads from local replicas And so if you have a replica as long as there's a replica the data the client needs the transaction needs in the local data center You can do the read in from that local replica Which may take a small fraction of a millisecond to talk to instead of maybe dozens of milliseconds if you have to go across country So it can read from local replicas. But no, you know again a danger here is that any given replica may not be up to date. So there has to be a story for that And the other big savings in the read only design is that it doesn't use locks. It doesn't use to face commit and it doesn't need a transaction manager And this avoids things like cross data center or inter data center messages to packs those leaders And because no locks are taken out not only does that make the read only transactions faster, but it avoids slowing down read only read write transactions because they don't have to wait for locks held by read only transactions And just to kind of preview why this is important to them Tables three and six show a 10 times latency improvement for read only transactions compared to read write transactions. So the read only design gets some factor 10 boost in latency And then much less complexity so almost certainly far more throughput as well. And the big challenge is going to be how to square the You know, really transactions don't do a lot of things that were required required and read read write transactions to get serializability. So we need to they needed to find a way to kind of square this increased efficiency with correctness And so there's really two main correctness constraints that they wanted to have read only transactions impose. The first is that they like all transactions they still need to have read only transactions And what that means is that even though just to review even though the system may execute transactions concurrently in parallel, the results that a bunch of concurrent transactions must yield both in terms of sort of values that they return to the client and modifications to the database. The results of a bunch of concurrent transactions must yield both in terms of sort of values that they return to the client and modifications to the database. The results of a bunch of concurrent transactions must be the same as some one at a time or serial execution of those transactions. And for read only transactions what that essentially means is that the entire all the reads of a read only transaction must effectively fit neatly between All the rights of a bunch of transactions that Can be viewed as going before it and and it must not see any of the rights of the transactions that we're going to view as it's going after it. So we need a way to sort of fit the read all the reads of a transaction read only transaction kind of neatly between read write transactions. The other big constraint that the paper talks about is that they want external consistency. And what this means It's actually Equivalent to Linearizability that we've seen before what this really means is that if one transaction commits finishes committing And another transaction starts after the first transaction completed in real time. Then the second transaction is required to see the rights done by the first transaction. Another way of putting that is that transactions even read only transactions should not see stale data. And if there's a committed right from a completed transaction that's prior to the read only transaction the prior to the start of the read only transaction the read only transaction is required to see that right. Okay, so this is actually none of neither of these is particularly surprising the standard databases some like my sequel or something for example Can be configured to provide this kind of consistency so in a way it's sort of the consistency that if you didn't know better this is exactly the consistency that you would expect a straightforward system. And the, you know, you don't have to have it but it makes programmers lives makes it much easier to produce correct answers, you know, otherwise, you don't have this kind of consistency then the programmers are responsible for kind of programming around whatever anomalies that database may provide. So this is like a night this is sort of the gold standard of correctness. Okay, so Let's I want to going to talk about how we don't transactions work. It's a bit of a complex story. So I think what I'd like to talk about first is to just consider what would happen if we did just absolutely the stupidest thing and had the read only transactions. Not do anything special to achieve consistency but just read the very latest copy of the data. So every time I read only transaction does a read We could just have it look at the local replica and find the current most up to date. Copy of the data and that would be very straightforward. Very low overhead. So we need to understand why that doesn't work. In order. So this is a so why not Read the just the the latest value. And so maybe we'll imagine that the transaction is a transaction that simply reads X and Y and prints That's read only print Y just print X. Okay, so I want to show you an example of a situation in which we having this transaction just simply read the latest value yields Incorrect. Not Not serializable results. So we have three transactions running T1 T2 T3 T3 is going to be our transaction T1 and T2 or transactions that are read write transactions. So let's say that T1 writes X and writes Y And then commits and you know, maybe it's a bank transfer operation. So it's transferring money from X to Y And we're printing X and Y because we're doing an audit of the bank to try make sure it doesn't lost money. Let's imagine that transaction to Also does another transfer between Balances X and Y and then commits and now we have our transaction transaction T3 it needs to read X and Y. So it's going to have a read of X and say the read of X Happens at this point in time. And so I'm The way I'm drawing these diagrams is that real time moves to the right wall clock time the kind of time you'd see on your watch moves to the right So the read of X happens here after transaction one completes before transaction two starts And let's say T3 is running on a slow computer. So it only manages to issue the read of Y much later So the way this is going to play out is that transaction three will see the Y value that T1 wrote but the X value that T2 wrote Assuming it uses this dubious procedure of simply reading the latest value that's in the database And so this is not serializable because Well, we know that any serial order that could exist must have T1 Followed by T2. There's only two places T3 could go so T3 could go here But T3 can't fit here because if T3 was second in the equivalent serial order then it shouldn't see writes by T2 which comes after it It should see the value of Y produced by T1, but it doesn't see the value produced by T3 by T2. So this is not an equivalent This serial order wouldn't produce the same results. The only other one available to us is this one This serial order would get the same value for Y that T3 actually produced, but if this was the serial order then T3 should have seen the value written by T2 But it actually saw the value written by T1. So this execution is not equivalent to any one at a time serial order. So this is like there's something broken About simply reading the latest value. So we know that doesn't work. You know what we're really looking for of course is that either the Or our transaction either reads the both values at this point in time or it reads both values at this point in time Okay, so the approach that Spanner takes to this it's a somewhat complex. The first big idea is an existing idea It's called snapshot isolation. And the way I'm going to describe this is that let's imagine that all the computers involved had a Synchronized clocks that is, you know, they all have a clock the clock yields yields a sort of wall clock time like oh it's 143 in the average April 7th, 2020. So that's what we mean by a wall clock time a time. So let's assume that all the computers assume, even though this is a True that all the computers involved have synchronized clocks that is the, you know, they all have a clock the clock yields yields a sort of wall clock time like oh it's 143 in the afternoon on April 7th, 2020. So that's what we mean by a wall clock time a time. So let's assume that all the computers assume, even though this isn't true that all the computers involved have synchronized times. Furthermore, let's imagine that every transaction is assigned a particular time a time stamp. And Okay, so we have these time stamps. They're wall clocks times taken from these synchronized clocks for read write transaction. It's time stamp is I'm going to say just for this for this simplified design is the real time at the commit and for read for a or at the time at which the transaction manager starts to commit and for read only transaction that time stamp is equal to the start All right, so every transaction has a time and we're going to design our system or snapshot isolation system gets is designed to execute as if to get the same results as if all the transactions had executed in time stamp order. So we're going to assign the transactions, each transaction time stamp, and then we're going to arrange the execution so that the transactions gets the results as if they had executed in that order. So given the time stamps we sort of need to have an implementation that will kind of basically honor the time stamps. And basically, you know show each transaction the data sort of as it existed at its time stamp. Okay, so the way that this works for read only transactions is that each replica when it stores data it actually has multiple versions of the data. We have a multiple version database. Every database record has, you know, maybe if it's been written a couple times it has a separate copy of that record for each of the times it's been written, each one of them associated with the time stamp of the transaction that wrote it. Then the basic strategy is that read only transactions when they when a read only transaction does a read, it's already allocated itself a time stamp when it started. And so it accompanies its read request with its time stamp and the whatever server that stores the replica of the data that the transaction needs it's going to look into its multi version database and find the record. So this is for that as the highest time that's still less than the time stamp specified by the read only transaction so that means to be the only transaction sort of sees data that is data as of the time as of its time chosen time stamp. Okay, so this is for this snapshot isolation idea works for read only transactions, or Spanner uses it for read only transactions. Spanner uses still uses. Two phase locking and two phase commit for read write transactions. And so the read write transactions allocate timestamps for themselves to commit time but other than that they work in the usual way with locks and two phase commit. Whereas the read only transactions access the multiple versions in the database and get the version that's, you know, written by the has the time stamp that's highest but still less than the read only transactions time stamp. And where this is going to get us is that, you know, read only transactions will see all the rights of read write transactions with lower time stamps, and none of the rights of read write transactions with higher time stamps. Okay, so how would snapshot isolation work out for our example. The example that I had here before, in which we had a failure of serialized serializability because reading transaction read before read read values that were not between any two of the read write transactions okay so this is now our example but with snapshot isolation. I'm showing you this to show that the snapshot isolation technique. Solves our problem causes the read only transaction to be serializable. So again we have these two read write transactions T one and T two and we have our transaction that's a read only transaction T one and T two. They write as before they write and they commit. But now they're allocating themselves time stamps as of the commit time so in addition to using to face commit and to face locking these read write transactions allocated time stamps so let's imagine that the time of the commit T one looked at the clock and saw that it, the time was 10 I'm going to use times of 10 and 20 and whatnot. And you should imagine times as being real times like four o'clock in the morning on a given day. So, let's say the T one sees the time is 10 when it committed T two sees that at the commit time the time was 20. So I'm going to write these transactions chosen time stamp after the at sign. The database storage systems the spanner storage systems are going to store. When transaction one does its rights they're going to store a new sort of not instead of overwriting the current value they're just going to add a new copy of this record with the time stamp so it's going to the database is going to take away a new record that says the value of x at time 10 is whatever it happens to be let's say nine, the value of record why at time 10 is 11 maybe we're doing a transfer from x to y. Similarly, T two chose time stamp of 20 because that was the real time and commit time, and the database is going to remember a new set of records in addition. So these old ones is going to say x at time 20. Maybe did another transfer from x to y, and why at time 20 equals 12. So now I took copies of each record at different times. Now transaction three is going to come along. And again it starts at about this time and does read of x and again it's going to be slow so, you know, it's not going to get around reading to me until much later, much later in real time. However, when transaction three started, it chose a time stamp by looking at the, looking at the current time. And so let's say, since we know in real time that transaction three started after transaction one and before transaction two we know it's got to have chosen a transaction time somewhere between 10 and 20. So it chose, it started at time 15 and chose time stamp 15 for itself. So that means when it does the read of x is going to send a request to the local replica that holds x and it's going to accompany it with its time stamp of 15 it's going to say please give me the latest data as of time 15. Now transaction two hasn't executed yet and but nevertheless the highest time stamp copy of x is the one from time 10 written by transaction one. So we're going to get nine for this one. Time passes transaction two commits now transaction three does a second read again at accompanies it to read request with its own time stamp of 15 center the servers now the servers have two records. But again because the server gets transaction threes time stamp of 15, it looks at its records and say ha 15 sits between these two I'm going to return the highest time stamp record for x for y. That's less than the requested time stamp and that's still the version of why from time 10. So the read of why we'll return 11. That is, the read of x essentially happens at this time but because we remember to timestamp, and we have the database keep data as of different times it was written is as if both reads happened. The time 15 instead of one at time 15 and one later. And now you'll see that in fact, this just essentially emulates a serial one at a time execution in which the order is time stamp order. Transaction one and transaction two. Sorry, then transaction three. Two that is the serial order that is equivalent to the results were actually produced is the time stamp order 10 15 20. Okay, so that's a kind of simplified version of what spanner does for really transactions. So complexity which I'll get to in a minute. One question you might have is why it was okay for transaction three to read an old value of why that is it issued this read of why at this point in time, the freshest data for why was this value 12, but the value it actually got was intentionally a stale value, the freshest value, but the value from a while ago, this value 11 so why is that okay, why is it okay not to be using the freshest version of the data and the kind of technical justification for that is that transaction to and transaction is concurrent that is the overlap in time so the sort of time extent of transaction two is here and the time extent of transaction three is here they're concurrent. And the rules for linearizability and external consistency or that if two transactions are concurrent. The serial order that the database is allowed to use can be can put the two transactions in either order, and here the database spanner has chosen to put transaction three before transaction two in the serial order. Okay, Robert we, we have a student question, does external consistency like with timestamps always imply a strong consistency. I'm going to guess yes. I think so if strong consistency, strong consistency usually what people mean by that is linearizability. And I believe the definition of linearizability and external consistency are the same. So I would say yes. And another question how does this not absolutely blow up storage. That is a great question. And the answer is, it definitely blows up storage and the reason is that now the storage system has to keep multiple copies of data records that have been recently modified multiple times. So it definitely expense both, both the cost and storage and space on the disk and memory, and also it's just like an added layer of bookkeeping, you know now lookups have to consider the timestamps as well as keys. The storage expense I think is not as great as it could be because the system discards old records that paper does not say what the policy is. But presumably, well, it must be discarding old records. Certainly if the only reason for the multiple records is to implement snapshot isolation of these kinds of transactions then you don't really need to remember values too far in the past, because you only need to remember values back to the sort of earliest time that a that a transaction could have started at that's still running now. And if your transactions mostly you're always finished or forced to finish by killing them or something within say one minute. If no transaction can take longer than a minute then you only have to remember the last minute of versions in the database. Now in fact the paper implies that they remember data farther back than that because it appears they support these snapshot reads which allow them to support the notion of seeing you know data from a while ago, you know yesterday or something, but they don't say what the garbage collection policy is for old value so I don't know how expensive it would be for them. Okay. Okay, so the justification for it's legal is that in external consistency the only rule that external consistency imposes is that if one transaction has completed than a transaction that starts after it must see its rights. So, T1 maybe T1 completed let's say that T1 completed at this time and T3 started just after it, maybe external consistency but demand that T3 sees T1's rights, but since T2 definitely didn't finish before T3 started. We have no obligation under external consistency for T3 to see T2's rights and indeed the in this example, it does not. So it's actually legal. Okay, another problem that comes up is that the transaction T3 needs to read data as of a particular time stamp but you know the reason why this is desirable is that it allows us to read from the local replica in the same data center, but maybe that local replica is in the minority of Paxos followers that didn't see the latest log records in the leader. So maybe our local replica, maybe it's never even seen, you know, never saw these rights to X and Y at all. It's still back at a version from time, you know, five or six or seven. And so if we don't do something clever, let me ask for the sort of highest version record, you know, less than time stamp 15 we may get some much older version that's not actually the value produced by transaction one which we're required to see. So the way Spanner deals with this is with their notion of safe time. And the scoop is that each replica remembers, you know, it's getting log records from its Paxos leader, and the log records turns out that the paper arranges so that the leader sends out log records and strictly increasing time stamp order. So, a replica can look at the very last log record it's gotten from its leader to know how up to date it is so if I ask for a value as a time stamp 15 but the replica has only gotten log entries from a Paxos leader up to time stamp 13. The replica is going to make us delay it's not going to answer until it's gotten a log record with time stamp 15 from the leader. And this ensures that replicas don't answer a request for a given time stamp until they're guaranteed to know everything from the leader up through that time stamp and so this may delay. This may delay the reads. Okay. So the next question. I've been assuming I assumed in this discussion that the clocks and all the different servers are perfectly synchronized. So everybody's clock says, you know, 1001 and 30 seconds at exactly the same time. But it turns out that you can't synchronize clocks that precisely you. It's basically impossible to get perfectly synchronized clocks. And the reasons are reasonably fundamental. So the topic is time synchronization, just sort of make sure clocks say the same real time value different clocks, read the same value. The fundamental problem is that time is defined as basically the time it says on a collection of highly accurate expensive clocks and a set of government laboratories so we can't directly read them. It's just that these government laboratories can broadcast the time in various ways. And the broadcast take time and so at some time later some possibly unknown time later, we hear these announcements of what the time it so we all may all hear these announcements at different times, due to varying delays. So, actually, first I want to consider the problem of what the impact is if on snapshot isolation. If the clocks are not synchronized, which they won't be. Okay, so what if the clocks aren't synced. There's actually no problem at all for the spanners read write transactions because the read write transactions use locks and to face commit. They're not actually using snapshot isolation so they don't care so the read write transactions will still be serialized by the lock the to face locking mechanism. So we're only interested in what happens for an aura for read only transaction. So, let's suppose a read only transaction chooses a time stamp that is too large. So that is far in the future you know it's now 12 1pm and it chooses a time stamp it's a one o'clock p.m. So the transactions chosen timestamps to big. That's actually not that bad. What it'll mean is that it'll do read requests, it'll send a read request to some replica the replica will say wait a minute you're you know your clock as far as far greater your time seems far greater than the last log entry I saw from my packs was leader so I'm going to make you wait until the packs was the time and the log entries in the packs was leader catches up to the time you requested only going to respond then so this is correct but slow. The reader will be forced to wait. But that's not the worst thing in the world. But what happens if we have a read only transaction. And it's time stamp is too small. And this would correspond to its clock being was either set wrong so that set in the past or maybe it was originally set correctly but the clock it's clock ticks too slowly. The problem with this this is a, obviously causes a correctness problem this will cause a violation of external consistency because the multi version databases you'll give it a time stamp that's far in the past say an hour ago. And the database will read you a value associated with the time stamp from an hour ago, which may ignore more recent rights. So using a, assigning a time stamp to a transaction is too small, will cause you to miss recent committed rights. And that's a violation of external consistency. So, not not externally consistent. So, so we actually have a problem here. The assumption that the clocks were synchronized is in fact a very serious assumption and the fact that you cannot count on it means that unless we do something. The system is going to be incorrect. All right. So, so can we synchronize clocks perfectly. And that would be the ideal thing. And if not, why not. So, so what about clock synchronization. The, as I mentioned. Where time come from this is actually collection of the kind of median of a collection of clocks and government labs. The way that we hear about time is that it's broadcast by various protocols sometimes by radio protocols like basically what GPS is doing for spanner is a GPS acts as a radio broadcast system that broadcasts the current time from some government lab through the GPS satellites to GPS receivers sitting in the Google machine runs, and there's a number of other radio protocols like WWV is another older radio protocol for broadcasting the current time. And there's newer protocols like there's this NTP protocol that operates over the Internet that also is in charge of basically broadcasting time so the sort of system diagram is that there's some government labs and the government labs with their accurate clocks define a universal notion of time that's called UTC so we've UTC coming from some clocks and some labs then we have some you know radio or internet broadcast or something for the case of Spanner it's the we can think of the government lab to broadcast into GPS satellites. The satellites in turn broadcast and the broadcast of the millions of GPS receivers that are out there. You can buy GPS receivers for a couple hundred bucks that will decode the time stamps in the GPS signals and keep you up to date with exactly what the time is corrected for the propagation delay between the government labs and the GPS satellites and also corrected for the delay between the GPS satellites in your current position. And then there's in each data center. There's a GPS receiver. That's connected up to what the paper calls a time master to some server. There's going to be more than one of these for data center in case one fails. And then there's all the hundreds of servers in the data center that are running Spanner either as servers or as clients. Each one of them is going to periodically send a request saying oh what time is it to the local one or more usually more than one case one fails to the time masters and the time master will reply with oh you know I think the current time as received from GPS is such and such. Now, built into this unfortunately is a certain amount of uncertainty. And the primary sources of uncertainty, I think, but there's there's fundamentally uncertainty in that we don't actually know how far we are from the GPS satellites exactly. So the radio signals take some amount of time, even if the GPS satellite knew exactly what time it is the signals take some time to get to our GPS receiver. We're not sure what that is that means that when the Jeep we get a message from the radio message from the GPS satellite saying exactly 12 o'clock. You know if the propagation delay might have been, you know, a couple of nanoseconds that mean this that were actually propagation delays much more than that's really uncertainty in the propagation delay means that we're not really sure exactly whether it's 12 o'clock or a little before a little after. In addition, all the times that time is communicated there's added uncertainty that you have to account for. And the biggest sources are that when a server sends a request after a while it gets a response. If the response says it's exactly 12 o'clock. But the amount but I'm say a second past, you know, between when the server sent the request and when I got the response. All the server knows is that even if the master had the correct time. All the server knows is that the time is within a second of 12 o'clock, because maybe that maybe the request was instant but the reply was delayed, or maybe the request was delayed by a second and the response was instant. So all you really know is that it's between, you know, 12 o'clock and zero seconds and 12 o'clock and one second. Okay, so there's always this uncertainty. And in order to really can't ignore though because the uncertainties, we're talking about milliseconds here. And we're going to find out that these that the uncertainty in the time goes directly to the these, how long these safe weights have to be, and how long some other pauses have to be the commit wait, as we'll see. You know, uncertainty in the level of milliseconds is a serious problem. The other big uncertainty is that each of these servers only requests the current time from the master every once in a while say, every minute or however often. And between that, the each server runs its own local clock that sort of keeps the time starting with the last time from the master, those local clocks are actually pretty bad. And drift by things by milliseconds between times that the server talks to the master. And so the system has to sort of add the unknown, but estimated drift of the local clock to the uncertainty of the time. So, in order to capture this uncertainty and account for it. And that's this true time scheme. In which when you ask what time it is what you actually get back is one of these TT interval things, which is a pair of an earliest time, and a latest earliest time is there early. The time can possibly be and second is the latest the time can possibly be. So when the application, you know, makes this library call that after the time it gets back this pair. All it knows is that the current time is somewhere between earliest and latest. This might be in this case earliest might be 12 o'clock and latest might be 12 o'clock in one second just just are guaranteed that the, that the correct time isn't less than earliest and isn't greater than latest. We don't know where between that it lies. Okay, so this is what I'm when a transaction asks the system what time it is this is this is what the transaction actually gets back from time system. So, let's return to our original problem was that if the clock was too slow that a read only transaction might read data too far in the past, and that it wouldn't read data from a recent committed transaction. So, what we're looking for is how spanner uses these TT intervals and its notion of true time, in order to ensure that despite uncertainty and what time it is transaction obey external consistency that is a read only transaction is guaranteed to see rates done by transaction rate transaction that completed before us. And there are two rules that the paper talks about that conspire to enforce this and the two rules which are in section 412. One of them is the start rule and the other is commit weight. The sort will tells us what time stamps transaction what time stamps transactions choose and basically says that transactions time stamp has to be equal to the latest half of the time current time so this is TT dot now call which returns one of those earliest latest pairs that's the current time and transactions time stamp has to be the latest. That is, it's going to be a time that's guaranteed not to have happened yet because the true time is between earliest and latest and for a read only transaction. It's assigned the latest time as of it's the time it starts. And for read write transaction is assigned a time stamp. This latest value as of the time it starts to commit. Okay, so the start rule says, this is how spanner chooses time stamps the commit weight rule. Only for read write transactions. says that when a transaction coordinator is, you know, collects the votes and sees that it's able to commit and app and chooses a time stamp. After it chooses this time stamp, it's required to delay to wait a certain amount of time before it's allowed to actually commit and write the values and release locks. So a read write transaction has to delay. Until this time stamps that it shows when it was starting to think about commit is less than the current time. That earliest. Sorry. But it sits in a loop calling TS dot now and it stays in that loop until the time stamp that it had chosen at the beginning of the commit process is less than the current times earliest half and what this guarantees is that since now the earliest possible, correct time is greater than the transactions time stamp. That means that when this loop is finished when the commit weight is finished, this time stamp of the transaction is absolutely guaranteed to be in the past. Okay, so how does the system actually make use of these two rules. In order to enforce external consistency for read only transactions I want to go back to our or I want to cook up a someone simplified scenario in order to illustrate this. So I'm going to imagine that the writing transactions only do one right each just to reduce the complexity. Let's say that there's two transactions so we have T zero and T one are read write transactions and they both write X, and we have a T two which is going to read X, and we want to make sure that T two sees. You know it's going to use snapshot isolation at time stamps we want to make sure that sees the latest written value. So, we're going to imagine that T two does a right of X, and writes one to X, and then commits. We're going to imagine that. Sorry T one writes X and commit T two also writes X, writes a value two to X. So we're going to say it's really a prepare that the transaction chooses its time stamp so this is the point at which it chooses time stamp and then it commits sometime later. And then we're imagining by assumption that T two starts after T one finishes. So it's going to read X afterwards and we want to make sure it sees to. So, let's suppose that T zero chooses a time stamp of one commits writes the database. Let's say T one starts. At the time it chooses a time stamp it's going to get some it's not get a single number from the two time system really gets a range of numbers. And the earliest and the latest value, let's say at the time it chooses time stamp it, the range of values the earliest time it gets is one and the latest field in the current time is 10. So, the, the rule says that it must choose 10, the latest value is this time stamp so T one is going to commit with time step 10. Now, it can't commit yet because the commit weight rule says it has to wait until its time stamp is guaranteed to be in the past. So transaction one is going to sit there, keep asking what time is it what time is it until it gets an interval back that doesn't include time 10 so at some point it's going to ask what time it is it's going to get a time that where the earliest value is 11 and the latest is, I don't know why I'd say 20. And now it's gonna say aha now I know that my time stamp is guaranteed to be in the past and I can commit. So T one will actually this is its commit weight period to sit there and wait for a while before it commits. Okay, now after commits transaction two comes along and wants to beat X. It's going to choose a time stamp also. Assuming that it starts after T one finishes because that's the interesting scenario for external consistency. So let's say when it asked for the time. It asked at a time after time 11. So it's going to get back an interval that includes time 11. It's back in an interval that goes from time 10 is the earliest and time 12. So latest and of course the time 12 has to be since we know that must be at least time 11. Since transaction to started after transaction one finished. That means that the 11 must be less than the latest value. So it's going to choose its latest half as its time stamp so it's going to actually choose time stamp 12. And in this example, when it does its read it's going to ask the storage system oh I want to read as a time stamp 12. Since transaction one wrote with time stamp 10 that means that you know, assuming the safe weight. When it works, we're actually going to read the correct value. And what's going on here is that the says this happened to work out, but indeed it's guaranteed to work out if transaction to as long as transaction to starts after transaction one commits. And the reason is that commit weight causes transaction one not to finish committing until its time stamp is guaranteed to be in the past. Right so transaction one chooses a time stamp, it's guaranteed to commit. After that time stamp transaction to starts after the commit. We don't know anything about what its earliest value will be, but its latest value is guaranteed to be after the current time that we know that the current time is after the commit time of T one, and therefore that T two's latest value. The time stamp it chooses is guaranteed to be after when see committed and therefore after the time stamp that see used. Because transaction to the transaction to starts after T one finishes transaction to is guaranteed to get a higher time stamp. And the snapshot isolation machinery the multiple versions will cause it to read to its read to see all lower valued rights from all lower time stamp transactions. So T two is going to see T one. And that basically means that we're this. This is how spanner enforces external consistency for its transactions. Any questions about this machinery. All right. I'm going to step back a little bit. From my point of view sort of two big things going on here one is snapshot isolation by itself. Snapshot isolation by itself is enough to give you that is keeping the multiple versions and giving every transaction time stamp snapshot isolations guaranteed to give you serializable read only transactions because basically what snapshot isolation we're going to use these timestamps as the equivalent serial order and things like the safe wait, the safe time. Ensure that read only transactions really do read as of their timestamps see every rewrite transaction before that and none after that. So there's really two pieces. Snapshot isolation. Snapshot isolation by itself though is actually often used not just by spanner but generally doesn't by itself guarantee external consistency. Because in a distributed system it's different computers choosing the timestamps so we're not sure those timestamps will obey external consistency even if they'll deliver serializability. In addition to snapshot isolation, spanner also has synchronized timestamps and it's the synchronized timestamps, plus the commit wait rule that allows spanner to guarantee external consistency, as well as serializability. And again, the reason why all this is interesting is that programmers really like transactions and they really like external consistency because it makes the applications much easier to write. They've traditionally not been provided in distributed settings because they're too slow. And so the fact that spanner manages to at least make read only transactions very fast is extremely attractive like no locking no two face commit and not even any distant reads for read only transactions they operate very efficiently from the local replicas. And again this is what's good for a basically 10 factor of 10 latency improvement. As measured in tables three and six, but just to remind you it's not all. It's not all fabulous the the all this wonderful machine really only applies to read only transactions read write transactions still use to face commit and locks. And there's a number of cases in which even spanner will have to block like due to the safe time and the commit wait. As long as their times are accurate enough, these commit weights are likely to be relatively small. Okay, just to summarize the spanner at the time was kind of a breakthrough because it was very rare to see deployed systems that offer you distributed transactions where the data was geographically in very different data centers. It's surprising, you know, spanner people were surprised that somebody was using a database that actually did a good job of this, and that the performance was tolerable. And the snapshot isolation and the time stamping part of the probably the most interesting aspects of the paper. And that is all I have to say for today. Any last questions. Okay. I think on Thursday we're going to we're going to see farm which is a sort of very different slice through the desire to provide very high performance transactions. So I'll see you on Thursday.