 Alright folks are warm welcome to all of you I'm really glad that all of you could make us today and join us in this session, especially on a Friday evening. My name is Piyush Kudil, I'm the head of engineering for a company called Capital Technology Technologies and it is a pleasure to have Unmesh today with us. He will be talking about the spanner paper, which is one of the, I would say, you know, one of the modern technological marvels, especially in the terms of in the database systems, distributed systems and whatnot. Before we get on to that, I just want to take a minute to introduce about papers we love so folks are papers we love is just a community of folks that we come together. We are trying to meet at least once a month but yeah depending on the schedule and the interest we are able to gather. We get together as a forum we pick up one interesting paper could be of any topic. The last few sessions have been around distributed systems and more infrastructure and theory as well. And the spanner paper is a continuation of that. I strongly encourage and request all of you to follow our page, meet our papers we love page and keep following the sessions and the updates as and when they come and we would love to grow this community further. Right. And without taking more time just quick introduction about Unmesh. Thank you. Unmesh is a principal consultant with ThoughtWorks. If you have been following technological articles and I'm sure you would have come across at least one article from Unmesh I personally am a big fan of his writing. I've been attending his meetups and presentations very regularly and very enriching very enlightening and a lot of my technical concepts have improved after listening to him and doing his article as well. So without much further ado, over to you Unmesh. Thank you Piyush for a nice introduction. So, we are going to discuss spanner paper today but let me admit that I'm it's it's a difficult paper and I'm I generally struggle with the papers, because what happens is, there are a lot of things that are mentioned there. Unless you see some actual implementations that they're very difficult to follow. So what I'm, what I'm, I will try to do today is why we will talk about things that are mentioned in spanner paper. We'll see that for any distributed storage system, which is even if not globally distributed distributed like Kafka, for example. And it needs to provide certain features like atomic commits and and transactional consistency. Then the design choices that they have, there are a lot of commonalities so things which are mentioned in spanner paper. You will, today you will find a lot of those also in databases like MongoDB, as well as message brokers like Kafka, there are elements of it, which are common and one of the work that I'm involved in nowadays also is to document these common things. So I'm calling it as patterns of distributed systems and trying to document all these things which are common. So we'll go it that way. I will talk through what the common things are. I will focus on some of the key common things that are there and then we'll refer back to the spanner paper. But I will mostly be showing some, some example code that I have, which resembles spanner because of these commonalities that are out there. So, okay, let me share my screen. While Unmesh is setting it up, folks just want to introduce Sripad as well. Sripad is another very knowledgeable technocrat I have come across and he would be answering the session today. He would be hosting Unmesh and he would be kind of answering the session going forward. Over to you Sripad. So Unmesh, before, I mean, the main time you're setting it up, right? So one of the key things what spanner published is to be able to offer two phase commit at a distributed scale. Right. Which no other database at that scale ever offers. At that time, I mean, when the paper was published, yes. At that time. Yes, yes. Even today I doubt because... Even today there are... But yes, I mean, it was very interesting timing because I mean, the timing of the paper is like 2010, when NoSQL was the buzzword. And having a globally distributed database which advocates consistency with two phase commit. I mean, two phase commit, you know, has a very bad rap in the history of... In the J2E world and even later, it's almost advised to avoid two phase commit. And then having a globally distributed database, which is really encouraging people to use two phase commit advocating the benefits that it has. It was really a unique thing at the time. Yeah, but even today, right? So let's say, I mean, rather adaptations of it like CochroDB, etc. Maybe I think they run on data center or maybe one region. But if not across data centers or region because of particular hardware setup requirements, maybe, but... Right. Yeah, yeah. People who are interested in this, right? I think they talk about in the whole thing in the paper at slightly. I think if you are interested, I'll share a paper on PTP protocols and the general issues with timing as such. Because that's a very hard thing to get. They don't talk about it in this paper, but it's a highly recommended reading as part of that. Right, right. Yeah, I mean, so one of the things that Spanner talks about and we'll touch upon that is the true time API that they have and the tremendous timing machinery that they installed across the globe to make sure that the time offset across machines across the globe is limited to let's say 10 milliseconds. And they try to see why that's important. Because as she mentioned CochroDB and there are open source alternatives to Spanner, which are also getting popularly nowadays like Uwabite and IDB and CochroDB. And they don't use true time. They use something like hybrid clock. So one of the things that Spanner advocated and really implements nicely is two phase commit. And two phase commit with locks and that's tremendous because getting scalability and availability with this is a tricky thing. But I mentioned briefly about some of the patterns that I mean, what I call as patterns, which you see in used in Spanner and two phase commit is one of those. And two phase commit is a pattern to guarantee atomic storage across multiple servers. And there are some very common implementation techniques that are used to provide this guarantee and Spanner uses some of those will look at those. The other is reading without locks. So the other thing is two phase commit requires locks and to give you better performance obviously you can't have locks for every request. So reads are implemented without locks. And when you need to implement reads without locks, you need to have some kind of versioning. I mean, that's the most popular pattern or most common pattern that is used. So I call that pattern as a version value and Spanner does the same thing. It uses version value. So does nowadays MongoDB and CockpitDB and Yuga by click. All these modern distributed databases, they use version value. By the way, it's even used in... So Suraj has a question. Suraj? Good, I think. Yeah. Right. So you will see at implementation level again how a version storage is implemented is very common. And Spanner has similar implementation even if they mentioned only briefly that they used version storage. But the other critical design choice that needs to be done when you use version storage is what to use as a version. For example, if you see things like its CD, they will just use incrementing integer, more like an import clock as a version. And that's enough for some cases. For a database like Spanner, generally a timestamp is used as a version, a timestamp to which you are storing the values. And then that opens a lot of problems because once you have time in picture, which is assigned to a particular right or a deed, it opens a lot of problems that you need to solve. But the first one is when you are doing... When you mix this with a two-phase commit. So I have one commit because they use time, it also solves a lot of other issues, right? Oh, yeah, absolutely. You get that time ticker correctly. Yes, yes. So people... I mean, generally, right? Also, if you get a time, which is just a monotonic... I mean, in a way it's... Increasing integer in a way as long as it really syncs. So the cost of time is basically to agree to a common time at any moment with reasonable amount of error. Right. And that's what I think the spirit of Spanner is basically, if we can reduce that error boundary to an acceptable minimum, then we can use that as a very globally coordinated monotonic integer for coordination. And that's, I think, the beauty of it. Yes, yes. And one of the things that I mean, because if you see data is like tie DB, it uses a separate central timestamp server, which assigns or which every other question or server request timestamp from. And that has then that bottleneck. So Spanner, what Spanner does with its timing machinery is the underlying machinery is made so robust that you don't need this kind of a coordination to get monotonically increasing timestamps. But that's one of the things. But at maybe core structural or design choices level, I mean, these are the things. So you two face commit with locks, reads without locks. And for that, you need to have version values and then using timestamp as a version. Then choosing a point in time when you assign those timestamps to writes, then time this to this two face commit because you're not doing individual rights. You are doing multiple rights to multiple servers. One question. In the paper, what you mentioned is that even for read, we have to mention that you are reading without lock. Yes, yes, yes. Yes, you need to mention that because otherwise that read will be like you can always read in a read write transaction. And then it will always hold on. So you need to have a read on the transaction, which is marked as read only. And then again, the tricky decision that the implementation needs to make is what timestamp to assign for that read. So we'll focus on these aspects because these are, I think, tricky and complex enough to talk about. One question in normal DB work rate. We can do a look for update or at least making sure intention is locked. There's no select for read. There's no way. I don't want to lock and that's a lot of other issues. Yes, yes. I mean, just adding that API itself is a genius work. Yeah, yeah. And then there are these other patterns. So one of the, one of the things to note is that Spanner paper mentions Paxos and a lot of Google and Microsoft papers, they mentioned Paxos because of historical reasons. But whenever they say Paxos, it's really a replicated log implementation, which is, I would say, very similar to similar to raft. Because when they say Paxos, it's really a multi Paxos, which is a targeted log. So that's, that's one of the key patterns. And that's one of the things to keep in mind when you hear the word Paxos and a long-lived middle basis and stuff like that. It's mostly rough. Then there's a partitioning because that's one of the decisions that you need to take how you partition on this data across all these servers. And it uses key range partition will be won't go in there. And then one of the key things for any distributed database is to is to have some place where you keep metadata, as well as where you manage group membership and failure detection and decisions like that. And for that, you will typically have a particular Paxos group, like a group of servers, which, which are consistent with some consensus or like graph. And that's called as consistent core. I mean, I call that as a consistent core. Again, very similar things you will see in all the other databases like like Mongo, Yucca, White or CochroDB, but also in things like Kafka and Itzy and even Kubernetes, by the way, even if it's not a database. So we'll focus on first, first two things and I will, I will try to drive it through code. So we'll see what what two phase commit needs are and what are the design choices and what choices Spanner has made. But to look at an example. So imagine you have, let's say I'm creating this 100 servers, you know, like there are 100 servers on which you can, you can store data. And there are equipment bookings which are stored on different servers. I mean, I just chose this example because I didn't want the account and banking example wanted something different. So truck bookings go on one server and backhoe bookings go on some other server. And the key thing is that if someone is planning to book multiple, multiple equipments that booking needs to be atomic. And it should not happen that booking for one like truck goes through but backhoe can't be booked or vice versa. It should either all go through or none should go through. And that's the requirement of atomic commit across the servers. So what we are seeing in this test is there are these 100 servers. And there is a plan Alice and how you will do a typical booking is you will first check if a truck is available on a Monday and a backhoe is available on a Monday. And if both are available, you say that you book a truck and then you book backhoe and where owner is myself like Alice. And the requirement is that these two should go atomically even if they are across two servers. And what you will typically can expect is that you do a commit when you're done, you say that commit this transaction and then wait for that commit to happen. Now, to implement this particular atomic committing, you need to have or you need to make multiple design choices and obviously Spanner makes all those design choices. First is, you need a persistent coordinator, you need one server, which acts as a coordinator amongst the group of servers that you have. And it should be persistent. And because if that crashes and comes back up, it should know what state it was in so that it can resolve the pending transactions. The servers themselves, they need to detect that a particular request that's coming to me is part of a transaction. And so that they can act accordingly and they can take decision on whether to actually make this thing available to the user or not and how to hold locks and stuff like that. And you can see here that Spanner is essentially when you can look at it as a distributed key value store and it picks up for any action it does, a get or a put. You need to first based on the key that you are interacting with, you need to pick one coordinator amongst all of the servers that you have. And what happens typically in a distributed key value stores, again, Spanner paper doesn't explicitly mention this, but you will have the first key that client interacts with. You will have, you will mark server for that key as the coordinator for your transaction. And Spanner does obviously that, like any other, any other variable, so it picks up one server as a coordinator. The other thing that you need to do is whenever you start something, start a transaction and you need to mark every action that you are doing, every action that you are doing, like a boot target in this case. With that transaction, you need to create a transaction reference. And a transaction reference need to uniquely identified and we'll see how one Spanner does it slightly differently. But you need something like this. And then when you do a get request or send a get request or a boot request, boot requests are slightly different than get. But again, what you will do is you will typically acquire a read log. And only if you can acquire a read log, you will read the value from the underlying KV store. And the thing is that if there is a pending write transaction, your read will block. So you're required and this is again an example code, but you can expect Spanner to have a very similar code. And only after your log is granted, your get request will be fulfilled. Boots are slightly different. So when you acquire read log, if there are multiple read logs, then would I be granted a read log? Yes, you would be. I mean, if there are read logs, you would be granted a read log. So on the contrary, if there is any read log, then write log cannot acquire. Yes, write log cannot acquire it. I can have multiple read. So I can maybe go inside that acquire. It's very similar to read write log. Yes, it is a read write kind of log, but you can see that it's just a data structure that it has. So it has owners, which can be multiple. And it will be multiple owners only if there are reads happening. And then there's a weight queue that you will typically have in the log for waiting transactions. And Spanner does, and we'll see this goes into conflict situations and deadlocks. Obviously, you will get into deadlock with this. But before going there, you see that the Boots that are happening, they are not really applied immediately. They are just buffered. So you see here that there is a transaction state that is maintained and you just keep pending updates with yourself. Now, one of the things here is that you don't need to, because Boots are not really applied or stored till your transaction is committed. You don't need to send this to the actual servers. I mean, you can defer all of that till the very end, till the commit happens. And one of the things to note here is that your typical databases will have different consistency guarantees, right? Like read uncommitted and other things. Spanner is strictly serializable. I mean, it's actually more than serializable. It's linearized by database. So it doesn't allow you to read uncommitted transactions. And to add to that, and we'll see in some time that your writes are visible only with a particular commit timestamp. Your key is not just key, key with a timestamp. And because that timestamp is assigned only at the commit time, only when you actually commit a transaction, these values, even if you not store it and buffer it on the client side, it's absolutely fine. So you are not the difference between read and write. You can see here is that reads go to the actual server where the key resides and it holds a lock. Writes are just buffered till the very end. And by the way, we are not yet put timestamp and versions in here. We'll do that a bit later. But you can see that you will see that even with versions and timestamp that you get from true time, the core structure remains exactly the same. So we are looking at two-phase commit and essentially transaction implementation in Spanner, which is very similar. So gets go to the server, acquire locks puts, they don't go to the server or even if they go to the server, they are buffered. In this case, in my sample code, they are going to the server. And the other thing is you mark everything with a unique transaction key. And you pick up a coordinator, which keeps track of what all keys are part of your transactions so that it can see how it utilizes that. But once you are, let's say, done with whatever you wanted to do, like reserve and checking it's available, you commit. And the commit request essentially goes to your coordinator. And coordinator does a lot of things now. So coordinator essentially keeps track of the transaction state. So it knows about this transaction and it uses a write a headlock essentially to record every step that it's taking for this particular transaction. The other thing that it does is it sends. So this is where the actual two phases of this two phase commit get executed. What was done before was just setting up everything that's needed typically for two phase commit. So it sends a prepare request and to all the participants for this particular transaction. A particular server receives a prepare request. You can see that we saw that the get already acquired all the logs. The updates or puts, they acquire logs in this prepare stage. And they will return a successful result only if they can acquire logs successfully and they record all these decisions in the write a log themselves. So the prepare updates, you can see that you have these pending updates that we had buffered in the put operations. They try to acquire logs and these acquire logs, they will acquire a read write log or a write log is in this case. And only if they acquire write logs, they record this pending update into write a log just in case this this server crashes at restart it will be able to figure out that they were pending transactions. And then it returns the result and if all the participants they reply to essentially then the coordinator will actually commit the transactions. It will send the commit message to all the participants. And this is where your actual the puts that we had in the puts we had buffered, they will actually be made available in the key value store. So only after apply, you will have this pending updates applied and you will keep key value store and we'll see why this is an available map and concurrency list. I will comment on that. But only when you commit, this is when these these changes are actually made available to the end user. So if now someone is trying to read at this point, they are they are made available. And then it releases the log. Now, one of the key issues with whenever you use locks in into face commit is of a deadlock. So whenever you have concurrent transactions going in. You can have a read log let's say now in this case. You have this Alice checking this and and just calling reserve and we saw that this reserve is not holding any locks and it's actually it's actually just buffered. But let's say this Bob transaction is going and when Alice commits, this is this will be blocked. And this will be blocked because Bob is holding a read log. It just has a read only thing going on in here. And it said get with a read log. So Alice transaction will be blocked for and it won't be able to commit till Bob's transaction commits. And if Bob transaction does reserve as well, they will be depending there will be a circular thing happening there with a deadlock each depending on the lock held by other. So you need need a conflict resolution policy in this case if there is a deadlock, what you would do and Spanner also needs to implement that and Spanner. There are just three policies I mean they are not it's not a rocket science. You cannot either say that whenever there is a conflict and we can look at a test maybe. So whenever you whenever you create this lock manager inside or a transaction manager with a lock manager, you can say what conflict resolution policy you need to have. So that it will allow only one transaction to go through and all the other conflicting transactions it will it will kill. It will throw errors whenever there are requests for acquiring locks for those transactions. And as you can imagine it happens in we saw the method for lock acquisition. And you can see here that if it cannot acquire a lock, there is a potential conflict. And if you have an error policy, you say that okay if there is a conflict you just kill all the transactions or a transaction that is causing the conflict. Or there is there are two other policies won't wait and wait die and I won't go into a lot of details here but what it says won't wait. What it says is it will allow only older transactions to continue like if a transaction is older, it will grant a lock. If there is a younger transaction, it will need to wait. But if there is a conflicting transaction like old transaction is trying to grab a lock and a younger transaction has that lock, it will kill that transaction. And what it allows you to do whenever you are allowed to kill other transactions, it essentially breaks the deadlock. And there are general observations that wouldn't wait policy generally has fewer restarts. I mean one of the issues with this is that when you error out or kill other transactions, there will be more restarts because you will typically have whenever your transaction fails, you will retry to make sure the transaction happens. And if you kill more transactions, there will be more retries and your system throughput will essentially go down. So won't wait has proven to have fewer restarts and so Spanner uses won't wait. And this is one of the examples of you will see that there are passing references in Spanner paper like you will see. Yeah, so it's in reader transactions use won't wait double deadlocks but this is how it's you can imagine it to be implemented. And what happens because of that is when there are these two transactions like this Alice transaction is happening. And Alice transaction has a read lock in here and this transaction has a lock in here as well. And when Bob's transaction tries to get the read lock, it won't get it and when it tries to then commit, it will fail. It will be wounded. But it will need to be in retry. I mean, you will see that. So this is how you can typically expect a two phase commit with deadlock prevention mechanism. The other other thing I think I think I did not talk about is to Yogi had a question. Okay. Hey, hi, hi. Yeah, so you mentioned earlier that. So I'm not clear whether this implementation here is more of a kind of abstracted implementation of Spanner. So it kind of represents because there seems to be some differences between some assumptions made here and what's in the Spanner. Yes, yes, yes. So it's, yeah, so it's an abstracted implementation. But I have referred to Mongo and Cockroach and obviously Spanner, there is no code to look at. But you can expect it to be similar. And if they design choices, what you have is like when to go to a server, when to get your coordinator involved, like at the very end or from the beginning, there will be differences there. But all in all, I think broadly. Yeah, so I think particularly on the, on the read side, I think Spanner is very different, right? The reads, unless you ask for a transactional read, it will not, it will not go to the leader. Yeah, yeah. So I will talk about that. I will talk about that. So right now, what I just showed is the two phase committee implementation with read and write. Read only reads, they are very different. I mean, they won't hold locks and they can go to any replica. And the other thing is we need to talk also is when a particular version number or a timestamp is assigned for a read or a write. So that will be very different. But basic code structure, you can expect it to be very similar. Yeah, one of the other things, again, Spanner paper, not explicitly talk about it. But you need a way to order your transaction references because we talked about wound rate and weight die and younger transaction and older transaction. So you need to have a transaction reference that you create to be assigned with some IDs. And typically in distributed systems, you will see that these are server IDs, which are unique and monotonically assigned. So Spanner does not talk about how they assign these IDs, but there needs to be a way to assign these IDs. And the other thing is you need to need a way to track age of a transaction. So you need to know which transaction is in the system for longer time and which transaction is in the system for smaller time. And that you can typically keep this all this information in the transaction reference that you pass along. And instead of then UUID like this, you will have a monotonic ID that's assigned to a transaction and you track the age. So anytime any action is done by a client on a transaction, you will increment the age. And based on that, then you can compare which transaction is older and which transaction is younger. So this is the basic two phase commit that our code structure for a basic two phase commit to which has locks, which happens on multiple servers. And with some elements of what Spanner does in here. In this thing, I have two questions actually. When it comes to write, how do you differentiate between update and creating new entities in this protocol? Because see creation in a way is... You don't really, I mean, because you are, I mean, if you consider key value, you're just putting in a store. Now we'll see. And the other thing is because it will be a version storage to your update is essentially an insert with a new version. Basically. Yeah. So it's going to be, so we will talk about the version values. But just to note some of the design choices and I said two phase commit code structure that I showed. You will see similar code structure in Kafka, CocoaDB, MongoDB, obviously Spanner. The design choices they refer essentially when to start interacting with the coordinator. Is one of the things that can differ. And like CocoaDB and Kafka, you will see Kafka's two phase implementation. They involve coordinator right in the beginning. Spanner differs it to the very end. Because one of the things Spanner is very cautious about is avoiding or making as many parallel calls as it can across a set of servers. Obviously because it's designed to be globally distributed with servers which are far away and more latency is involved. So it differs talking to coordinator till the very end. The other is whether to use locks or not against Spanner uses locks and and it actually simplifies a lot of things. And we'll see how maybe. But CocoaDB is mostly lockless. They are now implementing locks to some extent, but they don't use locks. And in case of Kafka, you can see that there is implicit lock. Kafka doesn't obviously use any explicit locks till your transaction commits you your kind of block from reading anything that's going on in on your topic or partition. The other is as we saw. Sorry to interrupt. Quick question. I think in your code you you are mentioning that there is a clock that is actually. So there's a timestamp that is also being used as an ID. Just wanted to understand. Timestamp that is used as an to calculate the age of the transaction. I think it is playing a role. Just wanted to understand in a distributed setup, who's providing that unified view of what I mean, because I think it's a difficult problem in a distributed server. Yeah. So in this case, it's it's it's simpler because the transactional client is calculating the age and it's based on your monotonic clock, which is clock monotonic nanotime based block. So your age is getting calculated only on a single server. It's just passed to all the other other servers. The age is not computed on on different servers. So you see. I think I do. Okay. Because I have some other code in here but you can you can imagine him to increment this age whenever you do put target in using this direction. So. Good. Yeah, I mean you can imagine it imagine it happening happening. Okay. I think it's final paper. They've implemented this two time images for this. No, no, no, no, no, two time two time is just for versioning for ordering transactions. You still need some way to measure age of a transaction and order order your transaction ideas, which is, which is different than your versioning usage of two time. But then who decides ordering rate and somebody will have your lock manager and based on your transaction rate. So your transaction rate, if you see you this is a lock manager. Yeah, I mean it's not deciding the ordering but it will use this ordering for making ties. So base, my doubt is whether the timestamp is only used for ordering the transactions, right? The timestamp is not I mean I have this code, but you focus on this code. I mean if you use this rate, then it won't work across servers. This will work across servers because you're essentially ordering on IDs or on age. Okay. Yeah, this won't work across servers. Is it the same in Spanner as well? At least the Cloud Spanner documentation says that they use transaction IDs which can be ordered and essentially for the wound rate implementation, they use that transaction ID based. Now how they assign, they haven't mentioned. Okay. Yeah. Yeah, but I mean you will typically have this use of monotonic IDs that given impact source or raft, you have this use of monotonic IDs that you need for breaking ties. And which is typically done by using your server IDs and server IDs are expected to be monotonic IDs assigned one at a time. So I can imagine Spanner using something very similar. Right. So yeah, but these are the design choices that we talked about. And the last is when the right request, I mean we saw the right request are essentially not persistent, not visible to the clients till the commit stage. So they can be deferred and Spanner does defer sending those to the actual servers till the very end. Cockroach DV for its different implementation which uses these rights as kind of intent records which use as lock as well, which are used as lock as well. So it sends every right to the server Spanner differs that. So Spanner will, you can expect at commit. It will send all the pending requests to rights to the client and as part of the prepare message, not before that. But the other now aspect on top of two-phase commit that Spanner or any distributed database, which needs to have lock free reads is to use versioning, some kind of versioning so that your rights don't conflict with your reads, particularly if you're doing a read on the transaction. If you're doing a read write transaction, then there is no choice but to block each other and have conflict resolution and use locks. But if you have read on the transactions, and you need to allow you read on the transactions to continue without conflicting with your transactions, then you need to have some kind of version implemented. And I have documented that as a pattern called versioned value. And again, code structure wise, a versioned value implementation of a storage is very, very similar across all databases, including Cockroach DV and MongoDB. And you can imagine it to be same in Spanner. And most of the modern databases, they use RocksDB as their storage engine for implementing this versioned storage. Now, the choice that's tricky always is what to use as a version. And you can implement a simple, you can have a single integer that's used as a version, or you can have a timestamp, system timestamp to be used as a version. Code structure wise, there won't be any difference. And we'll see how, so whether you use a single integer or a timestamp given by true time or a hybrid clock like thing that Cockroach DV implements, the code structure wise, it will look very same. So the same key value store. Now, one question. So when you say lock free, by definition, that means it cannot read the head of the origin tree. Because if you start reading the head, that means that is not allowed to be modified. By the head, you mean the latest thing that's going on. Or it doesn't matter. You'll always read with a version. No, that's a good point. I mean, the thing there is that one of the guarantees that these databases need to provide when you say that you don't conflict with the right is that if you read at a particular value, if you read at, let's say, I will just take integers as timestamp, you read at timestamp one, and it returns you, let's say nothing, then any right that happens should happen after that one. So there should not be any right happening at one. Right, right. So it's like right after read kind of a guarantee that you need to need to provide. And that's the tricky thing without locks because you read at a particular timestamp without holding a lock and your timestamp or versions are assigned to writes when your actual commit happens. But that commit now somehow needs to take into account that there is a read that happened at possibly a earlier timestamp or some timestamp or some version. So a right should happen at a larger version than that read. And Spanner handles it in a certain way. I mean, it actually true time helps it because true time guarantees some amount of consistency across the servers. But Spanner still needs to have some kind of a weight that's added to the read request to make sure that if any right is about to happen at that particular timestamp that that happens before a read response is written. Or what it can do is your client first checks with the server, what is the safe timestamp to read at and then it reads at that timestamp. So that's it's a very tricky thing to implement without locks and one of the one of the complex things. Like all other databases also suffer from this. Remember, they say that lock free reads are allowed. And CockroachDB, if you see it implements what they call as a timestamp cache, which essentially for any read request, it records that as well in that cache. So any right that's happening after that read, it's guaranteed to increment the timestamp for that read. But we'll look at let's say this. I think some of these ideas are on this read. I think there's an excellent talk. I'll put a link. I think he has mentioned Eric Brewer has mentioned that in his talk. He differentiates between freshness and consistency. Because I think that's a great thing. But so basically, if I say when you want to read lock, I mean, without lock, then you'll always do T minus five or whatever T minus some duration. So that is always guaranteed to give you consistent read but not fresh read. But not fresh read. And then if you want a fresh read, then only thing you can do is you actually ask server what is your latest read and then read that or you just wait till the latest read happens. I mean, if you know that you want to read at a particular timestamp, you wait till the server actually has some rights at that timestamp and Spanner does both. So the same kind of key value store that we had which was transactional. Now, if we want to implement Virginia to allow lock free reads, it's implemented like your storage, which was a simple hash map. You can imagine it to be a concurrent skip list hash map or concurrent skip list map, sorry. And key that you see here is now not just a string key, but a version key. So this is the actual key you are writing, and it will always be accompanied by a version. If you implement it with a simple, simple lamp or clock kind of a thing, you can imagine it that every put this you increment this version so that you will have newer versions assigned to every put. Now, one of the tricky things when you mix this with two phase commit is you increment, let's say it with every put, but which because you're you're actually storing values across servers in a two phase coming, which particular version number. You need you assign because then when you say atomic commit, they also need to have same version number both or sorry, not both. In our example there are two but all the puts or all the stores or the values that are stored as part of that atomic commit they need to have same version number. And typically this fits very well with with two phase commit to where you have a prepared phase, which allows you to respond to a coordinator with with your choices. And in this case, when you have a version key value store. Yeah, so whenever your coordinator sends you a prepare. You do exactly the same things that you did before like call your pending updates your. You're actually acquiring locks for and then writing it to the right log, make sure it's failure safe. And then what you do is you respond the version number that you are you can assign to this particular right to the coordinator and this version number again I'm I use just single integer but you can imagine it to be a timestamp. And it can be a hybrid timestamp or if it's like in case of spanner. It's a two time given timestamp, which is that's a TT dot latest or the latest timestamp latest time value for this prepared. Now what this coordinator does, typically in your prepare phase is it picks up the maximum timestamp or the maximum version value. That's across the set of servers across across the papers and and when it commits. It sends this particular timestamp to all the servers. I mean, along with all the other recordings that it does. So your key value stores. When they apply a particular update or apply any pending operation. We'll always do it with this commit timestamp which was maximum amongst all the repairs that that happened. So you apply will look like this essentially you are putting a virgin key with a commit timestamp and this is exactly what happens in spanner. So when spanner assigns a timestamp as a virgin to read write transactions. The prepare I mean it obviously holds locks first for all reads and writes that are happening in there. And one thing that's that's important here is what locks make sure is that there is no other right happening while you're reading. So it blocks all the any other rights that that happening. And then it picks up a timestamp at the across a set of servers. The coordinator picks up the latest one across the set of servers. And in all the rights that happen in that transaction always happen with with this commit timestamp. So you will so spanner you can expect exactly this code because this is the exact code that you will also find in popular TV. This is the exact code that you will also find in MongoDB by the way MongoDB I do have. So if you look at MongoDB source code again you will find the exact same like finding the max of repair timestamps and and then assigning it to commit. So this is this is what happens when you assign a assign a version. Now the tricky thing is when you assign timestamp as a version to this because we said it's it's just integer in our case. So when you now pick up a timestamp the challenge that that that you have is of a clock skip that that happens across the set of servers. And and one of the other subtle things in here like when we say lampot timestamps typically within a transaction or within a set of interactions that happen between a set of servers. You need to maintain some amount of causality like you you do a right then read then a right then a right somewhere else. And if the third right happens after a second right you need to track that causality with this timestamp and with your system timestamp I mean it obviously you cannot do that. So the alternatives that that you have is you either have a very strict upper bound on how much a clock skew is or you use something like hybrid clock and all the open source. Databases which are inspired by a spanner like Cockroach DB and you go by anyone MongoDB whether it's not inspired by spanner but has again very similar elements to it. They use hybrid clock to track this kind of a causality like you but what spanner does is spanner you can imagine this to be just a true time dot now dot latest kind of thing if if let's say you have a true time API available to you. Any any questions on this because we have one last thing that I want to talk about and that's how to assign timestamp to the lock free read I mean we briefly talked about it. The trick of read only gets with a timestamp because you don't know what timestamp to assign to it. Now as Chepath said you always have that puzzle because once you get I mean one of the guarantees that that stronger consistency gives you is that if you read a particular value at a particular timestamp that will never change that's like a snapshot isolation guarantee that that you get that I when I read at a particular timestamp. No other right will happen at that timestamp I my response will never change across the set of servers. And one of the things that that you can do is because you know based on the key. One question right so yes let's say. So this proof is coming it works on a lock level or whichever way right so but there is always going to be a possibility because it's at the end one way or the other it's kind of not instantaneous operation right so for example yeah in a transaction there are 10 keys I'm updating yes yes one may get updated before other yes yes and it's just a semantics over it right so yeah. Although I'm reading at a particular time stuff I started reading at a particular timestamp yes what you're saying is when I do a read without. I'll never get that yeah yeah in between right although that is available. Yes yes but how does yeah so one of the things is if you know I've just like one implementation show me that you know that you are going to let's say this server for reading this key. You ask that server what you what is the latest right that you have done at what is the timestamp at which you have done the latest right. And then you use that timestamp to read because then that's it's guaranteed that any other right that will happen it will always have higher timestamp than this. There is one catch here though and we saw that there can be in flight transactions which are prepared but not committed. In here and then what you need to do is then you wait for if there are any prepared transactions you wait for their result while doing it. And then that guarantees that you will see all the values that are expected at this particular timestamp. And there are other things you can do or do we need to or does any of this thing maintain this particular timestamp version. Whether it's initiated by a particular transaction and if that transaction is complete meaning that is there is no more other like there are no more other like any other key beyond this. Yeah, I mean it will check for prepared transactions because only danger is there for in flight prepared transactions which are which can possibly be applying our adding more values at this timestamp. If there are no prepared transactions then it's it's not a problem because any other prepared transactions will or anything that happens after this will happen at a higher time. CochlearDB has a slightly different implementation I mean what it does is if there is a read that's I mean it tracks every read and every write timestamp that happens and it kind of errors out on read if that read is happening at a timestamp at which there is a in flight write and then that read is supposed to. Retreat. Retreat. Yeah. Right. So, so this is just just keeping in mind in which how much is additional remaining or because it's no I think I want to end here because I mean this is I wanted to show the code structure wise this is the thing that you can expect in spanner. There are other things when maybe we can have a separate session for that essentially how the replicated law works and how partitioning works and how metadata with is stored with consistent code. It won't fit in next. Okay. Yeah, but but the crux again I what I wanted to maybe highlight on more is spanner paper is unique for sure with its to time and other things but but a lot of implementation details. You will find very similar implementations in practically any storage system that needs to be distributed and provide some consistency guarantee. So, you will have same kind of problems solved in slightly different ways, but in a similar enough ways. So I'm open to questions I think I will stop here for today. Great. Hey, I had a couple of questions so one is, is the is app development different when you're using these kind of database given that there is a chance of failures of example, batching on the client side before sending the rights to a server say for example right we might have, you know, optimistic failures in a sense so Right. Is it very different or I think app development shouldn't be different. I mean you just need to be aware because things things can go wrong when it's so much distributed. But app server interface if you see it's a SQL interface. It's, I think in spanner it's not full SQL, it's party SQL. A lot of these databases you will see giving either a Postgres compatible API or MySQL compatible API so So you've got to assume essentially that you're working with like an optimistic concurrency system and always attempt retry retries, I guess that will be the A lot of those retries and all it will be, I assume will be part of your client language that's given to you. Okay. It's hidden from you but good to be aware of all of these things when things go wrong. But that's not possible right I mean I mean if the underlying values changes and your business logic depends on it, the price of a product changes, there's no retry which will fix it automatically at the client side you'll have to have logic which retries or retries although the user customer client has to kind of do it. Yeah, that's true. Yeah, but you get the value of this kind of thing right so for example, one of the things when I got startled when I said, or when they say explicitly that you want read only transaction. Right. That means that kind of facility snapshot isolation is very difficult even in that suppose it's very difficult to express those semantics right. Yeah. What do you mean, why is it. So, if I want to get a repeatable read in postgres, right. Yeah, I mean you need to have a versioning of some sort. Yeah, and that is in has to be part of your data structure and not supported by database. You know, give me this transaction with this read so that I'll always read I'm not interested in fresh data or whatever. But that's the basis of what multi version concurrence control is right I mean that that is how it works. So, in theory yes but it's in practice it's very difficult to express that in your sequence. I guess I'm not understanding why but. So, for example, if I want to say, without a session I want to read a particular value always like select something something. And once it treats always give me that value, getting that snapshot isolation without in between consistency reads unless I use transaction. I think that I thought is the basis of MVCC right essentially the client side transaction has an incrementing it's the same transaction ID and it basically reads any rows which are. Yeah, yeah, yeah, but yeah, I think what she is saying is right, I mean it has to be part of your core on the core implementation and it can't have in your SQL is not I mean it's beyond SQL to some extent because I mean that guarantee that when you're reading at a particular version, no other conflicting right can interfere. It's a core part of your, I think the way you, the way you manage your view from your plan to your database. Let me rephrase, actually. So, in SQL, I once I start a transaction, it's a locking transaction at a global level. I cannot lock at a particular version of our data set, because it will always have fresh consistency in mind right. That's that I don't think it's true. I, if you have a repeatable read or a serializable isolation level, the database guarantees that you want to see a version of data that is there at the start of the transaction. Any commits that happen afterwards is not visible to that transaction. That is the guarantee that a repeatable. Repeatable reads does that but then what I mean to say is that setting up those things is also not that easy in like, I mean, then you have to actually set up isolation levels you have to select those solutions level when you are actually making those. Most of those data reasons like repeatable read is default. I think the challenge is when you actually move to multiple servers, like when you're working with a single master database, the version of the time stamp at which you read the data is kind of implicit like one of the transaction. But the the true time infrastructure and all is needed to provide that guarantee when you're reading across multiple masters that that was my understanding. And Yogi, going back to your question, why do you think the application, at least for spanner, why do you think the application needs it's like optimistic current currently but like with read write transactions it's almost like pessimistic right. Yes, it is pessimistic by with read write. Yeah, so if you're doing a read write lock explicitly then it's effective like select for date so in that case, no, I don't think that that's an issue. Yeah. So are you saying that all transactions in spanner are by default read write you can't actually. No, no, they're not. You can mark read transaction separately and they're handling separate. Yeah. So it is a choice given to the application developer. Right. He can actually read data at a consistent point in time like all data he reads, he gets back our consistent but it might not be the latest, but he can also do a read write transaction where he's updating. Yeah. The reason I was saying was that if you think about just a single master database right and you have a table which is highly contended where some some rows are accessed and written by multiple transactions like in that case, you have to make application logic aware of that fact, and you have to attempt retries at the application level because the value might have changed and change the business logic that needs to, you know, recalculate the stuff. And so which is why you have to explicitly build for optimistic currency failures and either bubble that to the client and then the client retry or, you know, have explicit logic at the application level to manage that and deal with it within some bounds. So providing a support for it right transactions, which is kind of locking and all that like it is trying to take away that right right right so that's the pessimistic locking thing so then that takes goes away so essentially you're trading off, you know, sort of availability for consistency in a certain sense. So, compare and swap basically I didn't see any compare and swap kind of things here in this show because that's also one way to do lock place. What you're saying is optimistic right optimistic one yeah I mean you, you say, yes, yes, so you track the version that that you're going to write to and if there is an increment to that increment you you fail that fail that right yeah. Yeah, but not not in case of spanner. And that happens in, in, I think, in, in, in Cockroach will be to some extent. But they are also moving, as I understand to at least some amount of log based implementation because in their experience as well when there is a contention there are there are a lot more retries with with optimistic so they prefer locks in those cases. So the other question, sorry if I could. So the one thing which was not clear was see two time and these banner implementation at Google depends a lot on the underlying network infrastructure that Google has custom built right they've got private networks which are extremely, I mean, low latency high, high level networks right so that's what that's actually a key design part of the pack source implementation is assumes that the nodes are highly available and like you know, so given that constraint I'm curious as to how did the cockroach DB and others who are working this whole open source yeah. Yeah, I mean so the, the crux of, I think that that network latency and, and two time is to be able to add weights, which give you things like external consistency I mean if so two time even with two time. There is a uncertainty window that the API gives you. And what this this I have this sample code but I don't have a unit test for that but you can expect the code to be like this when you commit a particular transaction at a particular time stand. I can say that if this is my commit timestamp. I, I wait and it's essentially just a thread or sleep I don't do anything. Essentially I don't send and this is happening at the coordinator, you can see this and this is coordinator I don't do anything till. I don't send this commit timestamp all the servers till all the servers in in the cluster across all regions have actually elapsed this commit timestamp or actually passed this commit timestamp. And in your typical cloud like I'm going by the cocked to be configuration but typical cloud your clocks to can be of the order of 200 milliseconds across the set of servers so essentially if you need to do a commit weight. It will be 200 milliseconds for commit and what true time gives you the guarantee is that, at least in to like 2012 when the paper was published or 11 their clocks queue or that diff was in the worst case scenario 10 milliseconds and they were working on getting below 1 millisecond so essentially any weight of this kind that that they have in their implementation is actually practically doable and Cockroach DB doesn't present do that Cockroach DB or any other open source implementations they don't make sure that whenever they are doing a particular right at a particular time all the servers in the cluster actually are past that timestamp they don't do that because of this reason that you practically can't get things below a particular offset value. So actually the paper talks says I think 200 microseconds not millisecond I think the 99 percentile is in the 10 milliseconds. But it's it's very low and that's that's what is I think the others what is very interesting but all these open source implementations they don't they don't have this this line. I think Cockroach DB does I did but that's there. I think experimental implementation. So you know PTP rate I mean at some point yes yes yes after what it's actually gets you at least in a single data center. Very close to that kind of activity. Yeah yeah but I mean in case of Google they're saying it's across across across the world but that's across what they needed even in a single data center it was difficult earlier on. So do you know if anybody who's actually used a spanner or Cockroach DB in anger like in production systems and no. So the only person I know who's the company so rubric basically their core product uses Cockroach DB for they used it for a while but I'm not sure if they were migrating away from it or not. At least till three years back they were using it in one of their core systems and I mean JJ is not here but he had worked on it. So who's a common colleague. So yeah. Oh interesting. Yeah because I can imagine that there are only few use cases for this kind of a globally distributed setup. Particularly with things like you go by it and IDB and Cockroach DB. I don't know how production ready they are I mean in terms of maturity. So I think people are cautious when choosing these. Even without the globally distributed aspect like just having a single data center distributed. Yes, that gives you transactional guarantees. Yeah, yeah. It is a useful thing. It is it is absolutely and I think practically everyone is doing that nowadays giving some even some in memory data grid products. They like Apache Ignite and Jamify they do give you two ways commit and this kind of atomic committees. So it is definitely useful then Kafka has it now so you can imagine. We're almost about time. Yeah, yeah. Thanks so much for taking time out for going through this complicated and interesting paper at the same time. Yeah. And thanks all the participants for, you know, attending through this and asking a lot of questions as well. Thanks everyone. Yeah, thank you. Thanks. Thanks so much. Thank you for organizing. Thanks. Thank you folks. See you.