 Hi everyone, I'm Gaurav. I work at VMware on a data center observability system called Network Insight. And in this talk, I'll discuss some of the specific challenges we encountered while building a configuration store layer for this system and our approach to address these. So just a bit of context. The configuration store layer maintains time series for last number of objects. Each time series can be thought of as a sequence of states or nodes where each node maintains the state for that object for a particular associated time span. The exact details of this layer is not so relevant for this talk. And I'll focus on three of the specific aspects of this layer that should be more generically applicable to many other layers that people are building. So firstly, this layer integrates with other subsystems by providing a change feed for every modification that is happening within the store. We also want this layer to provide low latency, point access, and mutations, roughly in the order of five to 10 milliseconds. And this is due to the nature of our existing guarantees that we provide out of this layer. And finally, we should be able to read and write very large objects into the store. We have seen records of up to 100 megabytes in size. And in the future, this could go up significantly, almost up to a GV or so. So let's take each of these aspects that I've bolded one by one at a time. So first is the change log. So typically, these are used to provide an ordered feed for all the updates happening in the primary store to other subsystems. And these in turn react to these changes by taking some specific actions. For instance, an indexing system could consume the change log for making the updates searchable, or a caching system could use this feed to keep itself synchronized with the ongoing modifications. The structure of change log keys in FDV roughly follows this pattern that I've shown here. You have some subspace prefix to isolate these, followed by a version stamp to guarantee uniqueness and strong ordering. And you have associated value that contains the metadata about the change itself. But this design suffers both read and write hotspot. By design, all the logs are appended at the end of this key space, which will be hammered by all the mutations that are coming in. And if you have two or three storage servers that are holding that chart, they will get all the mutations. In addition, these storage servers have the extra work of splitting these charts as they grow in size, and then ship out the older splits to keep the data volume balanced. Even if you look at from the read point of view, this is the only chart that is getting all the read requests. And that consumes a lot of CPU cycles from it. So yeah, I mean, it is quite saturated design. So this is a graph that we plotted from one of our production systems. And we were able to get this out by using a feature called locality info that FDB provides. This gives us the range boundaries that are hosted on each storage server. So we use this information. And in our layer code base, we tag each of the writes and then emit telemetry for it. And that tells us where is that right key going to its server. And in this case, we have our application factor of 2. And we can see that each write is going to exactly two storage servers. And they are so precisely overlapping that you can see only one color. And for a last duration of time, those are the only two storage servers that are getting all the mutations. And based on some internal logic, FDB clusters keep changing these active storage servers for that chart every four to five hours or something like that. It's not so helpful in our case. By that time, the storage servers are already saturated. And they are throttling back transactions. So what we want to have is ability to switch the storage servers faster. If we could do that much more quicker, then these writes will keep on jumping from one storage server to another. And storage servers have ability to accommodate a small burst of mutations because they have about 1 and 1 half GB of buffer space that they can use. So in order to do this as a fast switching of storage server, what if we could get some kind of prefix that we could put in front of our change log keys. These prefix need to have certain properties. They need to be non-contiguous. Otherwise, if they are contiguous, then we will not be solving anything. And they need to be unique. Otherwise, we will not be able to read all these change logs deterministically at the read time. Also, whatever that logic has to be for generating these prefixes, it should be repeatable at the read time. Otherwise, we will not be able to find the data that we have written at the right. So if we had such a key, we could use this in place where I've called bucket. And if this bucket prefix satisfies all these properties, then we can achieve the desired behavior. So we use a simple bucketing function without any bookkeeping or overhead. We take the read version of the transaction. We mask n lower bits out of it. n depends on how fast you want to switch the bucket prefix. And then we reverse all the bits to give us almost a random distribution, but it is deterministic. And you can repeat it at the read time as well. If you use this function or any such similar function and put it in, then what you will get is good even distribution, fast switching of these logs among storage servers. But this has an issue related to ordering. I mentioned earlier that these buckets are based on the read version of the transaction, whereas the changes themselves are based on the commit version of the transaction. Now, if you consider the example that I've given here, TX1 and TX2, TX1 starts before TX2 and ends after TX2. So if we consider the change itself, then the commit version of TX1 is higher than TX2, and so it should be after TX2. But if you look at the read version of TX1, it is before TX2's read version. And the bucket derived out of TX1's read version could be before bucket derived out of TX2. And hence the change log could be in the reverse order. And we don't want to do that. We want the same order for the change logs as for the changes themselves. So we put an additional constraint in these transactions. And what we want or what we ask these transactions to do is that you put an invariant that the bucket that you are writing your change log to, it is in some way clean. And you also dirty the older bucket so that there are no out of order change log writes. So if you see what we've done here is that for the first transaction, we have dirtied the bucket zero, which is like one bucket before bucket one, by putting a right conflict there. And we are expecting our own bucket to be clean by putting a read conflict on it on bucket one. And similarly, transaction two goes and dirties the one bucket before it, which is putting the right conflict on B1. And it expects its own transaction to bucket to be clean by putting a read conflict on B2. And this red line that I've seen, this creates a conflict between them. And after TX2 has committed when TX1 tries to commit itself, there is a conflict due to the red line there. And it gets retried with a liter bucket, a read version, and therefore the bucket, and things are nice. So if we had some way to apply user defined functions on the version stamps on the server side, we wouldn't need to do all of this. But unfortunately, at the moment, FDB doesn't provide such kind of functionality. So yeah, I mean, in practice, we can expect to see some conflicts due to this. But on our workloads that we experimented with, it hardly matters. It is like very, very few conflicts, almost negligible. All right. So this is, I don't know if it is visible, but this is after this change what the scenario looked like. If you recall from the earlier graph, there were only two servers active for long durations. And now if you see, they are almost randomly getting assigned. So the first graph is showing you the mutation rates for each of the servers all put together. And the remaining three graphs, I've highlighted one server in each one of them and trying to see how they're switching fast. And it's pretty evenly balanced out. And it changes roughly every few minutes. The second problem I want to talk about, which is again, very generically applicable is minimizing the latency of the transactions. So typically, these are the four phases of transaction. You have GRV to get the read version, reads themself, writes, and then commit. And typically, this is the range of latency that we see in our systems. So out of these, GRV is something that is optional. It can be removed at some cost. So every transaction has a read version and a write version, depending on whether it is doing at least one read and whether it's doing at least one write. If we could cast these versions and then reuse them, then we can eliminate the GRV call. And I'll go into the details into the next slide. But using this, we are able to save about 25% of our transaction latencies at the cost of some of the transactions or read-only transactions, giving you slightly still data, but monotonic. Monotonic in the sense that because we are caching all the read and the commit versions in a process, a transaction never gets to see a data older than the last transaction that happened. It will see at least as new the data as the last transaction. We have to be careful here because GRV is not just the mechanism to hand out the read version. It is also admission control mechanism used by proxy to control incoming transaction. And proxy slows down the response to the GRV if the cluster is under load. And if we aggressively, aggressively bypass the GRV call, then we would defeat the mechanism. So this is roughly the structure of the simplified version that we use. So we route all the transaction through this run block, you can say, as a pseudocode. And the callers get to choose whether they want to reuse the callers get to choose whether they want to provide a read version upfront or not. So if they provide the read version, then before running the transaction, we set it on. Otherwise, we refresh the read version in the green block there. And after applying the mutation, after applying the transaction, committing it, we have the commit version available as well. And we pull it out and use it in the cast value. If you see the refresh read version block, in addition to making the GRV call explicitly, we also measure the latency of the GRV call. And this latency serves as a proxy as an indicator to know whether proxy wants to throttle the incoming transaction. And a very simplified approach could be that if the latency is greater than some threshold T, then you consider that the proxy is not willing to accept too many transactions. And you don't use the cast version and go to the proxy directly. And let it apply its throttling. So I mean, T could be like 5, 10, 15 milliseconds or so. If you are under it, then you're good. If you are going over it, then you better go and ask proxy. And don't worry about these optimizations here. Note that both these calls, GRV as well as commit version, they happen all the time within the transaction implicitly. So we're not adding any extra overhead other than calling it explicitly and caching these values. For write-only transaction, GRV call doesn't happen. And there are some more options that we have in our code base to avoid explicit GRV call if it's a write-only transaction. Finally, I want to talk about this third aspect of handling large values. So we get these large values in our system from device configurations that we collect from the data centers. They come in XML, JSON, protobuf, et cetera. Examples of these configuration could be large switches and firewalls that have many ports and rules, and computed network topologies that have hundreds of thousands of parts in them and tend to become large. Now, if DB has this transaction limits around both the duration as well as size, it doesn't allow transactions to span more than five seconds. And even though the documentation says 10MB is the upper bound for a transaction, the recommended size if you read the forums is close to 1 megabytes. But we still want to have consistent writes and atomic visibility. And we want to have consistent read without any stale or partial data. We just want to write these large values as if FDB allowed such large values to be written in one transaction, even though, due to the limits, we cannot do that. So what we do is we follow a multi-step protocol to achieve it. And it's pretty standard and simple, but useful. So what we do is we start with writing like a temporary garbage collection row and give it an ownership of a chunk pointer. Version's time can again be used as a chunk pointer, but it could be anything, a UUID or whatever you want to. So one thing to note, each of the colors represent an individual transaction. So it probably will make more sense then. Once we have got the chunk pointer, then we break up our data, which could be megabytes or gigabytes in size, and then write it against this chunk pointer in multiple transactions, maybe in parallel, depending on how you want to write it. Once this is complete, we shift the ownership of chunk pointer back from the GC record to a master record, which is our main lookup mechanism, and delete the GC row in the same transaction. Once this is done, the ownership is now fixed with master, and the visibility is atomic. In between, if there was any failure before we were able to do this blue transaction, all any partial data would remain under the ownership of GC row, which will be eventually cleaned up by some background task. When we go to the delete of these rows, instead of deleting the data inline, we shift the ownership of chunk pointer back from the master to a new GC row, and clear the master row all in same transaction. Again, what we get out of this is, if there was any concurrent read happening for that record, it will not fail, because the data is still present, and it will get deleted after some time. That concurrent read, why I say that concurrent read would have otherwise failed is because it's a lot of data to read, and the concurrent read might not be reading this entire record in a single transaction. It might be using multiple transactions, so it will not get MVCC benefits. So if I delete the data inline, that transaction would get partial or incomplete data. But with this pattern, we don't run into that. And the background cleanup just periodically scans for all the garbage rows for, let's say, anything up to now minus 30 minutes or now minus 1 hour. And for each of the garbage, it clears itself and does a range delete for all the data under the space of it. And that's about it. Like if you consider extensions of these, like updation of records, they are pretty simple, and this protocol can be easily made to do so. So just to recap, we have used, we have discussed some of the common patterns that we find running into many times when building layers. Like I've been listening to many talks since morning, and I think there have been a lot of mention about change feeds and latency and everything. And I think these are pretty generically applicable problems. We saw how we use some of the AFDB constructs to address these issues. Specifically for change log, we made use of version stamps, conflict ranges, and locality info to first find the problem and then devise a pattern for it. For reducing the latency, read and write, commit versions were used. And for handling large values, version stamp, and multi-row transactions. So using these kind of techniques, we were able to balance out our SS storage server queues pretty uniformly. And we removed throttling happening due to it. We were able to reduce our transaction latencies by about 25% under a certain set of conditions where it was OK to get certain state of reach. By the way, I think I skipped the part where for the pattern that we have used for reducing the latency, it only affects the read-only transactions for write transactions. They will never be run into consistency problem. They will always conflict and be retried. And so they will not be sacrificing any guarantees for transactions that involve write. In theory, they may run into slightly more conflicts. But again, we don't find that happening too much for our workloads. And finally, we have a pattern where we really do not have any limit on the size of the record we can write in FoundationDB. So I mean, we don't have to worry what that limit is and whether we are affected by any of such limit within FoundationDB. So I think that's it. Yeah, thank you.