 Carnegie Mellon vaccination database talks are made possible by OttoTune. Learn how to automatically optimize your MySQL and PostgreSQL configurations at ottotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Hi guys, let's get started. So, we're happy today to have Badres from Microsoft Research, he is a Senior Principal Researcher at the Data Systems Group, which is a new group at MSR. You've been there since, for a while now, right? Since 2008. Okay. Prior to that, he also worked on another sort of famous stream processing system called Trill. And then he's here to talk about the faster than the new key by store he's been building at Microsoft for a couple of years now. So, Badres has a PhD from Duke University. I think you work with June, right? That's right. June, yeah. So, Badres can go to the talk and as always, if you have questions, please unmute yourself, say who you are, say where you're coming from and ask Badres your questions and feel free to do this at any time, interrupt whenever you want. We want this to be a conversation for Badres feels like he's not talking to himself. Okay. All right, Badres, the floor is yours. Go for it. Thank you for being here. Hey, thanks, Andy. Thanks for inviting me here. I look forward to talking to all of you. As Andy said, happy to take any comments or questions in between as well. So yeah, I'm Badres, I've been working at Microsoft for the last like 13 years and the first half of my career, I was working on stream processing with the Trill system, which is also open source. And the last couple of years, we built this new key value store called faster, which was really meant to started as a project to address the state management problems that we were seeing in Trill, which is the stream processing system that I worked on earlier. There's a joint work with a bunch of amazing collaborators, both from academia and the open source. And in particular, I'd like to call out to X in terms of Gunnar Prasad and TNU, TNU from MIT from MIT, XEMU actually, and Gunnar, who was at UW, but now decided to go to WhatsApp to make systems better there. Instead of, well, he basically put a hiatus to his PhD and going there. Okay. Anyway. All right. So let's see. Next slide. Yeah. So the modern cloud edge, what we notice is that there is this huge influx in the kinds of applications that are being written in this new environment. And these applications have a particular pattern of accessing and updating state. In particular, the access update in the cache huge volumes of the small or fine grained objects. This could be billions of per device counters, per ad statistics in Bing, shared app state in serverless architectures and so on. And often the data does not fit in memory and it requires fast access and update durability at the same time. In addition to the state itself, we've noticed there were other applications that required very simple logging instructions like reliable logging and messaging as key building blocks. And these applications needed to work across a wide variety of environments, including the cloud edge kind of hybrid environments as well. So what is faster? It's basically an open source library that allows you to accelerate this kind of object storage, indexing and logging. It's high performance. It is implemented with kind of thread scalability in mind on a single multi-code machine. It is latch-free, uses based on shared memories and it's primarily the version of faster is in C sharp, while we also have a port to C plus plus, which is used in a couple of use cases. But I'll really focus on the C sharp one for now in the stock. The two subcomponents of the system are faster log, which is basically the underlying log abstraction of faster. This turns out to be independently useful because it allows users to use this hybrid log abstraction, which I'll talk about in the next couple of slides as independently as a persistent queue. And it's used, for example, in Azure event grid and other such services. Faster KV is basically the log with a index on top. And the index happens to be a high performance, concurrent latch-free hash index that sits on top. And the key abstraction here is this hybrid log, which basically integrates a standard kind of read, copy, update kind of log with in place updates on the tail of the log for performance. And that you can think of it as an integrated cache with the storage behind it. And it shapes the hot working set in memory using this technology. And I'll talk a little bit more about this in the next couple of slides. A distinguishing feature of faster is its high performance. So basically it is a bottleneck by the CPU cache coherence cost. So on the reasonably modern machine with multiple cores, you basically get around 150 million operations per second, which is the cache coherence kind of bottleneck of those machines. In particular, you exceed the throughput of pure in-memory systems when the working set happens to fit in memory. So in some sense, it's a larger than memory system that, if your working set happens to fit in memory, it will perform really well similar to a pure main memory database. So the talk is outlined as follows. We'll start with the single node version of faster and kind of really focus on that. So I'll talk about the architecture, some use cases and give you a peek under the hood of how we pull off some of the smarts that the system offers. And then we'll go on to multi-node faster, where I spent a significantly lesser amount of time on this, and this is working the last year, I would say. And there's a couple of papers we have out this year on this topic, including the overall architecture and the distributed recoverability part, which will be appearing at VLDBN Sigmund this year. All right, let's start with the system architecture. So faster is basically kind of evolved as a single node library that you can embed into your application. And the way it basically thinks of memory is in two kind of distinct segments. One is the in-memory portion that's shown at the bottom here, and then there's tiered storage. And by tiered storage, I mean any sequence of storage devices that are basically combined to form a unified obstruction. Now, this could include local SSD, it could include remote or DMA remote memory, it could include Azure premium storage or any of your favorite cloud storage and so on. As long as they implement this abstraction of what we call as an iDevice interface, it becomes a storage provider for the faster library. In memory, you have a hash index, and this hash index is carefully optimized to kind of be very performant and latch-free. And for example, it has like these hash buckets that are sized at one cache line each, and there's some additional tag bits that are used inside these buckets to give you a better hash resolution. And there's some interesting algorithms there on how you can kind of keep this hash table updated in the presence of concurrency. And I won't get into details of that, but this is kind of the first component here. The paper has all the details on how we actually pull that off. The other side which is probably more interesting from an architectural standpoint is the hybrid log. And the hybrid log is basically our abstraction of storage, it's a record-oriented storage, which is divided into like a bunch of pages that's shown. And all the pages basically have a single unified address space. So starting from zero all the way down, and the page numbers keep increasing. And there are specific points in this address space that we call out. For example, the begin address is the logical address of the beginning of the log. The head address is the first address at main memory, and the tail address is the tail of the log, basically. In addition, there's something called a read-only address that basically separates out the in-memory portion into a read-only region and a mutable region. So essentially what happens is that the hybrid log allows records that are in the read-only region to be updated in place. And as the records eventually migrate into the read-only region, they become ready to be pushed to storage, to secondary storage. So basically it's immutable and you start pushing them out. So you're not allowed to make any changes. So any changes to records in this part of the log undergo what is called as the read copy update. So they come back to the tail. So if you think of the lifecycle of a record, it starts at the tail, makes its way up here while allowing in-place updates. And finally, once it gets into the immutable region, you can again kind of bring it back to the tail if it becomes hot again. So in some sense, it captures the temporal locality of the hot records. So we call this immutable region and this is the immutable cache. And the hash index itself basically chains all records that collide onto the same hash offset comma tag into a single high link list. So there could be collisions in this link list, but they kind of span across them and the pointers always point in the upward direction. So you always point to the latest record here. In addition for read workloads, we can have a separate read cache, but you think of it as just a smaller instance of the same hybrid log, which basically allows read records to be placed here for the purpose of supporting read heavy workloads. Now this architecture is really intuitive, but the challenges come in when there is multi-threading involved because you want to be able to keep this entire architecture correct in the presence of multiple threads that are writing to this data structure. And as records are being offloaded to secondary storage and so on. And we'll talk about how we pulled that off using the APOC protection framework shortly. Perfection. For elements or entries in the read cache with the translation of link list, if the chain would ever point to something that's in the readable cache, like in your example, it only goes up to the hybrid log on disk. There really could be anything in the readable cache as well, right? Totally, yes, yes. You think of it as just something that gets spliced in between the hash index and the main log so that if they are really hot, you can basically reach it with one cache miss basically. So that's how you would like chain it through. But then if there's an update to a colliding key, basically it would invalidate the read cache entry and then it would point to the tail and so on. Got it, thanks. So all these threads are now hitting this index, right? So that's the overall architecture. And I think I already talked about the overall hybrid log region. There's this read only region, the mutable and the stable, as mentioned. So again, it captures temporal locality because basically one of the important things was that even if you have a billion objects, you're indexing, for example, in Bing, like your indexing say user clicks, right? The user who's actively clicking on ads, for example, maybe a small fraction of them and you want to really capture those hot records in the mutable region and get like 100 million operations per second for those records. But it's fine for the colder records to sit somewhere on secondary storage and then eventually get paged in on demand. So before getting into the technical details of how we pull off the concurrency and the multi-threading nature, let me get into a little bit about the features, the use cases and the performance characteristics of this architecture. So basically we offered latch-free operations of read, absurd and delete. So the semantics are as you would expect. So there's reads, upsorts is basically, you insert the record if it does not exist or if it does exist, then you update it. And then of course, deletes, which are again handled either as in place deletes in the mutable region or as tombstone records in this log if they are not in the mutable region. There are no transactions, but there's an atomic read, modify, write operation which is a fundamental API that faster provides. We call this RMW. It's a pretty powerful primitive for aggregation and lots of our use cases actually use RMW as a building block. And the very simple example would be to say you're maintaining per user counters of how many times did someone click on a particular ad, for example, right? So then you would basically have an RMW operation that just increments the counter for that particular user. The way it's exposed in the API is quite straightforward as this example shows. I don't know how much you can read from here, but basically you create a device, basically this tier storage device. You create an instance of the key value store. And then the way you interact with faster is in the form of a session. So you construct a session to faster and the session is nothing but a sequence of operations to it. It's a linearizable sequence of operations. These could be upsets, reads and read modify writes. You can see in this example, like I created a session and then I perform an absurd followed by a read and then say a couple of read modify writes. And when I created the session, I'm gonna provide it a lambda here that tells me what is this increment operator, the definition of the merge operation for a key, right? So given a key and an old value, how do I compute the new value when you do an RMW operation? So the RMW could be like count equals count plus one. Or in this example, it is simply A plus B. So it's a summing kind of an aggregate, right? So then now you have the two RMWs here will basically add 10 to each of your, to that particular key. So obviously the final output would be the old value plus 20, which is kind of what's happening in this example. And the simple example, I'm just absurding a value and just making sure that you read what you were absurded. So the interesting thing here is that you could have many such sessions that are talking to faster and we will talk about how we give you recoverability guarantees on a per session basis using the notion of prefix recoverability. And just to give you a heads up, the way it works is that out of these sequences of operations that you're issuing per session, there are prefixes that become persistent and you're lazily notified that certain prefixes of this session are now persistent. And the system takes care of this guarantee internally through a mechanism that I will talk about shortly. We also support async operation. So the C sharp has this task framework or more generally core routines you may be familiar with in C++. So everything is async, the disk access is async and we can easily saturate, for example, modern NVME drives using this model. So we use a very kind of asynchronous overlapped IO in the system so that you can basically saturate the like 750,000 IOPS with a modern NVM drive using around four to eight cores of a machine. So it is pretty optimized for the use case where your cache is backed by something like NVME devices. And we can also dynamically grow the index size while the operations are happening on the system. So again, the multi-threading is something that is designed to be able to handle index growth as well. And we'll talk about how the underpinnings of the system in the next couple of sections. Right, so I was talking about the I device abstraction and that's actually the generalized abstraction that the users can in fact write their own I device layers. So for example, we have had people write devices for Azure storage, Azure page blobs in particular, right? Or you can now mix and match them. For example, you could say that you could have a local SSD that tears down into Azure storage so that you can use your SSD for reads but all the writes kind of write through all the way down into cloud storage so that that becomes your point of truth. So you can basically also shard these devices, you could tear them and so on. So you can essentially create these meta devices that still expose the I device abstraction and work with faster and modified. And it takes care of all the sharding and the tearing under the hood. I mentioned a little bit about like incremental checkpointing and recovery. So we support that. So this example, we simply are taking the first 20% of the log and compacting it. So basically check for liveness. And if for all live records, you basically write them to the tail. So this is standard log compaction that is supported. And then you can take checkpoints at any time. So these are asynchronous checkpoints and you can also like expire old records and so on. So basically it provides you all the fundamental primitives to be able to implement a key value store either based on like explicit deletion of keys or expiry based deletion of keys as well. One of the new features that we are adding this year is secondary indexing. So as you notice, we only support point lookups, point updates and remodify rights, but we are also adding the ability to secondary index the faster log. So for example, you could have a range index that sits to the side and allows you to run a range queries on either keys or values. Or you could also do what is called a subset hash indexing, which is an interesting way of like collecting subsets of data and making them very quickly accessible, which are not based on the key itself but based on some properties in your value. So for example, I could have a secondary index that indexes all the value.pet fields and then somebody could come in and ask, give me all records with pet equal to dog, right? And then it would basically, this is the secondary index would hash chain all those records so that you don't have to do a full scan to get all the records with the pet equal to dog, for example. And turns out that this is pivotal for durable functions, which is our serverless offering where they allow users to write queries on their kind of state. So these are secondary queries that you can now run using this infrastructure. So I actually mentioned a couple of use cases, but just to summarize like some of the main use cases, this includes the Trill use case where Trill is used as the backend for Azure Stream Analytics. And they're using faster KV to externalize reference data. So I mentioned this Bing use case earlier and that's one of the scenarios that we're looking at where the Azure Stream Analytics would be used to basically process these reference data lookups and things like that in a streaming pipeline. Event Grid is another use case of Faster where they're using not the KV, but just the log as a routing and notification service on the edge. And here the fact that Faster is based on C Sharp makes a lot of difference because C Sharp basically works anywhere.net framework can run. This could run on Raspberry Pis, it can run on Linux devices, it can run on honking multi-core servers and all of the above, right? So it can run on your private cloud so that you don't have to really deploy a service on Azure to use it. So since it's a library, so Azure Event Grid on Edge, for example, is offering that allows you to do notifications and routing on private cloud kind of settings. And this is where they find it very useful to use Faster as a building block. With Sybilis, I think one of the interesting recent developments is that we have a new stateless, stateful Sybilis system from Azure called Deerable Functions. So Deerable Functions has been around for a while and it supports like different state backends. So for example, you could use Azure Tables as a backend, you could use SQL Server as a backend for state. So some of us have built a new backend for Deerable Functions called Netherite. So Netherite is basically a backend for Deerable Functions along with the programming paradigm for Sybilis which basically allows you to write C-Shop or Python or other code in a way that is failure agnostic. So you can write a code using async framework and it will completely get kind of parallelized and executed behind the scenes using the Sybilis infrastructure of Deerable Functions of Azure Functions. And Netherite in particular uses both faster KV for keeping the function state as well as faster log for replayable messaging. And there are quite a few GitHub use cases. I won't get into the details but we have a fairly kind of growing community of people who are trying to use, who are using basically both the faster log component as well as the KV component in various scenarios. And it kind of also drives usage of cloud resources as well. I mentioned a little bit about scalability. Here's a very simple graph that shows kind of basically where faster stands compared to some standard multi-threaded systems out there. So for example, faster basically gives you linear scalability on one machine when the working set happens to fit in memory in this example. And we compared it to like the Intel TBB which is kind of the TBB hash table. And it turns out that while it does pretty well on one socket, it kind of keeled over on like as you went to kind of run it on a multi-socket machine. And we'll get into details of how we kind of do our multi-threading in the next part of this talk. Although Mastery and DocsDB are not exactly like Apple Staples comparisons because they actually also support range indexing. Although now we actually kind of do with secondary indexing but the point here is that we are so much better that if your use cases really are kind of a good fit for the paradigm or the API that we offer, I think the performance benefits are definitely worth it. So now I'll give a brief peek under the hood of how faster works. I won't get into too much details but just kind of give you some of the kind of the tidbits of how the system works under the hood. So the first piece of technology that is used behind the scenes is epoch protection. A lot of you are already familiar with it but I kind of give you a very brief primer and how we kind of use this technology. So the basic need for epoch use of epoch protection is that we want to avoid any coordination between threads in the common case where they can just hunker down and just run in kind of burst of operations and have simply have some mechanism to agree on shared system state. So very briefly, let's say there were four threads. There's basically a global epoch counter which is a current epoch of the system and this counter can be bumped by any thread in the system. So you can go from C to like E to E plus one at any point. On the other hand, each thread keeps a state local counter which is copied from E. So it's kind of the local version of the counter. So for example, you can see that each of these threads are independently refreshing their local counter value from the global current epoch which is shown at the bottom. So they can be kind of lagging behind in some sense. An epoch is said to be safe. If all the thread local epochs are greater than the safe value C, right? For example, these are the safe epochs. So one becomes safe as soon as the final thread leaves one. So at this point, epoch one becomes safe and so on. So you can see that the safe epochs are basically lagging behind the current epochs. And that's how the system is essentially lazily moving ahead. Now how do we use these, right? The basic idea is that we have this notion of a trigger action. The idea is that you can associate a trigger or a function callback with any epoch bump from say C to C plus one. Now this trigger action will be executed later on in the future. It's kind of like a promise that I'm gonna execute it when C becomes safe. So this very simple concept significantly simplifies the synchronization in a multitated system. So for example, let's say I wanted to issue a function F when the shared variable becomes active. It's as simple as saying that I'm gonna set the shared status to active, bump the current epoch and associate the bump with the trigger function F. And they know that F will be eventually executed when all the threads have seen the active status. The other piece that we use as a building block here is this notion of marking, where a thread can mark an operation or a phase as complete. And then it can also check if it's complete, like check if all the other threads have completed their action. And if yes, you can advance to the next phase. Again, you use the epoch protection framework to kind of drive this. And it's used to kind of, for example, you can create asynchronous state machines in the system. So for example, the entire system as a whole can move from one state to another while threads are lagging behind. In some sense, you don't have to coordinate all threads and move the state machine in lockstep across all the threads. This mechanism is used extensively in the faster system under the hood to basically think of it as this generalized the distributed system building block or I should say distributed multi-threaded system building block. We use it for memory safety, index resizing, log buffer maintenance, the checkpointing state machine uses marking, and so on. There's a quick example I had. Let's see how much time do we have, so 30 minutes. Maybe I'll just give you this example in brief. Let's see. So basically think of the multiple threads which are talking to this hybrid log. And we talked about these read-only markers that identify what is read-only and what is mutable. So it could, let's say that we decided to move the read-only variable from say this value to something in the future. So it's basically getting incremented. We want to do it without taking any logs, obviously. So thread once view might be that the read-only marker is here, so it thinks that this record is actually mutable. That's thread once view. On the other side, as soon as this read-only variable is updated, thread two might think that it's immutable because it sees that this record is in the immutable region. And so it might think, okay, I'm gonna do a read copy update and we'll write it here. And boom, you have a bug, right? Because the update that this guy did in place is lost. Think of the update being a read-modify write operation. We don't want any read-modify write to ever be lost in this scenario. So essentially, the problem here is that the two threads did not agree that a particular record was mutable or not. And one obvious way to do this would have been to take a latch, but what we instead do is to basically use epoch protection to solve the problem. So basically you set the read-only offset to the new value K, and you simply bump the epoch with this callback, which will set the safe read-only value to K. So basically the safe read-only value will lag behind the real read-only value. The safe read-only value is the read-only region that is agreed upon by all threads to be safe. That is anything above that is definitely mutable. And we call this the fuzzy region because some threads think it's mutable and some do not. And you just have to be careful when dealing with this region because you know that if not everyone agrees on the mutability status of this region. And the way you do that is by simply going pending. If you get a read modifier right there, hey, I'm gonna hold off because I know that it's not yet safe for me to do that operation. And it's very similar to an async IO that you might have issued anyway when you go into stable storage. So that's a high level idea of how this whole thing works. I'm gonna get very briefly into like the recovery part of faster as well, because that's an interesting contribution in its own right. Turns out that this notion of recovery is applicable not just to faster, but beyond faster to any kind of database system or could potentially use this technique. So the problem here is that if you remember the mutable region, it's lost on failure. And you really don't want to lose the kind of too much data on failure. So you basically wanna take periodic checkpoints of the system. Now, the obvious way of supporting durability is by using a writer head log. So, hey, I'm gonna do an update. I'm gonna write it to a writer head log and hit that's life's great, right? I mean, use group commit and all's good. Unfortunately, you go down from 150 million operations per second all the way down to like 10 to 15 million operations per second because of the contention on the writer head log. And that's not acceptable, particularly if you're kind of taking infrequent checkpoints and you really want performance to be very high in between those checkpointing intervals. Our approach is actually gives you something slightly better than a simple traditional database with fine-grained commits. We call this prefix recovery. And the idea is that if you recall the sessions that I told you about, right? We have clients establishing sessions, issuing sequences of operations. What prefix recovery says is that, I'm going to basically commit prefixes of operations to the system. In other words, all operations before a particular time get committed and none after. And so the world is basically moving in these kind of coarse-grained increments and you're committing prefixes of your operations that you're issuing in terms of issue time and not necessarily completion. And this is very compatible with like reliable messaging systems. For example, I'm reading data from Kafka and I want to be able to prove my Kafka input log as kind of this operations get successfully committed into FASTER. And this can be done because FASTER will eventually tell you that, hey, everything before sequence number 100 is now committed. So I can prove my Kafka log at that point. So this is a very nice prefix-oriented abstraction. Unfortunately, prefix recovery, if you try to implement it naively, you would require either a writer head log which as we saw is not scalable or you could require atomic commit log of operations which again is not scalable because it becomes a bottleneck. What we instead do is something called concurrent prefix recovery. Think of it as a multi-threaded version of prefix recovery. So these are the various threads that are being kind of talking to the database. And basically you're going to create this kind of the staggered line in the sand where different threads independently decide when they move from database version V to V plus one. And so that essentially endifies the line in the sand where the database version has changed. And you can tell each such session, you're not committed until that prefix. All the operations before a prefix are committed and none after, basically there's a nice guarantee. And in particular, something that is after this line in the sand on one session cannot really, before the line in the sand cannot really see the world that's after the line in the sand from a recovery perspective. So it's purely, you're recovering to a strict prefix in effect of your database. The key to scalability here is that it is a system and not the user who chooses these exact CPR points per thread. Because if we ask the user to say that, hey, I want to do the change at this particular point, then you have to block because the other threads have to kind of be aware of that. But if you allow the system to choose these lines in the line in the sand for the various threads, it turns out that you can do this with the bulk protection without having to like block the system in order to find this line. And we actually pull that off in the system. And there's a paper on this which talks about the details. So yeah. So this is sort of the secret sauce, right? This is why you're recovering through it and still be durable. So the gist of the idea is that everybody says can emit, but then you figure out what's the high water marker when you are gonna commit or like for the, I guess the yellow region kind of that boundary, yeah, that part. Like, this is what each thread thinks is getting committed, right? But in actuality, you're not gonna, you say I want 04 committed for that first step. You may actually be only committing up to 05 or how should we think about this? Right. So when the user issues an operation, that we separate out operation completion from operation commit. So the operation completes in memory and comes back to you. So you can actually forge ahead on uncommitted stage if you want on an uncommitted stage. But then you will get basically a notification that you've actually completed some prefix. And for example, if you're gonna expose the value to the outside world, you would await that completion. And we'll tell you that, hey, the operation 100 that you kind of issued some time back, you're gonna commit until that point and so on, right? So that's a lazy commit that's happening in the backend. So I guess my question is like, one, of course, that's the first version of Mongo, but they were doing that for everything. This obviously it's more low level. So the people running code against faster know what they're doing. So I guess my question would be, what percentage of the people of the commit calls that you're servicing are asynchronous versus synchronous and therefore need to wait? Like are you getting me insane performance perfect because 99% of people don't need to block until it's been flushed to disk? Right. Yeah, I think it depends on the use cases, right? So for example, if you're reading, like in streaming use cases, typically people don't care about committing before kind of completing the operation because there's something else that's actually taking care of checkpoint, the replay and checkpoint of the streaming system itself. But if you look at more general, like a Cosmos DB style of use cases, you do care about commit. And a lot of users, they don't, if they don't want to explicitly think about operation complete versus commit, you can actually issue an operation and just provide a parameter wait for commit to call it true. And then we take care of everything under the hood and it becomes an async operation that would just appear to be like a traditional database to you. Okay, cool. So I'm gonna kind of probably just skip through what happens in the actual checkpointing. Like there's basically a checkpointing protocol that happens a state machine, which basically uses the APOC framework to go from version V to V plus one. So basically there's a prepared phase and then there's an in progress phase and so on. And the system essentially, they're like the V and the V plus one versions are present in the log. And then eventually we will commit or basically persist all the V and V plus one versions in memory. And then during the recovery phase, we will eliminate the V plus one version so that you're basically recovering to a consistent prefix. So which is the version V of the database. And that's the protocol that happens behind the scenes of going through the state machine to basically pull off this whole recovery model. Yeah, the paper has all the details on how this works. But basically you can take these snapshot checkpoints and as you go through this phase, the system is temporarily in this read copy update phase where you don't allow these versions Vs to get overwritten until it's actually persisted on disk. But once the protocol is over, you again reopen the database to in place updates and then it thinks go on as before. So basically orchestrates this whole kind of flashing to disk so that the V versions are all correctly recorded in memory. And then we also support incremental snapshots. And in some sense, you can capture only the changes that happened since your last commit so that it becomes quite efficient to take a sequence of such commits without having to take the full in-memory snapshot every time. So that's basically what I'm gonna talk about in terms of the single node faster. And like I said, that's kind of a pretty mature system at this point and it's used in a multiple different scenarios. But I'm gonna give you a very brief summary of where we are taking the system in the last year or so. And that is specifically in the direction of remote faster. The goal here is basically to take this in-memory or I should say this embedded key value store of faster and allow us to access it from remote clients. So the clients could be like in any language and they are basically talking to faster which sits on remote servers or a form of servers. And basically the architecture is shown here but you have the sharded compute nodes that are talking to faster which is could be a sharded cache with storage for each shard. And the goal is to get the same throughput as faster but access remotely. And it turns out that this is not as simple as it seems because the first time we actually tried this out the network became the huge bottleneck for us. I mean, it's like doing a naive way of kind of integrating the remote remoteness given particularly that we are operating at 100 plus million operations per second makes it much harder or challenging to kind of design this end to end system starting from the client all the way to the server to get the high performance. But kind of after a lot of work we basically managed to pull this off. In some sense what we now have is the ability to access faster remotely through remote sessions and still get the same server throughput of 100 plus million operations per second per machine. And I'll kind of talk a little bit about where there there are a couple of papers that talk about how we pulled it off and that'll be the next slide. And the nice thing is that you inherit all the nice faster features of tier storage checkpoints and so on. Right. Is it RDMA into the hashing decks or like what are you accessing through RDMA? Or is it a protocol API kind of thing? Right. So there are two ways RDMA could be used. One is when the remote client talks to the server there's the hash table is on the server side. So the client's operations that are being sent to the server could be sent using RDMA. But there's the other thing that we can also do is that this high device abstraction where the hash table sits on the client's side but then it could just talk to remote memory through using RDMAs. As a think of it as a virtual device that happens to access remote storage to RDMA. So these are the two kinds of ways that you could access remote storage through RDMA. So what do you guys use? So we actually have both implemented and yeah, so the Shadowfax works specifically to this RDMA for the client side but there's some work in progress that we haven't published yet looks at using RDMA as a device abstraction. So that's kind of its status. And why not the DPDK or I mean whatever the equivalent is in Windows take your kernel bypass. Yeah, you could use DPDK as well. Yeah, we just haven't kind of looked into that yet. Okay, that's right. Right. So Shadowfax is this elastic client server prototype that we created and this was actually using faster C++ although recently we have ported it kind of the similar technology to C sharp as well. It uses standard TCP on cloud VMs to get high performance. So basically think of it as like using micro batching to get larger batch sizes and then push them through to the server. And you can get low latency but there's a trade-off. I mean, if you want 100 microsecond latency you cannot get 100 plus million operations per second throughput. So basically you have to kind of make a trade-off between latency and throughput. And the line between the latency and throughput is better for RDMA compared to TCP, of course but TCP is more prevalent in our modern data centers. So that's kind of what we are going with in the first architecture. The graph here shows really how the simple faster and faster with like Shadowfax basically with remote access actually are pretty similar to each other. This is really a simple deployment with one client and one server machine with say 64 threads on the client talking to 64 threads on the server. So that's kind of the design of this thing. The basic idea why Shadowfax is able to get good performance is that you eliminate shuffle at the network layer because faster is not a kind of partition design, right? It's basically a shared memory design and we basically push that in all the way down to the network layer. So the network connections are made from clients to specific server threads and those server threads are going to access shared memory directly. So again, you're depending on CPU cache coherence on the server side to get data as opposed to trying to do any form of shuffle on the server side. And this is important compared to like traditional architectures like C star, which some of you may be familiar with which is kind of the standard client server architecture. So in fact, when you compare C star with no ops that it don't even use memcache just use C star which does nothing this drops tuples on the floor. It still does not scale because it actually does this shuffle operation on the server side. And the shuffle operation ends up being super kind of unscalable at like hundreds of millions of operations per second throughputs. But the faster Shadowfax architecture is able to scale all the way up to 70 plus million operations per second in the same setting. So what do you think what the shuffle is? It's basically that like the one thread from the client sends it to the server and then any thread in the server can run it. Well, it's that thread that runs it, right? So it's basically there's no shuffle in the sense in the Shadowfax design the one thread on the client talks to one thread on the server which does the operation. Correct, yeah. The question was what is the shuffle? Oh, got it. The opposite of what you're doing. Yes, exactly. The opposite of what I just said is shuffle. So that's basically where you dispatch the request to a specific thread which actually does the work for you and then comes back. So that's kind of the traditional design for these kind of very perform sharding within the server. Yeah, the details in that again in the paper, it's actually available on archive. So I'll go ahead and take a look if you want the low level details on how we pull this off. The other piece here is the recovery part in a distributor system. And I guess I'll try to wrap up in like maybe five minutes. Is that okay, Andy? Yeah, I mean, yeah, yeah, man. Okay, go ahead. You have plenty of time. Oh, great. So we talked about this client server thing, right? So I think one of the challenges that comes up is the recoverability in this setting. In the some sense, we had this very nice property with CPR where you could have these sessions that create a sequence of operations and we were lazily committing the prefixes of these operations. And we could pull this off on a single failure domain which was a single machine because there isn't any kind of, there's a single failure domain you're going to entirely crash. You're going to come back. And when you come back, you come back to a prefix of all your sessions and that's all nice and good. But if you're trying to run faster on multiple machines so that you have multiple clients talking to multiple servers, extending this in that setting becomes non-trivial. And that's where we extended CPR to what we call as DPR. And this DPR stands for distributed prefix recoverability and I'll kind of give you a very brief summary of what this is. This is work that was actually led by Tian Yu from MIT and who was an intern here and he'll probably have more to see at some point on this and this is going to appear at Sigmod. So be prepared for that one. So I'm just going to give a very high level overview of the problem statement here. So typically, if you have this compute and storage, so compute talks to storage and storage is remote, it can take around milliseconds to talk to storage and this compute could be like as your functions, it could be client applications and so on and storage could be things like as your blob, this is D and so on. Typically both reads and writes are served from the durable storage in this architecture with no cache and it's pretty slow, right? There is no cache in this system. So to solve this problem, people introduce caches. So where you have a cache that sits between the compute and the storage in some sense, faster hybrid log can be thought of as one such example, although it's not really specific to faster. So you have a cache and typically what happens is that this cache could be things like redis or it could be like a main memory database and then it's just PRK checkpointing data into storage. Reads are often served from the cache but the writes typically go all the way down to durable storage. This works great if you're in a single failure domain but in fact you can do even better in a single failure domain, right? You could just say, hey, I'm gonna let operations complete and then let them commit lazily by taking a commit dependency. That is, hey, I'm gonna say that it's completed but it doesn't actually commit until the previous operations commit and this is basically what CPR does on a single machine and that works fine. But things can go terribly wrong if you shard the data. So for example, you have these three compute nodes which are now talking to these sub-force sharded database instances. And let's say that each of these sharded database instances has a cache in front of it, only for that shard. We call this a cache store which is basically a combination of a cache which talks to a store. And you can think of a sharded caching tier. Obviously you want compute to just be talking to the cache as much as possible. And if you are only worried about reads, then this is great and you can already do this today. You can serve reads using Redis or any other cache and all the writes have to go through to storage. But the question is, can you do even better? What about writes? Is it possible, for example, that compute one could write something to this cache, say this shard and then compute two could read that piece of information without waiting for that piece of information to go to storage. So that would be a nice thing to have. Unfortunately, you can't really do that today because it basically would mean that the commit, like if you did it without commit, then you're going to have loss of data. And if you waited for commit, that would mean that this write would have to wait for milliseconds before the reader could actually see the write. So you really are not in a good shape either ways. So the goal in this scenario is that we want, first of all, return operations immediately just like we had in the single faster case, the single not faster case. You want to return immediately before commit, but we also want to include writes, not just reads. You want to be able to write something and come back immediately. And you want other clients to be able to read that write or you want myself to also be able to read that write, but with the strong notion of like recoverability. In some sense, if I have a session, I should be able to recover to a known commit prefix of that session where the session could span threads. So for example, one session could first write to A or first could maybe read from A and then write to B. And the other session could be maybe reading from B and then writing to C. And all of that should work out such that for any given session you actually get prefix recoverability. And obviously what this means is that when there's a short failure, some rollbacks have to happen because a particular session might actually have read something that the other session wrote. But we know that this is fine because even on a single machine, we are providing prefix recoverability real easily. So people know that any operations that they have issued that have not yet committed are possibly going to have to be redone. So there's some rollback that could be involved because we are in multiple failure domains. So the short answer to this is that without getting into technical details that you can actually pull this off in a very nice way. In some sense, what we do is that you allow users to write sessions that span the short. So for example, you have operation on C, B, A, D, C and so on. And then we provide prefix recoverability guarantees. So if we say that, hey, everything until D is now committed, for example, now everything until B is now committed. So you provide nice recoverability guarantees on that session. So conceptually, it's no different from CPR because you're going to be issuing operations to the system and you're going to be getting commits of prefixes of operations, except that this is now happening on a distributed system. And how do you pull that off? That's a non-trivial challenge and you have to kind of track dependencies and stuff like that. And that's there in the sigmoid paper and I won't get into details of how we pull it off. But it's a pretty cool kind of extension I would think of the CPR architecture, but it's more general than faster because you could potentially apply it to any shorted cache, which is talking to a durable store where you want caches to be able to absorb the reads as well as the writes. So I think I'll kind of end at this and basically say that, yeah, the faster project offers this, started with this very simple library that offered this concurrent latch-free design based on a generalized epoch framework with these faster KV and the faster log being the two complete artifacts. They support secondary indexing. We have recently gone into the direction of remote access, ideally without any performance loss and we've kind of explored that space pretty deeply in the over the last year. And then the novel recovery techniques for both single node database and a multi-node sessions where the sessions could span multiple shards of your database. These techniques are quite general and they apply beyond faster as well. And yeah, the system is open source and links to all the research papers are on the website shown here. Yeah, so that's pretty much what all I have. Thanks. Okay, awesome, thank you. I will applaud on behalf of everyone else. So we have a couple of questions in the chat that came up. Mikhail, do you want to go first? Sure. I made some calculations on the 125 million upgrade. Mikhail, just I know who you are. I mean, you're the creator of NDB for everyone using this. Thank you for coming. Yeah. Thank you. That's correct. Yeah, I'm usually playing around with 100 gigabits per second. So I was wondering whether you actually managed to send those 256 bytes 125 million times. That to me means 400 gigabits per second. Or is that correct? Yeah, so it depends on the size of the payloads, right? So at some point, you're going to saturate your bandwidth between the client and the server. So in the examples that you're... In the graph that you presented, it said 256 bytes. So that's why I asked. It was the Shadowfax line. Right. Yeah, the Shadowfax line. Yes, so I think it was 256 bytes or 100 bytes. It probably was 256 bytes if the graph said that. Yeah, so this was really on a kind of network which had very high bandwidth on it. So I would have to ask Chenmei about the details of exactly how he kind of... I think he was running it on Cloud Lab, which if I remember right, and they did have very high bandwidth connectivity in that setting. So what was the question again? Yeah, I was just wondering if my calculations was correct and the answer is yes, if I understand correctly. Yeah, I think so, yes. And remember that this was kind of... This is not a case where you have like hundreds or thousands of clients here, right? It's a more controlled scenario where we have these clients on the servers. The scalability really is kind of driven by the number of clients that are talking to a given server. And we can really push that to its limit in that particular setting. If you had like thousands of clients talking to faster than the overheads of kind of multiplexing these clients would become the bottleneck in that case. I can imagine that, yeah. Okay, awesome. All right, next question from Jun Chen Yang. Hey, Jun Chen, how are you doing? Hi, so I have a couple of questions. So the first one is, in last part of the talk you talk about using faster as cash. Is it really meaningful to think about cash recovery for cash? Just like when the cash say fails and come back again you can have their data, right? How do you deal with it? That's a great question. Yeah, so faster should be thought of as a mini knife. In some sense, there are some use cases which use it as a cash in which case they just don't use the recovery part of it unless they want to start with a hot cash, right? So it's possible that you have like these scenarios where you want to recover the contents of the cash so that you don't have to wait for the cash to reboot up, rewarm up after a recovery. So in that case, they would perform a recovery. They would take periodic checkpoints. However, the main reason for taking checkpoints is really for the more when you're using faster as a key value store, right? Where you really want a kind of a commit prefixes where you know that it is not being backed by a real database but it's just being used as a database itself but it's used in both scenarios. So kind of both are important, I would think. But say when the cash goes down so the traffic will be sent to say add a machine. I mean, the updates, how do you make sure when the old machine comes back, that still data can be removed or detected? Right, so the scenario I'm talking about is where faster is used as the kind of the entry point for your system, right? So all the operations are going through the faster API. So then when you fail, then you wait for a new faster instance to boot or if you have a secondary, you can have like a secondary that are constantly absorbing the log of faster and keeping it up to date so that when the primary log fails, the reads can kind of switch over to the secondary and then the secondary takes over the log of the storage and becomes a new primary basically at that point. Okay, thank you. Another question is, so you use the hash table as the indexing structure and you're talking about small objects. So and you use this say external devices such as SD how do you maintain this such huge hash table in memory? Yeah, that's a great question as well. So yeah, so one of the initial designs was that we were keeping all the keys in the hash table in memory. So that quickly became unscalable because the billions of keys, there's no way you can keep the entire hash, the keys, all the keys in memory. So what we have is a design where we don't keep keys in memory, but we just have eight byte entries in memory. So there's eight byte overhead per key that is kept in memory. Again, that's also not like entirely scalable because again, you're having an eight byte overhead per key. And when I say eight byte overhead, the reason is that they would allow collisions, right? So all the colliding records on that eight byte entry would be present on the log. So it's at the cost of additional IO in the presence of collisions. Now, one of the newer designs that we are looking at kind of for really low memory use cases is to have a two level hash index where you actually keep first of all a hot index for just keeping the hot keys and then a cold index which is a two level index for keeping just the cold keys. So if you have, for example, a million hot keys and a billion cold keys, it's okay to spend eight bytes per key for those million records, but for those billion cold keys, you probably don't want to use eight bytes per key. So this is something we are actually examining right now and this is a topic of a summer internship as well for this summer. Okay. Thank you. We have a paper on that or something similar in that case. All right. So one more question from my side. So when I look at the recovery architecture, if I understand correctly, you basically trust the storage layer to be the sort of the safe layer. So you could more or less view this as a sort of a number of shared memory machines that where you partition the data, but the actual data is residing on the storage. So from a sort of, so basically a shared disk implementation in a sense, is that correct or have I misunderstood something? Right, the point of truth is storage, yes. So basically think of the mutable region as the cache that treats the point of truth as a storage. Now, the storage, since it's a tiered hierarchy, there could be some tiers of the storage that are in the local NVM drive, like it could be a local SSD, which may actually not survive failure because as your machine, for example, goes down, the replacement might come up on a different machine. So you actually lose that. So if you want true recoverability, you have to tear down to a storage layer that actually is accessible across the different machines. And that would be the shared tier that you talked about. And typically it would be things like Azure page blobs, which can actually be accessed across multiple machines. And itself it does the triple redundancy, for example, so that you actually get redundancy of the data that's stored at that tier. And that would, for us, it's just a tiered in the entire storage hierarchy. Okay, thanks. Okay, anybody else? Otherwise, I'm gonna be selfish in my questions. Okay, so quickly, for the data structure you're planning on using for the secondary indexes, I mean, I don't know the Facebook paper, I haven't read that one yet. Is it a tree data structure or using some trickery with the hash index is what range thing, range game? Right, so the data structure we are doing for secondary hashing, right? That's actually just a hash table. In fact, it's a mild variant of the faster hash table, which actually chains together multiple records with the same key. And that's what the Fishtail paper talks about. For range indexing, you can use any of the shelf, you could use a BW tree or any variant thereof. And kind of we are agnostic to what range index you would use. So faster exposes record IDs to the secondary index API. And you could basically kind of do whatever secondary index you wanted. So if I wanted to support range games for my own tree, I have to show it my own data structure. Yes, you guys wouldn't provide one? We want to provide one, we have not done that yet. What we have is the hash one, yeah. Okay, and I understand that. Like, you think you're using a BW tree or again, or is it something different? Wouldn't you guys think- That's a great question, yeah, that's up for grabs. Yeah, we are looking at alternatives as we speak. And BW tree looks like a great architecture. I mean, maybe we just use that or is there something better? I mean, I think more about that. The, I mean, there's a C++ and a C-sharp version of it. I guess the first question is how much code do they share? Are you having to re-implement things for the two different languages? And does that mean C++ repo is always sort of lagging behind the C-sharp one? Yeah, so we had to re-implement it because one of the reasons of using C++ C-sharp is that it runs anywhere, right? It runs on every device on the planet pretty much. So that's the reason for its attractiveness. So we couldn't, for example, just use P-invoke or like a remote interop to talk to C++. The C++ version definitely lags behind in some sense. Like the latest features are not yet there in the C++ version, but we do have developer resources which are currently looking at bringing it back up to date. So basically the work that's happening this year is to really bring C++ back up to the same feature set as C-sharp. And I think that we do want both cases because for example, people who use ROX DB and other systems today, they're not gonna use C-sharp. I mean, that's out of question. So to kind of target those kinds of use cases, we still do care about C++. So I think both are important. It's hard enough to build one, right? But now you gotta maintain two. It's just, I mean, it's a trip down system that's pretty bare bones, right? So like that makes it easier, but definitely adding the same features is hard. Okay, so my last question would be, what's the use case that someone has brought to you and say, hey, I think I wanna use faster and you recognize that like this is, you could not be using faster for this and you say what it was. Right, right, it's a great question. So I think there have been cases where people have wanted to do range queries, right? And they say that, hey, I have like maybe a non-trivial fraction of my workload being range queries. So let me just go ahead and do a full scan and let me just use faster for this, right? And it initially works well. I mean, as long as you didn't have any range queries, but then as the fraction of range queries grows up, the cost of full scan just kind of starts killing you and the cost of things like garbage collection also start killing you in that kind of environment. And sometimes like if you have a non-trivial set of range queries, then I would say that you really don't want to kind of, until we have this whole secondary support, I think it doesn't make sense to use faster in that scenario. I think that's kind of the first thing that pops up to my mind when I think about it. Or the other thing is transactions of course, right? Like, of course if you want to do multi-record transactions, yeah, I mean, people don't even bother kind of using the system. So, I mean, I tell them right at the start, hey, if you think that your system will require transaction, multi-step transactions specifically, then probably this is not a good fit for you. Right, okay. All right, so we have one last question from Mary. You want to unmute yourself? All right, I'll read it for her. How often do you see Trill use as part of faster deployments? You mentioned it in Azure Stream Analytics, is it a common pattern? Yeah, that's a good question. So, for faster in Trill, it's actually very popular within the data center, within Azure Stream Analytics, right? So, as far as the open source version of Trill is concerned, of course, you can use it with faster, but we haven't seen many use cases where they have used faster as a library with Trill as a library outside ASA. But within the ASA service, it makes much more sense to use it because ASA provides these high level kind of early managed service to Uber, right? And the more cogs they can save behind the scenes, the better. So they can kind of, because of faster, they can support much larger working sets of their streaming pipelines. And so they basically use Trill behind the scenes and use faster to offload the cold state so that they can potentially pack more instances onto the same machine. And so this is kind of the thinking behind the reference data lookup capabilities in ASA. So I think that's kind of the main use case that I've seen. And specifically there, I guess one of the key drivers is the ads use case, which I think I referred to. And we also talk about it in the Trill paper as well, which is kind of this whole ads indexing pipelines. We have a demo several years back on the ads pipeline at Microsoft, which uses streaming technology and there it gains heavy memory usage, right? So that's one of the driving use cases where you just wanna push out the cold data so that the hot working set stays in memory and you can run your pipelines using a much smaller number of machines than before. Okay. Okay, I guess, thank you so much for doing this. And this has been an awesome talk. I'm really appreciate it.