 Okay, cool. Everyone, I'm Ritchie, I work at Uber, and I spent the last two years working on M3DB, which is Uber's distributed time series database. And my talk today is about seeing if I could re-implement something that looked like M3DB, but using FoundationDB as the core storage layer instead of writing a storage system from scratch like we did. And the thing I built is only a prototype, I'm not actively developing it or anything like that, but it's kind of a proof of concept that you can actually build really high write throughput systems on top of FoundationDB. So in this case, like millions of writes per second, and 10x compression, kind of time series compression and 2,000 lines a go. And just a quick disclaimer, do not use any of this code anywhere for anything important. It was just a prototype to do some benchmarking and figure out different architectures for FoundationDB. There's a lot of cut corners and to dos and missing features, but I think it does kind of prove the point. So very specifically what I was trying to do was to see if I could build a system with the same API compression and performance characteristics as M3DB, but as a more or less thin layer over FoundationDB. And when I say kind of high write throughput time series data, this is kind of the interface I'm talking about, where when you're writing, you're saying this is my time series ID, and here's my value, which has like a time stamp and a float. And when you're reading, I think there's actually a mistake in this method. You provide the series ID, and then the minimum time stamp and the maximum time stamp in it, it rips out all those data points, right? So this is kind of my first attempt, which is what the FoundationDB documentation says to do for time series data, where basically you get a bunch of writes, you batch them up into one transaction, and then your key is basically the time series ID plus the time stamp to make it unique, and then you pack the value into the float, or you pack the float into the value. This works, but you get terrible compression, right? You're repeating the time series ID each time, and you don't get prefix compression and FoundationDB. And on top of that, you're not compressing the values or the floats at all. It's also like the write throughput's just not that great compared to like what modern time series databases can do, just because you're, you know, are generating a lot of transactions. The second thing I tried to do, which is not a good idea, and I don't recommend it, but I just wanted to see if I could just do Gorilla compression directly in FoundationDB. I kind of knew it would be slow, but I just wanted to see if it was possible. And if you're not familiar, Gorilla compression is kind of like one of the most common forms of streaming time series compression that people use for like time series data specifically in the observability space. And the interesting thing about Gorilla compression is that it operates on individual bits. So if you have a value that hasn't changed since the last time you encoded it, you can encode that and grill with just a single bit. So that's pretty hard to map directly to the FoundationDB interface. What I ended up doing was basically taking the entire state of like the streaming encoder and serializing that to FoundationDB. And then every time I wanted to read, I would like rip out that state, kind of hydrate this in-memory object. Some of you are shaking your heads. Yes, it was not a good idea. You know, write the next value into the encoder and then serialize the state back. If you know anything about how like storage system and B-trees work, you'll realize that this will not have good performance, which it did not, but you do get really good compression. So after kind of trying the two like really naive things, I figured that if I was gonna do this, the architecture would probably have to look a lot more similar to M3DB, but kind of replacing some of the lower pieces of the stack with FoundationDB instead of our custom storage system. So if you, just to briefly go over M3D's architecture, the way things work is writes are coming in constantly. And once they arrive at an individual M3DB node, two things happen in parallel. They get guerrilla compressed in-memory and kind of basically buffered. And at the same time, they're written in uncompressed form to the commit log so that if a node dies and comes back up, it can just read its commit log, restore the in-memory compressed state, and kind of keep chugging. And then kind of as an asynchronous process that's happening in the background, these compressed blocks that you've already buffered and compressed in-memory are written to disk as immutable files. So if you're familiar with systems like Cassandra, you can imagine the in-memory part as like the memtables and these immutable files on disk as the SS tables. Pretty straightforward. And then if you think about where your data lives at any point in time, your kind of most recent time series writes live in two places. They live in your commit log files and in your in-memory mutable encoders and those kind of cover the same amount of data. And then data that's kind of a little bit older lives in these immutable files at files. So kind of mapping this over to FoundationDB world, I decided to take kind of a similar approach and put a semi-stateful layer in front of FoundationDB. So I wrote basically a Go program that looks a little bit like this and I realized there's a lot going on in this picture, but the idea is basically the same as what we do with M3DB, where writes are coming in, they get buffered in-memory and I'm doing kind of guerrilla compression in a synchronous fashion there. And in the same time, they're getting kind of batched together and written in uncompressed form into commit logs. But the commit logs live in FoundationDB. So they don't have to be, if a node restarts, I don't have to, if y'all's running this on Kubernetes, for example, if a node dies and comes back, it's completely stateless, it has a node identifier, it goes and finds its commit logs from FoundationDB and you don't have to worry about any of that statefulness. And then kind of like I said before, where you have this asynchronous process that's constantly reading data out of the buffers and writing the compressed chunks to disk, I'm doing the same thing, except I'm writing them to FoundationDB. And you, for each time series, you kind of have this like metadata entry that tells you, you know, for this time series, here's all the chunks that I have for it, here's how big they are, and you kind of have these simple zone maps that tell you what is the first and last time stamp in this compressed chunks, which helps you figure out what chunks to read when you need to serve queries. So implementing the kind of what I call semi-stateful commit logs is pretty straightforward. You basically accumulate writes into batches and then dispatch them to FoundationDB as one large chunk. So you may have like a thousand writes that translate into one large chunk that goes into FoundationDB. The commit log chunking format is optimized to be read from start to finish quickly for recovery. You can't go in and jump around kind of randomly. And the two operations that you need to support for commit logs are you need to be able to fetch all the undeleted commit logs when you're doing a recovery and so you can replay them. And then you need to be able to delete all commit log chunks before a certain time. So as you start flushing your compressed files to FoundationDB, you can start cleaning up your commit log files, otherwise they kind of accumulate forever. And this is the key format that I picked for my commit logs, it's very simple. If you're, once you actually kind of turn this into a more distributed system, you'd have a host identifier in there too, but it's commit log and then kind of a monotonically increasing commit log index. And then if you think about how this actually works in terms of commit log cleanup, you basically have this background process that's looping infinitely. And it waits for a new commit log chunk to be written with a specific index. So looking at this diagram, for example, we would wait for commit log chunk number four to be written. Once that happens, we would kick off the flush process which basically goes and loops through all the compressed chunks that live in the buffer and flushes those to FoundationDB. And I'll go into the details of that process in a minute. And then once you've done that, you know that all of the data that lives in commit log chunks one, two, and three now lives in FoundationDB in like a separate readable form. So you can just delete commit log chunks one, two, and three. And the system is constantly doing this process. This is what it looks like in code. I don't need to go into the details, but basically it waits for a commit log rotation. It calls flush on the buffer and then it truncates the commit log using some token that came back from the rotation. So are we a database yet? Not really, right? We can write commit logs, clean commit logs, and read commit logs. But that's not actually useful to anyone. People want to be able to read their rights basically. So this is where the buffering system kind of comes in. Like I said before, it's very similar to Cassandra and similar systems MemTable. We're buffering and pressing time series data in memory before flushing it to FoundationDB. And when you write to FoundationDB, you have the option, because of the transactional nature, to either merge with an existing chunk or to create a new one and then update your metadata at the same time. And all writes and reads kind of have to flow through this buffer system to get a consistent view of the data. So this is more or less what the buffer system looks like in my prototype. It's basically a gigantic synchronized hash map where the entries are the time series ID and the values are an array of gorilla encoders. And I'll explain why that's an array in a second. We'll start by looking at the right pathway, which is actually very simple. A write comes in, you look up the time series ID in your hash map. And then if there are no encoders there, you just create a new one and code the value and you're done. If there are an array of encoders in there, then only the last one is considered mutable. So you grab the last one and you encode the value in there and then you're basically done. This system doesn't account for out of order writes. So for this specific implementation, you always have to be writing values with timestamps higher than the previous ones. There's no reason it couldn't support out of order writes. It just makes the implementation a little bit more complicated. So I didn't do it for demo purposes, but you could. And then this is where things get a little bit interesting which is this flush API. And this is what's actually getting all of your data out of this buffer and into FoundationDB. And there's a very specific contract that the flush API has to implement in order for everything to work. And the contract is that once the flush method completes and returns successfully, all writes that were in memory when the flush function started have to be persisted to FoundationDB in like a durable way. And the way it works is it basically loops through each series in memory. And then for each one, it marks the current encoder as immutable. So it basically seals the current compressed chunk and then flushes all of the immutable encoders to memory. And then once it's flushed all the immutable encoders, it can evict them from memory to clear up memory for basically more writes. And when I say flushing to FoundationDB, there's two things that have to happen, right? One, you have to actually store the compressed data chunk, but you also have to have an index over them. You have to be able to say, what chunks do I have and what time ranges do they cover? So in a production system, you probably have a slightly more sophisticated zone map than this. But in my implementation, I just have a slice of metadata. It contains the first time stamp and the last time stamp. That's really useful for satisfying queries and also deciding which chunks make sense to compact together. And then you also have the size, right? Because you don't want, you probably want to set, you don't want any individual chunk to be too large or too small. So you can use the size to decide which chunks make sense to merge together also. And then this is where I think FoundationDB really starts to shine, which is when you're doing the flushes and you're doing them transactionally. So when I flush a time series compressed block to FoundationDB, the first thing I do is I read the existing metadata for the series being flushed. Then I use the series metadata to decide if the data being flushed should be merged with the existing chunk or written out as a new independent chunk. Then you read the existing chunk that we need to merge with if we're deciding to do a merge. If not, you don't have to do that. And you write the new merged chunk or just the new chunk to FoundationDB. And then you update the series metadata with the new chunk information, which is either a new merged chunk or a completely new chunk that you just wrote out. And you can do all of this as a single asset transaction performed with strict serializability. And I think this was kind of like the big aha moment for me when I was using FoundationDB. Because other people have built systems that kind of look like this and they're built on top of Cassandra or other things. And I have implemented basically this entire system but on a custom storage engine. And it's just like you just can't do this. It's like really hard to get all of this stuff right. And it's like really easy with FoundationDB. It's almost like you're programming against an in-memory lock system. It's just really easy to get it right and know that it's gonna do the right thing. So to me, this was like really, really powerful. Finally, just to walk through how the reads work. You've probably figured this out by now, but for the time series, we go and look up the time series metadata and read the latest version of the series metadata out of FoundationDB. Then we use the metadata to determine which chunks contain data for the specific time range. So you're basically taking, this is the interval that the user wants to read and here's all the intervals that these chunks cover and just basically doing interval intersection to see which chunks contain data that you care about. You also need to look at if you have any in-memory encoders that haven't been flushed to FoundationDB that contain data for the time range. And then you just merge across all those compressed chunks and return to the user like a merge stream. It's basically the same algorithm as merging case-ordered arrays. There's nothing complicated there. So kind of in conclusion, building a semi-stateful distributed system on top of FoundationDB is a lot more work than building a stateless one, but it's a lot less work than building a distributed system from scratch. So I recommend it. If you think you can find a way to model, if you have a problem that doesn't seem like it would work well on FoundationDB, but you can kind of find a way to model it as a semi-stateful system instead, it's probably a better idea than writing a custom storage engine from scratch. So like I said before, this will do like millions of writes per second and give you like industry competitive compression and like 2000 lines of code. I mean, obviously it'd probably be a lot bigger if I spent like three to six months productionizing it, but it's not far off. And you know, FoundationDB may never beat a purpose design storage system, but it's a lot easier to program against than the operating system, file system, network and physical hardware. And it just makes things really easy. And you can spend a lot more time kind of focusing on the things that are core to your system and less time dealing with a lot of problems that FoundationDB has already solved for you. So that's it. I think I have a few minutes for questions if anyone wants to ask, but that's all my slides.