 Databases, a database seminar series at Carnegie Mellon University, is recorded in front of a live studio audience. Funding for this program is made possible by Ottotune. Google. We're super happy today to have Jordan Dirkbeet. He is the co-founder and designer of Tiger Beetle, a new transaction processing database written in big, major point of this talk. I'll talk about today. I'm excited because it's super hard about a transaction processing system and he's going to try to do it and try to get into why his is really good. As always, if you have any questions for Jordan, please unmute yourself, maybe you are, and feel free to do this any time. I'll wait for the conversation for him and not talking for an hour. Jordan, the floor is yours. Thank you for being here. I would say you're in South Africa right now and it's 11.30 p.m. We appreciate you staying up late. Awesome. Thanks, Andy. It's a good day tonight. Incredible to be with you all and I look forward to our discussion questions as we take a tour of Tiger Beetle. Tiger Beetle is a distributed database open source Apache 2 designed to count anything at scale. For example, financial transactions, in-app purchases, API billing, rate limiting, even game scores, page views, or likes. So any kind of balance or business event really, and any kind of count where there's a vector or some quantity of value that's moving between two or more parties. So to do this, Tiger Beetle gives you double entry accounting primitives out of the box and a single binary that you can run to have a highly available cluster with mission critical safety, strict serializability, and performance. With a tightly scoped domain, we've gone deep on the technology to do new things with the whole design of Tiger Beetle, our global consensus protocol, local storage engine, the way we work with the network disk and memory, the testing techniques we use, and the guarantees that Tiger Beetle gives the operator. For example, one of the most surprising is that there is no dynamic allocation of memory during runtime. So after startup, we don't call malloc or free. Instead, we make the technology exciting so that deploying Tiger Beetle can be as boring as Tetris with square boxes. So I promise you some surprises, some treasures are new and old. In fact, we read a blog post about Tiger Beetle's static allocation. And I loved the Funhacker news comment that said, welcome back to the 70s. And to which I reply, it's good to be back. So let's roll up for the magical memory tour, because so many design decisions in Tiger Beetle have to do with how we work with memory. How we think about memory explains questions like why scale up before scale out, why use direct IO to skip the kernel page cache, why use cache line-in-line fixed size data structures, why write a new storage engine instead of RocksDB or LevelDB, why is IO U-ring so exciting, a YD couple database performance from memory allocation, and my favorite, YZIG. So Andy, are you ready? Go for it. Look, you pay whatever copyright thing you violate here, it's all you. I just figured I'll ask you the question first, so then we get to get the Q&A going back and forward. So yeah, so let's cross the road, because our first stop is why design alleged database. And the short answer is that double entry accounting at scale can be a killer workload. It's a hard problem for a number of reasons. First, you can't sacrifice anything. As engineers, we like trade-offs. Sometimes you can trade latency for throughput or strict serializability for availability. But when you're tracking something as valuable as account balances, you have to have both strict serializability and high availability. And you have to have both high throughput and low latency. So if you have a big business or if your business has strict SLAs, there's nowhere to hide. You need mission critical safety and performance. You want the system to be as easy to operate, to be predictable, and to just work. The second reason is less obvious in that the nature of double entry is contention. If you have a million customers but one bank account, then you have a million debits to different customer accounts. But all million debits must still credit the same bank account. So this interacts badly with Rolex. So you see people reaching for sharding. But with double entry, sharding across rows of databases is a showstopper. Because then you can't easily execute global transfers across accounts, at least not with predictable performance and critical business logic becomes complicated. So you also might not have enough accounts for sharding to make a difference. For example, for a high volume payment switch, only three to four banks participating, you might find even a single account approaching the limits of an ad hoc solution. So finally, the hardest part is that you're not only tracking money movements within a system, you're often also tracking money movements between systems with different stacks. You're in the world of distributed systems doing two-phase commit across ledges. Now you have four balances, debits and credits pending and posted to control liquidity for in-flight transactions as money moves between ledges. But to get in-flight transactions right, your distributed database has to be able to deal with subtle issues like your clock sync protocol getting asymmetrically partitioned. So this is where the clock sync partitions are not aligned with your consensus protocol. So clock sync is broken, but your consensus is running. And then there's the risk that your transactions get time stamped too far into the future causing money to be locked up. So getting these primitives right to track transfers within and between systems is not easy. The story goes like this. You start with a tried and tested SQL database. It can do hundreds of thousands of transactions per second as raw material. Then you write your ledger logic in the application around this. But when you look at the finished product and how long it took you to build it, you're spending a few thousand dollars a month from hardware only to hit the wall at 100 transactions per second or if you're lucky a thousand. Now you've got 10,000 lines of code around your SQL core and your financial invariants are no longer enforced or protected by the database. So you're building a ledger database, but you don't know it. And for bonus points, there's no Jepsen report. On top of this, the world is becoming more transactional. We've already seen the cloud moving from hourly to per second billing. Now APIs need to do usage tracking, games are doing more in-app transactions. Even energy providers are starting to switch energy back and forth into the grid more and more frequently to arbitrage energy prices. The faster you can settle, the more you can serve people, the more money you can make. So volume is increasing across sectors and it's a hard problem getting harder. How do you solve a problem like this? We wanted a methodology. We asked ourselves two questions. First, what design will serve our three goals of safety, performance and predictability or operator experience? Second, how can we develop the design in a tenth of the time? In other words, velocity. So in terms of design goals for a new distributed database to make sense, it's quite a big investment. We wanted to increase performance by three orders of magnitude with 10 times less hardware and 10 times more safety. We wanted Tiger Beetle to be so safe that developers start to feel nervous if they're using anything else. So we defined clear fault models for network storage memory CPU. Crucially, we also adopted NASA's power of 10 rules for safety critical code by Gerard Holtzman. The result is that we now have more than 3,000 discussions in Tiger Beetle to ensure that everything either runs correctly or shuts down safely. Based on experience with the sessions of the last seven years, even before Tiger Beetle, I've come to see that fail fast, fail safe is a good philosophy to have. And it also makes particular sense for a distributed database like Tiger Beetle where assertions can help you detect deadlock in production while at the same time protecting you from them. Since after the replica restarts, it probably won't hit that rare case again. So your system self heals and you learn from it. And then it gets better for everybody else using it instead of letting these kinds of bugs lie latent and dormant. So in terms of design, we looked at different architectures. We're pretty lucky here because right about the same time two years ago, July 2020, when we started Tiger Beetle, Martin Thompson of Almax was speaking about the state of the art in the evolution of financial exchange architectures and the value of seeing these as replicated state machines. So an easy way to understand what a replicated state machine is, is to say it backwards, very loosely speaking, it's a machine with some state that you replicate through a log. If you replay the log operations in order through the state machine function, which is really a pure function that takes the current state plus the new operation applies the new operation to the old state and then returns the new state. And if you do this across all machines in the cluster, then all machines reach the same state. But how do you take this architecture and solve for a performance? How do you design a ledger to be a thousand times faster than ad hoc ledgers? And how do you do this on commodity hardware for a tenth of the price? So it's worth motivating that this is important as a goal because performance is a spectrum you can trade. Performance gives you a margin of safety to cope with growth and avoid high utilization because that's where little small peaks in and where latencies start to skyrocket. You also don't have to cut corners. For example, in some parts of the world, high throughput ledgers can only keep up by running in volatile memory. If the power goes, the transactions are all lost and have to be recovered from banking partners. So you phone a friend. But if you have a thousand times more performance in your ledger, then you can take the heat and trade this performance for stable storage and a safer system. So performance also buys you cost efficiency. This is surprising, but you can have a smaller hardware footprint and start to survive in challenging environments. For example, a friend of ours benchmark Tiger Beetle achieving 94,000 transactions per second on a Raspberry Pi on an SD card. And what I love about the particular SD card they used, and I looked it up, is that it's advertised as having superior write speeds of up to 30 megabytes a second. And I didn't see what the asterix was for. And that it's also waterproof, temperature proof and x-ray proof. So the motivation for performance is cool stuff like this, to be fast on small, slow hardware. And how do you do this for a replicated state machine? The answer is mechanical sympathy. There are four parts to this. It's the four physical resources of a computer, network, memory, CPU. These four resources have two characteristics in common, latency and bandwidth. And I think the best way to make the CPU go fast is just to think about memory. The best way to think about memory is to think about disk. The best way to think about disk is to think about network. So let's work our way up to disk and memory, starting with the network. And here in terms of network, when we looked around at existing ledgers, they were built around SQL databases with no double entry primitives. So what we saw is that for every financial transaction, the application would make around 10 to 20 physical SQL queries for every one logical financial transaction. We actually looked at an open source payment switch. We did consulting work. We traced all the SQL. And it's pretty tough to optimize. It sounds like a lot of queries, but it needed to make that many. It was about 18. So if you're good, you can really get this down to one physical SQL query per logical financial transaction. Or of course, you can even push past lots of stuff you can do. But then you start to run into the problem of contention, as we saw before, where RELOX blocked our test to hot accounts. So we therefore asked the question, how can we go from one database query per financial transaction to one database query per 10,000 financial transactions? Because if you can do that, you've not only made the system a thousand times faster, but also 10 times cheaper. So we call this the payments equation. And what we did with Tiger Beetle was to change the way that the application talks to the database. It's a simple idea. Instead of the client sending one financial transaction in a single database query, you send on the order of 10,000 transactions in a single database query. So single database query, and you've done 10,000. Another database query, and you've done another 10,000. And you can do this because financial transactions are small. They're an idea, data description, an idea of the debit and credit accounts, the company moved. If you leave room for user data to link up to external side per databases, you can think of a financial transaction as two CPU cash lines or 128 bytes, or if you're using an Apple M1, it's just one cash line. So we're future proof. But you can pack 10,000 of these transactions together into about a meg and send them over the network to the database. Database can write this to the disk. Disk networks really like it. If you give them bigger pieces of work to do, you get more throughput. And counterintuitively, you also get better latency because your system doesn't build up queues. So it's like the Eiffel Tower. If you only allow one person to elevate at a time, then people are going to queue outside the ticket office. But if your elevator can accommodate like 10 people, it makes sense. Let's let those 10 people in at a time to enjoy the view. So we prototyped this design back in July 2020, sketched all the performance components over like a five hour session Sunday afternoon. And after five hours, we realized we could achieve on the order of a million transactions per second on a single CPU core with scale up before scale out. So kind of like the Frank McSherry paper in the scalability, but at what cost? Single core systems, if you design them right, work with the memory while they're really muscular. So batching of client requests was the big one. Everything is a batch. And it's just a question of whether it's a batch of one or a batch of 10,000. So this kind of changed the way I see interfaces today. Now I see everything is a batch. If you're doing consensus, it's a batch. Just a question is a batch of one or 10,000. And this idea I think is really powerful. If you're designing systems, just make your interface support an array of stuff. That way you can always just put one thing in. But if you've got more, it just works better with the hardware. But how do you do state machine replication within the cluster? We've sort of solved the problem of how you get the transactions from the client to the database. How do you now replicate within the cluster? How do you ensure that all machines in the cluster have a totally ordered log that you can run through your state machine? So here, we didn't just want to assume raft on the basis that it's popular, like raft. Instead, we looked around and we reached for Brian Oakey, Barbara Liskov, James Cowling's Feustand replication or VSR. If you know raft, then raft is in fact, I was saying to Andy earlier, it's like a grandchild of VSR. It's a descendant. It's in the lineage. It's a subset of VSR, except that raft has a leader election of view change algorithm that is random, and it dates back to PAXOS, which is unfortunate. However, the 20, and we'll go into the reason why, but the 2012 revision of VSR's view change algorithm is newer. It was written two years before raft, but raft kind of missed it. So raft went with the random view change of 88 to 89, VSR PAXOS, and 2012 was already another iteration by Barbara Liskov, James Cowling. It's not in raft, but VSR has got it. It's this view change that encodes more information into the protocol, so you get more predictability. And the big thing is that this means you don't have raft's problem of dueling leaders. So there's also no randomized padding to mitigate this, as you would have in raft. So you get faster fault detection and faster fault isolation with VSR. If the leader fails, you detect it quicker. You can also switch over to the next leader in a few tens of milliseconds quicker. So if you care about latency, it's a better view change. It's better for distributed systems. I did an interview with James Cowling. He just gave some fantastic explanations, everything they were thinking, that just went into this view change. So it's really worth it. Again, you don't have raft's problem of dueling leaders, so there's no risk that you'll get stuck in another leader election, Luke, which can happen with raft. So now with the concept of a replicated state machine with VSR as our champion replication and consensus protocol, we have a global consensus protocol for replicated log. And we have a local storage engine, which we'll look at a bit later. The local storage engine obviously is to store the state that the state machine produces as we run operations through the state machine function. So at this point, we're pretty far along with the design. Still got a code, right? But we have strict serializability, high availability, thanks to VSR. But you have to be careful with all these things to think of the network not only in terms of latency, which you see so often even in distances, you have to also think in terms of bandwidth. So you see sometimes papers are always talking about start topology for replication. That's where the leader replicates in parallel to all the followers. But this divides your available bandwidth throughput by the number of followers of the leader. So for example, just a small one gigabit nick at the leader with four followers. If you use a start topology, then you can only do 250,000 transfers per second. Obviously, million transfers is 128 makes. So that's your million divided by four, there's 250. So you've also quadrupled your leader's load, and you've made your network load unbalanced. The leader is doing far more work, and this increases the risk of cascading failures. So here instead, what we use is what we call a ringer start topology for real. This is where we arrange the cluster machines or replicas in a ring. The leader replicates to the next machine and so on. And then each machine sends a small ac reply directly back to the leader. So there's no RPC here. This is like full on message passing. The leader can send to any machine at once. It doesn't expect a reply. They then send a separate message back. So you get these really interesting rooting topology where you can do multipath rooting. It's really pretty cool. And that's also interesting with Raft-VSR. If you look at the papers closely, you'll see Raft assumes request response like HTTP. That actually isn't great, I think, with consensus. VSR is full on message passing, which is a little bit broader. The terms are very similar, obviously, but VSR just guides you. This is how we got it for target leader. So again, this is kind of like chain replication, which is great for throughput, but then for latency to handle partitions or machine failures when the ring is broken, you obviously want to blend this a little bit with start topology. So you retransmit to other machines either proactively or reactively. Lots of techniques you can use. So what's great with VSR here is that you can also predict who the next leader is going to be. It's not a random view change. So with VSR, you can have your current leader prioritize synchronous replication. Obviously, you're not, you're replicating to all the replicas in the cluster, but your replication, some of it must be synchronous for durability guarantees. Some of it doesn't have to be synchronous. It can be async. So what you do then is you take VSR, you say, right, I know who the next leader is going to be. They need to have everything. So prioritize synchronous replication to the next predictive leader, and then you can do async replication to the others. So this gives you the durability you need, the machines you need it on most before you reply back to the client. But what was also, how much, how much intelligence, I mean, how much intelligence do you have in topology? Or are you just going around the ring, meaning like, like, if I know another box is on a different switch, and there's three of the boxes on that switch with it, and I just send the message to that the leader within that box and then it replates locally. You're doing anything like that or is everyone treated as a first class citizen? We don't have that yet, but we've been thinking about it because what's pretty interesting is as an operator, if you know that replication is just following the ring, you provision your cluster replica 01234, they're ordered like that, you can actually just lay your replicas out to match your physical network topology, that would make the most sense for replication. But we're going to, those are features we want to do that would be pretty cool just to try and play with us more. We've also got some things where we're doing like a little bit of gossip and getting more awareness of faults in the network so that if you see ring isn't going to work, you switch to stock. This reminds me of NewDB had a similar gossip protocol, I was going to say this, it's sort of something like this for the, I think the comment that they're transaction executors were like, you know it's running the cache or something and you do an update, you can validate it. I mean, this is a shared nothing system, right? Yeah, I mean, the state is fully replicated, but the actual, we don't gossip the actual in-memory control plan stuff on each replica that they have their own view of the network. We're not doing like flan gossip with that, that is up to Jimmy and Andy. No, no, it's more like, if you have an update on the front end node, then you did a gossip protocol to broadcast the update to a logical record to other nodes. You're doing, you're doing authentication, you're going in a ring, theirs was more of a scatter kind of thing. It's a shared everything system where every node has complete copy of the data and just making local copy. Yes, absolutely. We definitely replicating all the data to every node. And then it's just a question of how do you do that efficiently because if the leader gets it to every machine, there was a paper that came out, I forgot the name, but it was Michael, that it was basically the idea of trying to, how do you share the load of the leader and let's introduce proxies that can do the replication, but it's a lot of the same thing, just different ways to do it. But I really like chain replication. It's a fantastic technique. And here, the only problem is chain replication doesn't have consensus. You have to manually, when the topology breaks, you know, reconfiguring, here you just integrate it with consensus and you get high throughput plus you can, you know, fix up your topology. How early you like, it's pretty easy. Yeah, so we obviously, this is still work in progress, but these are these are the ideas. The ring is there for sure. And the retry retransmit is there. And also what was important on the networking side here is just the philosophy that network events from remote machines shouldn't be able to trigger memory allocations in TigerBeatle. Like if someone sends you a message, it cannot cause a memory allocation. I think that idea is pretty foreign to people, but I think it's so important. You can't, how do you build a safe secure system if someone can just poke you and your database is going to allocate? You know, so we with TigerBeatle, we wanted to totally de-couple ourselves from external stimuli. So we're going to get into that now. We wanted to decouple performance from memory allocation. So as your performance increases, resource usage must stay the same. I have a question about draft, if it's okay to interrupt. Sir. Okay. I don't know about view stamp replication, but I'll definitely check it out on draft. So the reason, so what I understood from Ringo's start apology is that leader does send messages to followers to begin the replication process, but also followers send the messages to other followers. So it's not just the leader sending messages. From what I understand, draft, the draft leader, only the leader is sending messages because the leader log becomes the source of truth. So if the leader is getting some transaction, it is the algorithm is such that. Yes. Yeah. So how are you solving that? So we'll get to that. I've actually got a diagram where we show exactly, you know, the difference between rafts, leader, you know, which log do you trust? You know, but the easy answer is just that raft really is VSR. So all these techniques, you can do them with raft. You have to maybe like mess, you know, you have to you know, rip out that RPC and get to like a proper message bus, which is, which would make raft better, I think. But there's no, it doesn't change any of the correctness. Raft is VSR, like the whole correctness proof comes from, I mean, the algorithm, you know, is, is probably not VSR. It's just any, the only, the unique thing with raft is really like the paper presentation, the RPC, you know, everything in a single table, plus the view change just comes from Paxos. It's random. That's, that's, I mean, I'm just being very reductionist, but I think those are roughly right. I would say also to like a bunch of you made raft implementations in a bunch of programming languages, like there was never any live Paxos where there's a bunch of rapid locations. I think that, but that was led to the wider adoption of raft river Paxos. Yes, yes, exactly. And raft is great. I mean, because VSR is great. So if you love raft, it's really VSR that you like. It's just that we don't know this is the lineage. We get it from, from Brian Oakey, you know, he, he pioneered consensus a year ahead of Paxos and it was all there. It just, it just needed raft to come along and repackage it in a, in a paper that we could appreciate. I just wish, you know, that, that more people knew about Brian. Another question, does raft really wait for, like, when, so I was under, I was, I thought the draft can send messages and do not have, does not have to wait for responses from the followers. Yeah, quite right. Either way, you know, that also works. Got it. Okay. Thank you. And that's, that's the thing with all these things is like, you'll see with Tiger Beetle with what we, you know, with what we do. So VSR is obviously the view change. And it's just because that's the historical name for this thing. That's, that's why we call it VSR, you know. But, but I mean, we then take Particular Recovery for Consensus Bear Storage, we take the control protocol from there. We also implement that. And I'm going to show you that just now. So what, what you see as Tiger Beetles VSR is a whole lot of stuff. And that's the thing with raft, you know, you can take raft and then take a whole lot of papers to, to mix up how you do reconfigurations. And same with Paxos, it's all into like multi Paxos, multi Paxos, rough PSR, they're all, you can, like Heidi Howard has a great paper showing how you can just mix and match, learn from all these systems. So we, we take Heidi Howard's flexible Paxos corums and we just apply them to VSR because it's really the same, it's the same problem, you know, distributed consensus. So yeah, but again, we just want to decouple all this and the performance from allocations so that memory usage is constant regardless of throughput. So it's like C groups, but in your database, as the operator, you tell Tiger Beetle how much memory it can use. And that's what it will use. So this keeps operated in the database in control means that Tiger Beetle has a limit on all network resources to safely handle overload. Also thinking about networking in terms of memory, because network bandwidth is starting to overtake per core memory bandwidth. So often you'll get like these dims and the advertise, I don't know, like 20,000, whatever gigabytes a second or 20 gigabytes a second, right? But per core CPU memory bandwidth is actually closer to around six gigabytes a second or about 20 for M1. So if, if like our most Linux machines, you're dealing with six, there's a lot of networks that are faster than that. So we were really careful to reduce memory copies. So we receive from the kernels TCP receive buffer into one of our statically allocated butters. And then because the Tiger Beetle client has already done the work of arranging the data in the buffer carefully, it's almost ready to append as is to the log. So we've thought through all this stuff so that like really try and minimize mem copies. All we have to do at the leader is literally just tweak the 128 byte header that's in the buffer. And then the leader has to recheck some of the buffer. Obviously that's now kind of akin to a mem copy, but there are no further copies beyond this. So you can take a look at the on request method, it's on underscore request. And you'll see that in Tiger Beetle source VSR replica.sig, you can see this in action and actually read through the whole VSR protocol, everything I've been speaking about. So let's move on a network to disk. Yeah, what's nice with the larger batch size that we use for networking is that it also plays better with the disk for sequential write throughput. So you can amortize fsync. You can amortize fsync across thousands of requests for very efficient group commit. It's like group commit on steroids. But the database doesn't have to do, it doesn't have to have any mem copy overhead to construct the group. Since this work is offloaded to the client. So you literally, you call receive to the kernel, you into a statically allocated buffer and there's your group commit done. Obviously, these kinds are API gateway. So they, they're big enough, getting enough volumes that they can batch for you. At the same time, when we write to disk or read from disk, we also use direct IO. And I really love direct IO. I know Linus doesn't. I love his rants about it, but I love direct IO and I always have. It's crucial for performance. You have to avoid that expensive mem copy to, to or from the kernels page cache. Even more so for safety, you just can't build a safe database today without it because the kernels page cache can swallow IO errors and report dirty pages as clean in what has become, become known as the infamous F sync gate. So as far as we know, direct IO is really the only way to handle this correctly. But if your page cache is managed by the database now, it's also really cool because you can get huge speed ups by avoiding a copy completely whenever you hit the cache, especially if this is like a cache for LSM tree, you know, where you're doing lots of read amplification. If you were going to kernel page cache, you're doing copies across all the levels in your tree. Now, if you own your page cache, which is awesome if you're a database, you can then change your cache interface to allow synchronous functions. If you know they're just going to work on the data synchronously, it's pretty safe. You can give them a constant read-only page pointer instead of doing a page copy. So at one point I actually worked out how much throughput we would have lost if we hadn't done this for Tiger Beetle. And it was something so scary that I went repeat here. But this is why as, you know, as far as I understand, it's so important to avoid mem copies because otherwise every time you do a copy, it's not just the cost of the copy itself for the CPU. This is just why I think it's important, you know, to avoid copies because you've got the costs of the copy for the CPU itself, but then you're also potentially flushing parts of the CPU's L1, L2 caches. And that de-optimizes other parts of your system. So database memory bandwidth across the whole system is really important. But I think, you know, how we see and work with the disk from a safety perspective is where Tiger Beetle really starts to diverge from most databases. So while we use many of the same performance techniques, what I've showed you is nothing new. We have a stricter storage fault model. And this is because most databases are not designed really to survive bit rot, misdirected reads or writes or latent sector errors. So you see checksums, but that's really because power loss and torn writes are about as far as must go. So there's some checksumming, but not as much testing. So I thought we can look at a few examples of storage faults, you know, what I mean, starting with the easiest. So let's try this out, right? BitRot sounds simple. You just use checksums, right? Everybody agree? Okay. So there's a classic quote from Jeff Bonwick of ZFS about checksums. A block level checksum only proves that a block is self consistent. It doesn't prove that it's the right block. Reprising our UPS analogy, we guarantee that the package you received is not damaged. We do not guarantee that it's your package. However, I mean, that's misdirected writes, which is pretty, that's pretty nasty when it happens. Also pretty rare. But much worse is this research from UW Madison, one best paper at Fast 18. And it showed that many distributed databases have latent correctness bugs in their right-to-head log. So I've chatted with engineers about this. And I always ask them this question, say, oh, okay, so what happens? I'm going to show you the example now. This is really like a latent correctness bug in the right-to-head log for, I don't know, just name a database. And it's even with respect to very simple single sector faults. It's not even misdirected IO as in a Bonnwick quote, right? So the trouble is that they always interpret a checksum failure in any portion of the right-to-head log as a torn right after power loss. And then they truncate the log from that point on. So the database starts up, it reads in its right-to-head log, skis a checksum failure goes, aha, like that was a torn right. That was power loss, right? And then truncate the rest of the log. Foundation DB goes even further because it knows sometimes this is torn right, but sometimes there's what looks like a valid entry further. So foundation will even zero the rest of the log. They found correctness bugs like that, which is pretty cool. We've got that in Tiger Beetle too. But still, the issue is that you see a checksum mismatch and in the right-to-head log, database is going to truncate the rest of the log. So I think we're all good on that point. I think you can see another problem. Disks fail and they can flip bits in a single sector, regardless of your database is still running. Then the operator just shuts it down safely or good. You start it up again, but it's possible for this kind of bit rot to occur in the committed portion of the right-to-head log after the transaction was already act by the whole cluster. So in this case, most databases would confuse us with power loss again, and they're going to truncate even the committed transactions from that point on, which is really undermining your consensus quorum. So you can, this can then propagate through the consensus protocol, and this can cause global cluster data loss. So if you want to read all about this protocol-aware recovery for consensus based storage, it's kind of like one of those silent, slow-moving, like Bob Dylan said, slow trend coming. This paper, it won Best Paper Fast 2018, but it's still picking up. It's going to be maybe 10 years, and then everybody's panic stations. But you read the paper the first time, and it's like, oh, okay. You read it the second time, and you start to see like, this is a real disaster. And there's so much other great stuff in there. So that we're not touching on here. But this is something that's kind of, you have to change the whole database, how you design your right-to-head log if you want to fix this. So we follow the paper in this regard, and we store headers for batches in our right-to-head log out of band. We have all the batches in one right-to-head log. This is the actual operation data. It's also got the header in each batch. Then we take all those little headers, and we've got a small little right-to-head log just for headers. And this enables Tiger Beetle to differentiate between torn writes through power loss and bit rot in the middle of the committed log. There's also a lot more that we do for this. So we kind of took the idea with our whole storage fault model, multiplied it out. And the code for this that DJ wrote is something that we're super excited about. So we literally used a matrix to enumerate all possible combinations of storage faults against our log recovery logic to then represent these cases in code and use compile time verification to check that all cases are handled. So this was going beyond the paper, like I said, and DJ and I spent a ton of calls and correspondence on this. We call these our walk-in talks, walking and talking through all kinds of failure scenarios. Since we're looking at repair faults in the right-to-head log here, and this is coming back to the earlier question, another issue with Raft is something like the RAID 5 problem, where a leader can only be elected according to Raft if it has a pristine log. So Raft has this trick that it's only going to elect someone who has the longest log. But it sounds great. It's not great in practice. If you have a storage fault model, because there are cases where your cluster can become unavailable prematurely, because Raft has no protocol to repair the leader's log. It always trusts the leader's log and the followers catch up from that. But what do you do if everybody, just like RAID 5 issue, you have a cluster of three, and maybe one node is down and two replicas have two sector faults, then your cluster is done. So for example, in this diagram, Raft, here the black boxes are empty holes in the log. We've got three replicas. And here Raft won't be able to elect a leader. The cluster will remain unavailable, even though there's enough durability for every operation. You can see here every operation, the core and intersection property is holding. You've got two out of three, two out of three. So Tiger Beetle here can elect a new leader, even if the new leader has a faulty log. This is what we get from protocol aware recovery. It's that control protocol. All you need is for each operation to have enough quorum across logs, and the quorum doesn't need to always include the leader. So this means that with Tiger Beetle, or with the control protocol, you get higher availability because you're fully utilizing the durability you're paying for. I just think it's important that it's not a big change. I mean, it's a bit of work, but we should be doing this because you've got the durability in the protocol. So to give another storage fault example, latent sector errors. This is where you periodically can't read sectors. Maybe you can't write them. Usually it's because you can't read them, because the disk might remap if you can't write. These are interesting for two reasons. First, they're not so rare. So a study by Baira Vasundaram, analysis of latent sector errors and disk drives, found that 3.4% of disks exhibit LSEs in a 32-month period. The second reason is because they make the disk look like the network. So these LSEs, you temporarily can't read a sector. Suddenly it's like the network fault number, where you might be temporarily partitioned from disk sectors. And this means that to work with the disk properly, you really need to see the disk as a distributed system. That's just one disk. I mean, this is like super theoretical stuff, right? But if you are designing like critical file systems, then it makes sense to just do this right because these LSEs happen. And yeah, just one of these, you might not be reading your disk correctly at startup. And this is especially true for copy-on-write file systems. You've got copy-on-write trees, they've got root nodes, and you automatically switch out the root node to move everything to a new tree. If you get an LSE, you might not see that new root node, and your file system might start up on an old state. And then yeah, so that's why it's so important. You actually have to use consensus corums, read write corums, as you switch in your, I'll show you a diagram now what it looks like. So here you've got two trees, copy-on-write, the little block in the top right of the square is the root sector. Some file systems call this super block or super block. So you've got all your copies. And then finally, when you're ready to commit them atomically, your whole file system state, you write your new super block or your root sector. And usually what file systems do is they'll write a few copies of this for durability, because it's such a critical sector. Then at startup, they'll look at all the well-known locations for these root sectors. They'll literally just pick the sector with the highest sequence or version number. And I think you can see where we're going, because if you don't use corums for this, how do you know that you've read all the sectors you're supposed to? You can't just ignore one that you couldn't read, because that might be the latest version. So you have to have some guarantee that you're reading enough root sectors, that your read corum intersects with the right corum when the trees were switched out. Otherwise, you might temporarily see an old version of the trees the newest. Again, if some newer re-sectors are temporarily partitioned through LSE. So to understand this better and actually see the code with how we solve this with Tiger Beetle Super Block, really nice. It's a quite fun code. You can take a look at Tiger Beetle source for yourselves, superblock.sig. But of course, it's not enough to have a writer-hid log. If your state is larger than memory, then you can't keep everything in memory. You need to pay your state to disk and also page it in when you read it, when you need it. So you need a storage engine. And as with our consensus, I think you can guess what we did. But for Tiger Beetle, we wrote our own storage engine. We had good reasons. So top five reasons we wanted to solve our storage fault model. The existing engines didn't. Second was we wanted to integrate the writer-hid log of the storage engine with that of the replication protocols that we could solve for, again, that example I showed you, protocol or recovery for consensus-based storage. You have to integrate your global consensus protocol with local storage engine if you want to have proper distributed database correctness. Third reason was we wanted to guarantee the deterministic storage. And that's for deterministic storage means you can do deterministic simulation testing, which is quite huge. We'll cover that later. But you can also do byte for byte verification across replicas that they reach the same state. This is like online verification for production systems. Just giving you a piece of mind that all the replicas are an exact copy of the whole state for set. And then the same, this deterministic storage also means that if you have like sector failures or you need to recover parts of the state on one of the replicas, you know that the state on all the others is the same. So now you can do these, you can do much faster recovery because you've got smaller diffs between machines because you've got deterministic storage, rather than everyone just haphazardly according to the thread scheduler, writing to different places in disk. So all these benefits, online verification, deterministic simulation testing, faster recovery, which you always want to optimize for. Fourth reason was that if we could do it ourselves, you know, we could really focus on memory. That's what we love. So we could go for, you know, maybe an order of magnitude, more efficient memory, less memory usage with static memory allocation. And lastly, you know, there's research like the silk papers, Fantastic Care on LSMs, but RocksDB and LevelDB, they've got these one second plus write stores. All of a sudden, your client request is going to have a one or two second write store. That's just because of how compaction is not always scheduled or not always incremental. So we wanted to totally eliminate compaction write stores completely. We wanted to deterministic compaction schedule in our last century for extremely tight bounds of latency. This is something we're actively working on now. So if you take a look, you know, storage engine is recently merged, but this is one of one of the things that you'll see PRs coming in for soon. The design is all there. And it's, yeah, pretty, pretty interesting implementing this. So if you want to learn more about Tiger Beetle storage engine, you can also check out our 10 minute lightning talk about Trip Boy. You can find it on our YouTube channel. This is the 10 minute version. There's a longer five hour video that you can also find there. And that one has, it's complete with curd walkthroughs and pair programming that Isaac and I did. But another surprise about Tiger Beetle, I think, is that when it comes to network and disk, all our IO with the kernel is without syscores. Syscores carry the cost of context switches, cache misses. So instead, we use IO Ewing to submit IO and receive IO completion. So disks and networks have become so fast these days that the cost of the switches, context switches about the same as the IO it submits. So you can have throughput if you're not careful. It's also why we use a thread per core design in the past yet to be multi-threaded if you wanted async IO. These days, more efficient with IO Ewing, get to use the kernels thread pool. And I think that's the best part about IO Ewing for me. Now I have a first class async IO, APR and Linux, you unify all your networking disk IO. So shout out to Yen's expert for being awesome, because with IO Ewing, it's definitely a good day to database. So IO Ewing is just one example of how to design for memory, given network and disk have become so fast. Another is the renewed focus on memory in the gaming industry, rise of data-oriented design for extreme performance. Andrew Kelly did a fantastic talk in handmaid, Seattle last year on this. For the IO Ewing, you're honestly using for disks. Are you using the experimental one that buffers do or no? We use it for disk and for network. The network stuff was actually in, I think, from kernel 5.5 or 5.6. So that's what we support. It's like a year ago. I didn't realize it in mine. Thanks. Yeah, that's pretty good. And there's a ton of new stuff for network, all kinds of stuff, like registered buffers. That's even been a while, but then there's just the boatloads going into IO Ewing, like more experiments and stuff. But this is the basic networking disk has been there for some time. So yeah, but I think you've got data which is designed for memory, that from a safety perspective for memory, take this all the way, then the rule for static allocation of memory is common. It appears in most safety-critical coding guidelines, again, like NASA's power of 10 rules of safety-critical code. You can see this. We've adopted it for Tiger Beetle's coding guideline. You can read it in our repo. It's called Tiger Star. And the reason for doing static allocation is simple. If you force your code to live in a fixed, pre-allocated area of memory, you get a deeper understanding of the domain. You're actually thinking about everything. You're thinking about memory. And this not only saves syscalls in the data plan, but makes it easier to verify memory use, protect against fragmentation, and detect deadlocks. So the point is that there is a systems limit to all these things anyway. If you don't plan for them, then you're going to pay for it in Opscript and later, where static allocation makes capacity planning for the SRE easier, since all limits are explicit. And yeah, in other words, every kind of resource usage in Tiger Beetle, larger, small, different sizes of duration, there's a fixed limit that's being pre-calculated. And I think it's also important to understand what we mean by static allocation. So we don't mean in the sense of implementation, as the C compiler would do it in the binary. Rather, it's just the system's principle that all memory, indeed, or resource usage, it must be thought through, have a realistic limit. This limit should be known at compile time or at program startup. If command line user wants to give you, tell you how much RAM for the block cache, that's also fine. But when you're actually up and running, at that point, there's no more dynamic memory allocation. That's how we see it. So for example, we do this for, this is not just a user land slab, we do this for everything in Tiger Beetle. So it's the memory for the messages for every possible permutation of the consensus protocol. We've worked that all out. And it's pretty light, low footprint, all the memory for structs for less entry compactions, everything is all statically allocated. And then we lay this all out, all the code and data types, everything is laid out by hand, minimized padding, optimized for special locality and to reduce cache misses. Because generally, the less memory you use, the less you thrash the cache, the faster you go. So again, misconception, static allocation is wasteful. I haven't seen this to be the case with Tiger Beetle. Our storage engine, it's not only statically allocated, but it's extremely efficient. It's pretty cool just to look at the code and see how efficient it is. Because it's a few hundred megs of RAM, they're enough in memory to address excluding bloom filters and caching and all that stuff. But just the bookkeeping structures, that's enough to address more than 100 terabytes of storage. And that's a lot more than storage engines that do dynamic memory allocation. So I think it's just because we think about memory so much, tend not to waste it. And once you have limits in place, you can then assert and test that these limits are not exceeded. So this is a force multiplier for fuzzing. What you don't limit, you can't test. What you do limit, you can. So if you're fuzzing now and you can detect rare leaks that might otherwise have only been detected in production. So at this point on our tour, we're coming into land. We've looked at network disk memory. We come to the last part of our methodology, which is velocity. How do you develop a distributed database design in the 10th of the time? How do you get the courage you need to write a consensus protocol storage engine? It took Postgres almost 30 years to get some of this right. So how do you do this distributed in two years instead of 10? And the first part of the answer, at least in terms of development velocity, I think is the velocity of, anyone guess? How do you optimize development velocity? You choose a language like, like ZIG. Thanks, Andy. Good, good guess. But yeah, so this is ZIG. And I think we couldn't have done this so quickly without ZIG. The tooling type safety, compile time code execution, compile time safety and runtime safety of ZIG is an order of magnitude better than C. I also believe that game development is a great business test for language. And here ZIG's ecosystem is blossoming. ZIG has an incredible performance culture, a huge shared focus on memory, no hidden allocations. You can handle allocation failure safely. So out of memory is something we just have to handle, right? There's just no way we can panic. ZIG allows you to do that. The explicit control of a memory layout and alignment that ZIG gives you is just perfect for direct either, much nicer than in C. And then you get this rich choice of allocators and test allocators. It's a whole ecosystem of allocators in ZIG. It's a whole culture around different types of allocators. So it seems like a risky bet, you know, pick a young language. Remember, again, ZIG is really like a front end for LLVM, same as what Rust uses. And in our experience, you know, we were watching ZIG for two years before we made the decision. And the team are fantastic. The quality, things are early, but quality is really good. We also realized that our roadmaps would coincide in future in terms of stability. You know, we're going to take time. Tiger Beetle's going to take time to get to production. So will ZIG. And we wanted to invest for the next 30 years rather than pair language tags for the lifetime of Tiger Beetle. So I think ZIG is the right language for the next 30 years. And, you know, give or take a year or two, that's fine. We'll go with ZIG. It's our only dependency. We're really happy with it, being a force multiplier for development velocity. However, I guess, you know, most of the time that you put in invest in a database, it goes into testing. It's the slow feedback cycle. So how do you amplify test velocity? For example, even with Jepsen, as incredible as Jepsen is, nevertheless, if you want to find a bug that takes three years to manifest, you still need to test for three years. If you find it, you might not be able to reproduce it on the next run because Jepsen and most databases are not deterministic. So we literally ask the question, how can we speed up time? How can we find bugs and replay them again and again as we switch on debug logs? And the answer was inspired by FoundationDB, the awesome simulation work they've done, as well as the work that James Cowling, CJA Giacomo and others did at Dropbox. Don't know if he spotted, you know, James Cowling, Dropbox, he's also one of the authors of Eastern Replication. So that just sold it for us, right? But they did some incredible work on simulation testing at Dropbox that was a big part of the inspiration. So we actually just, well, we said, well, let's literally speed up time. So we took that abstraction of the replicated state machine, networking, stable storage, even the clock source. And then you make them all deterministic. So you can shim the message passing an IO and you can literally just tick time. You can run a whole cluster of tiger beetle replicas and clients, but in the same process as a pure simulation. So from a single 64 bit seed, whenever you want, you can just, you know, drop this in Slack to share it with your team. The simulator can simulate a whole universe of network faults and latencies just like Jepsen. But it can also do storage faults injection and simulate storage latencies. And it can do this in a way that is protocol aware. So that we can actually corrupt all replicas in the cluster, different, you know, places in their log to test the tiger beetle runs smoothly. Time in the simulator speeds up and literally becomes a wild true loop. So we're literally just testing as fast as the CPU, you know, can take time. Because the simulator also controls the world as a state checker that can look into all the replicas, check their state transitions, the instance they take place and use cryptographic hash training to prove causality and do all of this. I think this is my favorite part of our tiger beetle, just that it's a deterministic distributed database. In the first three weeks, we were able to find and fix 30 distributed systems bugs in our consensus. I think it's like averaging five a day, find and fix, find and fix. And the simulator is also the reason that we felt confident to launch a bug bounty challenge called viewstone replication made famous. And we pay out, you know, up to $8,192 for correctness bugs in tiger beetle's consensus within the scope of the bounty. So we're now simulating tiger beetle's new storage engine. We have some performance regressions there that we're fixing. The things are looking pretty good. And we're looking, you know, towards our first production release next year. So the last thing I want to show you is the business logic in tiger beetle, as we come to the actual state machine. This is really the last stop on our magical memory tour. It's important to understand everything we've talked about in terms of network disk memory, static allocation and testing, it's all part of tiger beetle's ESR library. One day, hopefully, you know, it's going to be an open source library. It's completely abstracted away from your business logic in the state machine. So we were intentional about this. When you're in the state machine, you don't see static allocation. This is thanks to zig comp time. So you get first class objects. And you also get a really simple programming model where all the business logic is synchronous. For example, here's how tiger beetle processes are transferred between two accounts, moving some money. Extremely simple to code business logic like this. You're protected by the consensus protocol. And tiger beetle, that's all around you. But your code in the state machine is like the simplest possible code. Once a batch of operations is in the log of a column of replicas, each replica can then execute it through its state machine function. You take the current state, apply the new operations, return the resulting state, goes into the storage engine. And again, here, even, you know, we've taken care that we use fixed-size cash line-on-line structs for all account and transfer data types. So processing a batch like this, there's no deserialization on M copies. You literally, you've received from the kernel, you've got it in a buffer, and the state machine logic is literally just iterating through it. No deserialization. So we require little Indian architectures. And then you can do this. And Zig has got fantastic, you know, really fine-grained casting and alignment in the type system for this. So you might be wondering how we can make the biz logic synchronous. And the answer is that, you know, just before we commit to the state machine, we have a prefetch for a team. It looks through all the operations and ensures that all the data dependencies are prefetched or, you know, like page faulted asynchronously into the database cache. So ideally, it should appear as if the data is all in the memory already. The execution engine shouldn't have to worry about how data is fetched into memory, says Andy Pavler. This means that by the time the business logic in the state machine gets to run, it's all completely synchronous and simple. Nice, easy to reason about. And again, you know, all the network consensus disk storage engine static location, it's outside of the state machine. So the state machine doesn't have to know about this. And so you see, we're learning from the professor here, I hope. Yeah, finally, architecture of a replicated state machine, I think it's just great for a distributed database. If you stick to it, if you keep all the complexity out of the state machine, then you can make your state machine generic, and your database can become like a database framework, like kind of like a distributed systems Ironman suite. So you can take Tiger Beetle's financial accounting state machine out, and then you can put your own whole new state machine in. And then you get like Redis or something, right? And you get all the benefit of Tiger Beetle, but for your domain. So the late great Fred Brooks said that the programmer like the poet works only slightly removed from pure thought stuff. He built his castles in the air, from creating by exertion of the imagination, few media of creation are so flexible, so easy to polish and rework so readily capable of realizing grand conceptual structures. And on behalf of the Tiger Beetle team, I'd like to dedicate our talk to Fred and say that if there was ever a dream for a silver bullet, at least for databases, then let it be the pure thought stuff and castles in the air of old powerful abstractions, and static allocation, and new deterministic simulation testing. So we're excited for the future of these things. And hope you'll join us for the journey as the road goes ever on. All right, so awesome. I applaud what I have ever noticed. We're a little over time. So maybe one question from the audience. Is Victor still here? Victor has a question in the chat. All right, so Gavin, do you have a question? Yeah, can you hear me? Yes. Okay. First, thanks, Jordan, for the IO Euring library in Zig. It's fantastic to have that in standard. I have a question about the architecture of your storage engine. So you mentioned that you have things set up where you sort of just did the perceive into a buffer, right? And that's almost zero copy from the kernel. Do you have it set up where you're doing like fixed buffers or registered buffers in your page cache, and you're registering those with IO Euring, and you're just sort of sending those across? Yeah, thanks, Gavin. I loved your blog posts on IO Euring. So I'm glad you're using it. That's awesome. So there is a copy from the kernel TCP received buffer into our buffer in user space. That is a mem copy, mem copy, but from then on, no more. We're not using registered buffers yet though, so we could. That would make things more efficient. We just got things up and running. And at the time when we started this, when we did that IO Euring into Zig standard lib, registered buffers, I think we're pretty new. So I wanted to just get it into the standard lib. So I kind of did the core of what you need for network and disk IO, and I didn't do a lot of the extra fancy stuff. I figured people will, and they have, they've contributed that as open source since then. So yeah, it'd be pretty cool to use that and talk to you about. Thanks. Got it. Thank you.