 Hey folks, so this is going to be a talk about two projects that I've been working on One's a pretty new one that I've basically started a couple months ago And that's going to be around IO u-ring which is a new kernel interface that if you're working on Linux It's basically going to completely replace all of your thread pool based IO as well as e-poll. It's Testing testing no test test test test test test Yeah, let me know if I should just go and scream or I can test test test test test test Cool, okay Shit, is there any amplification? Can the folks in the back hear me? Okay? Okay, cool If if you have trouble hearing me, please go like this And I'll I'll just thank you Okay, so the the recent project I've been working on is called Rio, and that's a IO u-ring library IO u-ring is extremely exciting. It's like in my opinion is the most exciting thing that's been happening in the Linux kernel for like the last 10 years And I will be applying it to my database sled, which I've been working on for the last four years And sled is an embedded database for the Rust community and let's get started Well, so who am I? I've been basically working on stateful systems in the rust ecosystem for the last six years or so My first rust project was the rocks DB bindings That are now like like everybody's using it and like every blockchain project. It's kind of scary but Be careful what you build you never know how it's gonna be used sometimes these things are good sometimes, you know Be careful about market externalities I previously worked at a bunch of social media companies and infrastructure companies in the US working on container orchestrators distributed systems distributed databases some like serverless infrastructure Dataflow things I basically like low-level infrastructure and building things that other people use to build whatever they want with and Yeah, for fun. I also like to build and destroy distributed databases For fun. Also, I teach Rust workshops. I've taught workshops at Mozilla, Microsoft BMW if you need a rust workshop, I'm happy to give you like a three-day workshop or a specialized one beginner or advanced And I'm unemployable so So I like databases because they basically involve working on all these kind of like low-level problems where If you have managers who hired you to do them, they're unlikely to tell you how to do your job So I like front-end work also, but if you have it for your job, there's a lot of people who kind of like tell you what to do I think I kind of got driven to distributed systems and databases because when you get into these weird areas There's just fewer people to tell you how to do your job and that appeals to me And I got hooked So some interesting techniques and database engineering lock-free programming. It's like juggling chainsaws Basically having shared mutable state that you don't use mutexes to regulate lots of like replication consensus Distributed systems stuff lots of race conditions that you have to test for Yeah, you have to do a lot of correctness testing We can only build things that we can actually test the important characteristics about so that also has a lot of restrictions on how we architect our systems And also I'm pretty interested in self-tuning systems We basically the more that people have to configure a system the more likely they are to Misconfigure the system and I love tuning things a lot So databases kind of are all these things combined and that's what brought me here So I started this project sled about four years ago to basically have a personal project where I could just apply all these things That I like thinking about and to basically implement these papers that I was reading It's a pretty simple API. It basically looks like a B-tree map except you never need a mutable reference to it Yeah, you can open it as if it were a file you can insert things you can get things you can iterate remove things when you drop it at f-syncs and it also f-syncs periodically and Yeah, basically acts like a B-tree map from bytes to bytes and it also handles all the multi-threaded issues and Reliability issues if you crash, so it basically is just like a better map I think that Russ is the best language for building databases in for a number of reasons One that gets often overlooked So I mean everyone always talks about like safety like unsafety But one of the ones that appeals to me is someone who like is more than willing to build unsafe things is the fact that Russ has a lot more information at compile time than C and C++ around aliasing information now This is something that actually Will allow Russ to approach for a trend performance over time as we unlock more and more LLVM features around this, but I Mean there's a reason why we're still using 50 year old Fortran libraries in all of our linear algebra libraries, so it's it's because Fortran is able to optimize much more aggressively in C and C++ It's it's difficult to get that kind of performance Russ is able to get the same kinds of optimizations as we can get in Fortran So strictly from a performance standpoint Russ is a superior tool Correctness. Yeah, blah blah blah everyone always talks about unsafety I write a lot of unsafe code. I build lock-free databases I try to do it in a responsible way with like Miri where possible and lots of other tools but um anyway, so I can still use the C and C++ performance and debugging tools and Another big thing though is I can just accept other people's code really cheaply Whereas if I'm working in a C and C++ code base It's like a lot more energy trying to review the code because you have to have a lot more things in your mind You have to familiarize yourself with more of the code that they're touching and it's just much more expensive to accept code From other people so as a one-person project essentially with a few awesome contributors It's a it's a really nice language to Not spend too much energy doing the the normal maintenance ship stuff Okay Some other things about sled. It's pretty fast to compile When you want to just have a database you don't want to like bring in a dependency and then have like hundreds of new Dependencies that slow down your project. So it's like it's pretty small. It compiles pretty quickly and it should just let you Solve the database problems without too much fuss One thing that has been really critical So it's good that there was a profiling talk right before this I lean really heavily into different kinds of profiling But when I build things I like to build profilers into the systems so that I can tell I can basically trace like where like is this big Number coming from I can kind of see through the system where things are starting to slow down where bottlenecks appear So when I do changes in my code, I can see how it affects pretty much the whole stack And this will work on any system. It doesn't rely on per for anything And it can be compiled it can be turned off through a compile time flag So this is like a technique that if you're building performance critical systems just by having a profiler built into it It it really it's a wonderful thing to give yourself I also work a lot with flame graphs. I wrote a tool That's available here basically to make Flame graphs easier to use and you don't have to mess with pearl scripts or anything And it works pretty well. So if you do a lot of flame graph stuff This is a very nice tool for using and basically just type flame graph my workload and I'll just spit something out like this Anyway, how fast is it? Very fast it can do over 17 million operations per second when you're doing a 95 percent read 5 percent write workload Which is like somewhat representative of most like transactional workloads so It's very fast, but it's definitely still beta So you don't store your primary data in it yet There's like a SRE proverb never use a database that's less than five years old You will always regret this decision no matter what so lucky for you sled turns five this year, so This is a an exciting year for the project So now is like when things are like really starting to be more and more productionized and Now is when I'm like starting to get on stages and telling people about it because the next year is really exciting Cool has it work Basically has a lock-free index that maps from keys to the locations on disk or memory So this is how we like look things up. It's based on the Microsoft BW tree loosely It has a page cache that translates things in memory into the on-disk representation Also lock-free and based on another Microsoft project called llama It uses log structure to storage that plays very nicely with modern flash Based on sprite LFS, which is like a pretty ancient log structured file system But it works pretty well because it basically just gives us a nice clean framework for doing garbage collection over segments And we tend to just write large segments at a time And of course we use IOU ring for writing these huge buffers and this is This is going to be like the second half of this. I'll talk about this in much more detail And this is also exported as its own crate RAO. I squatted that for like six years Like and only now can I actually have like a really good use case for it I'm very excited to To be publishing this And we also have the cache based on windowed tiny L few which is basically what all of the Java Ecosystem is relying on for all their high-performance caching needs any database that's written in Java is going to be using the caffeine library For the most part and this is a technique that they rely on this will be soon exported as the bear kind crate So We need to avoid blocking wherever possible because we want to support lots of threads Threads cannot if you just want to read some things Ideally you shouldn't have to acquire a reader lock before you access that because you know Maybe there's a bunch of writers coming at the same time and readers. I won't focus too much on this Because this is like basically like a 20-minute lightning talk But the important thing is that anything should just be able to do what it wants to do without blocking Otherwise Yeah, you're going to start seeing huge latency spikes and these will backlash through your system So in order to set a value to any key we first find so it's organized as a tree structure We go from the root node to the responsible leaf and then we mutate that leaf But it's important how we do this because it cannot be blocking So Yeah, we don't want latency that's bad So we use a technique called RCU Read copy update. This is a technique that gets used all over the place in our kernels The high-level idea of what what happens here is we read an atomic pointer We make a local copy of the data that it points to We change it in the way that represents our mutation and then we attempt to install Our mutation by using an atomic compare and swap operation on that atomic pointer if We fail We basically retry and go back to one But if we succeed at some point we want to then destroy that data and the way that we do that in a way That's safe is by using the cross beam epoch crate So the the real problem here is if we immediately did a free of the data that we just basically made Inaccessible it's possible that other threads still had a reference to it So we have to delay the destruction of that data until all threads that might have witnessed the data have Have finished their work and that's what we use cross beam epoch for So readers don't wait for writers. They just read that atomic pointer and get their data and writers proceed optimistically They just read copy update and if they fail they retry, but the trade-off here is that instead of taking out a mutex We just do it anyway and the bet that we're making here is that we have to retry less often than The price that we would pay for acquiring a mutex on every update So if we have to start retrying a lot then it makes sense to actually dynamically start taking out a mutex And we can actually detect this at runtime dynamically like we can measure how often we experience contention and only start acquiring Mutex's for rights when we're in a high contention area So this is an example of auto tuning systems You can basically measure contention as you experience it switch over to using locks when they prevent contention Or when they prevent wasted work, but usually just proceed optimistically and it goes really fast However, this is just the in-memory part of this We're building a database and ideally the things that we do in memory are mapped to things on disk The key constraint here being the ordering in memory has to map the ordering on disk So if we have a bunch of threads doing read copy update and then after they do the read copy update They log to disk It's possible to have a race condition here Imagine one thread tries to delete a key and another thread tries to change its value let's say that the delete should happen before the mutation and The deletion gets through here. It deletes the old value and Then the mutation does the same thing and installs a new value But then there's a race condition here where they log their up their updates out of order If we then crash and then recover the database Even though we mutated things in one order in memory We recover something that no longer has the data. That's data loss. So we basically can't have this So how do we make the things on disk map in memory? even though we're not using any locks and we're basically Yeah, all this has to be lock-free, but we still have to have our order on disk map the order in memory so what we do is we use a Special type of log reservation. So we try to reserve a slot at the end of the log That can later on either be canceled or filled in with data and the important part here is that that Reservation at the tip of the log Happens after we did our read but before we did our compare and swap operation Later on when we know if our compare and swap failed or succeeded If it failed we basically can fill that conditional log reservation and with like zeros But if our compare and swap succeeded it means that the log reservation also happened at a point that is linearizable and At that point we can actually fill in the mutation on disk such that it matches the order in memory So even though we don't have any locks We're still able to basically have things lined up And this is a really interesting technique that you can use for basically Linearizing all kinds of things without locks you basically have like some conditional slot here that gets taken out between the read and the operation That's that's actually installing a new version if that operation succeeds you do the conditional thing so sled also supports reactive subscription and Basically the way that that works is it also takes out like a subscriber a possible subscription Slot here and it only actually fills it in after the compare and swap So this is how we keep things on disk matched up with in memory Cool. How do we get fast IO? One of the main things is we just write things eight megabytes at a time all writes are non-blocking Even though our write functions are not marked as async functions. They're async They basically just queue things up into an eight megabyte buffer and later on we write the whole thing at once We also support out-of-order writes There's there's no head of line blocking at recovery time. We figure out the right order And of course we use IO u-ring which I'll start talking about now So IO u-ring is basically an interface for fully asynchronous communication with the kernel What we're used to before this is we have to do a syscall Anytime we want to do like a read or write or an e-pull the kernel will basically Yeah, we're basically going to need to do a lot more context switches and People have already talked at length about some of the reasons why we might want to avoid these So I won't go too much into it But how did IO u-ring actually come about it actually started with a much more modest goal Which was to basically improve the old AIO interface Linux actually already had async disk operations But it was it was very restricted. It only worked for files that were open in o-direct mode which many databases did do but Other people tend not to want to use o-direct o-direct basically skips the operating systems page cache and So you you don't have to pay the costs of the page cash But you don't also you but then you don't get your reads cashed in memory and you also Importantly you have to do all of your your IO operations based on the the disks block size So most people don't want to do all their reads and writes aligned to 512 byte blocks It's it's a lot of effort to add to everybody's program the kernel does this for you under the hood But you had to do this if you wanted to use AIO The effect was nobody actually used AIO everyone just used the red pools that would do blocking IO under the hood Exposing an async interface, but that's that's much much more work and much slower So I owe you ring started to just or began initially as an effort just to replace that But in effect, it's much more ambitious So in the first version of IO you ring if we've got basically read like vectored reads vectored writes F-syncs the ability to work with like pull FD or yeah, like event FD devices However, so that the very top row is the Linux version So 5 1 came out with the support for like the basic file operations But as you can see over time, we're starting to get many more file operations as well And then also network so connect accept We can be like send message receive message and this is the reason why it's also going to start replacing e-poll Because we're using e-poll for our network IO But this is going to let us just start submitting operations to the kernel and it's going to like execute them asynchronously and It has a number of advantages that I'm going to cover now How what's the time by the way? Cool. Cool. So the way it works is with two ring buffers There's a submission ring buffer where we basically just load up This ring buffer with those events that that we saw before so we load up the the submission ring buffer with one of these events They have different arguments that you can pass as well depending on the operation F-sync just requires a file descriptor that you pass to it But read like a vectored read would require a buffer to put the read into As well as like a where in the file you want to read from etc So the submission queue is just those operations that you want to submit They get executed out of order and this is a way to dramatically increase the throughput and Then the result of those operations gets put into a completion ring buffer interestingly after you set this thing up you can run it with zero syscalls and There's a mode that you can configure it with called sq pole And this basically creates a kernel thread that will pull the submission queue Looking at those events that you just submitted or operations that you just submitted execute them Asynchronously and then fill them into the completion queue which you can also read without doing any syscalls This is m mapped and shared between user space and kernel space So we can do all of those operations now with zero syscalls This is increasingly important in like the post kpti like Specter meltdown mitigation world where our syscalls became a lot slower Especially if you're doing a ton of them now. We don't need to do any of them. Cool The real crate is pretty simple. You just create you running instance Do like right at you pass a buffer file descriptor an important API design issue here Is it a sync or does it work with threads? You know, this is like the question like this is the first question pretty much anyone asks when they see a rust library Can I use this thing at all or would I have to completely change my architecture if I wanted to take advantage of it? Rio doesn't force you to make a decision. It returns a concrete object that just happens to implement future So you can either just call wait on the completion so you can basically send off a bunch of events From a thread and then like basically batch up these completions And then later on wait on them just like you would do if you were spinning up a bunch of threads that you then join on later So you can do a similar kind of pattern with this blocking wait method Or you can just data wait it because it's also a future by returning concrete futures that just work with both We don't force as many API constraints on our users if if like I think everyone's kind of tired of choosing like Or of seeing libraries that they wish they could use but it just targets the other side of the world we can target both We can also Do like accepts here So this is just a simple proxy that well in this case. It's just an echo service really because We're doing this proxy method for the same stream. We read a bunch of bytes And then we we write the same thing right back So this is just a simple example of how you might describe an echo service You can also do this in other ways But it works with network stuff Operations are executed out of order by default However, you can chain them together by setting a flag that links them and what this does is it basically allows you to specify That the kernel should not begin the next operation until the previous one finished So this lets you do things like submit a bunch of rights to a file and then link that to an f-sync So you do all these things you submit them all to the kernel and the kernel will just Do all of these rights and after the rights finish it will do an f-sync And then you really only have to look at the the completion for that f-sync and you just basically submitted a topology of IO operations at the kernel and And it just tells you after they're all done. It's it's a really beautiful low interaction way to communicate with the kernel You can also do things like a chain like a connect call to some service send a bunch of bytes to it and then chain that to a receive and Whenever there's and just by using these links you just like submit Like a whole whole client operation to the kernel and then you just get the completion back when it has sent you a response It's it's really beautiful. You can also attach timeouts to all these So if you're working on top of a kernel Programming languages are effectively just DSLs for orchestrating syscalls So You know and and this is you know a program to take input and and you know and they're they are useful based on their output If you're only looking at the program rather than its effect on your life, but It's a but really the way that like as long as we're not like working in the embedded world Or even in the embedded world to some extent if you're using like a real-time operating system The we are using programming languages to interact with the world around us and part of the reason why I'm so excited about IOU ring It's because it's like totally changing this conversation We're able to just kind of like submit to apologies of interesting dependencies to the kernel The kernel just does them and then we find out later on how they worked And this is sort of like kind of like separating user space into like control plane kernels data plane And and it's I don't know. I'm just really really really excited about this whole change. It's it's really cool Interestingly one possible direction this could evolve in is by integrating BPF As a way to basically execute a little bit of logic in between chain calls For example, we can accept a socket which you know, we don't know what the file descriptor of that new client is going to be yet But we can use BPF to basically Then do a read on the same file descriptor and then write some stuff back to it So with BPF, we're going to be doing even more interesting stuff without having to do an interaction between user space and kernel space It's it's amazing system Okay, lots of people have been getting extremely good results with it Everybody loves it anyone who's doing extremely high-performance systems right now. They are getting ridiculously cool results And yeah, if you're interested try out Rio for IOU ring if you have a Linux kernel 5 1 and up You can start to use some of the operations and sled for everybody if you have to store some things So it's these are the projects that I'm excited to be talking about today Yeah, we have some cool results. I already talked about we want to do a lot more things if you want to help out I'm on github sponsors. As I said, I'm unemployable. So I'm really trying to just work on open source Right now through like donations and I'm trying to still make useful things for other people also if you want to Talk about distributed systems come in our discord channel. It's actually pretty pretty active And yeah, if you're interested in helping out, I love talking about this with other people. So come and join us I also do rest trainings. Thanks