 Okay, guys, this is the last talk, actually. We are getting into very unsafe and very scary lengths. So the last one is going to be, how do you actually interact with something which is completely memory unsafe or completely memory crazy as RDMA? And this guy here, Andrea, is going to guide us through all these difficulties and current cases. So a round of applause for Andrea. All right, folks. Thanks for sticking around, by the way. That's fantastic. Yeah, I'm going to talk about some really, really wildly unsafe stuff. Let me give you some context. I'm Andrea, and I am a PhD student at ETH Zurich. We do systems research of various kinds. And at some point, I ended up started playing with high performance networking, kernel bypass ways to do better than what we can do with TCP over a kernel. And I stumbled upon RDMA, which is a pretty well-known technique, at least in the area, for doing fast networking without having to rely on the kernel TCP stock. Additionally, RDMA is an interesting piece of software or piece of technology because it allows writing and reading from memory on a different machine without that machine's CPU involvement. And I'll explain what I mean by that. So it's a bit of an executive summary of what the talk is going to be about. We're going to see how we can use some of the Rust ownership and safety semantics to make DMA look a little safer than we might expect. And here's the problem. So we have hardware, and this hardware has direct access to program memory, so this is a vector you have allocated, and there's some hardware as a network card that can go and write directly to it without the CPU even knowing about it. So this seems like in contracts with SafeRust, for example, where we know that we are guaranteed if you're writing SafeRust code that we don't have data races. Data races, quick refresher, is that defined by Denomicon as two or more threads concurrently accessing a location of memory. One of them is a write, so one thread is writing to this location, and one of them is not synchronized. This means that essentially we don't know what the result of the operation will be because we don't know in which order they will apply. In fact, in some cases, we might see a partial write because we end up reading the memory location midway through the write operation. So scary stuff. And now we are also talking about not even just two threads, but even a thread on a CPU and a piece of hardware, a network controller that can go and directly access memory independently of the CPU. So we'll see how if we think about hardware operations as a thread of control, as if there was another thread on the CPU, we can use Rust ownership semantics to make this look a little safer. All right, so there we go, a leverage ownership, and this will allow us to prevent data races between CPU and hardware. So, all right, let's have a look at what RDMA is. In fact, I'm going to talk about specifically IB verbs, which is a pretty standard interface for RDMA hardware. And it, RDMA stands for, again, remote direct memory access. And IB verbs is the library that's user space code interact with RDMA hardware to do remote memory operations. So it's a simple scheme of what's going on. We have two compute nodes. They are connected by maybe some infinite network, even TCP works these days. And there's a program running on both, a network NIC, a network card connected to both that connects them together. And with RDMA, we can do things like the following. The CPU can tell the NIC, here's a buffer, the shaded area on the left. Please go and grab it from my memory. Don't ask me about it anymore. And go and write it into the remote process memory directly. And as you can see, node two here is the recipient of this operation and has no control over this. The CPU is not involved at all. And this is really interesting for some high performance work, because this means that we can do networking without having to spend any CPU cycles on it. Though, of course, this from a Rust program's perspective is pretty scary. We have some buffers in memory that are being written and read by something that's not even another thread of control. It's just some hardware. So of course, the opposite might be true as well. The CPU may say, here's a buffer. It will register with the NIC and say, here's where you can write in data that comes from somewhere else. And we can have an operation that writes, for example, from node two to node one without the CPU being involved in this at all. All right, so I lied a little bit. Of course, computers are complicated. And there is not just one address space for everything. Every process has its own withdrawal space that's mapped to the physical address space, the RAM or the machine. And so what we need is for the NIC, for the network card, to know this mapping. We are saying, here's a buffer in our application. Represents some data. We need to tell the network what physical address in RAM this data corresponds to. So when we set up buffers with a RDMA enabled NIC, we need to tell it, yeah, there's an address in virtual space, here's an address on your physical RAM. Go ahead and write to it and read from it whenever you want. In addition, we also need a way for the process to communicate with the NIC directly. And ideally, we also want to probably not involve the kernel in this. If you can do this directly from user space, why not? And the NIC can already access physical memory directly. We have determined this. So what you can do here is allocate a queue, which is just a memory region that we agree upon between the process and the NIC and queue operations in this memory region where the NIC can go and grab it. All right, introduction done. We have said that we have these mappings between memory regions buffers in our process and whatever the network card understands. These are encoded as buffer descriptors, which are essentially just structs that contain some information, the pointers and the length. And if we want to do, for example, a send operation, we want to take some piece of memory on the local machine and copy it to our remote machine, we need to figure out a couple of things. First of all, how to request operations. So as I said, there's a queue area where we can encode operational requests and we have two things there. There's a transmit queue, which represents the operation that we have requested the hardware to do and we have a completion queue, which is the list of things that have completed. So the card, whenever it completes an operation, will enqueue something in that list to let us know that it's done. This is, of course, fully asynchronous, which is also one of the really cool properties of working with a system like this. So let's say we want to send a piece of data. We enqueue the operation request. This will be picked up by the NIC. Maybe we enqueue another one. We can do this asynchronously. Don't have to wait for the operations to complete. And then, at its leisurely pace, the RDMA hardware can go and grab it, perform the operation, so write, go and read directly from our memory, write it on the remote machine memory and once it's done, it's gonna enqueue a completion. So it's gonna say, yeah, we are done with that. As you can see, we have a pointer to the descriptor that represents the piece of data that we're interested in working with. All right, there is a converse to this, which is how do we receive data from somebody else? We need to tell our hardware where to put the data that comes in, right? And so what we do is we have a list of what we call posted buffers. These are just areas of memory that we decided are useful for receiving data from somewhere else. So we say, okay, let's post a couple of buffers because we are expecting to receive data from somebody else. And whenever the NIC receives information, it doesn't have to consult us anymore because we already told them where to go and write the data. So the CPU is not involved again. It's gonna grab the first post buffer that's available, make the write into our memory and then post a piece of information on the receive queue. And it's gonna say, okay, that's done. All right, so we have a send, we have a receive, but that sounds like a roughly normal messaging, messaging system. What's next is it's kind of tricky to use these APIs directly. So the iBiverse API are CBase at pretty low level. So Claude in my group has done some fantastic work in wrapping a lot of these abstractions into object-oriented C++ style code. And his PhD has been on running massive join operations on thousands of cores based on this code. So let's take a look at how Infinity, Claude's library deals with this. So we have the buffer abstraction which just represents the fact that is a class that represents the fact that we have a descriptor and associated memory with it. Also, we can of course construct a new one. Constructing it just means, hey, allocate a thousand bytes and registry with the card. Let the kernel know that we don't want for it to move it. Do all the bookkeeping essentially. And then once we have that, we can of course go and take a look inside and this should be terrifying to any self-respecting RAS programmer. We are grabbing a raw pointer to the contents of this memory. And we can write and read from it whenever we want. And if we want to perform operations on the receiver hand, we post a received buffer which means we are getting ready to receive something so we need to inform the Nick where to write the data to. And then we just wait for something to show up. We call receive that just will return it as soon as something is available. Center side, similarly, we instantiate a request token which essentially represents the fact that we are in queuing this operation and we need to wait for it to complete. And we call send. And then through the request token, we can essentially determine when the card notifies us that everything has been done. Cool. All right, so we have a C++ library. Let's wrap it in Rust. The strategies are pretty typical one. Just use bind, generate, generate mappings. There's a, the talk before was a good introduction to some of the tricky bits of how to map C++ semantics into Rust. The Clouds library doesn't make much use of advanced, I guess, C++ features. So it was relatively straightforward to work with. But here's a straightforward wrapping of the buffer class in Rust. We just maintain a pointer to the C++ data structure. We can construct a new one. So just to call through to C++. And then maybe we want to be able to read and write to it. So we can grab a mutable reference to the underlying memory. It's lice and we can go and read and write from it. Here I can see I'm showing a DRF mute. There is also an implementation for DRF. That lets us have multiple immutable references to the struct. All right, so if you've seen that and you think about Rust ownership semantics and data races, that should be a little worrisome. So we have a way to construct a buffer. We can grab a reference to it and read and write from it. But it turns out there's somebody else with that can do this, right? The network card, as soon as we constructed the buffer and posted it, also has access to this data. Now, that's a data race, isn't it? So here's an idea. What about we think of the NIC as just, you know, something else that can have ownership of Rust memory addresses and memory locations. And what we can do is we start out with Rust owning the script or to the buffer and whenever we are ready to enqueue it and say, okay, we are gonna perform an operation to it, rather than just saying here's a pointer to it, just go ahead and enqueue it, we relinquish ownership from the Rust side for this buffer. And so we just like give it to the NIC and it's gonna vanish from our Rust world essentially for a little bit until the card tells us it's done. So until we are guaranteed that the card is not gonna try and write or read into it anymore. And when that happens, okay, then we go back and we get ownership back. All right, that's the theory, how do you write this? We have the buffer definition as we've said. Pretty simple. We construct, we create an interraw function that just lets us take a buffer, a owned type to Rust buffer and make it vanish. As you can see this call is taking self by value, so we are moving the buffer in. A buffer is not cloned, importantly. And whenever we call interraw, the buffer will disappear from our Rust world. After that, we have fromraw, which is the opposite way. So we are essentially reconstructing a typed buffer in Rust from thin air, from a raw pointer here. All right, how do we use that? That looks maybe a little bit terrifying, but at least now we have a clear boundary of when we as Rust programmers have control of the buffer and when the rest of the world, the hardware has control of it. So for writing a send operation, for example, we want to be able to call send. This is like QPair is the interface to the nick. And what we do is have send, which is, as you notice, is not unsafe. It will take the buffer by value, will take its ownership away. So we are relinquishing the buffer here. And you'll notice that the return type doesn't contain the buffer anymore. So we are losing track of the buffer here. And we construct this request token, which lets us keep track of what's going on with the operation, but doesn't let us access the buffer at all. And we call the interraw function that we defined before. So here the buffer is disappearing into nick nirvana, I guess. All right, done that. We want to return something for the user to be able to get the data back. So return the request token. Cool. All right, so we've seen essentially this operation, like we enqueue a new operation, which is the equivalent of losing track of the buffer. And then at some point we get a completion. So how do we handle the completion? Once we've done sending the buffer, we don't want to lose control of it forever because we're gonna just leak memory over and over again. Once the operation is done, we want to control back. And the way we do this is by having this request token, that's just an opaque pointer into C++ land. And once the operation is completed, we call wait until it's completed, it's not gonna return until the operation is completed. There are also synchronous functions for this. And only once the operation is completed, we'll get back the buffer. This means that we can now reuse it. So we lost track of it for a little bit. Ownership has been transferred to the RDMA hardware, but now it's back to us. All right, so have the request token that we keep track of and note that the function again takes ownership of self. And then we wait until it's completed and call from raw. The way we reconstruct a strictly typed Rust buffer. Cool. All right, how do we use this? This is a simple example. It's probably not super realistic, but I should give you a kind of a vague idea of how the API looks like. So we construct, we initialize all the global RDMA initialization and let's write a receiver. We construct a new buffer, we post it so that the NIC knows where to write the data and then we wait for something to show up. And as you can see, whenever we actually receive something, we get this receive buffer back, which is our buffer, the one where we post it before. But importantly, between these two calls, we lost control of the buffer. Sender is similar just the other way around. So I instantiate a buffer, write some data into it, post it, call send, and here is where we lose ownership. And so we have no risk of being able to write to it while the card is operating on it. And then we wait until it's completed and get the ownership back. So this is kind of the way we went about this, which I believe is a kind of a cool way to extend the Rust ownership semantics to also include things that are not on the CPU anymore. It's hardware. Here's a couple of comments on things I came across. So as you noticed, the send operation, for example, has this unsafe block. And if you read the Nomecon, unsafe marks the piece of code that's unsafe, where we're doing unsafe operations and we should be careful. So maybe when we start writing unsafe, we are careful that at the boundaries of this block we make sure that everything is safe again. Another example, the request token wait until completed. There's this unsafe block that separates the unsafe land from safe land. Unfortunately, if you look at the unsafe boundary, things are complicated. So whenever we call send, we relinquish the buffer, that this is a safe function. Whenever we call wait until completed, we get it back. But so we know that things are gonna work out as long as only these two functions exist. But wait, what if we have another operation that we implement? Somebody else comes in, doesn't really realize what's going on, and they implement clone for request token. This means that we can arbitrarily duplicate request token as many times as we want and get back as many copies of the same buffer as we want. And now, okay, we are back to the start, right? Data races galore, in fact, undefined behavior because multiple mutable references. Also, I lied. There is another bug in this slide which is this. I just added a mutable reference to that and that just makes everything unsafe again because that means we can generate arbitrary numbers of buffers out of a single request token. Oops, so what's the deal here? Safety is non-local. All the reasoning we need to do to make something safe, to wrap an unsafe library with safe code requires global reasoning. We need to think about all the other ways we interact with that piece of data. We cannot just think about it from a certain point of view. So, readenomicon does a fantastic explanation on why this happens. It's really well done and it's really complete as a good example. But here's the summary. To be able to write safe abstractions, we introduce invariants. For example, that something is only owned by us or by Denik. And then we rely on these invariants to write nice, to use stuff in user space and in safe Rust code. Unfortunately, safety depends on all of these invariants that we have introduced anywhere in our unsafe and safe Rust code base. So, what can we do about this? We cannot really keep track of all the things that happen in the whole program. So, just use ownership and privacy, so private members, to limit the scope to which these invariants apply to. So, the only thing that acts with buffer in our API is this function like pro. And the only entity that can go and do things with the buffer is our wrapper. If we had other entry points to this, we would be in trouble because we would need to think about all of them every time we'd make any ABA change. So, try to limit the scope of your expectations. And, yeah, I think that's pretty much it. So, a cool way to wrap C++ unsafe galore, a RDMA library that helped us do it in a kind of nice way. And as a final word, I'm just gonna remind you all of you that Denomicon is fantastic. The quality of the explanation has really improved over the years. And I strongly recommend reading it cover to cover if you're doing any unsafe rest, right? That's it for questions. Thank you. Right, the question is whether the C++, in fact, library underneath is asynchronous. Yes, it's really very much designed to be. So, you really want to post. So, posting an operation isn't an asynchronous operation because it's very cheap, but the operation itself will happen asynchronously under the covers. In fact, you're not even aware of it. So, in that sense, it's really asynchronous. You have seen blocking to wait for things. That's not how you would actually write an application. You typically see what has shown up since the last time you looked in an asynchronous fashion. A bit of possibly like a non-blocking socket might look like. It really depends then on the application design. All right, that's a very good question. So, do we really need to make it completely disappear from safe rust when we give it out because if we keep it private, then nobody has access to it anyways. That's fair. There was a choice made out of mostly convenience with interacting with the C++ API, but I guess the strong point here, you can totally do what you just said as long as you think about precisely where your boundary is and as long as your boundary is consistent among all the places where you do this. Yeah, that's a totally good point. So, what you're saying is that doing that would allow us to fix one of the problems. So, the mutable reference to the buffer. If the underlying FFI buffer wasn't clone, then that would provide a guarantee for us. I believe it is by default, unfortunately, but that's a very good point. So, the bind gen bindings are good but require a lot of care, determining which one of the things that bind gen things are safe in fact are. So, by default, I would recommend not relying on bind gen's expectation of what it has written. It will probably try and generate a clone implementation for you, even though it's totally unsafe to do so. All right, very good question. Yeah. Let's go. So, the question is, I've done seemingly really dangerous things. How do we go about testing them and what are the tools that we could use for this? And, I guess, word of warning, I'm a researcher. I have an engineering background, so I've done industry for a few years, but now I'm a researcher, which means that my interest in testing is limited to saving me time rather than making things completely and super stable in production. That said, I think various forms of kind of deterministic but randomized testing is really good for something like this. So, if you can introduce random weights or random thread interleavings between you and the card and whatever happens, that's something that definitely helps. There's the deterministic crate, for example, that does a good job at this. Otherwise, yeah, it's really tricky to, I don't think I have a good answer on what I would use to kind of prove the safety of this. There are really cool research level approaches to doing composable safety proofs of complex RAS programs I use, like wild and unsafe features. They are not easy to use yet. So, I think that's possibly one of the things where, I'm excited to see what academia comes up with next. There's really cool work out of my university that will let you write proofs in RAS itself because RAS already gives you us to so much that ending up doing formal verification is a lot easier now because RAS is so cool. Right, the question is, is that TLA plus? It isn't, you could use TLA plus and try and prove that something like this is safe. You wouldn't have to then translate it to RAS which is always potentially error-prone. The RAS work I've been talking about is specifically for the RAS language and it's language embedded in RAS. So, you literally write proofs in RAS which is really awesome. So, if you're interested in verification and things like this, have a look. It's really cool. All right, thank you very much. Oh. All right, how do I handle the drop of the token? How do I handle the drop of the token? Right, good question. Drop safety is hard, read about it. It's harder than you think if you haven't read about it before. Or at least it was for me. Dropping a token for me isn't a problem. So, leaking is not one of the safety properties of RAS. There's no guarantee that you will not leak. So, dropping a token for us just means leaking the buffer. So, it's non-desirable behavior. So, it's one of those cases where you might reach for something like must use. But, it's not strictly unsafe which is a bit of a copout of an answer. There you go. More time? No, okay. Thank you very much. Sorry about that. I'm happy to talk more offline.