 All right, welcome. Is everyone sufficiently caffeinated? So, as Bill said, I'm Jacob, I'll be your captain today during this rather fast-paced journey through Linux API for splicing sockets together that we call Sockmap. So fasten your seat belts and let's go. Who am I? Why am I here talking to you? We are at Cloud for on the linux slash OS team where we deal with rolling out fresh calendars into production. That's my one role. I also have a second role, which is more important to this talk. I am a commentator of this Linux API or a feature called Sockmap. Couple things about this talk. It will be good if you have a little bit of networking knowledge and you know the basics of BPF. It will be definitely easier to follow and how, you know, if you know a little bit how containers are built, what technologies we use there. What we aim to achieve today, well, I hope you'll know that Sockmap exists after this talk and you will be able to say if it maybe fits your use case, if it could be applicable to your workloads. And if so, you'll also feel ready and know where to, you know, look into more. Our agenda for today is pre-packed. We're gonna start with taking a look what can Sockmap do for us. Then we're gonna actually find out what Sockmap is because, you know, at this point we don't know yet. But I'm gonna see how to set it up. After that, Sockmap, you know, operates open socket. So we're gonna look how you get your Sockets into Sockmap. Then we're gonna go over the configurations which we can set up with Sockmap to supply Sockets and we're gonna finish it off with some real life use cases. So what can Sockmap do for you? Or actually what can Sockmap do for container networking? Well, if you have in Linux to networking namespaces and for two processes, each in their own networking namespace they would like to talk to each other over Sockets, over networking, kind of as if you had in Kubernetes two pods that are on the same node, right? Then in order for these processes to exchange messages, the message needs to go all the way through the network stack down to the virtual networking layer, probably VTH pair or, you know, VTH with some routing in between and up, again up the whole network stack to reach the destination process. So there are some overhead to that. What Sockmap can offer here is a bypass where we can create a shortcut path between the two Sockets. So the messages flow straight from the, you know, the input socket, the source socket into the target, the destination socket. That way we save on some overhead of passing through the network stack. So let's actually see how that looks in action. We can set it up without too much effort ourselves. If you have a Linux VM or a machine running Linux, we will need two network namespaces. We need to put a virtual link between them, a VTH pair. We bring our interfaces up. So now we have L2 networking between these network namespaces. We are saying some addresses. So we have now L3 networking between them. And in one of the network namespaces, we're gonna run a TCP server. That's just gonna be listening for incoming connections. And in the other one, we're gonna run a TCP client that's gonna be doing requests and waiting for a response from the server. So it's kind of like a request response benchmark emulating, simulating an RPC workload. And you know, our benchmark tool is gonna measure the latency of request response. So this is our baseline. What I got when I ran it on my laptop was something like 5.8 microseconds. Now we're gonna repeat the same experiment, but this time using the SOCMAP bypass, right? So to set that up, first we need to load a couple of BPF programs and create a BPF map. We're gonna talk about these programs and maps later. Then we need to do a little bit of setup. We actually need to attach one of the programs to a BPF map. And we need to do one more thing. We need to create a C-group so that we'll be able to get our hands on the sockets and configure the splicing. Now one more thing, to get out our hands on the sockets, we also need to attach another BPF program to that C-group. We're gonna talk about that later as well. So now that we have this setup, we move our shell into the C-group so all child processes are gonna be also started inside that C-group. And once again, we start our TCP server in one network name space and we run the client in the other and we measure the latency. And for invest setup, I got something like 4.7 microseconds on average. So, you know, there was less code in the network path so the latency went down. I mean, if you're running an RPC workload, double digits improvement without modifying the application, yeah, I would consider it perhaps. And you also don't have to believe me. Maybe I made a mistake running this benchmark. You can grab it and run it yourself to code his own GitHub. Now that you know what Sotmap can do, let's talk about what actually Sotmap is, what do we mean by that? So when we say Sotmap, I actually mean two things. First of all, it's a collection, a data type that is implemented inside the Linux kernel for holding references to socket objects. The data type is exposed to your programs, to user space as a BPF map, which is like we know key value stores that user space programs and BPF programs can use to exchange data. And as I mentioned, it holds references to sockets, but these are weak references, so that means that Sotmap doesn't keep the socket objects alive. Then the second meaning behind Sotmap is that it's an API with which we can enforce policy, which is just a fancy term for filtering packets and redirects steer messages packets between two sockets. And that these policies and these redirect rules are expressed as BPF programs, which we're gonna learn more about later. And these BPF programs, in order to be activated, they need to be hooked somewhere into the socket layer. All right, we have a rough idea what Sotmap is, so now let's see how do we set it all up. So to splice two sockets together, we first of all need the sockets, right? So these need to be connected or a voice establish sockets that we're gonna be operating on. It doesn't matter if we're gonna operate on sockets that have been actively open, so when we're making an outgoing connection from our process or sockets that have been passively open, so when we're just accepting and coming connection. What is important is that it is a connected socket. We have a socket, we're gonna leave it there. What kind of sockets we can operate on? Well, not every kind of socket, but the ones that are probably most used within network workloads. And even some of the more exotic ones like Vsock, which is used to communicate between the hypervisor and the guest's VM. Right, once we have a socket, then we need to create our BPF map that is gonna be populated with references to the sockets. So to create a BPF map, we just do it like for any other BPF map, either directly using the BPF Cisco or more probably using a library wrapper of a library of third choice, like eBPF Go that Celium uses or maybe libBPF, the reference implementation of a BPF wrapper library. Or you can create it from the command line as well, that's what I use for experimenting. Now, we actually have two flavors, two variants of this SockMap container. One called SockMap, which is keyed by an integer value, so that's pretty simple type. We have a number and that maps due to a socket. And the other one that is a little bit more flexible called SockHash, where your key is an arbitrary binary data blob that maps to a socket reference. So the data blob can be anything you want. It can be a string, it can be a serialized for Tappel, yeah. One thing to note, these types should not be confused with something called SockArray that also exists in the Linux kernel. All right, so we have a socket, we have our BPF map, then we also need a BPF program which will be expressing our redirect logic. So we do that by loading a program using the BPF Cisco and there are two kinds of programs that we're gonna use for splicing sockets with SockMap. We're gonna see later which one we use in which situation. That BPF program is actually gonna be using the SockMap we have created because that's how the BPF programs, the SockMap BPF programs select the sockets when redirecting packets. Also, BPF programs, they once loaded, they need to be activated. And we activate BPF programs by attaching them somewhere. In this case, we attach our BPF programs to the SockMap. Okay, once we have all that set up, we can finally insert our socket into our BPF map and we can do that with a BPF Cisco map update LM. One thing to note, very important, you can only insert sockets into SockMap only once you have attached your BPF programs to it. Otherwise, your redirecting logic won't operate on your sockets. So let's look a little bit more at the last step, right? How do we actually can insert get sockets into the SockMap? So in the simplest case, we have a user space process that owns both BPF map, SockMap and a socket. In this case, it's a simple scenario where your process will just make a BPF Cisco and you can insert a reference to a socket into SockMap. But this is usually not the case in real life. Usually, there will be one process that has access to your BPF map and another process that will have access to the sockets on which you actually wanna operate. What you can do then, well, you can resort to something like voluntary socket handover while you will have these two processes talking over a unique socket and sending over a file descriptor to your connected socket from one process to another. That, of course, requires you to be able to modify both processes. If you don't have the possibility, then you can also resort to stealing file descriptors to sockets from another process or getting a duplicate file descriptor using a pdfd getfd syscall that has been in Linux for some time. However, it's a privileged operation. Finally, there's one more way to get your hands on the sockets and insert them into SockMap. This one, this is what we have seen in use in our demo in the beginning. You can also attach a special BPF program type called SockOps to a control group and this program will be called whenever a connected socket is created, either passively or actively and it will be able to take this socket object and put it into SockMap. All right, we have a little bit of an idea how do we set it up. Now let's look in what scenarios we can actually redirect data from one socket to another. There are four different redirect scenarios in which we can set up SockMap. First one, we're gonna call a send to local. So this is the scenario that we have already seen where two local processes communicate with each other. So we have an input socket to reach one process rights and then the data is redirected and is received or read out from another socket. If you're familiar with other mechanisms for splicing sockets, this is similar to creating a socket per or if we move away from sockets, it's like a pipe. How do we set it up with SockMap? Well, we'll have this BPF program in between that we'll be doing the redirecting and this will be an SK message BPF program that we attach to SK message thirdly hooked. That program will call a helper which will select a socket, our output socket. In addition, we need to set an ingress flag on the helper. So this is what the sample program like that looks like. Yep, it will likely have some logic that will make a decision which socket to select and then it will call the helper I mentioned. We can redirect from any kind to any kind in this configuration just from TCP sockets to any other socket kind but the Sock. Then a different configuration is sent to egress. In this configuration, one process sends a message from a socket but we actually don't pass it down the network stack. Instead, we redirect it to another socket and send it out or pass it down to the network stack from that other socket. So this is kind of like splicing from a pipe to a socket and this scenario we again use an SK message BPF program to do our work. We set it up the same way as before. The only difference is that we don't set the ingress flag this time when we call our redirect helper. This configuration is really limited. We can only use the TCP to TCP. Yeah, I guess it could be extended further. We just haven't had any use cases for that. Now, something completely different. A whole different family of redirect scenarios is scenarios that relate ingress to egress. So in what situation does that happen? So this is useful in a situation where you have an L7 network proxy. So a proxy that simply reads from one socket and just pushes data, writes data out another socket and it doesn't touch, modify the data that it pushes from one socket to another. So examples of such programs would be the SOCAD utility or the systemd socket proxy. What SOCMAP can offer here is an offload mechanism. So instead of invoking your process every time you need to copy data from one socket to another, you can instead set up SOCMAP so the copying happens within the kernel. Right, so as I mentioned, the data comes in from the network stack into the SOCAD receive queue and we redirect it to our output socket and it's passed out through the network stack again. So it would be like splicing from a socket through a pipe to another socket. In this scenario, we need to use the other kind of a BPF program and mention the SKSKB program because this time, if you're familiar a little bit with network internals of Linux, we're operating on packets, not on messages, hence the name. We attach it to a different kind of hook dedicated for this type of program and similarly we also call a BPF helper to choose that target socket. This is the most versatile configuration that we suffered today. You can redirect from any kind of socket due to any other. Finally, the last redirect setup. When data comes in, packet comes in from the network stack, we receive it in one socket but then we don't read it out from that socket, instead we redirect it and read it out from another socket. So like splicing from a socket to a pipe. Again, the same kind of program as before but because we're going with the data to user space, we now need to set the ingress flag when invoking our redirect helper. Here we also have almost full flexibility any to any but Vsoc. If you know what Vsoc is for, redirecting to Vsoc and reading it out doesn't really make sense so that's why it's not supported. All right, I know it's a lot and it's a little bit complicated. So here's a cheat sheet that you can go back to if you ever need to set it up yourself. All right, so let's finish it off with a quick look at some real life use cases. Right, this is a psyllium plus a BPF day. So let's look at how psyllium used to use splicing with SOCMAP. So psyllium had this setup where they would do transparent L7 filtering so we would have an application running in one container and sending out data and what psyllium could do was kind of redirect these messages so they are instead read out from a transparent proxy. The proxy then looks at the message and makes a decision whether we should pass it or maybe drop it and if the decision is passed then we're gonna send it back down but through the same socket that the application is using because that socket may have some additional setup. So can you spot maybe the patterns that we saw, the redirection patterns that we saw here? Well, the first one is sent to local where our application sends a message from a socket but it's received by our proxy process while the other one is sent to egress where we send from one socket but the message actually egresses into the network stack from a different socket. Another real life use case that we know about comes from Biden's, the company behind TikTok. What they had is a client server architecture where the client and the server were communicating over unique sockets and they needed to migrate it to a setup where the server keeps running on the host while the clients have been moved into VMs and well, it didn't make sense for them to rewrite all the clients to switch away from unique sockets because that was too much effort so instead they decided to build a bridge, a proxy that would bridge the unique socket clients with a server that is now on host so has to use some different network protocol to communicate with. For that, they used Sockmap to offload the proxy work into the kernel so that it has less overhead. Then they actually iterated on this design and they even managed to avoid the overhead of virtual networking, VNIC networking from the guest VM to the host by adding support for Vsocks into the Sockmap. So using Sockmap, they managed to build this exotic proxy that splices together unique sockets and Vsockets. And I think you can probably spot the patterns here because it's just ingress to ingress redirection but in both ways. So if that fits your use case, where do you go to learn more? I definitely recommend the documentation in the kernel. It links to some test code that shows more advanced users of this API. You should also definitely check out Daniel and Jones talked about how you can use Sockmap together with KTLS. If you want a different point of view at L7 proxy use case, check out my colleague Marek's work and then Sockmap can be used also for completely different stuff, steering connections to Sockets. I have another talk about that so if you're interested, give it a look. You can reach me out on social media. I'll be here till Friday. Feel free to chat me up. I also hang out on some networking mailing lists and Scyllium Slack where slides and code is on GitHub. So thank you for your attention.