 Anyway, as Renee said, I'm Christof. I'm here to talk about OpenVPN DCO or data channel offload. Before I forget, the work to implement this on FreeBSD was sponsored by NetGate. So thank you very much, NetGate. They enable me to indulge in some of my favorite hobbies, like having a house and eating food. So anyway, let's talk about really interesting things like OpenVPN. What is it? Well, OpenVPN is a VPN. We've all digested this shocking information. Originally developed by James Yonan, first release in 2001. So this is a project with some history. It's not a FreeBSD, but it has some history. It can do peer-to-peer setups. It can do client server setups. It can do shared key, certificate, username, password-based authentication. It can do all of the things. It runs on really obscure platforms like Windows and Linux, but also on common things like OpenBSD and DragonflyBSD. Now, obviously I'm talking about it, so there is a problem with it. The problem is that it's slow, like really slow. Well, really slow by modern standards. Why is it slow? It's slow because it's a user space process. So it does the usual thing that VPN applications used to do, is you run an application, it tries to send the file somewhere. You're SSH-ing into something or you're serving something over a web server, whatever it may be. Your application copies the data to the kernel with a send or a write or whatever system call, which gets handed to the ton interface, which hands it straight back to OpenVPN to copy it back down into user space to do the encryption, to do the assigning, to do all of the packetizing, all of the things that it needs to do to make it a VPN. And then it copies it back into the kernel so the kernel can send it onwards, hopefully onto the wider internet to where it's going. So you end up having to do a whole bunch of extra copies. You also have to do your cryptography in user space, which means that if you have shiny hardware accelerators like Intel's QAT or something, you can't really use them. So what is DCO? Data channel offload, we put the data channel in the kernel. So we get rid of the copy, well, one redundant copy between user space and kernel space. We also get to use hardware accelerators. Like, as I said, like QAT. The exception there is, there is some hardware acceleration that you can use in user space, which is AESNI, which is the bit that you all have on your hardware, not the shiny QAT things. To do this, I've implemented a virtual driver, a virtual network interface with the wonderful name IFOVPN. And we're going to talk a little bit about how that thing works and about what the exciting problems are along the way and all the way at the end, to keep you in suspense, I have some performance numbers as well. So let's all ignore my wonderful graphical skills, but basically what it amounts to is now when your application tries to send something, so it does its rights or send or whatever system call, goes into the kernel, gets routed, it's decided that it needs to go out the VPN. The VPN will do the encryption, the hashing, all of the things that it needs. It will make the routing decision and it will go straight out whatever interface you have connected to the internet. There's still a user space open VPN process. It's still very much involved in things, but it doesn't see the data. So there's still an interface between open VPN and the kernel bits, but that is a configuration interface. So it is relatively infrequent. The good thing about this is it is much faster. I promise we'll get numbers. The bad things, well, there are some limitations. Most of these limitations are not really inherent in the new design. They're just a result of the open VPN project deciding that, hey, we've got 20 years of craft. Let's get rid of some of it when we build this shiny new thing. So you only get a ESGCM or Cha Cha for encryption. So no more RC4 or whatever old Monsons they still supported. Almost, you know, I'm not a cryptographer, so don't look to me for advice on what the best cryptographic algorithm is, but I do generally take the Henry Ford view here of, you know, you can have your model T in any color you like as long as it's black. You can have your encryption in any algorithm you like as long as you like a ESGCM. So what do you lose? You lose compression. You lose fragmentation support. You lose a layer two support that open VPN has so you can connect things via layer two rather than layer three. You can only do a subnet topology. You can't get traffic shaping added in open VPN because open VPN can do traffic shaping. You can still traffic shape, you just have to use the usual dummy nuts and your firewall to do traffic shaping. As I said, open VPN use this as an opportunity for a clean break. There is no protocol change to get DCO. So you can interoperate with a DCO server and a non-DCO client. Your client does need to be open VPN 2.4 or greater to actually understand the 2.6 protocol, which is what you really ought to be running anyway. There are some considerations in the implementation that we'll go through in some further detail. Not going to read through the list here because we'll cover all of them. The first consideration there, and it is a limitation on FreeBSD as well, is you can tunnel open VPN on top of UDP, which is what you should be doing. You can also tunnel it on top of TCP. The reason they support this is because sometimes firewalls do dumb things. As a firewall maintainer, I feel kind of obligated to say that this is a firewall configuration problem, not a firewall implementation problem. You know, it's somebody else's fault. The FreeBSD implementation of DCO is UDP only. There are two reasons for this and they both boil down to I am lazy. The first one is really, we didn't care enough about running it over TCP to spend the effort to implement it. The second one is that we need to hook into the UDP socket on the kernel side, and in FreeBSD there's very convenient UDP set kernel tunneling, which sets a filter function on any time a UDP packet comes in. Our new code gets to grab it and look at it and do things with it. There's no equivalent to that for TCP, so making this implementation in TCP would have been more challenging. See the original reason, I am lazy, we're not going to do any work if we don't have to do it. Between me and you guys on the wider internet, but the people who implemented the Linux side of this did implement TCP and they later told me that if I'd known how much of a pain this was going to be, we wouldn't have done it. As is only right and proper for software engineers because we do these things not because they are easy, but because we thought they were going to be easy. So that's an additional limitation. It's a design choice. There's no inherent reason why we couldn't do DCO acceleration for TCP on FreeBSD. It just needs someone to sit down and do the work, which is the reason why we don't have a lot of things. It's someone needs to sit down and do the work. On top of that, next issue, next consideration is the multiplexing. OpenVPN has a single connection. So they've learned the painful lesson of FTP is for the love of God, put everything in a single connection, otherwise every single not implementation is going to curse you for decades to come. So it's a single connection, but there are multiple types of data passing through there. You've got control data, which is key negotiation, authentication, that sort of stuff, and you have your data channel. As the data channel offload name implies, we only offload the data channel. So the control data needs to go to user space. The data data needs to go to, well, we'll stay in the kernel. There's a single socket. So the way this works is when OpenVPN starts and it tries to connect to a server, or it starts and it becomes a server and a client connects to it, basically in the same implementation, there is a single UDP socket. User space, the user space OpenVPN code will send the data that it needs to send to the server to do the initial connection negotiation and to do the initial setup. And once it's at the point where it's ready for data to start flowing, sorry about the echo, it will pass the file descriptor through the IOctl interface that connects the user space process to the kernel driver. It will pass the file descriptor into the kernel. The kernel will then follow a couple of pointers, will set up the UDP filtering function so that it can catch the data that it's interested in. And it will pass all data that it doesn't know what to do with. So anything that's not the data path through, I'm going to try to lower this a little bit and hope that it will remove the echo a little. So it passes any unknown data that is anything that's a control packet or for a peer it hasn't seen yet, anything like that. We'll just let it pass through and it will follow the usual socket data path where the select loop in user space will wake up, it will do a read on the socket and it will just get the data. Anything that the kernel can handle just doesn't get passed through user space anymore. The kernel does its thing with it. So the tiny little modification that we needed to make to the network stack is the UDP filter function used to just unconditionally grab the data. And now it can say, I have taken ownership of the data and you don't need to bother user space with this. Or it can say, no, I don't want to let it pass through to the usual flow as if this filter function weren't there. So that's a small modification we needed to make other than that, the kernel infrastructure was quite useful and just worked. Next consideration, locking design. The whole point of DCO is to make this faster. Really don't know where this other microphone is because this is quite annoying. The whole point is to make it faster. So you need to be a little bit careful about how you do your locking. If you take a mutex when the data comes in, process it and then release the mutex afterwards, you're not going to be able to make use of your multicore system and that's kind of a shame. I chose for what is relatively simple design, it's read mostly lock concept being that anytime we need to access configuration and state data, so what is my key, what is my remote and destination, stuff like that. We take a read lock so that the configuration can't be modified while we're doing this processing. And the advantage of the read mostly lock is you can have multiple readers at the same time. So you can have multiple course processing packets at the same time. When we make configuration changes, we take the right lock and while we're making a configuration change, the data can't flow. Configuration changes are relatively rare, so mostly that works out really well. There are two exceptions to this, counters. Anytime a packet comes in or goes out or we fail to allocate memory or something, we increment the counter. Fortunately, the kernel has a counter framework for this where we track these numbers per core and it's only when we actually want to read the numbers that we go and ask every single core, hey, how many packets did you see? How many packets did you see? How many packets did you see? We add them all up. And that way, there's no interaction between cores for counters. There's another issue. The OpenVPN protocol contains a packet ID, a sequence number, and there is replay protection in the protocol. You need to keep track of I have seen this ID where I have not seen it. That is a lot more difficult to do in this setup, so there's a separate mutex for this and the replay protection is disabled by defaults to make things nice and fast. I've already talked about the open interface, so the configuration interface between user space and the kernel. Decided to use NV lists to get an extensible interface, so the usual old style way to do this is you define a structure with several fields like an IP address supports, whatever you need to pass through. You copy that structure between user space and the kernel. That's nice and efficient, it's relatively easy to implement until you need to extend the configuration interface because you want to do something new and then it becomes really, really painful. And VList have a type length value encoding, so you can add fields to it and if one end of the interface doesn't know the new type and just ignores it and everything is fine. On the Linux implementation, they use Netlink. FreeBSD has now also grown a Netlink subsystem, so it would have been really nice to use that. Unfortunately at the time when I was doing this work, Netlink wasn't quite there yet, so unfortunately we have a difference in implementation there between Linux and FreeBSD because other than that, the user space side of things is very similar. Fortunately or unfortunately, there's also a Windows implementation, so the user space bits of OpenVPN were always going to have to deal with different communication protocols to the kernel. It's not that big of a deal, but any time you can do the same thing as someone else, it tends to be an advantage. Another issue is routing, so imagine that you have an OpenVPN server and it has five clients connected to it and client A wants to send something to client B. The client will send it to the server, the server then needs to forward that to the correct client, but this is a tunnel, it's not a single broadcast domain, it can't just go, oh, I'll put it on the OpenVPN interface and it'll turn up at the correct client that that's not how tunnels work. So essentially we need to do a second routing lookup. So once the network stack has decided that hey, this packet needs to be routed out of the OpenVPN interface, the OpenVPN code still needs to, needs to actually figure out which of the tunnels, which of the connections that it has to clients that it needs to send it down to. So there's a second routing lookup there. If you're following along in the code, that's OVPN route peer. There's a special case in the code, a shortcut. If there's only one client, the answer is pretty obvious. If anyone's interested, you can go look at the code. Essentially when we add a new peer to the connection, user space will tell the client, it's this file descriptor, this is the ID of the peer, this is the remote IP address and port, but it will also say, and this client has this particular VPN address. So the address inside the tunnel and we write based on those. Next little problem consideration is we have key rotation. Every so often after X amount of time or X amount of bytes, you want to select a new encryption key. All of the cryptographic complexity is handled in user space, which is wonderful because cryptography is complicated and I am a man of limited brain. So the negotiation is done by user space, eventually user space decides that, hey, it's time for you to have a new key and it will tell the kernel that, hey, there's a new key with the open VPN new key command. Fortunately, the packets, fortunately the open VPN packets actually contain a key ID. So it tells the, yeah, doesn't seem to be that one. So anytime you receive a packet, they can actually work out which key they need to use to decrypt. So the key rotation is we add a new key, we start using the new key, but we still keep the old one. So if a packet comes in that was encrypted with the old key, we can still decrypt it. And then after a certain amount of time when hopefully everyone has stopped using the old key, we can delete the old key ID. So ideally in practice, you can swap keys, you can rotate your keys without actually having any traffic interruption. Oh yes, one of the new features there, well new. One of the things that we added to that is that the kernel can also trigger a key rotation. Basically, the kernel can tell the user space that hey, it's time to do a key rotation. So the reason that we needed this is because the data path is fully in the kernel, user space doesn't actually know how many bytes have gone through the link. There's counters that you can ask for, but user space can't very well sit there and pull the kernel every five seconds of hey, how many bytes have you sent? How many bytes have you sent? So the kernel has the ability to say, hey, I have exhaustants, about half of the traffic that I'm supposed to use, half of the sequence numbers that I can safely use. User space should probably think about starting a key rotation now. If I remember correctly, that is one of the small features in OpenVPN that were driven from the FreeBSD side of things. So when the project started, when I started looking at OpenVPN, DCO, there was already an implementation for Linux that was mostly functional and one for Windows that was starting to get functional. So most of the features were already there. Don't know where that microphone is and it's driving me up the wall. Most of the features were already there, but while you're doing the implementation, you notice things like traffic counters like this key rotation that need a little bit of polish, which arguably is an argument that you can use with upstream projects that it's good for you to start to run support FreeBSD or OpenBSD or NetBSD or whatever because the more different platforms you're on, the more you're going to notice these little issues and it might make your overall product better. I have to say that the OpenVPN people were very pleasant to work with. They were very welcoming of patches. They reviewed them, they accepted them very quickly. Moving on, another consideration is VNAT. For those of you not familiar with it, it's also called VImage Network Stack Virtualization. Basically, it's a FreeBSD jail with its own IP stack. There was no immediate use case for this in NetGate, but I added support for it anyway from my experience with PF. It makes automated testing so much easier because you don't need to worry about having multiple hardware instances or spinning up VMs because you want a server and two clients to run an automated test. You can spin up a jail, you can start an OpenVPN server in it, you spin up another jail, you start an OpenVPN client in it. You can have them chat to each other. It is a wonderful way to do tests. There are some example tests in user tests NetIFO VPN. I've done talks about this testing framework before, so I won't belabor it too much, but if nothing else, if you don't use it for anything else, VNAT makes automated testing so much easier and you should definitely consider supporting it in whatever other network feature that you implement. I promised you numbers, right? I have numbers. These were performance tests done by a colleague of mine at NetGate's on obviously NetGate hardware. 4100 is a bit of a middle of the line device with an Intel, Atom, CPU, dual core 1.8 gigahertz with QAT offload. And we've been kind of unfair with the numbers because user space with IF ton, so that's what OpenVPN used to be before DCO actually has cryptographic acceleration. It has AES and I to use in user space. And we see 200, 207 megabits through that. If we do DCO with pure software mode, so no cryptographic acceleration, so that's DCO with one hand tied behind its back, fighting against OpenVPN in user space with both of its hands, and it's still slightly faster. Slightly, but it still wins. If you do DCO with AES and I, so the fair apples to apples comparison, you see 750 megabits versus 200. So that's, that is not a fair fight anymore. If you use QAT, so Intel's offload engine, because you can now use this because you live in a kernel, you see a little north of a gigabit. One question, yes. That is a good question. So the question is how does that compare to IPsec? That's a good question. I don't have the numbers in front of me. I think that this is actually faster than IPsec. I seem to remember that right now, DCO, OpenVPN, DCO is basically the fastest thing that you can get for VPN. But I don't have the numbers in front of me, and it's with the disclaimer that I might be misremembering. What I am quite confident about, but I also don't have the numbers here, is this is actually faster than WireGuard in kernel mode. Now, in the interest of fairness, that is because DCO does ASGCM, and WireGuard is Chacha, and Chacha is just a slower algorithm. But I tend to emphasize that point too much. DCO is faster. We have another question. Yes. So the question is what the packet size is. I don't know. As I said, I didn't run these numbers. I think this is just straight IPerf, so they're going to be 1500 byte packets. I think. I'm pretty sure that there's extensive, there's more elaborate information somewhere on the NetGate website. If it's not on the website, there are numbers internally. So if you're interested in specking out hardware, chat to those people, they can help you. I did some internal testing on AMD64 hardware that I have locally, and I saw a performance improvement of about a factor of three. So now that I've got you all very excited for this, where can you get this? It's in OpenVPN 2.6.0, which was released earlier this year, January 26th. Don't get OpenVPN 2.6.0, get 2.6.6, which was released on the 15th of August, like last month, to get the nice and fresh one. Basically OpenVPN 2.6, you can get DCO. You can get it from 3BSD 14. So this hasn't been backported to 13 or 12. If you want DCO, you need 3BSD 14, which is coming out any month now, soon. You can also get it on Linux and Windows, but you wouldn't be here if you cared. Let me also advertise my customer a bit more. You can get it in PF Sense plus $22.05, and obviously all more recent versions as well. So it's been deployed in the field with PF Sense in already last year already. It mostly worked, well, the date's there actually, middle of last year. Mostly works, get very few complaints. Either the customers can't catch me or it actually works. Pretty sure it actually works, because if I remember correctly, I'm actually running it locally to VPN internet gates. And I believe I have covered everything now. So thank you very much for being here, for being generous hecklers. And also if you have any questions, you can get me here. I'm, you know, my email address kp.3bsd.org. I'm on the mastodon at kp.bsd.org network. Feel free if you have any questions. When I first presented this at AsiaBSD.com, I had people asking, you know, can we get this on OpenBSD? And the answer is that, you know, if you do the work, yes you can. A slightly less glib answer is that if you look at the FreeBSD implementation, you should be able to rewrite that for OpenBSD. It's not going to be that different, but it's going to be different enough that it'll be a separate implementation. As I said before, I found the OpenVPN people to be very friendly. So if you come to them with patches, you know, if you point them at, hey, we have a DCO driver, it lives over here, and here's a patch to enable it in your software, I'm pretty sure they're going to say, oh, thank you, we'll take that. I had it in my notes somewhere, but I forgot to quote it. The FreeBSD implementation is about 2,500 lines of code. So it's not massive. Yes, Michael, you had a question. So the question is how this performance compares to other operating systems. Again, I don't have the numbers in front of me, so take them with, I might be misremembering, but my recollection is that FreeBSD is the fastest out of all of them. I'd have to look up the numbers to make a strong statement, so take it as random guy who remembers something and he might be wrong, because I know that performance numbers can get quoted and argued about taking way out of context. Yes. No, so this question is, does it require client side changes? No, none whatsoever. There is no protocol change. You can use this on one end of the connection and not the other end. It is strictly an implementation change. It interoperates with, you know, you can have an open VPN FreeBSD client doing DCO and you can have a Windows server running the old tap style. It will still work. You'll get a much smaller performance benefits, obviously, but, you know, we live in a fallen universe and those are just the limitations that we have to live with. Anyone else? Yes. You mean the top two here? Yes. Well, it's partially context switching overhead. It's partially that in the DCO software case, we're not using any of the cryptographic acceleration. We're not using AES and I, whereas user space bit is using AES and I. So what I should have done here is I should have had the IFTUN numbers without AES and I, because those would have been much lower and, you know, I'm guessing here, but you probably would have seen like 50 to 100 megabits for IFTUN without AES and I. So AES and, you know, in some ways this is also an advertisement for AES and I. It really does help to accelerate your cryptographic operations. So line one and two are kind of an unfair comparison. You really want to be comparing line one and line three and then comparing line three and line four is an advertisement for Intel's QAT. Yes. Yes. Those were policy choices by the, so the question is why did we lose layer two support and fragmentation support? Those were a policy choice by the OpenVPN project. Those were, we're going to choose not to support those in the new system. So the motivation for discarding those features is mostly it makes the kernel implementation and the interface towards the kernel simpler and it is a good opportunity to take a break with the past and to get rid of some things that we don't want to support anymore. So I don't know if they're really at the point where they want to fully discard discard layer two support for instance, but right now their policy is very much if you want layer two support you have to run the slow version. So there's nothing inherent in the DCO concept that says that you cannot have layer two. It's just, it's additional implementation efforts and bugs in the kernel are worse than bugs in user space. So you try to keep your kernel as simple as possible. Anyone else? If you do have questions, later feel free to find me in the hallway somewhere. I'd say tackle me, but please don't tackle me. Just come say hi, I have a question. All right, thank you very much for coming. Thank you.