 Okay, hi, so high-speed traffic encryption on x8664 with snap So hi, I'm max. I'm an open-source hacker and I've been working on the snap project since 2014 Some of you may have heard of it. I'm gonna go a little bit into it Later in this talk I'm also doing consulting and software networking in user space protocols software optimization, etc So For the last couple of years. I've been working on a project called Vita Vita's a high-performance side-to-side VPN gateway It's fully open source and it's hackable By hackable here. I mean that it has a very small and hopefully easy to understand codebase and It runs on generic x8664 server CPUs and Linux. So Vita's based on snap Snap is a toolkit for writing fast networking applications in user space This mode of operations also referred to as kernel bypass mode. You've all heard of it, I guess but basically it means the data path completely avoids the Linux kernel and Snap applications including Vita and This is the important bit snap itself even Written in a high-level programming language called Lua And this is possible thanks to a super fast implementation of Lua called Lua JIT Which we have a fork off called Raptor JIT, which I'm going to have a talk about tomorrow in the minimalistic languages staff room in case you're interested Which is basically a fork of Lua JIT that targets heavy-duty server applications specifically And by the way Vita was funded through the internet foundation They are really really cool and I suggest you check them out if you are in need for funding for any open source project So I guess let me start by showing you some typical snap code So snap programs are divided into modules that we call apps which have a number of input and output links and They basically process packets in a loop So this example shows how to read packets while the input link is not empty check if their time to live has expired and Unless they have if they have not expired if the time to live has not expired then we forward the packet to the output link If the TTL has expired then we will transmit the packet on to the time exceeded link where it will be received by Another app that would hand handle ICMP for example So what does high performance mean in this context? At the moment Vita terminates three million packets per second on a single CPU core that translates to about five gigabits of iMix traffic per core and These numbers are full duplex So this is actually six million packets being processed per second on a core three million being a capsulated and three million being decapsulated and The median term performance goal of Vita is to be able to do 100 g On a generic x86 server that you can buy off the shelf So here I'm betting on increasing core coincidences obviously and I'm thinking that maybe a Zen 2 with 64 cores might just be able to do it So how does Vita do it? In snap land we like to write software that is both fast and simple so we think that simple designs translate to efficient designs and We don't think that fast programs need to be complex We also like to avoid vendor lock-in wherever possible So for Vita, this means we avoid any extensions such as Intel quick assist or proprietary crypto cards to get the performance and rely on x86 is for and it's commonly supported architect extensions commonly supported here means that more than one vendor produces CPUs that can do that stuff right, so Vita's most obvious CPU hawk is obviously encrypting packets and decrypting them basically crunching numbers And for that we rely on two x86 extensions AES and I and AVX 2 AES and I provide CPU instructions to do a round of AES basically and AVX is a simd extension You've probably heard of it simd here stands for single instruction multiple data two talks ago you've heard about that and Yeah, this this code snippet shows that shows how we use a dynamic assembler called in ASM which ships with a LuaJit To implement AESJCM using the mentioned instruction set extensions For route lookups using longest prefix match. We use optimized pop-true implementation Again here we use the ASM to generate the lookup routine But everything else about this implementation is actually written in high-level Lua So we have a high-level Lua implementation of all the like surrounding code and then the lookup routine election needs to be fast just generated at runtime and The reason the ASM is cool for this sort of stuff is that it lets you generate code based on algorithm parameters and Even CPU features at runtime. So we kind of like say oh you want to use This key size for you LPM lookup We just are gonna generate an assembly routine to that lookup really fast and Both the pop-try and ASJCM implementations are upstream for you to reuse So with snap we maintain a library of all this stuff, which you can basically plug and play with Right, so we also wrote a simple and fast IPsec ESP implementation in Lua ESP's here stands for encapsulating security payload. That's kind of like the standard IPsec encapsulation standard thingy Here I'm showing how to represent packet headers as C structs in Lua code using the foreign function interface and In Lua you can access these as if they were like native Lua objects Meaning object dot field member Yes Right, and then another thing that I thought was cool is that we have this compiler or there is this compiler for pcap filter expressions that's the TCP dump language for matching stuff and And there's an implementation of that language called pf Lua developed by Gallia Which is included in snap and it also extends the language to pf filter language to be able to express match action pipelines and This is basically another example of code generation At runtime that's really prevalent in snap where we have some DSL and we compile that to either native code or For example, I've recently written a ebpf backend for that to be able to Have bpf filters running as Xcp programs Yeah In either way, I feel like this is a really robust way of writing this sort of program First of all, you don't you don't do mistakes when doing like little bit poking on packets and and second of all That's really I mean this is already kind of like efficient being compiled To to specialize code just for this instruction and not like general being a general purpose language But there's still a lot of optimization potential completely unclaimed in this For example, we could compile this expression using simd or whatever really and that stuff That's currently not done, but very much feasible What's my time thing? right So the way security associations and that's flows basically Work in ESP Presents some constraints with regard to parallelization For security reasons every packet transmitted over a security association has a unique monotonically increasing sequence number So if you want to distribute the work of processing one security association, that's like one flow across more than one core We run into a problem where we end up having to synchronize them in one way or another in order to not reorder packets And this is a known issue in implementing IP sack and there are papers written on this topic really We actually really want to use multiple cores. However Because doing three million packets per second on one course nice, but we only make sense if you can scale that in some way and For Vita, we decided to sidestep that issue Rather completely by imitating a scale out architecture internally That is we Prepretend that every course its own node with its own address and kind of like do the the network scaling as a network engineer might do it I imagine in the program and don't try funky funky Intel CPU core Synchronization tricks which always end up being complex and slow So at this point we move the problem into a network layer and can let two common network device features take care of it in hardware The first of that is like receive side scaling, which I guess is well known here which lets us Distribute flows received on the private interface Onto separate security associations for each core and The other one is VMDQ, which is originally a virtualization extension which lets us aggregate the separate security Sizations reefed on received on the public interface before forwarding them to the respective core and on the next slide Yeah, that's a high-level overview of this architecture There two queues here Q1 and Q2 which run on separate CPU cores On the left you have the private interface running an RSS mode has one address It splits the flows onto this if you cause and on set with separate queues and on the right We have the public interface where each of the cores slash queues has a separate public address and This means that each queue can then negotiate security associations Independently and process and even chunk of the traffic without any synchronization With the other queues. So we just don't have that problem anymore All right So on drivers The snap way is to write simple network card drivers in Lua Even if when those do not always make that easy Luke Gory had a talk on the subject. I think one or two fosters ago and I hear that nowadays he soldering a network interface card himself and For me, I can say that recently I've worked on XDP and Intel AVF drivers for snap and hence Vita the immediate goal of AVF and more more More prominently XDP is to Make Vita easily deployable in the cloud So the idea is that if we could have some very common prevalent interface That we can rely on to be available in the cloud and we could easily deploy snap applications there On XDP I can report that the initialization sequence is A bit heavy for my taste personally Like to me it's easier to initialize a reasonable hardware interface of a nick than XDP but overall it was a fun hack and a good reason to read kernel source code as any if you ask me I Hit some limitations with XDP Which have mostly to do with conflicting memory allocation models between XDP and snap However, it seems they're working with kernel upstream On these issues looks promising and at that point I want to say kudos to be on tuple for helping out with that. That's pretty great and if you're interested in a topic I have a blog post on the whole how to do XDP without lip BPF Can check that out Well, it's five minutes. Let me see right gonna Right, so there's the issue of authenticated key exchange as you might have guessed We did something else there as well since we didn't want to do IKE Authenticated key exchange is kind of like a tricky bit of the of the whole thing You want to cycle security sensations often without losing packets You want to cycle that often be to be able to provide a strong forward secrecy and While this is kind of like a low throughput part of the system is quite complex and by far provides the biggest attack surface you can you can find Yeah, I mean if you if you want to get a feeling for that check out the IKE AlfC it's huge So I ended up with a simple pre-shared key based protocol based on the noise protocol framework, which I can really recommend that's something quite modern and Quite clean if you are in need of some cryptographic key exchange TLS like things you should I think look at them They have a really good like community where you can figure out how to do this in a modern way And yeah for that we use a minimal set of modern cryptographic primitives Our DNA is M-based AESJCM implant implementation the sandy to X implementation of curve two five five one nine Which is written an assembler. I think and the Blake hash reference implementation on you The Blake hash reference implementation, which is written in C Yeah, and Alternatively, I plan to support full IKE version two Switch engineer Alexander Gull has developed a strong swan plugin to provide interoperability with SNAP So basically you could use strong swan if you really want to use a key IKE and we would kind of like Have a plug-in for strong so on to be able to consume the security associations negotiated by it right And I guess lastly snap comes with a fairly complete Yang library Vita manages configuration in runtime state using a custom Yang model. That means you can query an update configuration and also runtime state using Yang RPCs and Of course, you also get the configuration validation That comes with the model and yeah below here is an example of querying in the runtime state of a running Vita application All right So that's it for me You are welcome to get involved with this project both Vita and SNAP we on github if you Want to get more of the gritty details on that I Try to journal as much as possible of this on my blog where I go like deeper into certain subtopics of this Also, I offer support and consulting on this Both Vita and SNAP again. We are my consulting company interstellar and if you have any questions, please ask them now in the hallway Or shoot me a mail Yes, yes, please go ahead Mostly size Size and complexity. I'm sorry. Yeah, so the question was what's the advantage of using a doing packet processing a lure as opposed to do it and see with DPTK for example or VPP What me personally I started with lure because I was hired to work on that And I guess the answer for most people that work on DPTK is the same just the other way around However For me personally, it really boils down to size and one goal that we had with SNAP So just to repeat that Vita is based on SNAP. It's written in SNAP. It's not like it's separate thing It's just uses that as it's like toolkit to work and One goal that we had from the very beginning is that we want like non-hardcore programmers to be able to do network programming with performance We have some use cases where network engineers use SNAP as sort of a kind of like a debugging introspection shell where network engineers write like little programs to to Debug their traffic their things and we really had this like idea that you shouldn't be It should it should be less expensive should be more easy should be less complex and I think if you compare the code base sizes You will see what I'm talking about. It's like really really a difference in just size and scope So we want to keep it simple and a simple networking tool It means that we also want to use a simple language and see is not bad. Yes, please On big package on 1,500 byte packets probably not on 60 byte packets. Yes So it's like encryption on modern CPUs Especially if you're doing something that's supported widely like AESGCM where you have like AESGPU instructions for that It's actually quite fast a single core can encrypt beyond 20 gigabytes a second easily But you're not hitting that rate ever when you're doing small packets because for that for it to hit that You kind of like have to give it a long chunk to work on so you really end up being bottlenecked by just the usual number of packets per second issue and The kernel is really bad at that Anybody else? Yeah, please Yeah, that's a good question Yeah, yeah, so there's no actual cryptographic like primitives written in Lua so the primitives are all either assembly or the respective C implementations and Why we use Lua to generate the so then there is a dynamic assembler You write Lua code that generates assembly code but what you actually run in the end is not any Lua but just assembly code and like why is the AES Constant time because the AES instruction on x86 it uses constant time and we're executing that So that's something definitely we have to make sure that we don't do that. But yeah, that's basically how we know We'd really appreciate it if you'd help us out in two ways if you'd leave feedback