 No So I'm here to talk about current TLS and hardware TLS offload in free BST 13 This has been a long time effort. It started initially with Netflix and now also Melanox and Chelsea is involved and My name is like said Hans Peter Salaski also called HPS Or H. Salaski at free BST.org. I started out a long time ago with USB and now I'm doing full-time networking with Melanox and Here we have Drew Gallatin and I started off a long time ago with free BST alpha and networking and various things and Here we are with TLS. Why don't you go ahead Hans? Sure. So I Have a little petition here So how many has ever been to a cryptography course before at the university put your hands up So, yeah, there's a couple of people here So that's good. So you may be familiar with Bob and Alice. I need to say a little bit about why we do cryptography So Bob and Alice are the two famous characters that try to exchange a secret message And there's also the evil guy that tried to eavesdrop and do data tampering so Quickly summarized cryptography is making a numerical message depending on a small pre-shared key and You can use it right or you can use it wrong and Usually it's good. It prevents leaking data to others and also you can use it for check summing and Unfortunately also Cryptography can make your data disappear faster when you lose your keys So be careful TLS That's maybe familiar to a lot of you guys. It's short for transport layer security I'm trying to be very easy starting in the morning it's used behind HTTPS on port 443 and It can support multiple crypto codecs Like AES. We're mostly going to focus on AES because that's the main standard with TLS 1.3 and It can also support different key exchange protocols. So we you maybe know about Diffie Hellman and RSA and There might be more that we don't know about yet so like mentioned TLS is a protocol and It runs on top of TCP and Then on top of TLS you can run other protocols like shown in the slide and This is just to give you a picture and I'm going to dig a little bit into details When I started doing TLS work. This was a black box. What is TLS and trying to find Documentation was not so easy. So I started with the code TLS has a small header that encapsulate the data We have 13 bytes Typically, but it's variable So in the beginning we have a type It can be data handshake alert For example alert happens when someone is tampering with the data on the receiver Identifies oh, I cannot decrypt this packet in a sense alert message to terminate the connection Then you have some major and minor numbers and and for TLS 1.2 It's three and three This actually dates back to the times of SSL version three so so it's not actually should be one and two, but it's three and three and And the TLS length is a 16-bit number so you can actually encapsulate up to 64 kilobytes minus one to be exact but a lot of applications limit this to 16 kilobytes and and Yeah, that's just a kind of legacy thing You need to know that when you use TLS you might need to use smaller blocks You cannot do so big blocks After the length you have some variable nonce. It's usually eight bytes with some Crypto codecs depending on algorithm and sometimes it's not present And then you have the dot on So this is TLS 1.2 So the difference for TLS 1.3 is basically that you don't have a nonce The nonce is kept on the sidetrack of the protocol. It's maintained by the Hardware or software and to save bandwidth and As you can see here the major number and minor numbers are still three and three So why is that? That's again We have a lot of routers and equipment that only support TLS 1.2 And if you try to change these numbers Then maybe that equipment won't support this protocol and your packets will be dropped so in instead of putting the First byte the TLS type in the beginning. We now put it after the TLS dot. That's not shown in the picture and Yeah, this is just a curiosity at the moment I'm going to say a few words about AES AES is a old algorithm relatively old Started in the Netherlands and it wasn't called AES You can read about it at Wikipedia it's basically doing 16 and 16 bytes at a time and It can be used as a stream Version that means you can stop Encrypting and keeping the state and you can resume encrypting So basically you can When you're encrypting a stream you can do byte by byte Because it used the previous block Output from the previous block to encrypt the next block and in the beginning there is Initial vector that you use to start with And I can also mention free BST supports the non-stream version of AES too The transport layer of security is in free BST implemented by open SSL I know there are other alternatives like Libre Libre SSL But I'm going to focus on what we have in free BST at the moment You may be familiar with the term AES and I that's AES new instruction set It's usually a CPU offload for AES And it makes it run faster Then then we have also a software kernel TLS That means instead of doing this encryption in the user space We do it in the kernel instead and I will return to why we do that later on in the presentation There's also something called open crypto framework or OCF that is basically PCI card that where you can DMA the data and get the encrypted data back and that's for the kernel and Then we have yet another technology. It's called TCP offload engine or TOE That means we send only the TCP data to the network card and then a network card will do both the TCP and the TLS in the same operation and Then we have Nick kernel TLS That is when we're sending full TCP frames with data to the Nick and the Nick will then decipher the header and Undo the TLS encryption Packet by packet and this way you can also do TSO So so you can put down a big chunk of data to the Nick And then Nick will do both the fragmenting of the frames and it will do encryption at the same time So I'm going to look a little bit in open SSL Open SSL is you can look at it like a filter It's based around something called a bio structure It's like a source and sync for data and you can hook them together to To make a chain of filters like you can read from a file. You can do encryption and can output to a socket and With this framework all data must have a pointer and user space So so it's passing around pointers It's zero copy inside open SSL But when you do a socket to send then it will be a copy into the kernel So I will talk more about this later on as well Open SSL and kernel TLS so We have a guy at my office in not in Norway, but in Israel That's called Boris. He made 16 patches to support something called kernel TLS. It's like offload for doing the encryption in the kernel instead of in user space and He did it initially for Linux and Now we also have this in free BST So the API is very simple It's to set socket options. You have set such a set socket option that turns on TLS for TX and you have a set socket option where you can switch the back end you are using in the kernel So so you can for example say I want to use the nick offload I want to use open crypto framework. I want to use something else So so you can switch around which back end you're using and I have a link here which shows when free BST support was added It's revision three five one five two two and this is a cumulative work of many people So John Baldwin sitting here on the front did push button work and Get it got it into the tree, but but it's really Like Drew here will mention in this part. It's it's it's a lot of people involved Okay, as Hans was saying a lot of people were involved in this back in 2014 2015 Netflix made a commitment to protect the privacy of its users and to start encrypting the via TLS the streams that We send the movies to your to your devices in and so the problem was this is really expensive. How are we going to do this and Scott Long and Randall Stewart did Scott long have the idea Randall Stewart did the initial implementation of Colonel TLS and the idea was to Preserve our normal Sun file pipeline where you know instead of what what a lot of people do Which is to read the data in you know from the kernel into a web server And then and then write the data from the web server back into the kernel We kind of want to avoid that extra step and we just want to be able to do a send file So by doing TLS in the kernel we can preserve that same Pipeline and we can still use a async send file and everything looks basically the same except for the crypto step And the idea was we want to do it as you know as as efficiently as possible and a huge amount of time was spent by Randall Making it efficient and then even more time was spent by me coming along after Randall and even making it even more more efficient so in order to do some of these things in for TLS we needed some enhancements to M buffs and I'm going to talk about the not ready flag I'm going to talk about the and I'm going to talk about unmapped M buffs and then Hans is going to continue on with syntax for Nick TLS So what's the not ready flag? This is something that Gleb Smirnoff also from Netflix came up with to support async send file and the idea is that When you are doing send file you're reading stuff from disk and when you whenever you read stuff from disk There's a chance you're going to block So rather than having nginx have to block and lose an nginx context and have to have a thread pool The idea is that nginx Uses async send file. So what happens is you send file submits the The unbuff into the socket buffer and it issues a disk read to fill the pages that are attached to the unbuff But when it puts the unbuff in the socket buffer it marks it not ready And what that means is when TCP is processing the socket buffer looking for things to send it has to stop when it runs into But not ready. So when the disk interrupt handler comes And the pages are now there it marks the unbuffs ready and calls the TCP ready routine which which then calls TCP to reexamine the socket buffer and Send anything which has been marked ready and in that way you can avoid having an nginx thread pool where you're having lots of context blocking the The handy thing is that I realized that it allows for a very simple way to sort of add a stage to that pipeline and After the the pages coming from this you can leave them not ready and call a crypto routine Which will then mark them not ready so that in that way we can use the not ready flag for kernel TLS as well and so the next thing I want to talk about is unmapped n buffs and what that really means is a An m buff that's basically pointing to an array of physical First to start off is physical pages and in fact the structure is still kind of named that way, but it's really just physical addresses So it was initially and I initially thought of it for send file and not for TLS and the idea was that in Send file you have one m buff pointing to one page So for you know 64k you've you've got 64k divided by by 4k or like 16 pages you so basically for For every 4k you're in a socket buffer. You're walking a new m buff you're taking a new cash miss and TCP walks the socket buffer chains a lot, especially you know processing acts and doing things like that If you can combine all of these all these into an array So instead of having just Instead of just having 4k reference to have 16k in the case in the case of TLS or you know like a hundred K in the case of Non-TLS you can reduce these cash misses a large factor and even in our unencrypted workloads At the time when I introduced this it reduced our CPU by something like between 5 and 20 percent depending on the on the on the machine So the other handy thing which I realized later is that it also provides a nice way to work with TLS So by enhancing this just a little bit by adding space at the front for the the 13 bytes That Hans was talking about for the beginning of the TLS record and adding some space at the back for the end of the TLS record All of a sudden you've got a single atomic way to refer to a TLS record And that's really handy for being able to do reference counting for TCP retransmits for Nick TLS and The reason that's important is because TLS records don't always end up lining up with TCP segment sizes, so what can happen is TCP can get an act for You know up to a certain point in a stream Whereas that might be in the middle of a TLS record So what TCP wants to do is TCP says hey, I'm done with everything up to this point go ahead and free it Well, the problem could be that if If we need to retransmit then the very next piece The the last part of the TLS record in order in order for the Nick to be able to retransmit that it's got to see the front part So if we didn't have these m buffs We'd have to have we'd have to have come up with some more expensive reference counting way to prevent the front of the TLS record being freed so that the Nick could again DMA it down and And and recalculate to check some for the first part of the TLS record so with all that said the first Basically with for our software TLS implementation We pass the data from from user space into the kernel or from send file into you know into the kernel and the kernel does the TLS framing in the kernel and Like I was hinting at before The m buffs are marked not ready while they're waiting to be encrypted and the Basically the m buffs are our cued onto into basically a per a per CPU TLS kernel TLS worker thread and once that once that worker thread Encrypts the data it marks it ready and it's it's ready to go to TCP Yes, so I'm going to talk a little bit about something called m buff sand tags and This is basically a pointer so when you're doing Nick TLS offload you're Allocating a resource on the Nick to hold the crypto key and the crypto cursor and The sand tag is kind of owned by the Nick and It allows the network interface to decide if the packet coming in needs the special processing or not and The reason we put this sand tag in the m buff is That it needs to be very fast We cannot do a look up in a hash table Mess around with five tuples down in the Nick driver. It needs to be fast so and also We need to be able to traverse technologies like Vlon and lag that's short for link aggregation and You can imagine when a packet goes out That it might not always go out on the same Nick If you reconfigure your lag for example the packet can suddenly change to another interface and It's very unfortunate if suddenly your unencrypted traffic go straight on the wire so We added mechanisms that will detect route changes in both Vlon and log and It will also check if the underlying network device supports Nick TLS offload to prevent unencrypted data going on the wire The API for sand tags is very simple. We basically have four methods And these are function pointers in the network or struct if net and free BSD You can allocate a sand tag you can Modify it you can query it you can free it The allocate function is recursive. So you basically ask your route interface. I Want to have a sand tag for TLS then it checks the capabilities of the network interface Do I have TLS support or not if I don't have TLS support? We return a failure And this is recursive. So so if you have a Vlon on top, that's the Vlon first and for log It's so that log use something called a hash of the five tuple. It's usually called a flow ID and This information is not always present at the beginning of the connection. So so sometimes before you can allocate a crypto tag you need to wait for a Few packets to be exchanged so that the socket can record Which is my hash and which is my then? output network interface under lag and so Again, which is my destination Output ring in the network interface that this is usually called a topletes hash and we used seven least significant bits to switch the packets on the TX rings But maybe a lot of you are familiar with that From the network Stack perspective Things are very simple You basically set the sand tag pointer Which is the sand tag in the packet header of the ambuff and then you also need to set a checksum flag for the sand tag because We tried to avoid increasing the size of the ambuff so it would use another cache line and Unfortunately, we had to share the sand tag with the receive interface pointer and to avoid Leaking receive interface into sand tags for example when you do a ping you might get back to receive interface pointer So it's in the union This is maybe too much details for you But we have a flag we abuse a checksum flag to indicate if you have a sand tag or not From the network driver perspective it basically does the opposite it checks if the checksum flag is set and It does a container of the sand tag you specified in the ambuff and This usually gets you the per network interface specific Structure that contains for example the destination thank you or a copy of the so-called flow ID And it can do a simple check. Is this packet still valid for this interface or not? And this is in the fast path, so here you can see a set of Different use cases for data flow so The the the good old case is that you're using a socket right That is all to the left with open SSL. You have an unencrypted buffer Inside open SSL You do the encryption in user space You copy the encrypted buffer into the kernel and Then again the nick will read from the kernel buffer and put it on the wire In the second case where you have software kernel TLS You have the unencrypted buffer in user space. You write it into the kernel via a system call and The kernel will then encrypted like Drew said earlier We have a per CPU thread that will read unencrypted data with am not ready from the socket buffer and it will encrypt it and put the ready flag and then the encrypted buffer will go on On to the nick and the nick will put it on the wire and Then with Nick kernel TLS you as you can see you have an unencrypted buffer in user space you write it into the socket buffer and After it goes into the socket buffer we write it straight to the nick and I can also mention that there is some magic here Going on. So so when you do a system call write to the socket buffer for every system call you do you will add an TCP no TLS header and trailer. So so it will kind of wrap your transmitted data with Automatically with the TLS Header and trailer. So so it's all seamless to user space. You just write the unencrypted data and It's it's automatically encapsulated and Then after the data is in the kernel with this additional header and trailer Then it will go to the nick and the nick will do the encryption so this this eye chart here basically shows the the data flow for send file and In particular, I'm showing the data flow for send file with software kernel TLS And one thing you'll notice is that for basically every hundred gigabits is you know divide by 8 for gigabytes It's 12 and a half gigabytes a second. So when you're doing When you're doing send file with software kernel TLS basically the data flow is you bring things in from that You bring things in from the disks Into memory and then the CPU has to read everything you just brought into the disk to do the crypto And then once it does the crypto, it's got to write it back out into the memory And then from the memory it's got to write it Or really DMA read it into the nick. So basically you multiply your your band if you want to do 100 gigabits You've got to have 400 gigabits of memory bandwidth or or 60 gigabytes a second Which is basically just about as much as as a Broadwell Xeon can do the nice thing about Nick TLS is all of a sudden see these green arrows that go up and down now You don't see them anymore because all of a sudden you look you don't have to do this This memory read and memory write anymore and your memory bandwidth requirements are cut almost in half and this is important because Certain CPU vendors like to segment their product lines by memory channels and memory speeds And so you can you can maybe go down a product line if you can do Nick TLS Next slide So here you can kind of see What you would expect if you did TCP dump with a modified IPerf that support TLS offload in the kernel So if you do TCP dump on the IPerf client sending the data You will see here the unencrypted data with TCP dump You see zero one two three four five six seven eight nine zero one two three four five six seven eight nine On the server side you will see the encrypted data. So this is exactly the same packet So this is on the client side and this is on the server side And this is just to show you that this is what you can expect from Nick TLS and also software TLS offload so We hit some issues when trying to implement Nick kernel TLS As I said already the Nick is messing with the TCP data But it already does so with TSO So those familiar with the term TSO large send off load The Nick already update the sequence number when it fragment big chunks of data and now it's also Encrypting TCP data as you go along. So so but who says we have to follow the OSD model for everything Retransmission of TLS packets as true said There is a need for re-sending The beginning of a TLS record if you're doing a retransmit in the middle of a TCP packet And this actually cross TLS record As you might remember I said in the beginning of the talk that we have a 16k maximum length of TLS records and that basically means that if you need to retransmit One byte at the end of a TLS record you need to kind of dump Almost 16k down to the nick before you can retransmit that last byte in order to get the right crypto state so So but for the good case you don't do this. So so you might want to consider Using something like rack or Yeah Try to get the trans retransmission rate as low as possible when using TLS to minimize the PCI bandwidth used Then we have some benchmarks, so so let's see what's first Yeah, Drew. So this is this isn't the benchmark section, but it's not actually a benchmark. This is data from a Netflix circa 2016 hundred gig Server, it's basically a Broadwell base zon with 16 cores 32 threads for really fat fast NVMe drives and one Chelsea ot6 Nick and The case on the left is without kernel TLS and the blue bars are served or the Bandwidth we're serving out of the Nick. So as you can see we're maxing out around 40 gigabits a second and the CPU is is It's an average of 75 because it kind of zooms between a hundred and fifty It's it's a case where it's memory It's basically memory memory bandwidth bound and it's miserable for clients because it speeds up and slows down speeds up And slows down and never really never really finds a sweet spot It's just easier to say 75 and stop there, but you know I ramble The second case is the case that Netflix runs today Basically, we're at we serve at 90 gigabits a second That's our that's our target bandwidth for a hundred for a hundred gigabit cash to to allow for a little bit of extra capacity on the link for other things and We're at about a little less than a little less than 70% CPU with software TLS and the right most is with Nick TLS and as you can see the the CPU is cut almost in half for the for the same bandwidth and Again, this is Nick TLS on a Chelsea o t6. That's not available right now in head It's something that John Baldwin has patches for in GitHub and you can talk to him after the After the presentation if you want more information on that So here we have some benchmarks with Malanox Nick TLS so the the orange Line show here Software kernel TLS. So that's basically using the AS AES Instructions in the kernel to do encryption of the packet It's going on the wire and as you can see as you go up by number of threads. So You usually start Congesting the CPU because doing software encryption is relatively heavy even if it's done in the kernel and The blue line on the bottom is what you would get with plain text. So the CPU usage rises Linearly Because on this machine you have like 28 Course available and You see here the as you go up to 28. It's almost linear and then it rises a little bit and the gray line is showing Nick TLS with not yet on the market Malanox connectax 6 DX and You can see it used a little bit more CPU and that's likely because it's Encapsulating it with smaller TLS record. So it needs to encapsulate every 16 Kilobytes well, if you use plain text, you can do 64 kilobytes at a time with regular TSO and So so And and this is also interesting for those of you that do virtualization That you can imagine you can have a virtualized environment Where you don't have to do encryption in software or at the CPU at all You can have a or a weak arm processor can maybe also be a target for such applications So so you only need to switch the packets around and and then the nick will do everything for you Or almost everything except a TCP stack. We still want to do TCP stack in the kernel and As I already mentioned Malanox Has a web page you can go to Theoretically we can support up to 16 million simultaneous TLS records Now TLS streams at the same time either 25 gig 50 gig 100 gig and now also there will be 200 gig a bit second That's various configurations. You can use this It's you can use 200 gig ports or you can have one 200 gig port so There's different ways you can get 200 gig and Yeah, you maybe don't want to say how you get 200 gig over at Netflix Well, I'll talk about that in the next hour. Yeah so Then we have Little bit about Chelsea O's hardware TLS Of flow. Do you want to say a few words John Baldwin or so? basically the the T6 T6 Nick supports TLS one that one that one and one dot two in the in the fairly unique thing is it can do both CBC and GCM We have John has the tow support for the Sorry Colonel TLS support for for the toe mode of Chelsea O in progress. That's not something we use at Netflix But it's interesting to a lot of people and the really interesting thing is that the open crypto framework CCR Chelsea O driver is already Usable in the tree and you can use it, you know, basically right now with what's in the tree for It's actually one of the only It's basically the only thing we're talking about now for hardware offload that you could actually like, you know Use like at this at this second if you're running head So we reached the end of our talk and I would like to know are there any questions Can you go back to the slide with? un Unallocated ones I'm up so so just because I just understanding problem instead of a chain with a pointers You're using just an array to point everywhere to Okay, okay, so so that's isn't a generally good advice to avoid that kind of cues with pointers Yes, put a race in the modern. Yes, it's something that it's something that Linux does with their with their SKB pages I think that's what they call it. It's something that I wanted to do in free BSD for years. Okay. Thank you. Thank you Morning, you mentioned virtualized environments. How will the Nick crypto offload be available to S running inside the hypervisor machine. Well, that's really a good question. I will just repeat so in virtualized environments with melanox nicks They provide virtualization Inside the Nick so so the Nick has multiple virtual PCI functions and you can give this virtual PCI functions to your virtualized instance and And this way You kind of have all the DMA rings Inside your virtual machine and and the Nick will actually read from this rings directly So so that this is one You know unlike Intel adapters That they don't support this so well So with but with melanox cars you can really do it large scale that that You can split up One network card physical network card into many smaller virtual PCI functions And you can just hand them out to your virtual machines. What's that answer for your question? Thank you. I mean more questions So it's it's pretty rare what's what's in what's in head right now is TLS up to 1.2 I actually have a patch for TLS 1.3. That's that's working and has served a real Netflix customer traffic So and and they've also run TLS 1.3 with Nick TLS So it and I think a new TLS version at this point things are becoming ossified You can tell how ossified they're becoming because 1.3. He's got a masquerade as 1.2 So I think that the I think that's going to slow down a little bit. It's just my personal opinion. I mean But you do you do lose flexibility And that's definitely true, but from our at least from a Netflix perspective Except for people watching on a web client We mostly most of the clients are upgraded a fairly slow at fairly slow pace I mean one of our problems is is I'm still supporting TLS 1.0 because a grandma smart TV She bought in 2010 or whatever right so Okay final question