 Hi, I guess it's time to start and as I know you guys just have lunch and You would rather Take a nap and not listen to me rambling about the IPsec But I'll try to be as entertaining as possible Yeah All right So my name is Mike bill a poof. I'm the developer with the open bsd project and I'm working for Vantronic secure systems that produces networking appliances based on open bsd and Yeah as this Soviet times Polish poster says what have you done to fulfill the plan and Why you should be listening to me Well, I Have implemented most of the stuff I'll be covering in this talk And I Guess it would be a right time to ask me questions if you have any But the things that I will be talking about we implemented On the course of multiple releases and over a couple of years and some details might be already forgotten. So I Ask you to please excuse me Right, so let's get started first of all, I want to talk about the new ISN I instruction set that was implemented in inter West Mia and you a course In fact, it's a two instruction set the instructions instruction set extensions that we'll be talking about ISN I and CL mal and Something without right anyways the Intel introduced a bunch of new SSE instructions SSE is a an FPU a part of FPU not the CPU And the instructions they introduced were ISN creeped IS decrypt ISN creeped last as you could last for the final round of IS encryption and two instructions that facilitate the IS key inspection expansion procedure is key key gen assist and IS IMC These instructions operate on XMM registers Which is Yeah, and I forgot to mention the the the instruction to perform carry less multiplication PC PCL mal QDQ Carry less multiplication is is is is also what this instruction does it takes two XMM registers Field XMM registers if you don't remember 128 bit registers But it uses only 64 bits of data and it performs multiplication of 64 bit basically Vectors, it's a vector multiplication Producing a 127 bit result So as I mentioned these are the SSE instructions and But we wanted to use them in the kernel to be able to implement the the Acceleration of the IS algorithms used in the IP set processing. So normally floating point Arithmetics are not used in the kernel and there's a very good reason for that the thing is that that CPU and FPU have different Contexts that means that when task switches away from a CPU it doesn't necessarily mean that it switches away It's FPU context. So FPU contacts can be preserved right between task switches and That means that the whoever calls the SSE instruction Needs to first to prepare a new FPU context and save the old one so that it can Possibly restore it afterwards so This is this this this mechanism was reasonably new for our kernel and For for that purpose we have implemented the lock-in interface FPU kernel enter and FPU kernel exit which denotes a Critical section where A threat That has taken this basically lock Cannot be involuntary switched away and has to Also must clear the existing context and prepare a new one so that it can proceed with the with those SSE instructions Another caveat in that it cannot be safely used in the interrupt context the thing is that Switching FPU saving FPU context Might require you to send an IPI and wait for the result to synchronize it with other CPUs and if you were wondering if the IS instruction set is unique to Intel and MD Baldoser CPUs No, it's not the newest Oracle spark T4 CPU which is a continuation of the ultra spark t-line Also has also features and an IS instruction set although. It's not compatible with the SNI And it was also announced that arm v8 architecture and power 7 Plus architecture would also feature the IS instructions Now let's take a look at the new Encryption mode for IS Cypher that was recently Developed and standardized by different committees and This is a combined authentication encryption transformation, which means that While processing the data the The authentication tag is calculated at the same time as the encryption happens The message authentication code as I denoted it here, although it's not exactly correct But that's what most of the people are used to because in IS GCM this this this piece It's not called message authentication code is it's called authentication tag, but It's basically like We can consider it to be like a synonym for the for this talk Just so that we don't get confused so What kind of Mac it generates it generates 128 bit Mac not truncated that means that there was actually a couple of variations of this IS GCM standard specified that differ by the by the lens of the tech and 128 bit means 16 bytes So the the algorithm is called in fact IS GCM 16 and they have specified also as GCM If I'm not mistaken 12 and 14 which basically just takes this Mac and truncates it That's a normal practice H Mac hashes were also truncated In the IPsec but In open busy we are using a non truncated version Because most of the Most of the requirement documents they actually Specify only the full version they never nobody really cares about the truncated ones Also, it's possible to use an authentication only version where you just don't Encrypt the the plain text you don't generate the cipher text and in this case the algorithm is called IS GMAC and essentially this algorithm is just a Combination of IS used in the counter mode and a special GMAC hashing function that I will be talking about Later so where is it used? It used is in the new Mac security standard for layer 2 encryption in ethernet networks in fiber channels security protocols to encrypt fiber channel In IPsec it's also specified for SSH and TLS, but I'm not sure about open SL and TLS, but we don't implement it in SSH yet And then say suit be which is a which is a suit that the cryptic suit of cryptographic algorithm that It's in fact like a like a Document published by an SA that Describes which algorithms should use for which purpose and NSA suit be endorses it and Dosses GCM as a preferred mode of IS in fact The older version of the presentation set as a preferred mode of IS for network encryption but I went and looked it up again and I haven't found any other modes that they actually endorse so they say IS should be only used in GCM mode It's also optional in the USGv6 specification, which is a United States government v6 compliance specification Which is a list of requirements for US government contractors It was released It was it was released like a couple of years ago and the the basic purpose of it is to be is to to Make a set of rules for new contractions setting IPv6 enabled equipment for the United States government So everyone who sells this equipment they have to comply and right now or like couple of months ago the GCM side was optional, but it can be perfectly promoted to to to Must and It's also a new standard Well, it's all started with a new standard in fact So let's just take I made a made a few slides to just Briefly talk about how IS GCM operates and why is it different from the other modes and Here we define our input to the cipher and outputs or to the mode so a Secret key for IS obviously of 128, 192 or 256 bits long and Initialization vector in fact in the GCM the initialization vector is called nonce the same way it's called in the IS CTR a plain text of up to 64 gigabytes and An optional additional authenticated data this data is provided to the cipher and will only be authenticated but not encrypted That's another Another thing why this mode is a little bit special The outputs of the of the mode is a cipher text of the same less as plain text because IS GCM uses IS CTR The counter mode turns a block cipher into a stream cipher So the lens of the cipher text is the same and the authentication tag is 128 bit bits long So this is the The High-level view on the on how which data is getting processed by the by the mode Here in on the in the top we have a Let me draw it a bit I'm talking about this part here Right, so this is this is basically a Structure that resembles for example in an ESP packet. It has a header for example, it has a An NS PI the security parameter index and for example and ESP has a An insertization vector and and the data obviously so the way this the way cipher works is that we take Incitization vector we take header we take keys actually they're not depicted on the picture and We supply it to the GCM encryption engine along with plain text and output will be a cipher text the encrypted data the ICV which is which is an authentication tag and The header and like the sequence Number are copied Untouched and unencrypted, but they are also authenticated because they are supplied as as as Authentication data, so that's that's that's the crucial difference here because in ISCBC With H Mac mode the SPI of the ESP packet is not authenticated GCM fixes this By providing a notion of additional authenticated data Now let's take a look at the how a Bit more low-level picture Depicting how the GCM mode works Frankly, I was trying to find a Simple picture and this is the the simplest I could I have found and I have tweaked it to be even more simpler so this is This might not be 100% Correct because it also requires a bit of an explanation So the The way to look at this picture is that this part here On the right This is essentially an IS counter mode We take the IV we increment the counter. We encrypt it with our Generated keys because in counter mode you encrypt the counter block You don't encrypt the the plain text itself you include encrypt the counter block And that gets you what they call key stream and then you Exhorate with the plain text and that's how you get Cypher taxing in IS CTR. So this part just denotes IS CTR, so let's take a look at what happens on the left The the first thing that happens is that when IV is taken and counter Value is appended to it and it gets encrypted to get the first block which is basically a Which which is used specially in this mode what happens to it is that it's fed to the It's it's saved and Then it would be fed to the G hash here Now this is just saved for now, right? We just we store it in the context and we don't do anything with it and We start encrypting with with counter value one we start in keeping plain text now when we are when The GCM works on on blocks of data 128 bit blocks. So as long as you have that amount of plain text to process you run the you run the IS CTR algorithm, but on every On the on the every block you also send by this you can see this feedback kind of Arrow you send it to G hash and you hash it. Oh I forgot to mention first After we we have done with IV and generating this block We feed additional syndication data to G hash and it stays in the coin in the hash context there and now oops and now when we feed our blocks of Cypher text to G hash we basically Every time we hash a block we save it in the context and In the end when we are done processing put plain text what we do is that we XOR Our initial Counterblock in fact, it's called this way with the G hash value that we we have and this is how we get the authentication tag Pretty simple. Let's talk about the implementations that we did in open BSD That in fact two implementations the portable written in C so that we can get a feel of the cypher We can do the test vectors and stuff we can understand if we got the algorithm correctly and the second implementation is the implementation written in SSE assembly and the C glue code That makes use of the IS and I and I and CMO instruction extensions The portable Implementation is is is divided in several parts. In fact, there are Three parts one is not Described on this slide, which is IS CTR. We just used the IS CTR that we had previously implemented in our cryptographic framework and the two remaining components is the the G hash and roped into the More high-level API so that we can divide it into different stages like you need set key We need update final which was called IS G Mac and implemented in C script or G Mac C and H and It's a straightforward implementation. It uses 32-bit Integers and it does XORs in in C it's all in C and it's very slow because the multiplication in the finite field that is one of the core Techniques in the in the G hash is Has to be done bit by bit you multiply vectors bit by bit and These beat accesses and doing this in the runtime kills performs a big time So there exist a method to improve performance by large amount large amount by implementing the Tables of pre-computed coefficients. So it means in the initial phase when you set up your Cypher context you pre-compute a whole table of coefficients and then you just look look them up and and use in the in XORs This is not implemented yet, but something that is Reasonably needed because Well for those reasons we don't have it on any other architecture other other than In fact, I'm e64 Excuse me the other part of the of the mode that needs needed to be implemented is the Is an actual Function doing the GCM processing in the cryptographic framework and I will we'll be talking a bit about framework itself in a few minutes, but I Just must say that the cryptographic frameworks we works with naturally with m buffs or IOVAC type structures and Basically what we need in cryptographic framework is a function that traverses the chain and Calls the right functions at the right Moment in the right order. So this is what software creep to combine routine is doing and Unfortunately, there is also a simplification I implemented because the the the whole thing is rather complicated and I didn't want to complicate it Even more from from the start and I just use m copy that data and m copy back the m buff data-handling routines instead of traversing and buff Cues and doing basically a well It's not exactly zero copy because you still have to load XMM registers, but it's close to that right now. I'm doing and An unneeded Copying but that can be optimized Let's take a look at the Assembly assembly was written in fact by Intel and it was released We actually requested them to Release it on the BSD license and they they did and thanks for them because that saved a Lot of trouble for us and this file sees arc and D64 64 is Intel dot s and Only exists from D64 because of the Because it hasn't been ported to 32-bit architecture in fact it requires different function calling convention and Whole other things that we didn't do so if someone can pick up this task that would be a great great thing and I know that free BSD have taken this and taking this file and split it into More usable parts and adding more C code To simplify it, but this hasn't been I haven't looked very closely and I'm not sure how Like I know that they still have a lot of Assembly code for for key generation day. I think they didn't report that but that's something to to to look for certainly Unfortunately, I think they don't implement G G G Mac interface and G Mac requires additional XMM registers and I-36 has only eight of XMM registers. I'm mistaken. Yeah, and I'm using for uses 16 So that's that's also a difference that needs to be overcome This file implements a bunch of functions I said I said key to to do the key expansion to generate the key. Why is it important to have I Key generate key key expansion written in a C assembly. Well, it doesn't matter for for an IPsec tunnel that that creates connection once like Once a month and then it stays and it runs for a month and like does nothing But what if you have a VPN gateway that serves hundreds or thousands connections daily? key expansion is extremely extremely Slow procedure in fact open BSD uses the blowfish blowfish a key expansion code for Password encryption because it's it's very very complicated, it's it takes a lot of resources to compute Right I say I encrypt the encrypt that's That's clear. It implements basically a round of fires Is and I ECB is not used because ECB is not used in IPsec or anywhere else in fact and shouldn't be used CBC Which is used in IPsec? CTR same thing and two functions that I Have mostly put myself although I took some of the Intel code from the white paper that implemented the the The it's called Reduction steps in the in the Galois hash function. Yeah, I'm sorry. I forgot to tell that GCM stands for Galois counter mode Galois was a Mathematician that started the finite field theory So the Cyprus code just after him That's an interesting fact. Okay, so How does this I How does this assembly rops up into a Driver for the cryptographic Framework Well, there is a C file that actually implements this whole interface and it's it's essentially a rope around Assembly and currently it supports accelerated ISCBC ISTQ RIS GCM 16 since OpenBSD 49 and then they select a bit on committing this GCM stuff and GCM stuff was committed only in 5.1 and Because of the way IPsec because of the way Cryptographic framework operates In case you just in case you are using ISCBC with HMACSHA1 that HMACSHA1 cannot be accelerated by this driver at least right now so in In case you are using these Transformation and you still want to use ISCBC from ISNI You need to basically call software crypto for all your HMAC routines Which is which would the driver does as well So the future projects in this area is to clean up assembly Maybe make it faster. They in fact because IS is is specified in As a big Indian Cypher as most of the other cryptographic algorithms. There's lots of Well, in fact not in the IS code in the GCM code there's lots of Indian conversions going on and maybe some of them can be lost Port to I386. Well, I talked a bit about it. Don't want to stop on this Implement ISXTS It's not hard to implement ISXTS and the only reason why it wasn't done is because we we don't have Consumers for this. The only possible consumer is a soft rate crypto rate crypto mode But I will be talking about it a bit later, but they just want to mention that it's not Possible right now for creep creep For soft rate would be to just use the ISXTS provided by SNI driver Because in fact well, I can mention is right now because time is running out soft rate Does not go Through the crypto threat It calls crypto framework Involve method that I will be talking about a bit later Directly and What it what happens that it loses process contacts and you can't use SNI not in the process in the interrupt contacts Evaluate AVX. AVX is advanced vector extensions and USE Instruction extension Set released by Intel in the new CPUs. I think send the breaching up Basically AVX is a very interesting thing. It extends XMM registers to 256 bit YMM registers also, it adds and updates most of the SSE instructions dealing with 14 point mass to operate on 256 bit registers, but also it adds a set of commands Basically, it duplicates every almost every single SSE instruction with an same instruction, but with a V prefix that takes One argument more to save the result of the operation. This is a crucial thing for people doing like automatic vectorization in compilers that they have basically a side side effect free instructions and Maybe it's possible to gain something for ISNI from AVX. I'm not sure AVX should be AVX requires additional support from my operating system. That is also lacking in OpenBSD So if somebody can step up and write the code Okay, let's talk just a bit about cryptographic framework Just so that we will understand what's the What's the thing behind this whole So it was implemented by it was designed and mostly implemented by Angelos Keramites for OpenBSD 27 that's 10 or 15 years ago The main purpose of the framework is to provide consumers which are kernel services like IPsec to provide for the provide access to the hardware crypto accelerators, so it's just it's just a framework that allows you to write unified drivers for cryptographic accelerators and an IPsec stack and other Consumers within the kernel to have a unified interface to access them So it implements two devices or two. I'm sorry The implementation Includes the the kernel API which is described by crypto 9 man page and the And the User-land interface that is done through a device chart device Dev Crypto that is described in the crypto 4 man page the framework operates on m buffs that are coming from the network stack and See you Ios which is a basically a sort of an extension on top of the you Ios which is a kernel version of the IOX Copied from the Field with the information provided by the User-land every right that upper that user-land programs make translates into Uio's the data is translated to you Ios whether or not it was right was issued with a IOVAC or Just a supply and a plane buffer kernel always translates into a you Ios Example drivers encrypt include software crypto VR x-crypt set of instructions that implement iscbc mode in one instruction and and Is and I and PCI crypto accelerators like UBSAC which is a blue steel or Later bought by Broadcom Crypto accelerator and here he fun for example by a small company Let's take a look at the kernel API quickly or to To register The the cryptographic accelerator with within the framework the driver for the cryptographic accelerator has to call the crypto register function Providing three pointers new session free session process new session free session Deal with the context creation and context destruction and process does actual processing of the Mbuff was you Ios chains Consumer is the the code that actually wants to get a hand on the cryptographic services Starts with a new session then get calls get rekt to get a cryptographic descriptor or number of descriptors for example for Iscbc plus hmax one you would need two descriptors one for encryption one for Hmax Dispatch when when Consumers ready with the filling the cryptid descriptors was information it calls dispatch to Send those requests to the cryptographic framework and Consumers has to supply a call back point a call back function to the dispatch routine that will be called one During the completion phase when cryptographic when when the Cryptographic transformation has happened you might need to Do further processing like in IP sex stack for example, and this is this is Where you supply the pointer for a callback Free rec when you're done with the descriptors and free session when you're done with a session Now let's take a look how it cryptid dispatch works cryptid dispatch in fact Doesn't do doesn't co-process routine by itself instead it Creates a work you task For a crypto threat to to to To deal with So basically the only thing that cryptid dispatch does is just work you Qtask when crypto when crypto threat is being woken up by the Work you subsystem it calls script to invoke function which Which the only purpose of it is to call the is to find if look up the the hardware Driver and call its process routine and actual work happens in the in the driver Processing their request and here on this picture. You can see I specifically added this comment here to denote that Here in fact the crypto done function might not be called directly, but What happens in the PCI cryptographic accelerators that you create you feel you feel Ring with That you feel the DMA ring with the with the packets you want to process and You signal the car to start the to start doing so and and you exit then the interrupt happens and the interrupt Handler does the Q process Q completion and the final up the final Step in this Q completion routine is to co-creak is to hand back the cryptid descriptor back to the cryptographic framework by calling crypto done Crypto done will not call anything directly like the callback it will in queue a new task the same crypto threats once again For it to call the callback and the only thing that crypto threats will do is call the callback So why is it done this way? Why are we? Why we're not calling these functions directly Well, there are several reasons for them. I try to be fast. So the the The main reason is that this part is a network stack and This part is a network stack for example in the ESP case, that's The top one is the ESP input and the bottom one is ESP input callback so these parts they run as part of the IP software interrupt Which has an IP level IPL soft net the Cryptographic accelerator runs at IPL net which is higher than that. So The problem is that in the interrupt context or in fact anywhere else You can't lower your SPL level unless you know the previous one and in because of this function chain you actually At point at the point when crypto dollars invoked you don't know which SPL level was there and In case of interrupt. It's not even possible to do it correctly so that's why we have to basically Jump through the hops and call it this way now just a brief Slight about the old PCI and PCI X crypto accelerates in fact the latest broad conversion, which is PCI express and In essence is a PCI X device with a PCI X to PCI X press bridge. So it's still a PCI X Device so it was a great idea back in the 90s when they decided to afloat crypto to a PCI device although even back then they understood that having lots of small Having lots of bus transactions doing crypto for tiny packets is also not very Not very doesn't bring much of the performance improvement These devices are usually usually capable of doing Symmetrical crypto like desk 3ds or is in some particular mode like is CBC and you can't make this device device It you're unless it supports it they also can perform in the same round of DMA processing an HMAC operation usually it's MD5 or sha one if they have usually Random number generator I Didn't mention it and public engine that the only purpose we have found for it in open business to use use it as a Module X but perform module exponentiation. It's a procedure that is used for in RSA So unless you and yeah, okay, and Recently we have figured out that in fact these script accelerators and the way we That the whole work that was done to support them is actually not Not exactly rewarding because now most of this hardware is not 64 bd may capable So you need to is implement bounce buffers. So you need to implement you need to take buffers from the lower memory or You need to run it in the system with IOMMU and things like that and we have just Figure this out in time for enabling the big mem on MD64, which is a possibility for the kernel to access memory above 4 gigabytes and Also, these drivers contain some extra logic to perform to to to handle for example the Initialization vectors and other Parameters that can be wrong recently when those allegations IP sake allegations you might heard of it started we started looking into that and we found that The initialization vector processing in these drivers was in fact wrong what was happening is that drivers implemented the old way of Not re-initializing insertion vector for every packet, but Doing that only for the first one and for the later packets taking the last Bites of the previously encrypted packets apparently the old specification for CBC stated well like for IP sake I mentioned Stated this way of processing But all the newer specifications say well if you do that and you can control input and you can look at the output you can correlate the the The cipher text that you produce with the plane with the plane text that generated and you can Get close to recovering the key And the only reason why would you use them right now is for heavy module Expandation is still very expensive in software and it uses large Chunks of data, so it's still relevant if you run for example SSL Okay, I'm right now the time right just want to mention what do what do you want to do in the future with OCM? Unfortunately, I didn't have enough time to talk about the extended sequence numbers But I have added support to the cryptographic framework and to the idea that we to demon and we want to add support to other Drivers including dies SNI. We want to use multiple work is instead of crypto threat to to Make it perform better on SNP machines we need we want to rethink locking and We want to improve GMAC performance the C version and we want to convert soft rate to use work here Okay, unfortunately, I ran out of time so What's the plan the questions or Well, no time for questions Nobody knows so we can do some questions. Do that anyone have any questions Please speak up All that asm code was for From Intel does it work on AMD CPUs this which from what I know also include those instructions? You mean that yeah, I didn't try I haven't seen those CPUs yet I mean I nearly forgot that AMD exists in For enough it should it should they implement the same set of instructions any other questions. All right