 Okay. Good morning everyone. My name is Art Beechheuvel. I work for the ARM kernel team. And today I will be talking about the cryptosub system in Linux, why it is the way it is, how it's being used and some things that are wrong with it and that we're trying to improve. If there's anything, feel free to raise questions during the talk if there's anything that's unclear or needs some clarification. I will also leave some time at the end for questions, so whatever you prefer. People that have worked with me mostly know me from other stuff I do in the kernel and in other open source projects. So I maintain the EFI subsystem and also I contribute to the EFI implementations that we have for ARM and for QMU in the Tianokor project. The reason I got involved with EFI in the first place was because I was porting secure boot to ARM, so I actually came from the kind of security side into firmware rather than the other way around. The reason I'm giving this talk and not Herbert, the maintainer of the cryptosub system is that because, well, he lives in Australia and he just had a new kid and so travel around this time was not feasible for him. Well, to clarify my background a little bit, these are the first patches ever I got into the kernel. The first one is basically a back port from an open SSL fix to ARM code that we kind of adopted from the open SSL project. So even before I got involved we were already working with Andy Polyakov, who's like open SSL, well, most people probably know him in this audience, but he's an open SSL guru that wrote the accelerated software implementations for all the architectures that open SSL supports. And even before I got involved we already had SHA-1 and AES implementations for ARM that we adopted. When I got involved in working for Linaura before I joined ARM I started working on doing crypto in the open SSL project and then working with Andy we got some of the neon code, the SIMD code in shape to get it into the kernel as well. Yeah, after this we got the accelerated instructions in the ARM V8 instruction set, so the 64-bit flavor of ARM has special instruction AES, et cetera. So implemented support for those in the kernel and also the actual emulations of those instructions in QMU. And while these days there's a whole set of algorithms that we have instructions for, then these are all supported in the Linux ARM 64 port. So before we go on and explain look at what the crypto subset system is used for. I'd like to clarify what it's not. So there's no desire for being complete and implementing every kind of cryptographic algorithm that you can imagine. Every time someone contributes a new algorithm, well you're expected to clarify what it's actually going to be used for. Because otherwise, well, we maintain stuff without the user which is a waste of everyone's time. It's also not a playground. I don't mean that in a bad way, but it's not, if you want to experiment with crypto, the kernel is not the best place to do it because the way it's being used is very specific to certain use cases. And also, well, we have 4k or 8k stacks. We have preemption, we have to disable preemption when using SIMD registers and there are like lots of different things that make it, well, not the best venue for playing around with crypto. Also, it's not a popularity contest. There are support in the kernel for modes and ciphers that, well, kind of style, triple dash, MD5, things I got. But as long as it's still being used in remote file systems and other bits and pieces, then we have to support it. We can't follow the fashion. And even for new stuff like Chinese crypto, Russian crypto, NSA crypto, if there's like a Wi-Fi protocol, Bluetooth protocol, something that actually relies on it, we're going to have to implement it if we want to be interoperable. So it's not very constructive to have like political ideological debates about the actual algorithms. It's more about creating the environment where people can use and implement whatever they want. Okay, so it was added in 2002, which is, well, around the time I started playing with Linux as a user. So it's, yeah, it's way before I ever got involved. And things were a bit different. So everything was 32-bit in the first place. I mean, there was alpha and maybe some other architecture, but for long-tension purposes, everything was 32-bit. The stacks were just direct mapped. The VLAs, the something we restricted recently is variable length arrays on the stack. That's something we don't support anymore either. And indirect calls. So free tables, that kind of model of doing abstractions is, well, was fine at the time. And now with Spectre, Meldown, speculation issues, those are actually very costly, especially on x86. So at the time, this was all fine. And today, people running into issues because of this. And that's why sometimes the crypto subsystem gets some, not entirely underserved criticism, but it's good to point out where this came from in the first place. The three primary users today. So there's the block layer. There's the MLE subsystem, which is like a full encryption of entire block devices. There's encryption and the verity and integrity are pieces that are used for authentication. And then there's the file system level encryption where individual files and file names are encrypted individually. There's a networking stack where they'll get started with IPsec. But there's now also the Wi-Fi encryption layer that uses it. And something that got contributed fairly recently is TLS offload. So the HTTPS handling, being able to do that, well, basically inside the kernel. And it also relies on the crypto subsystem. And then the third one, which is a bit of an odd one, is the device node interface for crypto accelerators. So the crypto subsystem implements lots of algorithms in software, but the abstractions were designed in such a way that you can actually, well, that the caller doesn't really care whether the algorithm is backed by software or by an offload engine. And in order for a user space to be able to use that, there's an AFL socket type that you can use. This is all fine with one thing that's kind of unfortunate is that it doesn't distinguish between whether you're talking actually to an accelerator or whether you're doing the crypto in software. And when you're jumping through the hoops of using a device node and calling into the kernel just to run the same set of instructions that you could easily run in user lands, it doesn't really improve anything and well, the fact that this user land ABI makes that we have to support it forever. So well, if I were involved at the time, I might have pushed a bit for for putting a limit on it. IPsec was designed by committee. I work in firmware so I know what it's like to work on things that were designed by committee. It's very complex, very over engineered. And like well, people are aware of TLS, downgrade attacks, different levels of crypto, etc. IPsec is a bit similar. Well, these are lists of encryption modes and authentication modes that you can negotiate when you set up the connection. And these are all moving parts that you can, well, potentially subvert and, well, with a man in the middle attack or some other way to negotiate a mode that the peers didn't actually intend to negotiate, but which is weaker or which is kind of exploited in one way or the other. Yeah, so this is kind of the reason we have the abstract interfaces. So these modes, the mode you actually negotiate is runtime variable. So we can't support this with the library interface because you have to have a giant switch statement and a giant set of function calls to have all these different things supported. So that's why the API is very abstract and very flexible in doing that. And actually the MD sub system relies on this as well. So it's actually the user lens that defines the algorithm that you're going to use for encrypting your block device. And it's just passed into the kernel as a string. And then the kernel parses that and instantiates the algorithm. This is the full set of components that we provide. I'll go some of them I'll go into a bit more detail, but this is like the full list. So there's encryption. There's hashes for fingerprinting and signing that kind of stuff. We have an SK cipher, which stands for symmetric key. Symmetric keys is where users same key for encryption and decryption. We have an AK cipher abstraction, which is for asymmetric key. So public key encryption. Actually, it's not actually used, which is kind of strange. We have KPP for public key, key derivation. So to establish a symmetric key over insecure channel. And then there's some other pieces compression is using IP sec, I think. Well, some bits, some play different places. And then we have something for random number generators, which kind of collides with the driver stack for that. I think we mostly do pseudo random number, random number generators here, rather than true entropy ones that the RNG driver stack supports. Okay, so this is like a short sequence that explains the use. Keep in mind that it's designed to operate on network packets or disc blocks. And that it's designed to well be backed by various different implementations that could either be synchronous in software or asynchronous on a offload engine. So I'll go through these step by step. So a scatter list for people who are not familiar with it. The scatter list is just, well scatter gather list. It's a list of address, length, pairs that describe a buffer in memory, or a set of buffers that are concatenated logically. And this is one of the kind of impediments when you're using in kernel crypto casually. Because when you're using scatter list, you have to use linear addresses. So addresses that are in the kernel direct map. And with, well today with mostly 64 bit architectures, that sounds a bit backwards, but actually at the time, 32 bit architectures didn't actually map all the memory. So if you want to access an arbitrary physical page, then either you have to be lucky enough that it's already covered by the low memory or you have to go and K-map it and K-unwrap it, et cetera, et cetera. And because offload engines have also don't care about virtual addresses, it's actually for the use case at hand, it's actually quite logical to use scatter lists and not anything else. Yeah, so that means you can't use female buffers because they're not covered by the linear map. It also means that with the V-map stack change, you can't actually, well you can debate whether it's a good idea to begin with, but you can't create scatter lists that contain pieces of stuff. So when the V-map stack changes went in like two years ago for armor 64 and then year before that for x86, we, well we ran into a lot of issues, especially with the offload engine drivers that don't get that that much test coverage by the typical CI project where the fact that the V-map stack now lives in the female space means that the scatter list points to, well, bogus address and things crash, et cetera. So taking the disk encryption as an example, XDS is a mode of AES used to encrypt disk blocks. So this is what you would typically do once when you mount your encrypted block device. So you allocate an SK cipher transform and what you will get an abstract transform which is backed by some implementation of XDS in AES, AES in XDS mode. And there's, well in the crypto algorithm API, there's lots of logic that tries to locate one if one already exists. If no algorithm is registered that actually produces this transformation and it goes and looks for modules that might implement it. If no such module exists, it will actually parse the string and say, okay, this is XDS and then it will try to find what we call a template and this template is invoked with whatever's within the parentheses as the arguments and then the template will instantiate this. And then in this particular case, it's actually backed by another chaining mode, ECB and the ECB template. So there's a bit of, well, there's a lot of complexity under the hood that produces some implementation of XDS, whichever way the platform can provide it. So this means, for instance, that this TFM is like an abstract type. So depending on which algorithm actually produced it, its size is different and its contents are different. There's some standard fields, but there's a part of it that is owned by the particular implementation. Yeah, so the algorithm is actually like a runtime variable thing. So in the IPsec example, it's one of a list of many and in the same way, the DmCrypt and other block encryption stacks provided as a runtime thing. There's also downside, especially in the DmCrypt case, they, for some reason, they decided to support like a combinatorial expansion of every mode with every algorithm, et cetera, et cetera. So if you read the XDS spec, it's only defined for AES and the way the AES blocks are chained together is, well, it's really designed for AES and using it with anything else. It's not a great idea. The fact that we expose it to a userland actually means that if there's one guy in the world that decided to start using it, we're stuck with it forever, basically. So, yeah, the combination of those two things makes it a bit unwieldy to maintain. So the Crypto-Elk async, I included this here because, well, people tend to get a bit confused about how you actually set the flags. So we have synchronous and asynchronous implementations. A synchronous implementation is guaranteed to return immediately after you call it. And an asynchronous implementation may actually return a status code that says, oh, Crypto in progress, and then you provide a callback that gets called at some later point. And this is just a mask value pair where you say, okay, I want to filter on the async bit. So the async is the mask and then the value zero. So you want the async attribute to be zero, which means only synchronous implementations. Setting the key. Well, this rather straightforward. The only thing I'd like to note is that at the time this was like a seamless match with the algorithms. But today we have, especially in the MAC area, keyed hashes that actually use different key for every request. So unfortunately, the API is some constraints about when you can call set key and when you can call the encryption routines. So the encryption routines you can actually call almost from anywhere. Set key you can only call from process context. So if you look at how Poly 1305 is incorporated, it's slightly messy, but I have some ideas of how to improve that in the future. This is the actual encryption call. So this is what in the block IO example, this is what you do for every block. So I do it many times. And there's another thing you have to allocate. There's a request structure and the request is another abstract thing and the size of it and the makeup of it are an inpatient detail of the algorithm that happened to provide the transformation that you asked for. Yeah, so the reason it makes sense to allocate it from the heap is because the stack goes out of scope and for asynchronous implementations you can't, well, you have to access this at any time until the request is actually completed. Yeah, so in summary, the transformer request API is tailored to the way we use it in the kernel today with the IPsec and the block IO. But for doing casual stuff, it's not really that useful. So also triggered by the zinc debate, we started to provide library APIs for certain things. So there was already a SHA-256 implementation floating around somewhere used for KXAC, I think. So we pulled that into lib slash crypto. There was some other code that only, well, was only used casually and we didn't have any accelerated implementations in the first place. So it's much more straightforward just to provide a library version and get rid of the complicated transforms and algorithms entirely. I'm currently in the process of extending these with some other algorithms but I'll get back to you later. So we're adding more and more algorithms to the library interfaces that can be used for certain other use cases. Okay, getting back to the symmetric key cipher. That's what's most widely used in the kernel. We used to have a block cipher and an asynchronous block cipher and then not all symmetric key ciphers are block ciphers and then you can use a block cipher to create a stream cipher and so the naming was a bit off. Also, it's more useful to have one abstraction that can do both synchronous and asynchronous. So instead of, so we just got rid of block cipher and a block cipher and have only SK cipher today. We also have cipher, which is like it's basically an internal abstraction for the curved API that leaked into other code. I'll have an example of that later. It's software implementations only. It uses virtual addresses so it looks appealing if you want to do some crypto and you don't want to do the whole request and transformation then. But actually it's never the right abstraction to use for if you're not implementing a crypto template or if you're not implementing a crypto mode itself. Then we have, which is basically another flavor of a symmetric key cipher and I think it's actually it's a mistake to have two different types for this. I see someone nodding in the audience. So I'm actually intent to merge this in the future and we have had some discussions on the list already about this. So basically it's doing encryption but then while doing the encryption you also take a sign the cipher text so when you do the decryption you can check the signature first to make sure it's not tampered with. And like SK ciphers if you're doing AED operations you should not cobble together your own implementation using this cipher abstraction. So this is some code that took from the Wi-Fi subsystem and of course because of the history there we had four copies of this floating around and just recently we got rid of the last two. So on the left side you see the original code where so basically what you're doing in the CCMP mode is going through a network packet and it's divided into blocks so num blocks is the number of 16 bytes blocks that make up the network packet. And then it calls into the cipher encryption routine block by block so for really goes over every 16 bytes and then the first encryption encrypts the same block every time and then XORs in the data so it's basically encrypting a MAC signature and then the other call does the actual encryption of the payload and we replace it with the code on the right which just defines some data structures and then passes that into the crypto subsystem. So the AED we have in the crypto subsystem is called CCM AES so there are various ways it can be instantiated and this actually illustrates why you should be using the abstraction and not open coding it. I wrote an implementation for ARM64 a couple of years ago using AES instructions which does the whole transformation including the MAC and the encryption basically interleaved so it's one big algorithm that does all this processing at the same time and on a fast ARM chip it runs around four cycles per byte and actually interestingly because the algorithm is strictly sequential the MAC part is strictly sequential it's even faster on low-end hardware so it's three cycles per byte on hardware with short pipelines well because of the plumbing if you don't have a complete implementation then the CCM template will be instantiated and it will find different implementations of the encryption part and the MAC part and put those together. So if you rely on the original implementation we just saw which only calls the ciphers directly then with the even if you use these same accelerated instructions you end up around 28 cycles per byte so it's almost like an order of magnitude slower just the way the calls are wired together while even in the end you're using the same instructions for it. Before I go on I would like to explain something about AES most people may be familiar with this but I'll explain it nonetheless AES is well it's been unbroken for 20 years I suppose and well there are some attacks on reduced round versions or some other like academic vulnerabilities that don't affect the real world. What does affect the real world is that in order to implement it securely in C code basically it's either well it's either going to be fast or it's going to be secure. You can have an implementation that is very slow and is time invariant but the normal implementation we use in the kernel for generic so for any architecture that doesn't have its own special implementation we have AES generic which is in C code and the way the arithmetic the field arithmetic is optimized is accelerated results in an exploitable correlation in the lookup time the lookup table access time with the AES is made up of a number of rounds and in the first round the lookup table keys are based on the key X sort with the plaintext so if you know the plaintext and you know the timing you can kind of infer bits from the key and this is already 15 years ago this has been shown to be exploitable so if you have enough samples if you have like send millions of network packets to some unsuspecting host you can infer the key that it's using from the timing variances in the response times so there was another user in the kernel using the Cypher API directly it's the TCP fast open feature that we have basically what it does is now in your browser you have like 10 or 20 concurrent connections to the same server to load all the different parts of a page quickly and this relies on an extension where on the first connection the server gives you a cookie cryptographically generated thing and all these other connections that you're making right after they present the same cookie and then the server can say okay this is all the same guy and it just starts sending data immediately rather than doing the full 3-way TCP handshake and this was implemented using the Cypher API so this is the IPv4 version where it just takes the source and destination addresses which is 8 bytes of data and then that's with zeros encrypts it and then truncates it to the cookie size which is 8 bytes in the IPv6 case we actually do it twice well the addresses are longer so and basically this is cvc mac again the same mac that we saw in ccmp so you encrypt something then you x or something into the Cypher text and then you encrypt the whole thing again so on x86 if you have the special instruction this is actually what you get when you call crypto cypher encrypt 1 so first there is this indirect call because it's well it's an abstract API and then it will actually go and do kernel fpu begin and end which stacks the whole simd context to memory and then well that's calls the AES instructions and then unstacks it again actually recently I think about six months ago they finally implemented an optimization in x86 where the restore the fpu end is done lazily so that means that if you're doing many of these in sequence then it only gets unstacked when you finally return to userland but at the time that this was implemented it was really really costly and especially the fact that if you're doing this for IPv6 in the way that we're doing it before you're doing this back to back so you're really well moving a lot of data doing a lot of stuff you don't really need by using this cypher abstraction if you don't have the AES instruction then like the previous slide kicks in using AES for a Mac is a terrible idea to begin with because we know AES is susceptible to non-plaintext attacks and Mac input is plaintext by definition so anyone well it's just that it was very poorly chosen and it was someone with a bit more crypto cloud would have implemented it quite differently so what we ended up doing is switching to SIP hash which is just a C very fast C Mac algorithm the thing about Macs like keyed hashes is that they don't have to be collision resistant so even if you can manage to create collisions using this algorithm it doesn't really matter because it's only the key that you want to hide the peer already knows his own source and destination addresses so SIP hash was a much better choice. Stream ciphers yeah so one is there are different classes of length preserving symmetric encryption basically symmetric encryption means that you can encrypt and you can decrypt again so there's one key that is used for both and then there's some reversible operation that knowing the plaintext you can get the ciphertext and knowing the ciphertext you can get the plaintext and there's a special class of this which are called stream ciphers where the reversible operation at the base is just XOR and then the key stream that you XOR with is that's basically what is defined by the particular algorithm so this key stream is defined by the key which is the thing you have to keep secret and then there's a nonce or a salt or something that well just an arbitrary value that you mix in with it to make sure that you don't get the same output every time the problem with this is that if you use the because the basis of the algorithm is XOR if you reuse the same key and IV pair you will get the exact same key stream so if you have two different packets but they use the same IV if you XOR those together the ciphertext what results is the XOR of the plaintext and this is one of the catastrophic failures of stream ciphers and why you have to be careful to choose your IV yes so they're both typically generated randomly but it's worth noting that so the key is has to be a secret so when you use a random key you're using randomness because you want to be very difficult to guess otherwise it's not a secret for the IV often randomness is used but not because it needs to be really random or it needs to be secret but just to reduce the likelihood that you use the same value by accident but actually in most cases using a counter and managing the key and the counter correctly is a much better choice so the Google Android phones have at least the high end ones have full device encryption these days and this is usually based on SOC's CPUs that have these special instructions and they had some suggestions some proposals to do the same on phones that don't have these special instructions but so they were looking for something that doesn't use AES because AES without the instructions is both slow and well maybe insecure so the first proposal I had was based on Chacha 20 which is a stream cipher so for network encryption this is kind of okay because things are short lived you just send the packet receive it and then it's gone again for block encryption you will always use the same key as long as the blocks are live and the IV you don't have space to store an arbitrary IV so the IV is typically the sector number so under the assumption that you can only see one version of each sector at any one time they said that okay maybe Chacha 20 is suitable even though it is susceptible to the stream cipher problem but then in practice it turned out that if you have flash managed NAND with ware leveling block relocation under the hood that it may actually be the case that you have different generations of floating route somewhere on the device which could be used to defeat the encryption I'll speed up a little bit yes so there were some other attempts at doing this there was a NSA designed block cipher that was contributed and then they decided not to use it so we removed it again and then they ended up with a well completely new invention which is quite nice called adiantum it's a very complicated construction of different almost universal hashes and stream ciphers and block ciphers et cetera but they're actually shipping that into the low end funds and it's providing security so it's quite a nice achievement actually let's keep on to this yes so there's some talk about wire guard on the list recently so the difference main difference between wire guard and the other users that we have in the kernel is that there's no algorithmic agility whatsoever so all the algorithms are basically defined by the protocol so that means you don't have this you don't need these abstractions you don't need to have string parsing and other moving parts you can just do function calls especially if you decide not to not to support offload engines there's a downside however so first of all the algorithms they selected they were all designed by Daniel J. Bernstein which is kind of a royalty in the community but it's kind of dubious nonetheless and also the actual mode the IPsec mode that is being reused is explicitly presented as a fallback so it says okay everybody use AS but well if we ever found a problem with it let's have a plan B and this is the plan B so Chacha Poli is the plan B for doing IPsec and this WireGuard uses it as its only primary encryption mode so it did some benchmarks comparing these different modes so first of all the dark blue is with jumbo frames benchmarking a VPN stack with jumbo frames doesn't make a lot of sense it's just to kind of illustrate the performance performance difference so we have GCM AES which is kind of a successor to GCM which removes the ridiculous MAC part of it and then there's a new algorithm that was selected by the Caesar competition at the beginning of the year as a successor to GCM which is also based on AES instructions but using a completely different way and well as you can see just with the normal packet size they're already a bit faster the performance of the algorithms is a lot higher if you have these instructions and higher performance means longer battery life for mobile devices so yeah I think this is something we might want to revisit at some point if you really deploy this at scale yeah so we've done a lot of testing Eric Biggers one of the co-authors of Eddie Anton is doing a lot of work in the test cases and then well there's a lot of janitorial work going on as well but I think I covered most of this yeah I'll just leave this up here while we take some questions any questions hi thanks for the presentation so there was recently a debate about the hardware and zinc so related to not providing support for hardware cryptographic acceleration for IPsec how do you see this going into in the future I mean will hardware vendors have to develop this or this will be synchronous only and will rely on the existing crypto API to go through existing IPsec stack or I think it needs to be driven by demand so if someone can implement this so people are building accelerators for this mode already that can be used in IPsec so we already have some drivers in the kernel that support this and few well I guess it's the chip has it the new nxp chip chips have this as well so I think it makes very much it makes a lot of sense to do proof of concept with actual hardware and if we can see that it has some kind of improvement in terms of power draw or some other performance metric that we choose then it's something that we should definitely revisit but given the controversy that was already surrounding it I think it was a wise choice to just disregard it for the moment and just get the thing in and then start making changes to it if you want to okay so do you think that crypto API eventually will somehow the wire guard is somehow evolving in parallel with the crypto API do you think this will be absorbed or I mean well depends a bit on the author I mean the author knows a lot about crypto and we were very eager to have his input improving the crypto API rather than having a huge stack of stuff on the side next to it so I think there are very good ideas that we can leverage for other things as well so I think yeah to probably merge in the future but yeah having function call indirect call based APIs is costly these days and we didn't have that problem 15 years ago so we have to make some changes to the crypto API as well maybe using static calls or static branches and do other things to compose these algorithms rather than having it all rely on function pointers at one time because thank you any more questions we're running a bit late but I think it's fine because we have quite a short day today so we can shift down hi thank you for the talk my life experience showed me that it's important to distinguish between pain you must suffer that it is inevitable and pain that well you can do something about the first thing from learning the group ti and your presentation there are some things which people don't like but again there's something that we can do something about for example my pet peeve is the way the API want you to you actually don't want to use asking so I think everybody is kind of So my question is, maybe not specifically to you, but as a representative of Herbert, is there openness to patches that will try to fix some of these? Because I think there's the things we can change, but there's stuff like the AED that you mentioned before being united with SQL software that just makes sense, it will make such a more cleaner API. Thank you. Well, I think Herbert himself is not very active as a contributor, but he's quite open to making improvements. So just up to the people that care enough about these things to start fixing them. So myself I started to make some changes. Well once I knew I was going to talk here, I did some research and while doing my research I ended up improving some things and fixing some things and I kind of am a mental list of things I want to look into in the future. And if other people feel the same way we should obviously sit together and work out a bit of a roadmap together. And I'm sure Herbert and other contributors are up for this. I know Eric Biggers is also very eager to make improvements. So yeah. Okay. Thank you very much. Any more questions? No, let's thank the speaker. Thank you. Thank you. Thank you.