 Hello everyone. I'm Boris Brezio and I'm working for Freelictron for almost four years now. As part of my job for Freelictron, I'm actually doing a lot of stuff around RMS OCS. And one of these drivers I have developed for Marvellous OCS is the crypto engine driver, which is what I'm going to talk about today. So let's see what we'll see during this talk. First, I'll try to introduce various cryptographic concepts. So this will be a short introduction to those concepts and I don't pretend to be explaining everything about cryptography here. Then I'll try to explain what kind of services the crypto, the internal crypto API provides. So you can use those services and after that we'll dig a bit into the crypto API to develop a crypto giant driver. And finally, I'd like to share my experience, the experience I had while developing the Marvellous crypto giant driver. So let's start with the basic concepts. Of course, if you've already heard about Alisa Bob, I guess you don't need this introduction, but for those who would don't know these people, then I'll try to sum up a bit what cryptography is about. So cryptography is about ensuring that a communication with two people or more than two people is protected. And this protection is actually separated in three main concepts. So the first one is confidentiality, which means that no one else can spy on the communication, which is happening between those people. The second one is the integrity, which means no one else supposed to modify the content of the communication or at least if someone modifies the content, then all the people can notice and decide to close the channel. And the last one is making sure that when someone sends you a message, this message is actually coming from this person and not from someone else. Why, how do we do that? It's basically we're just manipulating input and output data and adding some metadata to it. And then thanks to that, we can ensure those three things. So the first thing we'll see in cryptography is the cryptographic hash or what is called sometime digest. And this is the part which is supposed to guarantee the integrity of your data. So ashing functions basically operate on a random number of input data. And then it's generating a unique fixed size data, which is most of the time shorter than the input data. So for example, you have SHA2, SHA1, MD5, and you have a lot more. So the main things that we want from an hash algorithm is that we want the probability to have two exact same ashes from two different inputs to be as low as possible. And ideally, we'd like these to be new, but that's almost impossible. We also wanted to make it impossible for a user to regenerate the input data from the hash. And the last thing we want is that if we do a tiny modification in the input stream, we want the output hash, the hash result to be as far as possible from the initial hash. The next thing we need when we're doing cryptography is the cipher element. So cipher is supposed to ensure confidentiality. It's basically using a private key or a key to encrypt and decrypt data. So you can have two types of ciphers. The swim cipher, which kind of operates on a random number of data. And the block ciphers, which are meant to be used on a fixed-size block. Then you also have the notion of symmetry in asymmetry. So symmetric ciphers are supposed to use the same key for both decryption and encryption, which means all the people which want to take part to communication have to share the exact same private key. Of course, that's not really secure, because if one of these people gets this key stolen, then that means the whole communication is insecure. So to address that, we also have asymmetric ciphers, which are using a pair of public and private keys. So the private key is kept secret and only the creator of the key keeps it. Then it can also, it can send the public key to anyone who wants to communicate with this person. And all the messages are actually encrypted with using the public key, and only the owner of the private key can decrypt the messages. So we tend to use asymmetric ciphers to enforce security when we don't know exactly who will communicate with afterwards. But it's also a lot more expensive than symmetric ciphers. So it's kind of a compromise here. A few examples. So for symmetric ciphers, we have AES. And for asymmetric ciphers, we have, for example, RSA. So I said that we have stream ciphers and block ciphers. Even when we use block ciphers, most of the time, we want to be able to use those block ciphers for a random number of blocks. And in order to do that in a secure way, we will have to choose a specific block cipher mode. And the block cipher mode is just something that describes all you can decrypt or encrypt several blocks of data. So for example, you have the ECB block cipher mode, which is pretty simple. It just takes a key, some input data, and then does some operation on it and provides the output data, which has been encrypted with the key. And you have more advanced cipher mode, which are getting the result of the previous encryption and using that as an initialization vector to encrypt the next block, which ensure that the data you're transmitting are actually obfuscated. So you cannot guess exactly what has been transmitted by observing the communication. Another thing we have in cryptography is what we call MAC, which stands for message authentication codes. This is actually used to authenticate the who actually sent the message. What it does is it takes a private key, then it does some transformation on the input data and generates some output data, which are put next to the data so that the receiver can then verify that the sender actually is the one who claims it is. So most of the time the MAC algorithm are based on HASH algorithm, and that's why we call them HASH MAC. And then we have the advanced block we use in cryptography, which is doing all of the things we've seen so far in a single step. So this means these kind of algorithms are taking some data, generating data to ensure data integrity, to ensure confidentiality, and to ensure authentication. And most of the time those are actually based on simple blocks like HMAC, CBC, and so on. So this was really a short introduction to those crypto concepts. If you want to know more about these things I'm talking about here, you should watch the talk from Gillard Benyosef, and I think he is giving a talk this week, so if you want to go see his presentation, I think it's a good one. Then I'm switched to what we are really interested in here. The Linux crypto framework. So let's see a bit how it works internally before seeing how to use it and how to develop a driver. So internally everything in the crypto framework is about transforming data in order to generate something else, and that's why you'll see a lot of places where we are talking about transformation. So there are two main objects. We have the transformation implementation which is the base class which is supposed to implement a specific algorithm, and then you have transformation objects which are the instance that are provided by a specific transformation implementation once someone asks to create such an instance. So if I had to make an analogy with an object-oriented language, I'd say that the transformation implementation is actually a factory and which is able to generate a transformation object, and then these are the objects the crypto user will use to do all kinds of transformation. So everything in the crypto framework actually inherits from those two interfaces, crypto arg and crypto tfm. The crypto framework supports Bernshoff algorithm, so you have a lot of Cyphers algorithm, Ash algorithm, AED algorithm, HMAC, and you also have one, I don't know exactly why it's here, but you also have compression algorithm which have nothing to do with crypto at all, but I think it was fitting well in the crypto framework and this is why they decided to implement it there. So these are the base class of crypto objects, and then based on that you will be able to generate complex objects which are combining different elements. So for example, you will be able to generate an Ash-MAC function which is using internally the SHA-1 protocol, or you will be able to use the AES block cipher using in CBC mode, or you will even be able to create an AED algorithm based on Ash-MAC SHA-1 and CBC AES. So as you can see, we always start from simple blocks and then build something more complicated from there. So how can you use the crypto framework? The first thing you'll start to do when you want to do some crypto operation is allocate an algorithm instance, what we call the TFM object. So this is done with a call to crypto arc, and then you pass the arc type suffix, so you have SK cipher for symmetric K cipher, AH arc for Ash algorithm, and so on. To this function, you will pass the algorithm names, we will see later what should be passed here. Then you will pass type, I'm still not sure what this type is about because obviously the type is already known in the function prefix, but still you have a type parameter, and then you have a mask which is saying the crypto engine, what kind of implementation you want to avoid. So for example, you want to avoid implementation which are doing things asynchronously because when you trigger a crypto operation, you want the result to be available when the function returns, and then that's the kind of thing you will be able to ask when you instantiate the TFM object. Once you have this TFM object, you will be able to create crypto requests, and this is done with the request analog functions. So again, always prefixed with the arc type, and with this request object, you will be able to assign a specific callback that will be called every time a crypto operation is complete. You will be able to pass a few slags, for example, whether you allow the crypto framework to request to the backlog if there are already too many requests inside the crypto queue, and all those things like that. So that's for the init part of the creation. Once you have that, you will want to set the context of your transformation object, and most of the time the context is about setting the private key or the public key or whatever you want to use to do this crypto operation. So here you'll use the crypto arc type set, and the context you want to set. So most of the time is crypto something set key. Once you have set up everything, you can start passing that to the crypto instance. So to do that, you will use the request set script, which you will pass the input buffer, output buffer, and the length of the buffer, and you will be able to trigger the crypto operation. So this time, it will be done using crypto, then arc type, then the name of the operation. So for example, for a cipher, it will be encrypt or decrypt. And you can do that many times, as many times as you want. Say that you have several blocks of data to encrypt, then you can repeat that over and over until you're done with encrypting or decrypting data. And once you're done, you can then free both the request and the crypto engine context. So that's how it's done. So let's see a real example of code to see how it interacts with the rest. Yeah, the code I'm going to show here is not really to be used in real setup. So this is just to show all the different steps you have to do, but that you shouldn't base your development on that if you really need to develop something that uses the crypto API. So let's look at the right of the screen. And we're looking at the encrypt function, which is the main encrypt part. So this function is supposed to encrypt some data. You will pass it the private key you want to use. You will pass it the data you want to encrypt. So this serves both of the input buffer and output buffer. And you will pass the size. The first thing you'll have to do is allocate the crypto context, then set up the request and so on. So we'll call the init function, which is doing, which is calling the crypto allocs SK safer. So allocating symmetric key cipher algorithm. We then use the request alloc to create a new request. Then we specify the callback we want to be called when the operation is complete. And finally, we initialize a completion element. Once we are done with that, we set up the key. So we use the crypto SK safer set key. And we set up also the data to pass to the crypto API. So after you pass it using an SG element, a scatter as a list, and then we just call request set script to say that this is my input buffer. This is my output buffer. And this is the length of the buffer. And once this is done, you just trigger the crypto request, calling crypto SK safer encrypts. So by design, the crypto API is asynchronous, which means when this function returns, you are not guaranteed that the operation is actually done. And this is why the crypto API can return in progress or busy errors. And in this case, you shouldn't consider that as a real error. You should just wait for the actual operation to be done. And this is simply done by calling wait for completion on the completion object you have initialized in the init function. So once this is done, you get back the error, clean up everything, and then return the return code. So this is a simple example of the different steps you'll have to follow to use the crypto API inside the channel. Of course, this was just a dummy example. If you want to look at real example, you just have to grab for crypto a lot of something, and you should find different elements which are using crypto. For example, we have DM Crypt, which is doing this encryption. We have various network protocols which are using crypto, file systems, device drivers, and I guess you can find more. So really, the crypto API is used all over the channel. Now we have a specific user. Some people want to use the crypto engines from user space. And so they pushed for a long time to have something inside the channel to expose those crypto features to user space. So we have currently two competing solutions. The crypto dev approach, which is not mainline. So that's not so good. And we have the AFR, which is a mainline. So let's see a bit how both solutions come together. The first solution to have emerged is crypto dev, and it's actually taken from the open BSD word. This solution is actually exposing a device under slash dev. And every time you want to do something from user space, you just use Ioc tools to set up the context, then send data, and so on. The main problem with this solution is that it's maintained as a net of three modules and will never be a mainline. Simply because we have something in mainline which exposes crypto features to the user space, and this is called AFR. Instead of exposing things through a dev slash dev slash something device, it's exposing things through a net like socket. And the main problem with AFR is that most user space crypto library or net using it. For example, open SSL as an out of three module for AFR, which is fully maintained. But in the other end, on the other end, it has a native support for crypto dev. So you'll have to make a choice whether you want to add an out of three module inside the channel, or whether you want to add support for AFR into open SSL compiling an out of three library. So before, actually I didn't try to understand how they work internally, but I tried to see what kind of performances they can provide, both of them can provide. There was an assumption that or at least it's written everywhere on the internet that crypto dev is performing better than AFR. So I wanted to make sure that with my crypto giant, the Marvel one, this was the case. And actually it's the case for 8K blocks on an AES CBC algorithm, then you see that the crypto dev algorithm is performing better. But you also see that for really small blocks, it's almost useless to use the internal implementation compared to the user space implementation. And then I decided to test the user space implementation, which is probably using the processor optimization assembly instructions. But still, it's worth looking at the results. So as you can see, even crypto dev is outperformed by the software implementation. So I decided to run the same test and launch a few threads in parallel because my engine is able to use things at the DMA level. And the more I have requests, the better it reacts, it scales. And actually I had to create 228 threads to get better results with the internal implementation. And interestingly, it was achieved with the AFR implementation and not with the crypto dev implementation. But still, when you are asking yourself whether you should use crypto dev of AFR, actually the first thing you should ask yourself is whether you want to use any of them. Because when you compare the results to the pure software implementation, then it's not so good. And I also had a look at the CPU and consumption because every time you use an external engine, one of the things you want is that you don't want to offload the CPU. And actually, even the version using the hardware engine is using quite a lot of CPU. So it's using around 60% of the CPU compared to 100% when you use the pure software implementation. So really, I don't think in most cases it makes sense to use your hardware engine from your space. The only case where it might be interesting is when you have a lot of requests which are coming in parallel. But other than that, you should think twice before trying to use those hardware engines from your space. So now let's have a look at how to develop a crypto engine driver. So from the crypto API point of view, a crypto engine is just an implementation of a specific algorithm. And the crypto API does not distinguish between pure software implementation and those which are using dedicated engines. So developing a crypto engine driver is just about implementing and registering a crypto arg interface. So most of the time you don't implement crypto arg directly, you implement something that inherits from crypto arg because as I said, crypto arg is the base class, the base interface and then you have to inherit from it. So you implement it, you register it using crypto register something and the something is the type of crypto algorithm. And so we will see a simple example with a CBC AES driver. We studied this one because it's quite simple. Now if you want to have a look at our hash algorithm or advance AED algorithm, then I recommend that you look at existing drivers to see how it's implemented. So the first thing you have to do when you implement a crypto algorithm or crypto engine support, the first thing you have to do is fill in the crypto arg structure. So the first name, the first field you'll have to pass is the crypto algorithm name. So this one is standardized and you always have to use the real name of the algorithm. So for example, this is CBC based on AES. So this is the name you'll have to pass. Then you will pass the driver name. So this time you don't use parentheses, you just use dash and then you prefix it with the driver name. So XXX is just the name of the driver. The next field you'll have to specify is the priority. The priority is supposed to be representing how well the engine is behaving compared to other implementation. So basically by convention we have this rule which is hardware engines always have higher priorities than arch-optimized implementation which have higher priorities than plain C. So normally we have hardware engines which have a priority of 300. Then you pass some flags which are specifying which kind of algorithm you are implementing the different things the algorithm supports. So for example, if your algorithm is behaving, is doing things asynchronously, then you have to pass the crypto async. Or if the crypto engine is not directly accessible to use space, you'll have to pass the crypto out-current driver only which allows the system to know when it should expose the algorithm to the use space interface. The next thing you'll have to do for the for block algorithm is to specify the block size which this algorithm is operating on. Then you'll have the context which is representing your driver specific context. So every time you have a new instance you will have a new context allocated and this context is actually allocated by the crypto framework. So the crypto framework needs to know how much data you need for your own private data. And this is why you have to specify the context size. And the last thing you'll have to specify is the descriptor and constructor methods which are called every time a new instance is created or destroyed. So let's have a look at the part of the crypto algorithm which is specific. So here the SK cipher algorithm is inheriting from the crypto org and you'll have to implement a few fields in this SK cipher algorithm. The first one we'll have to implement is the set key because this is a cipher algorithm. Then you have to implement two methods encrypts and decrypt and then some information about the the key size and the IV size if the algorithm needs an IV. Then inside your function you just do what has to be done to encrypt, decrypt or assign the key to to the context. So all those functions have to be driver specific because each engine is handling things differently. And once you have done that you just go the crypto register something and that's all you should be done and your driver should your crypto engine should be exposed. So when the framework exposed the crypto engine and someone wants to use this crypto engine it will either you ask the crypto API to allocate an instance based on the usual algorithm name so here cbc aes and depending on all the engines which are registered to the system and based on the priority then the crypto API will decide which one you should use. The higher priority the engine with the higher priority will be chosen over the other. If you want to really use one specific engine because you know this is the one you want to use then you can also pass directly the driver name when you request crypto instance. But usually most of the time the users don't want to have to specify a specific engine. They just pass the algorithm name. So that's all I have on the crypto framework part. Now I'd like to share a few things that have been complicated to deal with during the development of the driver. So first the framework is really complex and actually it's complex for a good reason. When you look at the number of algorithms which are supported in there and how easy it is to add a new algorithm to the framework then it explains why it became so complex. And also one of the good things is that even though it's really complex you have an extensive test suite. So every time you add a driver for a crypto engine then you can run the test suite and make sure that your driver is actually behaving correctly. Well the bad aspect is that it's so open that most of the time you have several ways to do the same thing. And one example is for example how you can implement crypto engine for an SK cipher algorithm. And you actually have two ways right now. So I think SK cipher is the right way to do it but we still have drivers which have not been converted yet. Also how to generate from a base class is not really clear. You have at least two ways to generate from it. Sometimes there is a union inside the base class with different, inside the union you have different interfaces and sometimes the child class is embedding directly the base class. So it's not clear what kind of inheritance you want to do here. Also it's hard to tell what are the good practices because when the framework evolves then the new drivers are switching to the new method but the old drivers are usually staying like they were. And the last thing is that there are some aspects that can be discovered the hard way. For example I didn't know at the beginning that the complete code back has to be called with the soft IRQs disabled. Which is actually logic because one of the big users of the crypto API is the network stack and the network stack almost everything is done in a soft IRQ context when it comes to handling packets. So it needs when you do a crypto operation it needs to be done in the soft IRQ context. But wow this is not the only subsystem where it's not completely consistent and I'm probably not the good person to complain about that because in the non-framework it's even worse. And that's not something we can address easily. So it drives cooperation from both the maintainer and all the driver developers and maintainers. So it takes a lot of time to migrate all drivers to new methods to do things. Another problem I had when developing the driver and testing it is that there is no way to use polling when you are under heavy load under heavy crypto load. Which means even though you have NAPI activated at the net level which allows the network stack to do things in a threat context and then pull for the and do that to all the packet handling in a threat context. Then you will have the crypto API which will generate a lot of interest which kind of defeats the world purpose of NAPI. So in this is our driver we address that manually by creating a threat IRQ and then doing a bit of polling after the last crypto request has been done to see if the next one can be handled after that. But still it's a bit hacky and I wonder if we should find a better solution like maybe add something to support NAPI like interface in the crypto API. So I'm not sure about that it's really a question to the audience. And the last one the last point we fought with was the load balancing issue. So in Marvelous OCs you usually have two crypto engines which are exactly the same and using the priority based selection of the engine you cannot use the second one. So every time this is the first one which is chosen to do things. Also if say you have in the same system different crypto engines which are not exactly the same but still can be used in parallel then actually only one of them will be able to end the request because only the one with the highest priority will be chosen when something needs to be done. So the question is should we introduce some kind of load balancing mechanism while it's not so simple to do because the framework is not designed to do that from the beginning. So if we need to do that we need to introduce a new concept which is crypto engine concept and each crypto engine is able to expose different algorithms. We also need to find a way to assign a load to each request because each request will be queued to a specific engine. So we can either find a way to calculate something which is rather complicated based on the request type, the request length and the engine it is queued to or we can just do something as simple as the load equal the length of the request. And of course we need to keep track of the total load of each crypto engine. And the last thing we need to handle is how to migrate requests from one engine to another because every time you want to switch to a different engine you need to allocate a new context which is driver specific and then under allocate a new request which is also driver specific. So this is just an idea how it would look like but really there is no proof of concept or nothing like that. But I think we can keep the upper layer unchanged meaning that the crypto user could still request the crypto API to allocate a specific instance based on the priority and then below that we could add some the notion of load balancer which you would gather all crypto engines which implement a specific algorithm and then at the load balancer level decide which one should be used depending on the current load on each engine. So yeah that's some ideas. That's all I have about the crypto API and the Linux stuff. So do you have any questions or suggestions or comments? Okay so you want to access the the question is it's a comment you want to directly access the hardware engine directly from user space so you're not using Linux. Yeah actually it should give good performances compared to the numbers I gave here because it's completely bypassing all the the kind of stuff and all the overhead you can have when you're switching to user mode and and so on. Yeah actually I'm not the one you should ask to but yeah you should talk to Herbert. Yeah yeah why not? I mean so anyway it would be a asynchronous crypto engine because you would queue some requests and then wait for the those requests to be complete. So it I think it will it would it would fit pretty well in the crypto API. I can tell. I think if you are using the FPGA to access to to provide things to the user space then you would have also poor performances because you still have all the the stack to to go through every time you want to do crypto requests and which is where the overhead is. No actually you just have to so the question is what has to be done to enable the crypto manager test self-test. Actually it's just an option to enable. By default all the tests are disabled just enable an option and it's done. Yeah absolutely not. Okay so the question is did I consider the poor consumption when doing my test and the answer is no and actually the comment is that some engines are consuming less than the CPU so even if the CPU is a bit more loaded in the end it would be interesting to use the crypto engine. Is that right? Yeah I don't know. I would have to do more tests to test that. Yeah okay I think we the the world crypto API is based on SG Scatterlist so it should already be ready for a DMA. I think what you want is to have everything consecutively allocated so I don't know if there's a solution to allocate the crypto request and and the contribution. Actually this is this is the crypto user which is allocating the data to be encrypted or decrypted or whatever. Yeah so you want to have an interface to ask the driver to allocate the data? Okay I don't know if it's planned but yeah that would be a solution if you want to use a pre-allocated pool in order to have things in a specific area of the the memory. Any questions? Any more questions? Yeah yeah so I didn't mention that it's Cortex A94 cores so basically when you have this test it's only using one core uh this one sorry it's only using one core and when you have the 128 threads it's using the four cores to do the operation. But still it's it's still performing better than the hardware engine. There is a fine slash proc something. Yeah I have no answer to that. Sorry sorry I don't hear you. Actually the OpenSSL test are only doing tests on maximum eight kilobytes so I didn't test on more than eight kilobytes. But I think it's since the engine is acting on on 2k blocks shouldn't change that much in our case. Sorry I don't hear you can you come closer? Yeah we could do that yeah it should. Okay yeah how does? Yeah um it's it's an algorithm which is uh arch optimized so it's using assembly specific instructions so it's basically implementing using an uh SKSF algorithm which is relying on specific instructions. Is that what you were asking? Yeah so yeah everything every implementation which are arch optimized are also exposed as uh crypto algorithm that's that's using the same interface. Yeah maybe I'll ask one okay yeah I think there is a way to uh use the Ocopy. I didn't dig into it okay but not the other way okay I don't know you should ask the uh crypto maintainer for that sorry thanks