 Hello everyone and welcome to this presentation about writing your own kernel crypto axillary driver. So a little bit myself first, so I worked for nine years for Texas Instruments in the Linux kernel development side of things. I have been the lead for the basebot team for about five years. I have something like 600 but she's merged upstream that I've personally written myself. And about 60 of these are in the crypto drivers section, which I am going to talk about today. Also I'm maintaining a vocabulary related drivers subsystems in upstream Linux. And here's also my LinkedIn profile link if you want to contact me for any reason. So this presentation is split in three sections. So first I have introduction talking somewhat about basic crypto concepts and what these things mean. Then in the second part I'm talking about the meat of this presentation here. So basically implementation details for the drivers and what you need to consider when doing that kind of things. And third part is some test results. So I've done some testing for the crypto drivers that I've been dealing with. So first starting with the first part introduction. So what is cryptography? So this cryptography is basically some pretty complex mathematical algorithms to convert data into something unintelligible. So basically you use some mathematical tools to convert data, plaintext data or binaries or emails or whatever. This kind of format that people who are not authorized to do that cannot read them. So basically cryptography is used for three main purposes authentication. So basically you certify that whoever is sending the data or whoever is the initial author of the data is actually the person that he claims to be. Then confidentiality is one thing. So non authorized people cannot read the data. And then we have integrity so nobody can modify the data either. And this is useful for example. If you are serving some software binaries so nobody can modify that then insert some viruses or whatever into the data for example. And using different cryptographic algorithms for different use cases. So first about authentication. So typically some asymmetric ciphers are used for this like RSA, DSA and so on. And we have two different keys for the encryption decryption process. So public keys, one key that can be shared freely. And then we have a private key which is only used by the receiver of the data. And this can be used for applications like digital signing, secure view, both and so on. Here's a figure on the right hand side. So we have a plaintext here. Senderer uses the public key to encrypt the data. Convest into ciphertext which basically cannot be read by anybody except the one that has the private key. And you can decrypt the data and get the original plaintext out of that. Then confidentiality. So we are using symmetric ciphers here like AS, DES and so on. And the main benefit for these symmetric ciphers compared to asymmetric ones is that these are much faster to execute. So the mathematical algorithms used for this one are usually much more faster to do. The main negative portion is that we have this private key that both sender and receiver must know somehow. And this is the main issue of how to actually share this key so that it doesn't end into somebody else. And then that part can also decrypt the data that we try to keep secret. And some applications like HDD encryption, secure message in IPsec and so on are using this kind of setup. Here's the figure again on the right hand side. Basically it's similar to the previous one but the keys are basically the same used by sender and receiver. Then for the integrity purposes we are typically using hash algorithms like MD5, SHA and so on. And applications for this is like image integrity checking, password storage and so on. For these type of algorithms it's impossible to generate original data from the ciphertext. So basically we are just getting the plaintext. We are running the algorithm on the plaintext and we get the sort of ciphertext. This should probably meant, named like hash result here but I kept the figure similar to the previous two. So we see the sort of difference of this. And for this integrity algorithms we have this optional key. So you can either use the key or not. But for example if you are using tools like SHA, some in Linux operating system. You don't provide the key but it's using the base algorithm to just calculate from the plaintext. And then you will get the same result always for everybody that's actually using the same tool. So going to the implementation side of things. So here I brought this kind of simplified system architecture diagram. What people might have available for their use. So we have this SOC where we have VAS and we have CPU and a couple of cryptoaccelerator blocks here. And basically you want to utilize these accelerator blocks to make your crypto operations faster. So how do you do that? That's basically the main topic for this talk. So these crypto accelerators quite often they utilize some sort of DMA to transfer the data from the accelerator block to memory and back. And CPU is just controlling the setup of the system here. Then talking about crypto drivers themselves. There's basically a couple of main concepts that people need to think about. These are the transform which is basically a single algorithm implementing some sort of cryptographic operation. This can be encrypting decrypting or hashing of data. And then there's the request which is basically a single crypto handling request containing data. And this request is always passed to transform and transform is doing its magic for the request and provides the results back. And single transform typically provides a few different operations which are using the process, the data for these requests. And in most cases if we are using this kind of hardware crypto acceleration we need to, or we are interested in using this asynchronous completion mechanism. Because the DMA is doing stuff in the background and we can do some other processing at the same time. One important thing to note is that both of these provide this kind of context saving area. And this is used during the whole lifetime of the transform or the request. And it is quite easy when you are implementing things to mix these two and you get some really strange problems there. Like overwriting some other memory buffer or whatever. And these are quite nasty debug so please don't do that. Here is a high level crypto sequence diagram. So this is basically using single driver to use single transform. And we are sending data to the driver and do the work here. So basically what happens is that the user first issues the mod probe to probe the device driver which provides us the transform. So we get to the driver probe function and then we allocate and create the transform. And then we register that created transform back to the crypto core. And once the transform has been registered then the user can open this for example socket interface using AFALG type of socket. And just sending data over the socket then to the crypto driver which does the processing and then we get the results back. And here you see that I've sent a couple of actually two simultaneous or parallel packets to the driver. And they are kind of processed thoroughly and we get the results back here. And once we have done everything that we want with the data we can close the socket and then also if needed we can remove the whole driver. And then it will unregister the transforms and release of the memory that was allocated for this purpose. Here is a list of kernel APS for creating a new algorithm. So there's a few different types of labels for this. The first one here is for registering symmetric cipher for like AAS and trivialness. Then we have same for hash algorithms and third one is for AAD which is typically used for things like IPsec. There's plenty of others available also but these are basically the ones that I'll be focusing on this presentation. And these are like they are used for this very basic operations. Encrypting data, decrypting, creating hash results and then running IPsec tunnel. And once proper register function has been selected you just need to figure out how to fill the ALG container. Which is passed through all of these register functions. For hash operations we need to register these six APM calls. And the first one is init. So we initialize the hardware state hash and any driver internal data for the request. Update is basically the most important one so we are passing the data to the hash to this call. And then in the final we close the current hash and return the result for the user. So basically final can be the stage where you actually do all the processing if you have cached all the data from update calls somewhere. And then you just close this here. Digest is a combination of init update final but the critical basically requires that you need to implement that one. Then the last two are export and import. So this is basically you can save the current state of the hash operation and continue it later. And these two are typically the most difficult to implement because you need to fetch the hardware status somehow out of your actual data book. And store the status that is provided data buffer. And then you need to be able to import that back to the hardware once you need to do that. Then couple of notes about hashes. So both export and import must be implemented. That said they might be tricky on some hardware and if it is not possible to implement the export import from the hardware, it might be likely that you need to resolve the software folder only in cases where export and import are going to be used. Then we need to register the proper state size for the transform. So using two small size will ensure some really interesting problems. And I spent my own time debugging these kind of things. Because you end up in memory allocation problems if you use wrong state size here. And also one thing to notice that using two large size will get the algorithm also rejected by the crypto core. And the size is actually quite small. It's maybe a couple of kilobytes. So basically you cannot save too much data to this state. Then here's kind of optimization trick. So use software fallback for small payload sizes. Because setting up DMS, IRQs and so on can be pretty expensive per packet. And you can actually get pretty large performance boost in some use cases if you do this. And also one thing to notice that data will be sent over in multiple chunks. So you basically get repeated update calls to your driver and you need to be aware of that. And it might require actually quite complex buffering to do this kind of hash driver. Then for the Cypher AAD operations. So here we basically need to register the set key, encrypt and decrypt calls. So set key is pretty obviously we are setting the key for the Cypher. And encrypt and decrypt are used to process a chunk of data. And for these Cyphers and AADs the whole chunk of data is actually passed in a single call. So you don't need to care about buffering in this case that much. Additionally AAD needs to register set outsize. So this is the authentication data size for AAD. So with AAD we are both creating Cypher text and then we are also creating authentication data for the additional data that is passed with the calls. Also with Cypher and AADs you need to register proper state and request sizes. You will face really interesting memory handling problems if you don't do that. Cypher is typically easier to implement than has because the data is passed in a single chunk. So if you are starting to write some sort of crypto accelerator driver you should most likely start with something. Some Cypher algorithm like AES and try to get that working first and then you can improve your driver and get more things working. And also with small payloads use software fallback similar to as I proposed in the hashtag part. Then about the testing support. So Linux kernel actually has pretty nice setup of testing functionality available for crypto accelerator drivers. So this first is self-test that are done by the crypto core. And there's a couple of config options in the kernel you can just enable to get this done. And these are executed when the driver probes and registers the transforms that it is supported. And results for these are seen immediately in the board log if any failures are noticed. And the status is also visible from the proc file system in the crypto file. And there's one thing about this crypto test. And if you're writing your own driver is that it is quite easy when you are writing a new hardware driver to actually hang your device in a way that it is waiting for data from the hardware block. And it's not coming back because you have used wrong sizes for something and things like that. And in that case it might be useful to use this kind of hack in the kernel. I actually tried to upstream that this patch at some point but it was rejected because it's possible that in some cases you actually may want to wait out for very long time for the results from the crypto core. But if you are testing your own driver you can try this patch out and it will basically time out if your driver doesn't provide the results in reasonable amount of time. And then there's also this crypto test module which can be used to measure basically throughput of your driver and it will test your transform with different data sizes. And you can provide the number of seconds that are going to be executed by the test also. So for this you basically use this by probing the crypto test module and provide this couple of parameters, mode and seconds of execution time for the test. And mode is basically the use transform for the test. And you can check the source code for the crypto test module to actually see what these different numbers are. But there's a private reference for this 600 which is for AS and 423 for SHA. Then you can also do open assessment testing. So either using AF-ALG interface or dev crypto. This dev crypto is a kernel external module which you can install and then you can use this crypto operations similar to Unix crypto implementation. And with open assessment you can test for example the speed for certain operations and use the desired implementation for the algorithms either using this dev crypto or AF-ALG or just use the software implementation on the open assessment library itself. Then also one important testing. So it is using some sort of IPsec tunnel. I've been using strong swan myself. And once the tunnel is up you can use IP3 or some similar to on top to test throughput of your crypto operations with IPsec. Here's some driver optimization tips. So basically some of these should be pretty standard for anybody who has been dealing with device drivers before. So combine processing if possible. So small data chunks, multiple interrupts or multiple DMA transfers. Always try to combine those somehow if possible. So that will basically save plenty of time if you can avoid execution of one interrupt and setting up things for that. And also same for the DMA. Because with this crypto operations for example in IPsec we are processing some 1.5 kilobyte chunk of data basically the networking MQ size. And if you are transferring data with almost for example 1 gigabit per second then you get quite large number of blocks that you need to process. Then also parallelism is obviously one important thing. So that is quite obvious with the IPsec example. For example that you can queue the number of packets to be processed by the hardware accelerator at the same time. Then also I mentioned this before the software fallback usage. So with the kernel if you are opening this kind of crypto channel it will use the same driver for any data that goes through the same channel. And if you process usually this kind of 1.5 kilobyte chunks and then occasionally you get like 50 bytes of data to be processed. Or all that goes through your driver. So it can be really beneficial to use software fallback for this small data size system. Then also one thing to try to avoid is scheduling because basically switching the context from the kernel to potentially even use space or something similar is going to be expensive. So if you are finalizing one request you should check the queue whether you have more data to be processed and just process the data immediately. So that can also help improve things. Then basically last part of my presentation is about test results. So I've used couple of Texas Instruments platforms for testing things out. I have AM57XX EVN which is Cortex-A15 dual core SOC running at 1.5 gigahertz ampue speed. And it's on V7 architecture. And it has neon acceleration available. So these neon kernel crypto drivers can be used with this board. And it also has DI OMAP family crypto IPs in use. So these can be used for crypto acceleration. Then I have J721 EVN which is Cortex-A72 dual core running at 2 gigahertz. And this is RV8R6O. The CPU core has crypto extensions enabled so we can use the ARM standard CE drivers for the crypto operations. And then it also has this kind of TI S-H-U-L crypto accelerator block which provides a number of algorithms that we can use for that. So what I've done in the testing, I've basically tested both hardware accelerator and software mode crypto. With this crypto test module. And basically it's just using, probing the crypto test module with specific modes. I've used this 600 for AS and 423 for SSA. And it will provide you plenty of results but I've just captured 128 bit key results for this. I also slightly modified the crypto test module because normally it only executes data up to 8 kilobytes or something. So I modified it to run up to 64 kilobytes so that is going to show the benefits of hardware crypto acceleration better. And CPU load is also measured additionally in all tests. So looking at the results on the AM57X6. So this one is with SSA hashing speed. So looking at the numbers, this yellow and green curve are for the neon operations and red and blue are for hardware accelerator operations. On the right hand side we have the CPU load. So going up to 100% here. And left hand side we have the bandwidth for the, actually the throughput for the operation with the T-crypt. So it goes up to 200 megabits here. And on the bottom we have the block size from 16 bytes up to 64 kilobytes. So what we can see from this figure is that with the neon acceleration the CPU load is 100% all the time. It's kind of expected because we are using pure software implementation to run all the crypto operations. And for the bandwidth it actually starts pretty low. So it is kind of expensive to run these very small payloads for the operations. And it goes up quite steadily up to about 100 megabits or so. And it separates on there up from 2 kilobytes or so of block size. The hardware acceleration however shows kind of different behavior here. The bandwidth also starts really slow and it gradually gets higher and goes up to something like almost 200 megabits. But what is more important is the CPU load. So it starts basically at 100% here with small block sizes. So basically setting up all the DMS IQs and things like that are going to consume. All the available processing power we have. And then it starts to go down and it actually gets pretty low when you get these kind of decent size blocks here. It's maybe 5% CPU load or something like that in the end. IPsec is an interesting case. So for IPsec we get something like 1.5K. So you see that the CPU load is already getting lower in that setup. And the bandwidth of this hardware acceleration is the true line and the software line are getting closer. So you will see some benefit in the CPU loading of the system here. The next slide is about AES algorithm. So it's a cypher. And here we have a pretty similar figure in the previous one. The bandwidth and the CPU numbers are on the left-hand side. So basically it's 100 megabits here or 100% CPU loading. And again we have the blue, green and yellow lines for the software algorithms and red and blue for the hardware acceleration. So the Neon operations basically consume 100% CPU all the time. And it goes quite rapidly to 100 megabits per second throughput when the block size is increased and supports over there. And for the hardware accelerated we will see similar thing that happened in the previous slide. So CPU load starts from 100% and it starts to get down once the block size up is increasing. Here also you see that the blue line doesn't get that high. So it means that the hardware accelerated blocks itself is saturated on the performance. So it cannot provide any more performance here. Here's the results for the J7. This is again quite similar to what was seen in AM57, but the numbers are quite a bit different. So again CPU load is on the right-hand side for the CE operation, which is the software. It goes to 100% obviously all the time. And the throughput for the software operation gets almost to one gigabits per second here at the high end. The hardware accelerated block also so similar to the previous CPU load gradually gets lower when the block size increases. But the hardware is designed so that the maximum throughput for the SHA block is around 200 megabits. So it is getting saturated here quite a bit lower than what the software can provide. And here we have similar numbers for the AAS algorithm. So again software is 100% loaded and the throughput gets to somewhere like 2.4 gigabits per second or something like that. And with the hardware accelerator the CPU load starts getting lower, again 1.5 kilovite chunks. And the throughput is saturated somewhere like 300 megabits per second again. So I did talk about the software fallback implementation earlier, so you will see here that what is the reason to do that. So basically all the numbers that we have maybe up to 1 kilovite here block sizes are very low for the hardware accelerator compared to the software implementation. And still the CPU load for both is going to be about 100%. So basically if you use software fallback for this area you get similar performance as what we get with this CE driver directly. Yeah, that's actually the last slide in my presentation, so thanks everyone and I believe there's going to be Q&A session after this.