 Hello, I'm Juliana, I'm working on NXP for almost eight years and in the last two I've been part of the Linux security team working on CamDriver, the NXP module for crypto acceleration. Today I'll talk about the crypto engine framework I discovered while working on game script and backlogging support for Cam. During the talk, I'll cover this content, what is crypto engine and what it can be used for, how can it be used and we'll talk about this API. I'll present you the improvements I've added to the framework and what performance we gained by using the new features, how a driver can be updated to include these improvements. Next, we'll discuss some ideas for enhancement and at the end we'll draw some conclusions. For what is crypto engine, I'll start by telling you one of its use cases. I found out about crypto engine while adding support for game script in CamDriver. In crypto layer, the block site for hardware engine driver patches requests from game script into its queue to wait for its thread or work queue to handle and finalize the request when finishing the encryption or decryption. However, the old method needed each hardware engine driver to implement and maintain their own queue and thread for processing requests. Linaro introduced in 2016 crypto engine framework that implements the queue and thread for pushing requests to the hardware as the hardware becomes free so that drivers could use it. At the same time, it can avoid some reduplicated code in hardware engine driver. So crypto engine, it's a queue manager. It manages asynchronous requests in form of cryptosync request. It supports SKCypher, AKCypher, AAD, and hash request. More important, crypto engine manages all crypto transformations that request backlogging. That means they have the crypto TFM Req made backlog like said. The purpose of a made backlog is to make the crypto request reliable. Without the backlog flag, if the hardware queue is full, the request will simply be dropped, which is appropriate in the network case with IPsec, for example. Blocklayer, on the other hand, should always use the backlog flag and stop sending more requests to the crypto API until the congestion goes away. So crypto engine queue, the request for later execution, it doesn't drop it. It also keeps the order of the request. This is one important aspect for the encrypt with the game cryptos case. In Kiernan, there are already a lot of drivers that use crypto engine and here is a list of these drivers. Next, we'll talk about how to use crypto engine and we'll discuss its API. As I mentioned, crypto engine is a queue manager, has its own queue, so first remove any request queue and the thread or work queue or tasklet which is used to queue request in the driver. Crypto engine only manages crypto async requests, but we can retrieve the original request by using container op. Also, crypto engine has specific function for each of these async requests as we'll see later on. There are two structures in crypto engine that each driver must include. We have the crypto engine, which must be added in a device descriptor structure. Here, there are some callbacks that the driver can provide. We have prepared the hardware to prepare the hardware when it becomes active. I'm preparing the hardware function to disable the hardware when there are no requests to do. And the do batch request callback we'll see later what this callback can be used for and I think that its name it says doll actually. Next structure is the crypto engine context as we see here, which must be added in the crypto transformation context. This has some crypto engine operation. We have a prepared request if the driver needs to do some preparation before handling the current request and undo it by setting a prepared request when finalizing one request. Drivers must provide a function to each request by setting do one request. So this last callback do one request is mandatory. In the next couple of slides, I'll show you how cam driver integrates with crypto engine framework. And from now on, we'll highlight the driver code with blue and orange will be used for crypto engine code. First, we'll talk about how the needed structure were declared and initialized, and then we'll go through the execution flow. So first, we add the crypto engine to driver's private data. Next, we call a crypto engine unlock init function to initialize and allocate the engine. After the initialization of crypto engine, we can start it by calling crypto engine start. If there are any problems or errors during queuing request, we can stop the engine by calling crypto engine stop or remove it with crypto engine exit as we have here in cam driver. The crypto engine exit function from the crypto engine API, first it calls the stop function and then removes it. Next, in order to be able to send request to crypto engine, we need to have a crypto engine context structure. This must be added as the first member of the transformation context. It must be the first member because crypto engine only manages a sync request. It cannot know the underlying request type and thus only has access to the transform structure. It is not possible to access the context using container of. In addition, the engine knows nothing about the transformation context structure from the driver. Therefore, the engine assumes and actually it requires the placement of the known member crypto engine at the beginning. Before transferring any request, we have to fill the crypto engine operation. We have to fill this crypto engine context by providing functions for prepare, unprepared request, and the do one request. As I mentioned, the last one is mandatory. In cam, we only register do one request. As you can see here, we have SKSI for do one request. The other prepare and unprepared requests are already implemented in the encrypt callbacks or finalized callbacks. This is the execution flow for a crypto request with crypto engine. When a driver receives a crypto request via encrypt callbacks, we have to transfer it to crypto engine, which adds it in the engine queue. Then crypto engine starts pumping requests by calling do one request callback from driver, which submits the request for execution to hardware accelerator. Now we wait for completion. When one request has finished encrypting or decrypting, the finalized done callback from driver is called. We call the finalized crypto finalized request function from the crypto engine API. In the next couple of slides, we'll go to this execution flow with cam driver samples on how to call the crypto engine API. We use as example SKSI for request and encrypt the encrypt callback. This is the SKSI for encrypt in cam case. We transfer the request to crypto engine by calling the crypto transfer SKSI for request to engine. This goes in crypto engine and calls the generic crypto transform request, which will enqueue the request in its queue. As we can see here, we have crypto enqueue request. Then the main execution loop starts in crypto engine. That is a crypto pump request here. A request is removed from crypto engine queue. We call here crypto dequeue request. This will call the do one request callback from the driver. This goes into the driver callback. We have in cam SKSI for do one request, which sends the request to hardware for execution. We call cam jump ring enqueue. When the request has finished its execution on hardware, the done callback SKSI for encrypt done is called in the driver. This will call the completion of the request from crypto engine. That means it will call crypto finalize SKSI for request and goes in the generic crypto engine finalized request, which marks the request as complete, as we can see here. Now, before talking about and explaining the improvements I added to crypto engine, let me tell you about some problems I found while exploring crypto engine. When trying to add backlogging support to cam, I've made the performance measurement on crypto engine. I realized that the framework is sending requests to hardware one by one. It uses a current request variable, which keeps the request in execution and is not sending a new one until this is made null on finalize. So crypto engine, it was serializing execution, even if the hardware has support for multiple requests. Also, as I mentioned, crypto engine accepts only a sync request. So any other non-API request that the hardware is capable of doing, like split key or RNG, these are some example from cam driver case. This cannot be sent via crypto engine. This will be still sent but directly to hardware. So there were cases when crypto engine sends a request for execution and from driver, a non-crypto API request was sent to hardware and the crypto engine request will return error, even if that request had the main backlog flag set. Also, the crypto engine queue size was hard-coded to 10 entries for regular requests. If the queue reached the 10 entries, the non- backlog request would drop, even if the hardware could execute them in case it has multiple entries than the crypto engine queue. For example, cam can execute up to 512 requests. For all these problems, I found the solutions. I added a retry mechanism to be able to equeue a request if the hardware cannot execute it at that specific moment. Using this retry mechanism, we can send multiple requests for execution. Actually, we can send requests until the hardware says it's full. In that case, when the hardware returns full, we stop and crypto engine will restart pumping requests when hardware becomes available, meaning when a finalized request has executed. Some hardware accelerator have support for executing multiple linked requests at once. I added support for this use case, but it's experimental. I couldn't test it. I don't have any hardware with the batch support. Also, the crypto engine queue size is not hard-coded anymore. It can be set when crypto engine structure is initialized. It can be set with a value based maybe on hardware capabilities. Now, this is the old execution flow of crypto engine, which was sequential. When a second request was transferred to crypto engine, it was enqueued. On the main loop, before sending it to hardware for execution, the previous request was checked. I recall the current queue variable I've mentioned before. If this request is still in progress, the second request is not sent to hardware anymore. The crypto engine exits and wait for the first request to finish. Now, this is the new execution flow of crypto engine. It sends requests to hardware for execution until the hardware queue is full. If the hardware has support for retry mechanism, the request is put back in front of the crypto engine queue to keep the order of request. If the hardware queue is full, we re-queue the request regardless of main backlog flag. For backwards compatibility, if the retry support is not available, the crypto engine will work as before. At the beginning, I showed you how to use crypto engine and how to integrate its API into a driver. Let's see what changes are necessary to use the new features. First, we need to use the new crypto engine unlock init and set function. This initializes crypto engine and also sets the crypto engine queue size, as we can see here. This is not hard-coded anymore in crypto engine. This can be set in the driver. As this example here, we set the crypto engine max queue length to a number based on the hardware capabilities based on the job ring entries. The new crypto unlock init and set function also sets the retry support variable. This was added for backwards compatibility. By default, the retry support is false and the crypto engine will work as before. We send requests to hardware one by one on crypto pump request and complete the request on crypto finalized request and so on. To support multiple requests in each driver, the retry support must be set on true as we have here in CAM. This will send requests to hardware until full. If we do our request, as we can see here, returns an error, the driver must not free the request, since this will be enqueued back into crypto engine queue. We call here crypto and queue request head. We put it in front of the queue to keep the order of request. When all drivers will support the new retry mechanism, I think we can remove this retry support variable. Before I commit in 2017, the crypto API was using busy return value to indicate both hard failure to submit a crypto operation to a hardware accelerator when the later was busy and the backlog mechanism was not enabled, as well as a notification that the operation was queued into a backlog when the backlog mechanism was enabled. Having the same return code indicates two very different conditions depending on a plaque was both error prone and required extra runtime check. Therefore, the return code used to indicate crypto request failed due to hardware being busy was changed to no space in crypto API. The same should be done in each driver. In CAM, here we return no space in case the hardware queue that is juggling is full. Now in crypto engine, if hardware queue is full, we return no space. The request is enqueued regardless of the made backlog plaque. If hardware throws any other error, like IO or INRAL, these are fatal errors it cannot be recovered from. In this case, the crypto engine exits. For example, in CAM driver, we use IO in case the job descriptor is broken, so there is no possibility to fix the job descriptor. This is a fatal error, so we shouldn't recue the request. If we do this, the request will just be passed back and forth between crypto engine and hardware. The new callback do batch request executes a batch of request as its name says. This has the crypto engine request as argument as we can see here for cases when more than one crypto engine is used. The crypto engine unlock init and set function initializes crypto engine but also sets the do batch request callback. On a crypto pump request, if do batch request callback is implemented in a driver, this will be executed. The link between the request will be done in driver if possible. Do batch request is available only if the hardware has support for multiple requests. We can see this here. As I previously mentioned, this is experimental and I didn't test it so far. Now, let's look on the performance results on crypto engine usage before and after using adding the new feature, the retry mechanism. Since my work with crypto engine started while adding support for backlogging in CAM, the first results are for the encrypt. I set up an encrypted partition with the encrypt setup for the encrypt and created files with different sizes on hyaluramic 6Q subreport. I used as you can see here the DD to create files. Maybe it's not the most precise tool for performance testing, although I used the F-data thing to be sure that the data is physically written to the file before finishing. We have here the speed in megabytes per second and on the X we have the file sizes, also in megabytes. Higher is better, so it's clearly the new crypto engine flow is better than the old one. We have here almost double the speed for small file sizes and here we have 60-70% increase for 4GB or 2GB of data. Next, I used the decrypt speed test. I wanted to have an internal test, no user space to here not space transition overhead. Decrypt is the crypto testing module intended for self-testing algorithm implementations. Here we have the result for AESCBC with 128 big key size run on the same hyaluramic 6Q subreport. The speed is in megabytes per second and here we have the chance of data sent for execution. As it can be seen, we have big, very big differences for small chunks of data. Here are like the 600-700% increase and for very big chunks of data we have a 50% increase here or 100% increase here for 4K. The same decrypt speed test, I've run on hyaluramics 8 which has a newer version of cam module than the hyaluramics 6. Here we also have better results with the new version of crypto engine. For small chunks of data we have 200-200% higher speed for small chunks of data and 10-20% increase for big chunks of data. Even though we have clearly better performance with the new crypto engine, we still need to further investigate why some performance results are not as expected. Why do we have such a big difference between targets? 50% gain on hyaluramics 6 and 7%, 10% gain on hyaluramics 8. Also, a big difference between chunks of data. We have 200% increase on 256 bytes of data and 10% increase for 16 kilobytes of data. Also, I believe we can still extend an enhanced crypto engine. We can add support or other than asking requests. As I've mentioned during this talk, there are drivers which has no crypto API requests like RNG in cam or a split key which are not accepted by a crypto engine right now. So the drivers need to have two paths, one to transfer the request to crypto engine and another one to submit the request directly to hardware. For performance increase, I believe we can try to remove some logs used now for all three major operations of crypto engine, transfer, submit request and finalize. But this I think will need more time to investigate and a total testing. To conclude this talk, I believe it's fair to say that the crypto engine is easy to use and the new feature add performance up to 500% higher speed for some use cases. Crypto engine is definitely the defect of queuing mechanisms for backlogging requests and not only. Crypto engine can be used in multiple applications if a driver wants to support this encryption or file system encryption for DM integrity checker of block devices and the list can go on. Also, Herbert, the crypto maintainer, the crypto subsystem maintainer said in one of his reviews that crypto engine is the queuing mechanism that any crypto driver can use to implement the queuing. That's all for me. Thank you. And if you have further questions, let me know.