 Hello everyone. Thank you for coming to my talk and my name is Ignat and today we're going to talk about Linux disk encryption and how to speed it up First a little bit about myself. I work at Flowsware. I do performance and security there I'm also passionate about security and cryptography and I also enjoy level programming and Linux kernel is one of my favorites I also do some bootloaders and other low-level scary C stuff Okay, let's go So before we start talking about DmCrypt, let's talk about encryption data at rest in general and Before we start talking about encrypting data at rest in general. Let's basically review The modern operating system storage stack from a high-level perspective to understand where you can apply encryption these days So this is a simplified version of the storage stack at the top You have the your applications which actually implement your business logic the applications write read and write data in files and Send them to the file systems the file system translate these files into Blocks and send them to the operating system block subsystem, which then later Routes it to appropriate storage device drivers, which talk to actual storage hardware and store the data and When we start thinking about where we can we apply encryption? We can basically apply encryption at each of these layers So first of all can basically just buy self-encrypting disks and there is a standard there called opal And those this will basically transparently encrypt the data for you Secondly, you can just implement encryption in the block subsystem of the operating system and The examples the known examples here's are lux and the encrypt which will be the main topic of today's presentation We also have bit locker on windows and file world on Mac Then we can actually encrypt data in the operating system file system layer and here We also know there are some examples like ecrypt a fast which is an older system and we have Xt4 encryption or it now Developed into a Module called fs group, which is actually supports now on the Xt4 file system And finally you can do encryption in the application layer Just like add code to your application which encrypts data before sending it to the file system and you're done So each of these approaches have their pros and cons so the pros for storage hardware encryption is They are simple. It's just there. So everything is handled by a hardware. So it requires very little configuration It's fully transparent to applications applications do not know or care if your Desk encrypts the data and it's usually faster than other layers because you don't waste the your main Host CPU cycle to do the encryption There are some downsides, of course and most of these Implementations are proprietary. So you have no visibility into the implementation. Thus, you have no auditability Which sometimes leads to poor security and recent findings show that some of the implementations of something drives up so bad that Microsoft for example decided to Switch to software-based Disk encryption by default in their Windows operating systems quite recently Then we have the block layer encryption so encryption in the operating system Block sub system. It's very similar to hardware disk encryption But so because it requires little configuration and it's also fully transparent to your applications But the advantage is if you run an operating system open-source operating system like Linux, it's open and auditable the downsides of course like any other Block storage base encryption it requires somewhat specialized crypto because the conventional Stream ciphers are not very suited to encrypt random block storage It may have performance impact because now you're doing the actual encryption on your host CPU So you're wasting CPU cycles like additional CPU cycles and Now unlike hardware encryption your encryption keys are stored in the main RAM of the operating system Which makes them more vulnerable to different RAM base attacks Then we have the file system layer encryption The advantages of that it's also somewhat transparent to applications Yeah, again, if you're using an open-source open open-source operating system, it's open and auditable The advantage another advantage is more fine grades and for for example You can have different directories in your file system which are encrypted or not encrypted You you can have different directories with different Encryption keys and as a result different users of your operating system even can have their own encryption keys for their data And also you have more choice for crypto and potential integrity support because at the file system layer The operating system has more context. So it sees full files rather than like independent random blocks of data and can do more things the downsides is again with like any other software encryption you can have a performance impact your Encryption keys are in RAM It's more complex to configure because all that fine-grained ability requires more configuration and another downside is you may have unencrypted metadata. So for example Free space or file size is not encrypted when you use file system layer encryption And then again, you can have the application layer encryption. So It's Relatively open and auditable if your own the source code for your own application You can basically read it. It's fine-grained again You can really implement what you want and you have full groups of flexibility. You can implement any crypto like The downsides are as other software based encryption data encryption solutions. So you have encryption keys in RAM and You actually have to code support either your application has to have source code which does the encryption itself. So you have to develop it and You may also have an encrypted metadata your file sizes and free space will not be encrypted and One of the downsides is again full crypto flexibility. Not every application developer knows how to properly use low-level cryptographical primitives. So actually your implementation of Encryption may actually be insecure even if you use like Modern library like OpenSSL. Okay, so let's Switch back to block layer operating system block layer encryption and Linux and in Linux We use Lux and DM Crypto. So Cloudfer is a South company and like many other South companies We prefer operating system block layer encryption. So it Basically, we want the encryption As a feature of the platform. So it should be transparent of the applications Regardless of whether application supports encryption or not On the other hand, we actually don't want To depend on like potentially vulnerable implementations of hardware disk encryption But on the other hand, we don't need the flexibility of the file system layer encryption. So block layer Disk encryption is that sweet spot like we use and many other companies too and because we use Linux We kind of use Lux and DM Crypt So what is Lux and DM Crypt to talk about the encrypt we need to just review device mapper in Linux first So device mapper is a is an interesting framework. So what it does is essentially the follows again You have your applications which read and write files to the file systems These file system normally will translate these files into Blocks and send them down the stack to the block device drivers to be actually stored So device mapper framework actually can insert itself in between And intercept these blocks as they go between the file system in the block device drivers and provide the additional Some additional functionality like we have DM rate which can create software radar rates we have DM mirror which can back up data and The topic of this presentation will be the encrypt which basically encrypts the data transparently So if we zoom into the encrypt again, we have the file system. We have the block device drivers So the encrypt inserts itself in between and basically when a file system wants to write a block It intercepts the request encrypts it and send it down the stack when a file system reads some wants to read some data basically intercepts the Cypher tech the red cypher attacks from the block device driver decrypts it and sends the plain text to the file system And which is one thing which is does good is it doesn't implement its own cryptography It uses a well-known standardized linux kernel crypto api, which is hopefully have been there for some time and It's open source and have sufficiently and was sufficiently reviewed To consider it more or less secure So this all well and good. So we enable disk encryption everywhere in clout for but It wasn't without problems. So we started to see some potential performance degradation and started to investigate But basically for the purpose of this talk, we will not present Some real data which we got from our production but To surface the problems in the dm crypt itself will present a benchmark Which is tries to avoid some kind of bias Around hardware and would be easy to be reproducible in a laptop But still show the problems in dm crypt So to avoid the bias of the of a specific disk or hardware What can we use we can use the fastest disk out there, which is basically no disk and on linux, it's very easy to create a RAM based disk So a disk in RAM with the brd module. This is what we do here. We create a four gigabyte RAM disk Now we allocate a two megabyte file for Luxe the touch header, which I come back Why I will come back in a second Then we format our newly created RAM disk into luxe Using our detach header And then we basically create an instance of dm crypt on top of our newly created RAM disk So this is our test storage stack at the bottom layer. We have The RAM disk So we've created a dm crypt instance on top And for the purpose of this benchmarking, we will use no file system Again to avoid the bias of a particular file system implementation in our results So in this setup, what we can do is basically we can really write directly to this dm crypt instance And our data will be transparently encrypted or decrypted Or we can write directly to the underlying raw device To actually get the raw IO performance and compare it to with encrypted performance That's why we use the crypto Detach header for luxe because if we didn't by default luxe would create We would use the first two megabyte of the our underlying RAM disk to store some metadata for our dm crypt instance and when we would write Read or write directly to to our RAM disk By passing the dm crypt instance, we could accidentally Erases data and spoil our experiment Okay, so let's add some workloads So first we'll just measure a sequential read write throughput on the under of our underlying RAM disk So we use this file co fio command and What we get in results that We get somewhere about One gigabyte per second of read throughput and one gigabyte per second of writes Because we use RAM disk reads and writes are almost the same so This usually checks out Now let's see how the same workflow performs on the dm crypt instance with Transparent Linux disk encryption involved and when we do that We see that the throughput dropped to 150 megabyte per second in both directions, which is actually Seven times slower. So yes, again, we use software disk encryption. We may expect some performance degradation But probably not seven times But what should we expect basically? Crypt setup utility has a handy benchmark commands, which benchmarks specific Crypto algorithms you could use to do disk encryption and Linux and We if we run the benchmark The on the default algorithm, which is is xts. We get that our test system has like Can't perform 1.8 gigabyte per second of decryption and 1.8 gigabyte per second encryption of pure crypto So basically if we take the worst case scenario, assuming we just take the whole disk Fully read all the data and then sequentially decrypted. So we basically read and then decrypt in a two sequential steps With one gigabyte per second of rings and 1.8 gigabyte per second of decryption. We could probably And the same basically the same for rights because it's symmetric. We could probably expect for that Sequential system to give us around 700 megabyte per second of throughput But what we actually seeing is around 300 megabyte like 150 megabyte in each direction So it's way below our reasonable expected case So what we did we try to improve it And we try different things we try to use different cryptographic algorithms But isx ts seems to be the fastest at least on x86 The mcrypt actually have some kind of like additional options or performance flag they call it For same cpu crypt and submit from crypt cpu's We tried to play with these but we didn't get any reasonable performance boost We also tried like to use file system level encryption and it ended up being even slower And again potential less secure because we would end up having like unencrypted metadata So, yeah, we're desperate. We tried everything but could not like squeeze out a single bit from our disk encryption So we decided to ask for help. We asked the community. We wrote our findings to the dmcrypt mailing list But the only thing we got back is this Reply if the numbers disturb you then this is from lack of understanding on your side You are probably unaware that encryption is a heavyweight operation And at this point I was wondering is encryption is that heavyweight So I decided to do a scientific research on that and by scientific research I mean I typed into google is encryption expensive And surprisingly one of the first meaningful Results I got is a blog post blog post from My own company where a fellow engineer did a study On the how costly Encryption for cloud for ease but in the context of tls. So cloud for processes Does a lot of tls Termination and processes a lot of tls connections. So Encryption in tls is very important to us as well but one of the Conclusions My colleague made is that using tls is very cheap even at the scale of cloud for so in Their study that you we used less than 30 percent of our cpu to do the encryption, which is Not that bad Yeah Yeah, so and basically This encryption uses similar algorithms as tls. So why should it be more expensive? Because of crypto So based on this I decided to look into dmcrypt implementation in the real world in more details. So again Again, we have a file system. We have the blog device drivers So we have our dmcrypt instant in between and it uses a crypto api It turns out when the file system wants to write some data It sends the right request which is interested by dmcrypt But dmcrypt does not immediately process it instead it cues it into a kernel work queue called kcrypt d For processing some later time When the that time comes dmcrypt sends it over to crypto api, but modern crypto api is also synchronous. It also may have Work queues it has more work queues. So like it usually has one work queue per cpu and That request may end up on one of these queues and get queued and then process like actually encrypted and Return back to dmcrypt, but dmcrypt again does not dispatch it immediately It queues it again into a different structure. This time it's a red-black tree where dmcrypt sorts these requests For them to arrive in some kind of sequential order And then it has a dedicated thread called dmcrypt write Which actually then takes these requests from that red-black tree and dispatches them to mobilize driver similar happens in For reads when a file system issues a really request again It gets intercepted by dmcrypt, but not processed immediately It gets queued on a yet another kernel work queue called kcrypt d i o Then at some more convenient time it gets dispatched to the block device drivers Where it's actually the cipher text gets read It it returned back to dmcrypt, but dmcrypt again does not process it immediately It queues it on our already familiar kcrypt d work queue Then at some point later dispatches it to crypto api where it can get queued again And when the crypto api actually decrypts the request it gets returned to the file system And that's a lot of queuing to handle one Read or write requests Last year I was at srecon in singapore and engineers from google made a very interesting presentation on the relationship between queuing effects And a tail latency in a general software system And that got resonated with me and one of the takeaways I had from there is That a significant amount of tail latency is due to queuing effects And I actually encourage towards the presentation on its own, but In a nutshell So in this system we may queue Read or write request up to four times before actually get a process And I assume no malicious intentions and I assume that probably if these queues exist there should be a reason So I decided to actually do some gene archaeology Luckily the whole linux kernel source and the vcs so you can try to reconstruct the reasons why the queues exist So kcrypt d kernel work queue was there from the beginning since 2005 Where the mcrypt code was actually merged into the mainline Kernel it was initially only involved for reader requests with the comment and the code It would be very unwise to do decryption in an interrupt context And it makes made sense in 2005 because in 2005 linux krypto api was not asynchronous. So actually you may end up Doing decryption in an interrupt context when you read the data and you shouldn't be doing like a cpu intensive operation in an interrupt Contact and so this does make sense And then some more queuing was added to reduce kernel stack usage in 2006. So The kernel stack was quite limited in 2006 and actually to avoid overflowing it This asynchronous behavior was introduced according to the comments a floating rides and Two threads and ios sorting in the red black tree was added around 2015 with the comment It's better for spinning this because spinning disk prefer getting sequential ios and Yeah, and it also mentions the c it's better for cfq scheduler, which is actually deprecated now from the linux kernel And turns out We were not the only ones experienced performant degradation from this extensive queuing because Again in 2015 there some commits to optionally revert some of the queuing by adding these runtime Flags which we tried before called same cpu krypton submit from krypt cpu So actually Someone already saw that there is some degradation with this extensive queuing so There are things to reconsider here, right? So most code was In dm krypt was added with spinning disk in mind So back in the day where we had spinning disk, which they had disk IO latency Much harder than the scheduling latency you get solves you can solve problems with by adding like additional queues or threads Because they're the context which their latency is negligible Compared to disk IO latency, which is not true for the modern fast storage Sorting IO requests in dm krypt probably violates and do one thing and do it well unix principle Sorting IO request is a task for the IO scheduler not for a module which does transparent data encryption These days kcrypt d work you in dm krypt may be redundant Because modern krypt linux krypto api is asynchronous by itself and it's basically can handle the cases Where you send it some data in an interrupt context? So we decided to do a cleanup To throw away all this extra complexity which seems to be outdated and turn this Back into this a simple module which does one thing and krips writes and the krips read And we also wanted to take it to an extreme point So we also wanted to make sure that we will not be queuing in the linux krypto api. So we wanted to have Synchronous linux krypto api model to process this request So as a result we come up with a simple patch It's basically a patch to dm krypt which bypasses all queues and async threads based on a new runtime flag And with linux krypto api. It's a bit more complicated by default the linux kernel linux kernel may have several implementation of the same krypto algorithm and They all have configured priorities and the linux kernel selects a particular implementation based on the priority Which is basically it considers The best or the most performance in this specific case, but we didn't want to make chances We wanted to it to use isni in a synchronous way Isni is basically a s hardware implementation x86 cpu. So it should be the fastest on an x86 platform But isni synchronous implementations mark as internal with respect to kripto api so It it's it's allowed to be called only for by other kripto api modules and not by External code like dm kript And also the problem with asni. It requires the fpu on the x86 And it may not fpu may not be available in some interrupt contexts because The kernel does not preserve Fpu in all cases especially when you switch in interrupt contexts and kernel code So what we came up with Just another crypto api module called xts proxy. It's a dedicated synchronous is xts module But we didn't implement any crypto nuts on its own. So basically what this xtx proxy does It's a switch when it receives an encryption request. It just checks Is fpu available in in the current context? And in most cases and 99 of the cases cs So it's yes, it just forwards the encryption request to the internal Synchronous xts is and i implementation in the kernel and it can do that because it's a crypto api module now So it can use other internal crypto api module In a very rare cases where fpu is not available. It forwards the Encryption or decryption request to a generic software is implementation in the kernel Which it does not require in pu fpu although it's Slower and can be much slower though But the advantage of this it's this module is fully synchronous There is no queuing involved when processing data using this module Okay, so let's actually redo our experiment so again The nice thing about our patch is It's enabled or disabled using a runtime flag So you don't have to do like a system reboot or stop the world to reconfigure You can actually do it live on a production or on a test system under a live workload. So here we just relaunch our Benchmarking workload The next thing what we need to do is we need to ensure that our the kernel has our new xts poxy module available. So we load it now Now we in a actually enable this functionality using this scary long command, but in a nutshell what Essentially does it reconfigures our dmcrypt instance and does two things. So first It basically tells dmcrypt do not rely on the linux kernel crypto api Priority configuration and do not let the kernel choose an implementation for you request explicitly to use xts ax xts proxy module And the second thing it does it basically enables our run runtime flag which was introduced by By our patch which tells dmcrypt Bypass all the queues all the dmcrypt queues it has implemented and process all requests synchroneously And finally when you reconfigure you need to do this suspend and resume cycle to actually To make your configuration active and for your configuration to take effect And this is the result actually so Here you can see the point at the point where we actually made our configuration active where we reconfigure dmcrypt to use our new setup Our read throughput actually immediately all more than doubled here And because we were using ram disk and actually because reads and writes and ram Have similar performance and because is itself as a symmetric cryptographic algorithm Encryption and encryption also has the same performance. We see kind of similar picture for write throughput So immediately when enabling our new updated configuration and patch our write throughput doubled as well Just to make sure we're not imagining things Here is there a snapshot from our real production system So before we did the test on on a on a ram based disk For benchmarking, but this is actual result from a production server And here we see our monitoring system. It measures the perceived SSD iol latency from the applications perspective is basically the iol weight starts And this is one ssd in our production and the yellow line shows the iol weight time basically for the raw ssd and the green line shows the same iol weight latency for the dmcrypt instance on top of that ssd And we see when we reconfigure our dmcrypt to use our patched runtime flag the nyxx crypto module these latencies converge And actually we see almost no difference between the Raw disk iol weight time and the dmcrypt instance on top iol weight time And just make sure too That we're not reimagining ghosts here and we actually have a real service production impact. Here is another comparison from our production Here we did a three-way comparison of one of our servers Which is performing the cdn our cdn workload. So here On these graphs you would see the comparison between a p99 response latency from a service which fetches customers data from our cache If we have it and delivers it to our customers. So here we see three distinct servers in the same data center So the green line here is a server with an encrypted disk. It has no dmcrypt at all And the red line here is another server, which is the same hardware same module Model, but when we add dmcrypt, we can see our p99 cache response time spikes And we see these like spikes of p99 response latency But the blue line is yet another server with dmcrypt, but with our patches enabled and we see that it's almost indistinguishable in terms of clout for cache p99 response times From an unencrypted server. So in with the respect to the service, we actually get a disk encryption for free Yeah, I think That is all what I had for today So in this presentation, we basically See that We introduced a simple patch for dmcrypt, which may improve the improve dmcrypt performance and thus transparent disk encryption on linux by 200 or sometimes in some cases even 300 The nice thing about it. It's fully compatible with stock dmcrypt. So it doesn't add any fancy crypto You can actually basically enable it on disk, which was already encrypted. So The crypto is compatible there And the nice thing about it. It can be enabled and disabled in front time without any service disruption So it's easier to test and do like ab testing or some other kind of test Or enable and disable it if you see the performance degradation Uh, we reassured ourselves that modern crypto is fast and cheap. So if you see a performance issue with With an encryption system don't immediately try to blame encryption itself Try to look around if your architecture is not very optimal and some of the Assumptions need to be reviewed. So the performance degradation is likely elsewhere And for this specific thing is Extra queuing may be harmful harmful on modern low latency storage. So dmcrypt Probably was designed with spinning disk in mind. But now we move to ssds. We'll move to NVMe disks and now we have to reconsider all the assumptions we made for For this dmcrypt in particular and the storage stack in general There are some caveats with the current patch as well From our testing we see that the patch improves performance on small block size high IOPS workloads If you actually have a workload with a larger block size like more than two megabytes Our benchmark showed worse performance The whole setup presented in this presentation uses assumes hardware accelerated crypto so because The presented xts proxy module supports only x86 platforms And finally your mileage may vary. Don't Jump in and immediately enable this flag on your system Try to measure it first compare results before that widespread deployment And actually let us know the results. It would be interesting. How does it improve or worsen the performance on your workflow? And finally here are some useful links. The first link is to the crypto setup project The user space portion of The Linux disk lux and dmcrypt and Linux disk encryption In general it also has some We can and information The second link is a man page for the low level dm setup utility. You might need it to Enable our custom flags if you use our custom patches And the third link is a link to my blog post Describing all the information I presented here and even more with links and more digestible readable format You can also copy paste commands from there to actually reproduce the whole setup I I presented here in this presentation and walks through it yourself And The first link is that we publish these patches on our Company Repository is On github so you can actually grab the patches and trying them out yourself The fifth link is recently that The reworked and reviewed version of The patch to dmcrypt was accepted into the mainline linux It's a little bit. It was a little bit modified. So instead of having one flag like in this presentation It now has two flags so you can independently control whether to bypass the dmcrypt work use for reads and for writes And basically if you have linux kernel 5.9 and above You don't need the patches anymore. So you can use a mainline linux kernel to do that And actually the latest release of creep creep setup user space utility includes support for these new flags So you don't have to even you don't even have to mess with the dm setup utility to configure it Well, that's basically it. Thank you for your attention And i'm now ready to Answer any questions you may have Have a nice day. Stay safe and bye