 Howdy. We are here from Cloudflare, and we are here to discuss our story on memory encryption. I'm Derek, and I work on the infrastructure security team here at Cloudflare, and I'm based out of Austin, Texas. And I'm Brian. I'm a hardware engineer on Cloudflare's hardware team, and I'm also based out of Austin. Our primary focus is on designing what our next generation server platform looks like and how we can make it highly secure without impacting performance. So before we get started, there is one quick slide on who we are as a company and our global presence to help visualize what we are doing in order to understand the level of work it takes when we design our server platforms. So a little bit about who we are. Our network spans across 200 cities in more than 95 countries, including 17 cities in mainland China. We have interconnects with over 8,800 networks globally, including major ISPs, cloud services, and enterprises. We have internet properties that are over 27 million and used by approximately 13% of the Fortune 1000. More than 1 billion unique IP addresses pass through Cloudflare's network every day. We operate within 100 milliseconds of 99% of the internet-connected population in the developed world and over 95% of the internet-connected population globally. Just for context, the blink of an eye is 300 to 400 milliseconds. We serve 14 million HTTP requests per second on average with more than 17 million HTTP requests per second at peak. We consistently do approximately 4.6 million DNS queries per second. That's around 400 billion queries per day and about 11.9 trillion queries per month. We wanted to talk about these numbers because everything we do is at scale. At the same time, security is of utmost importance. So we wanted to talk a little about encryption and how we handle encryption here at Cloudflare. We do encrypt different data states. We encrypt data at rest, including our cache on disk. One of our engineers recently made a post about patching the de-encrypt Linux module because he found that while the cost of SSD and flash drives went down, the modules were still built for spinning disks. So a patch was created to remove all the extra queuing and asynchronous behavior and revert de-encrypt to its original purpose. We simply encrypt de-encrypt Iowakrists as they pass through since we were using faster storage mechanisms than we were 10 years ago. We also encrypt data in transit. We have even collaborated with the Internet Engineering Task Force on evolving and standardizing the latest version of TLS. This helps to address some of the older cryptographic problems and design flaws with TLS that created the conditions for attacks like part lead, poodle, and beserk. But what about data and use? This is data that is being processed by one or more applications and data that's currently in the process of being created, updated, appended, or deleted. It also includes data that is being viewed by users accessing it through various endpoints and is the data that is successful to different kinds of threats depending on where it is in the system and who is able to use it. While we use different methods to protect data in use, we are always challenging ourselves with better protection modes. So one of our concerns is that someone could come in and steal one of our servers out of a data center or a colo. But it doesn't necessarily have to be a mission impossible style snatch and grab. How many of you have received reports of racks that were left unlocked or worse missing a door? I joke around about these things, but they do happen. Racks can be left unprotected and sometimes controls can be bypassed. So if someone were to steal one of our servers, the question becomes what exactly could they pull off of it? So when we just started discussing the concept of protecting data in use or further protecting data in use, we wanted to address how we could protect memory at our current and future scale. And the reason this is important is that data is stored in the clear. This can leave data vulnerable to snooping by unauthorized administrators or different methods of probing. Dim memory modules when powered down gradually lose data over time as they lose power, but do not immediately lose all data when power is lost. We've seen from a flux of research papers that memory modules can potentially retain at least some data for up to 90 minutes after power loss. Well, a reboot will generally take care of flushing memory caches, right? That's what, you know, cold boot attack worse to feed. Dump the contents of reboot, retain physical memory, be a form of modifications, which an attacker can then use to inspect that data. New or non-volatile memory technologies exacerbate this problem since an NB dim chip can be physically removed from a system with a data intact as it uses NAND flash to store a copy of its contents similar to our hard drive. Without encryption, any stored information such as sensitive data, passwords, or secret keys can be easily compromised. So do these attacks really happen? Cold boot attacks, as mentioned previously, first talked about more than a decade ago. I've started making a comeback with recent research papers introducing new methods to defeat DDR memory scrambling technologies that were used to obfuscate data written across a memory bus. By monitoring memory bus transactions, attackers are listening and looking for objects that can be secret in nature, think passwords, TLS keys, etc. Since the data was merely obfuscated via XOR and not encrypted, these attacks themselves were not very sophisticated, leaving DRAM exposed to memory extraction techniques. And then we had Rambly, which allowed an unprivileged attacker to read out certain memory belonging to other processes by leveraging the row hammer and bit flipping effect. Common hardware mitigations such as targeted row refresh, introduce other potential attack vectors like trespass, increasing the DRAM refresh rate leads to fewer bit flips, but there is a power and performance trade-off. And while ECC memory does complicate the attack, it does not prevent it. So we started looking into ways of better protecting memory, and we started looking at enclaves. Its memory encryption and isolation can be achieved with enclaves. It can be done in software only, but hardware manufacturers made hardware-assisted trusted execution environments to help create security boundaries by isolating software execution at runtime so that sensitive data can be processed in a trusted environment such as a secure area inside an existing processor or a trusted platform module. But enclaves were really meant to only process and run small pieces of code, not an entire OS. While there have been research papers that have shown how you can do it, they have come with performance trade-offs. On-cape page class is also limited to 128 to 256 megabytes of cache, and there still is a performance trade-off by enabling that. And at the same time, application refactoring is required in order not just to enable but also to use the enclave itself. And there have been a string of recent vulnerabilities that have come out, things like low-value injection, which are transient execution attacks that inject attacker data into a victim program and steal sensitive data and keys from an enclave. Recently, cache out, which is a newer speculative execution attack that is capable of leaking data from caching mechanisms including enclaves. And SCAX, which is a further evolution of cache out in the form of an enclave side-channel attack. So we made a series of blog posts earlier in March regarding our next-generation server hardware that we labeled GenX for the 10th generation, and it's based off of the AMD Rome architecture. We spoke about thermal design power, improvements in L3 cache, and overall performance tuning. But we were surprised at some of the included security features which weren't readily available from other manufacturers. In this case, it was secure memory encryption. Secure memory encryption is an X86 instruction set extension introduced by AMD in 2016. So it's been around for a few years for page granular memory encryption support using a single ephemeral key at boot with a new key generated by the processor in every boot. A page that is marked encrypted will be automatically decrypted when read from DRAM and encrypted when written to DRAM. And while there have been a handful of presentations and papers on secure encrypted virtualization, also MSF, it wasn't a feature we would use as we typically do not isolate with hypervisors. The SME components are fairly straightforward. There's an AES 128-bit encryption engine that is embedded in the memory controllers and is able to transparently encrypt and decrypt data in main memory when an encryption key has been provided via the secure processor. Then you have the AMD secure processor, which is an on-die 32-bit ARM Cortex A5 CPU that provides cryptographic functionality for secure key generation and key management. You could think of this like a mini hardware security module that uses a hardware random number generator to generate the 128-bit AES keys used by the encryption engine. The AS algorithm uses a physical address as a type of nuts. It is hardware isolated, so keys are never sent in the clear outside of the system on a chip, and it runs its own secure OS in Chrome. So how it works? It works by requiring and by enabling a model-specific register, which is a control register responsible for executing the x86 instruction sets. This enables the ability to set a page table entry encryption that here we have the documentation officially from the AMD developers manual. Support for SME can be determined through the following CPU ID function. BitZero indicates support for SME and again the relevant AMB documentation. Here you can see it on a test box, the validation output. You can validate that it's turned on by viewing the message buffer output by grepping for SME. You can view the EAX register contents by using the CPU ID utility to show support for the instruction in the processor and validating that Bit23 in the MSR is present. So how it works for an actual write after memory encryption is enabled. A physical address bit, also known as the ZC bit for encrypted bit, is utilized to mark if a memory page is protected. The operating system sets the bit of a physical address to one in the page table entry to indicate the page should be encrypted. This causes any data assigned to that memory space to automatically be encrypted when written in memory. So a page will be allocated, that page is zeroized, the encryption bit in the PTE if it's set clear or if it's been cleared, set it. Then with series of instructions is flushing the translation look aside buffer, flush memory caches, update the PTA, and then flush the TLB again. And when data is read, the secure processor provides the key to the AS engine to decrypt the data. The operating system sets the bit of the physical address to zero in the page table entry to indicate the page should be decrypted. And this is how standard SME works. And while it would be great to mark the pages we want encrypted ad hoc, we wanted to ensure that all memory was encrypted by default. And so that's when we looked into transparent SME. And as the name suggests, all memory is encrypted and it's performed transparent in the background, invisible to the US. All traffic going to the memory controller is encrypted, regardless of the value of the encrypt bit on any particular page. This includes instruction pages, data pages, pages corresponding to the page table itself. And no applications were required. So no need to refactor any applications to ensure that the applications themselves are using encrypted memory. It's a BIOS UEFI option that when enabled sets the MSR bit to active. Then your OS can activate memory encryption by default by setting the following kernel flag and by supplying mem encrypt equals on on the kernel command line. So now that we know that it's active, we wanted to test and see if it worked. So the built and built and loaded a kernel module specifically for memory testing that allocates a page of memory and zeroes out the allocated memory and issues a set memory decrypted function call against allocated memory. This specific function call is called to remove the encryption bit associated with the buffer under test. This doesn't actually decrypt the contents of the memory buffer, but we'll just mark it as not encrypted. This can then be used to compare against the reference buffer and determine the state of secure memory encryption. Then we check if the allocated memory is still zero. If SME is enabled, memory will still be all zeros. If SME is disabled, memory will not be zeros. So here we load the specific kernel module and then we get an error and we do that intentionally. We want the load specifically to intentionally fail so that the module doesn't have to be unloaded before we running the test while still capturing the default output. So with the module failure, we can still see the contents of the memory buffer. We can view the module output to console where we can see the printout of the actual hex dump. The printout shows the beginning of the buffer before the call to set memory decrypted. That checks buffer, buffer reference, and page size is still set to zero and after where the buffers do not match. So now we know it works. How old does it perform to our performance testing as well as rolling it out to production? And I'll hand it off to you, Brian, to go over the results. All right, thanks. So, yeah, as Derek said, now that we knew that the feature worked, our next step was to test how, if it any, it would affect performance. So we ran a series of benchmarks in the lab and then based on the results of those, we took it to production. The GenX servers that we're running this on have 832 gig dims running at 2933 megahertz. And we're using the Epic 7642 processor, which has 48 cores and 96 threads. And we're running in nodes per socket equals 4 mode. We're a Debian shop. We run Debian 9 on these servers. And our kernel version is 5.4.12. So the first test we ran was the stream industry standard memory bandwidth test. We use the standard stream.c available from the University of Virginia. But one change we did make was to increase the data set size to be around 5 gigabytes. And the reason why is that these processors have a large 256 megabyte level free cache and we didn't want that big cache skewing the results. So the graph that you're looking at here, you can see that depending on which subbenchmark you look at, we saw anywhere from a 2.6% to a 4.2% performance loss from implementing SME. The next test we ran was the Crip setup command. This tool is normally used to encrypt disks, but in this case, it has a built-in cryptography benchmark that we can use to, as a quick test of CPU and memory performance. And on this test, we saw less than 1% performance loss from activating SME. So this benchmark is not particularly memory bandwidth constrained. And then finally, we ran a custom web traffic benchmark that was developed by our performance team. This uses cloud for workers to generate web traffic from one host to another in the lab. And again, here we saw roughly 1% performance hit when transferring this small 10 kilobit byte image from one host to another. It uses 256 concurrent clients to do that. So encouraged by these results, we went ahead and activated SME on a host in production and then compared it to the performance of a host that's sitting right next to it in the rack. So they're both in the same polo, just one with SME off and one with SME on. This graph is a snapshot of the recent interval of web traffic to the server. This is nginx request service. And you can see the performance of the two servers track each other pretty closely. We are averaging an over-the-time period that you see here around 5% fewer requests per second service by the host that has SME enabled. Thanks for that, Brian. So what's next? So some of our future work includes doing this set of fleet-wide rollout. We currently have this enabled in a colo and we are pleased with a lot of the results. So planning on rolling this out fleet-wide. Also, I put in the ball in Intel's court for total memory encryption. This spec has been released or was released back in 2017. And we've recently seen Intel make some progress and deploying this in some of their future processors. So we'll be excited to test this as well. At the same time, more research. We love testing CPUs so we're looking to see if ARM has a risk equivalent as we believe full memory encryption to be a technology that will be widely adopted. Also looking into some newer AMD features for memory encryption when it comes to secure nested paging and seeing if it can protect container on times. So to summarize, first, memory attacks will happen. They will continue to get more sophisticated even as we continue to create countermeasures for them. Full memory encryption is available. This is an added security feature that doesn't require co-refactoring. And it's something that was surprisingly easy to turn on and test. And the overhead isn't as bad as we thought. And the majority of test results performance decreased by a nominal amount, actually less than we expected. And these official white paper on SME even states that encryption and decryption of memory through the AS engine does incur a small amount of additional latency for DRAM memory access, although it is dependent on the workload. Across all 11 data points, our average performance drag was only down by 0.699%. Even at scale, enabling this feature reduces the worry that any data could be extra-streated from a stolen server. So for up-to-date info on what we're working on, please feel free to follow our blog where we are consistently publishing content on new technologies and topics that are relevant in the standage. And with that, we thank you. Thanks for watching. Hi, Derek and I are here to answer questions. We had some good ones coming in through the Q&A. Angie asks, what's the overhead of encrypting memory? Hopefully we addressed that to your satisfaction, Angie. We ran a bunch of synthetic tests and saw anywhere from 0% up to about 4.6% hit on the test that we were running. And then on our live production environment, it's anywhere like 4% to 5% overhead, as in fewer requests per second serviced by NGINX with SME feature turned on. Fernando asks, does the chip manufacturer have similar instructions like AMD and sell the ARM? Do you manage this on smaller devices and not desktops? So yeah, we addressed it Intel has a spec for total memory encryption as well as multi-key total memory encryption, which is similar to AMD. They do have it on some smaller chipsets too, actually AMD does. As far as ARM, we're still trying to investigate. They don't share the same x86 instruction sets that Intel and AMD do. But we're in discussions to find out if they have a relevant equivalent. Yeah, and there were some press releases from Intel based on their summit that they had a security summit earlier this year where they said that they are kind of getting the infrastructure in place for their implementation of this. And patches to the Linux kernel and stuff like that. And that the feature would be supported in hardware on just future chips. So they haven't really made specific announcement about when that'll happen. But that's a good question. We got that one from a couple of people. Yeah. Any experience on using SEV with VMs or containers you can share. So not looking at SEVs because all of the we've seen extensions for that for containers, we are more interested in like the secure nested paging feature with an AMD chip sets that we're looking into as it protects those container run times. I'm not sure if you can answer, but has SEVs called for a notice server that various coworkers experience completely or worse suddenly going offline and they're coming back online. Not that we can really answer it, but it's a concern of of us and hence we wanted to protect kind of like the smooth physical attack vectors. This being one of them that if this were to happen, you know, how could we ensure that that are our hard work itself was with being protected as best as possible. Is there any reason to use transparent SME instead of SME if the OS Linux supports SME, and then we wanted to just do it transferring the background have it be more BIOS controlled. So, you know, we didn't want to just turn it on ad hoc. And since a lot of what we do is when we deploy our hardware we automate a lot of the configurations we figured it was a lot easier just to turn this slide on. And then we already answered the info question. The comparison between SGX and SME is not right here. We, you're right, it is actually a different threat model. The reason we brought that on is that it was more or less from from some of the onset side. We didn't want the specific overhead of actually enabling on place because there were limitations if we were looking at protecting all memory. We didn't want to have that limited memory space. Most SGX machines from what we found out, you know, just allocated a maximum of 128 megs of memory for the EPC non-clay page cache. And that would be shared once all enclaves so we didn't arbitrarily want to choose what we wanted to run in the enclave. We just were more consistently worried about memory protection. So, while not the same threat model, it was still the same concern. That's all the questions we have. Yeah, I think we addressed them all. Oh, wait a second, there's a second page. Let's try and get to some more of these if we can. You see at the top where you can click 203? Yeah, would it make sense to provision per user memory encryption keys? That's something that we have thought about. Something that we're still looking into. So, yeah, that's potentially a use case too. In virtual machines, choose to utilize as an independent of the host. Yeah, that's a feature of securing the virtualization. So, if you are running hypervisors, you'd have the host OS run its own encryption key and every VM have encryption key assigned to it. SME doesn't have cryptographic integrity. Can you give some thoughts on whether your threat model is the tactic of integrity? If it doesn't include, should it? Yeah, that's a tough question. We, it does. And I think that's something that we've kind of addressed with the vendor specifically. As I said, if we're trying to do some sort of measurement, then it's either going to be a combination of some other features that we are going to enable or something we defer back to the vendor too. So, Michael asks, you were using 32 gig dims. Oh, sorry. Yeah, that's okay. You were using 32 gig dims. Do you anticipate any additional problem with higher density memory? I don't think the feature is going to work significantly different with a different memory config. I don't have any measurements to base that on just because it is all in hardware. It's all built into the memory controllers. I would expect it would work the same way with 64 gig dims. So the last one I'll take is, can I fact edit the boot artist's stable memory encryption? No, we're learning that with other features. So things like, so what I'm based on, the trust and other features to prevent that. So that's a big thing too. And some models that you can do to kind of like limit that as well. And that's something that we're going to look at sharing in a future talk. Thanks everybody for the questions. That's a great time. Yep. Thank you.