 Hi, I'm Zane Weisman and I'm a PhD student at Worcester Polytechnic Institute, on land traditionally inhabited by the Knitt Muck people, occupied by European settlers and called Worcester, Massachusetts, since the 17th century. And I'm Tori Thiemann, PhD student at the University of Lübeck in the north of Germany. Today we are presenting our work, Jack Hammer, Efficient Row Hammer on heterogeneous FPGA CPU platforms, which is the result of a collaboration between the two of us, Daniel Moghimi, Evan Custodio, Thomas Eisenbarth, and Burke Sunar. In this presentation, we will explore tightly integrated FPGA and CPU microarchitectures designed for low latency memory sharing in server and cloud environments and demonstrate how these new architectures enable new cache side channel attacks and powerful row hammer attacks. Of course, cloud services have become essential to the way we use computers. What you may not know is that all of these cloud service providers now offer access to field programmable gate arrays or FPGAs to customers who might use them to accelerate tasks like cryptography, rendering, or machine learning. But in the age of spectra and meltdown, we know that hardware security is a major concern when multiple users share the same hardware on the cloud. So what does hardware sharing look like with a modern FPGA CPU system? We studied two similar FPGA platforms, which are both supported by the Intel Acceleration Stack. On one hand, we have the Intel Aria 10 Programmable Acceleration Card, or PAC for short, which you can see on the right. It comes as a PCIe expansion card. On the other hand is the integrated Aria 10 that resides in one package with a Xeon processor. You can see it in the middle. Don't worry if you find this figure complex, we'll explain it in detail later and show it again whenever it may be helpful. For the rest of our talk, there are a couple things to keep in mind. We have to keep in mind the different address spaces we have to deal with. DRAM and caches are addressed using physical addresses, while software processes running on the CPU use virtual addresses. Most of you probably know this. Since the FPGA is considered to be a peripheral, it addresses memory using IO virtual addresses like any IO device does. These IO virtual addresses may be equal to physical addresses if the system does not make use of an IO memory management unit or IO MMU. However, when using an IO MMU, the addresses are truly virtual, or if attached to a virtual machine, equal to the guest operating system's physical addresses. To ease DMA operations on the FPGA, buffers accessible by the FPGA need to be continuous in its address space. Whether or not the IO MMU is enabled, the FPGA driver makes use of huge pages whenever the requested buffer size exceeds 4 kilobytes. The IO MMU will handle address translation regardless of page size. So the second thing to keep in mind is that we can rely on huge page access for all of our experiments. The last thing that needs consideration are the caches and their behavior when the CPU or FPGA reads, writes, or flushes memory. During this talk, we will give insight to the behavior of the caches, which we ask you to keep in mind since we need it to understand why Jack Hammer performs significantly better than Software Row Hammer. So let's come back to the platforms we worked on to give you a first understanding of the Intel Acceleration Stack terminology and technical realization. In the figure, you see again the two FPGA platforms we had access to. On the right, you see the Programmable Acceleration Card, or PAC for short, that contains an Aria 10 GX FPGA and is connected via PCIe to its host CPU. The FPGA itself is split into two regions. The proprietary blue region can be seen as a firmware of the FPGA. It is designed by Intel and provides maintenance functionality, like programming the FPGA or providing sensor data for temperature or supply voltage. In addition, it introduces an abstraction layer called the FPGA interface unit, or FIU, to abstract away the complexity of the available communication buses and in exchange exposes the core cache interface port, or CCIP, to a user's hardware design. The green region, on the other hand, is the partial reconfigurable region, where a user's hardware design is placed. Intel calls a user's hardware design an Acceleration Function Unit, or AFU. On the left, you see the integrated FPGA, which is an Aria 10 GX integrated into a Xeon CPU package. The FPGA is connected via two PCIe ports and one UPI port. UPI is Intel's proprietary UltraPath Interconnect, used to tightly couple CPUs on multi socket systems. Its main purpose is to realize the cross socket coherency protocol to keep all caches in the system in a coherent state. The blue region of the integrated FPGA additionally features a coherent cache and an IOMU, because it could otherwise bypass memory translation by using UPI. As I just mentioned, the integrated FPGA is included in the same package as the CPU. The systems we investigated have several cores with core private L1 and L2 caches, and a shared L3 cache, which is the last level cache or LLC. Memory requests sent off by the FPGA are primarily served by the LLC and are only answered by the system RAM if the request misses the LLC. If the request is sent from the integrated FPGA via UPI, the FPGA cache serves the request if possible, and only on a cache miss forwards it to the CPU's LLC. On the software side, we have the operating system followed by the Open Platform Acceleration Engine, or OPE for short, which includes drivers for the FPGA and exposes an API to the user's software applications. The core cached interface port can be understood as the FPGA side API as it is exposed to the AFU to be used for communication. There are two types of communication available, memory mapped IO, MMIO for short, and direct memory access or DMA for short. The FPGA uses DMA to read from or write to main memory. This way, the FPGA can receive or send bigger chunks of data to or from the CPU. While all MMIO communication runs over PCIe, the FPGA can configure the communication channel to be used for DMA. The addresses provided for DMA operations in general are IO virtual addresses, but not all of our setups make use of the IO MMU, which is why we operate on physical addresses. Back when we wrote the paper, this was also true for the Intel VLab. However, this changed around December 2019. According to the documentation, the CCIP allows an AFU to provide caching hints when sending off DMA requests. For DMA reads, there are two, invalidate and share, which hints the CPU and FPGA caches to cache the requested data accordingly. For write requests, there is a third option called write push invalidate, which hints the CPU to cache the data in the LLC, but invalidates it from any FPGA cache. All caching hints are available for DMA operations sent over UPI, while requests over PCIe only support the invalidate hint. Tori will now take a closer look at the caches and caching hints, but let me say already that we found that the blue regions of all tested platforms including those on the Intel VLab silently drop and therefore ignore certain caching hints. Right, but before I go into detail of our findings concerning cache attacks, let me quickly refresh your knowledge about two important cache attacks, Flush and Reload and Prime and Probe. To run a Flush and Reload attack, the attacker requires access to a Flush instruction and the capability to precisely measure timings. In addition, the attacker needs to share memory with its victim. The attack then works in three phases. First, the attacker flushes the shared memory. By this, the cache line is removed from all caches in the system. In the second step, the attacker waits for the victim to execute and maybe access the shared memory. After the victim executed, the attacker accesses the shared memory and times the access latency. If the access is fast, the victim must have accessed the shared memory, as it is in the cache again. If the access is slow, the data is served from RAM and therefore was not accessed by the victim. In cases where attacker and victim do not share memory or if the attacker has no access to a Flush instruction, Flush and Reload is not an option. Instead, the attacker may launch a Prime and Probe attack if she shares a cache with the victim. For Prime and Probe, the attacker needs to collect an eviction set that contains addresses that, when accessed, completely fill the cache set the attacker wants to monitor. The attack then runs in three phases again. In the first phase, the attacker primes the cache set by accessing all addresses in the eviction set. The second phase again is a waiting phase where the attacker waits for the victim to execute. During execution, the victim might access an address that maps to the cache set monitored by the attacker. In this case, one of the addresses in the attacker's eviction set gets evicted from the cache. In the last phase, the attacker probes the eviction set by accessing all addresses and timing the overall latency. If the probe runs fast, the victim did not access an address that maps to the monitored cache set. If the probe runs slow, the victim must have accessed an address that maps to the same cache set and therefore evicted an address of the eviction set that is now served from slower DRAM. So let's have a look at what an FPGA-based attacker can do. We will first focus on the PCIE scenario where the attacker has DMA access via PCIE. This scenario is applicable to the integrated FPGA and the pack. We assume the victim to be located on the CPU in this case. We started off by measuring access times to see if an attacker can distinguish different locations of a cache line. To do so, we realized the timer on the FPGA by incrementing a register at every clock cycle. This is similar to measuring timings on the CPU using thread counters with the advantage that our timer runs uninterruptible and parallel to all other operations performed on the FPGA. We used the timer to measure access times to addresses we placed in the LRC and main memory. As you can see in the histogram, we have two distinct peaks for the two locations. And even though they are close together, we found that requests served from the LRC are clearly distinguishable from those coming from DRAM because our timer is precise and outliers occur rarely if the LRC serves the request. Let's now focus on FPGA-based attackers that have access to UPI. This scenario is only applicable to the integrated FPGA. In contrast to the previous measurements, we now expect to see three distinct peaks. When using UPI, we can in fact see three distinct peaks. Also, we can see that the FPGA cache has a constant answer time without outliers. This is because the cache is part of the blue region and the preciseness of our timer. The other peaks are really slim as well. This is caused by the high bandwidth UPI and the low system load. Together with the PCIe scenario, this is an important find on the way to running cache attacks against the CPU, since we can do the required measurements for step 3 of flush and reload and prime and probe. However, we still either need a flush instruction or some way of priming the LRC to perform one of the attacks. Even though we can send DMA requests via UPI, we cannot directly send UPI messages since we are bound to the CCIP. Because of this and the fact that the FPGA does not offer a flush instruction, an FPGA attacker cannot run flush-based attacks, but has to rely on eviction-based attack techniques. However, during the previous measurements we found that DMA reads do not alter the location and address is cached at. So we started digging deeper and analyzed the caching hints the CCIP gives us control over. We especially hoped for the right push invalidate flag that is supposed to load data directly into the last level cache. Here you can see a histogram of CPU access times to addresses that were previously written to by the FPGA using direct memory access writes, with the caching hints set to a right line invalidate. This flag is supposed to invalidate the cache line and write it to main memory. To give you a relation, we included the histogram for common memory accesses denoted with MA in black. As you can see, access times to addresses that were written via PCIE are smaller than one would expect when accessing DRAM. And it turns out that these were served by the last level cache instead. This is also true for most accesses to addresses that were previously written via UPI. But some timings are also higher than an access to main memory. These requests come from the FPGAs cache instead of DRAM. So the caching hint does not lead to the expected behavior, but rather behaves like what we would expect to see with the right push invalidate flag. We also tried the other two caching hints that are available for write requests and found a similar behavior. From this experiment we state that caching hints for write requests are not properly implemented at the time of writing. Independent of the provided caching hint, the data is always written to the last level cache. At first sight, this sounds like good news because we were looking for a way to prime the last level cache. However, it turns out that the behavior we see is not caused by the caching hints we add to the DMA requests. Instead, what we see is a CPU feature called DataDirectIO or DDIO for short that allows peripherals to write directly to the last level cache. And since DDIO and its default configuration can only write to 10% of the ways per cache set in the last level cache, this does not allow us to mount a full cache attack against the CPU. However, we can at least provide a proof of concept of the channel we found by constructing a covert channel to send data from the FPGA to the CPU. Our covert channel works like the prime and probe attack, but with the victim willingly sending data. To set this up, we had one software process construct an eviction set for a specific cache set in the last level cache and monitor it like an attacker would in a prime and probe attack. The FPGA on the other side writes to an address that maps to the same cache set whenever it wants to send a 1 and stays quiet whenever it wants to send a 0. This way, the software receives a 1 whenever it sees an increased probe latency and receives a 0 otherwise. To ease the coding, we had the FPGA send with half the speed the software was probing. This results in two probes per bit. Also, the FPGA uses repetition encoding by sending each bit three times to reduce transmission errors. You can see the resulting software measurements in the upper plot. We can see that each peak is followed by a slope because of the second probe and there are always three peaks in the block. On the bottom, you can see the decoded signal. Even this simple covert channel receives a throughput of 94.98 kilobits per second. This could easily be increased by a factor of six if the FPGA doubled its sending speed and omitted the repetition encoding. After we found that DMA reads do not change the location of a cache line, we started exploring row hammer attacks originating from the FPGA. This turned out to be fruitful and ended in the jackhammer attack that Zane will explain to you now. Jackhammer is our implementation of the row hammer exploit that runs on Intel FPGAs and attacks the host system's main memory. To understand how it works and why it's so effective, let's first take a look at how row hammer works in general. Row hammer attacks rely on the attacker having access to memory that's physically located near memory used by the victim. I'll get into how memory is laid out in a minute. The idea behind row hammer is that the attacker accesses one address, then the other, then the first, then the second, and back and forth until a bit in the victim's memory changes value without the attacker ever directly accessing that memory. But when the victim accesses that memory next, she reads the wrong value from memory. So what's going on with the row hammer attack? First, the attacker has to allocate and identify memory located near memory used by the victim. Memory is organized into banks, each of which contains many rows, and all reads and writes at a hardware level target one row at a time. Every physical address is statically mapped to a certain row and bank by XORing certain bits of the physical address dependent on the exact system and memory configuration. The actual corruption in the memory is likely caused by an electromagnetic effect. Each bit of memory in DRAM is stored in a capacitor gated by a transistor. The voltage stored in the capacitor determines whether the bit will be read as a one or a zero. But the transistor allows a little current through the capacitor, even when the memory is not being directly read or written. So it is the duty of the memory controller on the motherboard to refresh every row of the memory no more than 64 milliseconds after the last access or refresh. The prevailing hypothesis for the way that row hammer works is that every nearby row access causes a little bit of extra current through the capacitor, making that 64 millisecond refresh interval no longer sufficient to preserve the values stored in memory. So the simplest and one of the most effective defenses against row hammer attacks is to just decrease that row refresh interval, for example to 32 or 16 milliseconds. This is an option that can be found in the BIOS of most server and many consumer motherboards. A lower row refresh interval means the memory spends more time refreshing rows that aren't being used, so it has a small impact on maximum read and write throughput. But in exchange, it offers substantial protection against row hammer attacks as well as randomly occurring memory faults. Row hammer has been shown to work on lots of DDR3 memory chips and far fewer DDR4 chips. In some cases it can also work on memory with error correcting codes or ECC. For the purposes of this research, we studied DDR3 without ECC because the effects of row hammer on DDR3 are more easily quantifiable, so it's easier to compare across platforms and implementations. We're primarily concerned with row hammer attacks mounted from the FPGAs to the ram of the host system. Our row hammer implementation for RE-10 FPGAs, which we call jack hammer, relies on the CPU for the addresses of the aggressor rows that it will use for the attack. Once the CPU sends a start signal, it runs automatically, repeatedly accessing the specified addresses and, given the right addresses in hardware configuration, causing bit flips in the ram. So how does the performance measure up to a CPU row hammer attack on the same system? The figure on the left shows distributions of hammering rates, that is, the number of memory accesses per second sent by our jack hammer attack running on an RE-10 pack and running on its host CPU, an IV bridge i7. The pack achieves remarkably consistent memory throughput at nearly double the typical rate of the CPU. In the figure on the right, you can see how this translates to the rate of bit flips in the victim row. In many cases, jack hammer is able to flip four times as many bits per second as a CPU row hammer attack on the same system. So what makes jack hammer so much more effective? Well, recall that the row hammer exploit is based on non-linear effects. In every row refresh interval, the attacker is trying to push the bit voltages in the victim memory as far away from the correct value as possible, but at the end of the refresh interval, the row is refreshed and the voltages are reset. So just doubling the rate of memory accesses can make this chance of a successful bit flip during a particular refresh interval substantially higher. But how does the FPGA access the memory so much faster? In short, caching, or more precisely, not caching. When the FPGA reads memory, it does not place a new entry in the CPU's cache. The CPU, of course, caches all of the memory accesses by default. This means that for row hammer to work, the CPU has to flush each address between memory accesses. The FPGA has no such limitation. So let's remove that limitation for the CPU and see what happens. This figure shows memory access rates during a row hammer attack from the RE-10 pack and from the CPU with caching enabled and disabled in the kernel. In the left two columns, you can see that with caching enabled, jack hammer is running about twice as fast as the CPU attack. In the right hand columns, with the memory pages marked as uncacheable, jack hammer runs about 50% faster. But wait, the FPGA doesn't load anything into the cache, so what's saving so much time? While the FPGA can read from the cache, but with the memory set to uncacheable, the CPU's memory subsystem doesn't waste any time checking the cache for a copy of the requested memory and sends the request straight to the main memory. But what we're really interested in here is the rightmost column where you can see that with the cache out of the way and no time wasted flushing memory, the CPU row hammer runs faster than jack hammer could with the cache and almost as fast as jack hammer without the cache. All of those flushes in a regular CPU row hammer attack really add up, but what do these performance metrics really mean for a practical attack? We constructed a fault injection attack against the RSA signature function in the wolf crypt library. The vulnerability was reported and fixed in December of 2019. In this attack, the victim first sets up an RSA key with wolf crypt while the attacker allocates shared memory that the FPGA will use for the attack. The FPGA runs the attack by repeatedly accessing the main memory, flipping one of the bits in an intermediate value used in the RSA signature algorithm. This causes the signature itself to be faulty and also contain part of the RSA private key. The rest of the key can be recovered with either a valid signature computed on the same message with the same private key or with the message itself and the public key. This fault injection against the RSA algorithm has been public knowledge for a couple of decades and is commonly called the bell core attack. We tested this attack with and without the standard defense against row hammer of an increased row refresh rate. In the best case with a standard row refresh interval of 64 milliseconds, jackhammer caused a fault an average of 25% faster than a CPU row hammer attack. But with the refresh rate doubled, jackhammer was 185% faster. We took a couple of shortcuts in our setup of this attack, neither of which affect these performance measurements. The first is that we manually located memory to be used by the attacker and victim. In a realistic scenario, an attacker would use a technique known as page spraying, allocating and deallocating memory pages until the right setup is reached. Page spraying can also be assisted by using cache attacks to monitor the memory usage of the victim program. The other simplification we make is that we directly flush the memory used by the victim program. Realistically, an attacker would use an eviction set to force the memory out of the cache and cause the victim to read from memory, where the row hammer attack has caused a fault. In conclusion, we identified and verified timing leakages on the area 10 platform. We analyzed the caching hints that are available and constructed a covert channel from the FPGA to the CPU that reaches 94.98 kilobits per second. We also showed that FPGAs can be used to accelerate row hammer performance by 25% and used this to demonstrate a fault injection attack against the RSA implementation of Wolf SSL, which resulted in the CVE you can see on the screen. Thanks for your attention. Feel free to read our paper for more details and if you have any questions, come and ask them during our Q&A session at Chess 2020 or send us an email.