 Hello everyone, thanks for tuning in to our talk about our work with the title Let's Take It Offline, boosting brute force attacks on iPhones user authentication through a side channel analysis, which was a collaborative work together with my colleagues at the Ruh University Bochum, namely Olexi Lisowitz, Thorben Most and Amir Muradi. Okay, now, what is the work all about? In today's work, the smartphone became the most important device in our everyday life. We store all our pictures on it, we use it as a credit card, we are monitoring more and more health data with it, and basically all our communication is done and stored on the small device in our pocket. Here, the iPhone is one of the most widespread and popular smartphones with a market share of 14% in the first quarter of 2020. If you don't have one of these, don't one yourself, you will definitely know someone with an iPhone. So, as a result, it is definitely an interesting target for practical security analysis. And although there is vivid research ongoing with software exploits and new vulnerabilities being published quite frequently, there is very little research done on implementation attacks. So, we asked ourselves, what about physical attacks? In particular, side channel attacks, are they feasible on such compact architectures and what benefit can an adversary gain compared to software exploits? As a first step, we needed to figure out what iPhone model we want to consider. For this, we have two main requirements. The first requirement, which is actually a requirement which is necessary in order to perform a power side channel attack, is electrical access to the crypto hardware engine. So basically, we need to be able to communicate with the engine and trigger an arbitrary amount of encryption with chosen data. To do so, we needed to be able to run arbitrary code on the application processor of the iPhone, which can be achieved through software exploits. As we did not want to focus on finding new software exploit, and as it is very hard to come up with new ones, we needed a model where those are publicly available. The other requirement is that, as there is not really related research available in this direction, as a first step, the device should not be protected with advanced countermeasures against power side channel analysis. As we expected Apple to have applied those for newer devices, we decided to go with a rather old iPhone series. For the iPhone 4, both these requirements were fulfilled. As there already exists a public boot ROM exploit, we could leverage to gain electrical access to the crypto engine, which is to be more precise an AES engine. So yeah, we decided to start our research on Apple's iPhone 4. Let's now take a closer look of what hardware-fused, secret keys are stored on the iPhone 4. So what could we try to extract? And the two main relevant keys are the so-called group identifier key. The GID and the use identifier key, the UID, which are both 256-bit keys for the AES engine. The GID key is the same for all processors in the class of devices and is used for software encryption of the system. The UID key, on the other hand, is unique for every device and is used for user authentication and for encrypting and decrypting data. For these are the main two keys we were initially interested to recover. So as we learn, the UID is used for user authentication, so this might be an interesting target to recover. But what exactly do we gain from extracting the UID? To understand this, let's first take a brief look at how the user authentication works for the iPhone. As can be seen here, the user enters her or his passcode, which is then entangled with the UID key to derive another key. This derived key is then used to decrypt or unwrap the so-called class keys, which are in turn used for data decryption. Deriving the key for unwrapping the class key is done by a custom KDF. The important thing here is that the AES engine is invoked recursively with the passcode as an initial input and the UID key as the key for the AES. The described process has two implications. If A binds the passcode search to a single device, and the UID key is burned into a silicon and the UID is unique for every device, and B, as the duration count of how often the AES is invoked, is set to a number such that every passcode guest takes 80 milliseconds on device boot for search of the passcode is slow. For the iPhone 4, this results in an iteration count of 50,000. Note that a further measure to prevent boot for search attacks is an increasing time interval between subsequent passcode inputs. This is a solely software-based restriction, which can already be bypassed by existing software ploys. So extracting the UID key would not have any benefit here. So what would be the advantage of extracting the key? Knowing the UID key effectively decouples the passcode search from the single performance with straight iPhone. Having the key in hand, an adversary can run the passcode search on an arbitrary amount of arbitrary hardware in parallel, which enables the search at much larger scale. So how would an attack scenario look like? The adversary maliciously gains access to the iPhone 4 of the victim. The adversary then downloads the system key back from the device and utilizes existing software exploits to gain overkill access to the AES engine. To extract the 256-bit UID key, a classical CPA is performed. Removing the UID key in hand, the adversary can now perform a bootforce attack of the passcode on a large scale, for example on several GPUs in parallel. Note that the validation of a passcode candidate is done by checking whether unwrapping the class key succeeds or fails, which is basically a former check on the decrypted data. Afterwards, so at the end, the adversary can boot the system regularly and access the iPhone and all its personal data. Let's dive into the details of the attack a bit more, meaning how can the before-described signal steps be performed. So let's assume we already downloaded the system key back from the device and we now want to talk to the AES engine by running code on the application processor. For this, we start the iPhone and DFU mode, which can be entered by pressing a certain button combination during the boot process. After we enter DFU mode, we use the publicly known shatter exploit to disable signature checks in the boot world. Afterwards, we transfer a patched first stage bootloader, where again, signature checks are disabled before we transfer a second stage bootloader, which enables us to jump at the defined address where we can put data over USB. Now, at this address, we can then upload our measurement payload, where we talk to the AES engine and send chosen data. This is the first requirement for performing our CPA to extract the secret UID key. Now we have our measurement payload uploaded to the device. As the first step, and because of the simplicity of the required setup, we decided to perform EM measurements. Hence, we next needed access to the application processor and we needed a trigger that triggers in constant time distance to the encryption. For that purpose, we disassembled the iPhone to expose the CPU and use the volume down button as an interface for the trigger. So we removed the button and connected a wire to have a low latency interface to the CPU's GPIO, which we then could drive through a measurement payload. Eventually, instead of using the battery, we supplied the phone by an external power supply. So what are the crucial functionalities we implement for our measurement payload, which is executed in a second stage bootloader? To sum up, the query encryption and decryption of chosen plaintext or ciphertext, we chose a key slot for the AES. There are actually three slots, a slot for the UID key, a slot for the GID key, and a custom key slot, where we can give a chosen key to the AES engine. And eventually, we drive the exposed GPIO pin to high and low through memory map IO to trigger the measurement of the oscilloscope. So yeah, we basically have everything we need to perform our first measurements for the CPA. So how is the targeted application processor structured? The iPhone 4 uses Apple's A4 SoC, which provides a 32-bit ARM V7A clock at 800 MHz frequency. Luckily, there's already pretty extensive work done on reverse-engineering the CPU structure by Chipworks and iFixit, which can be found when following the link under the shown picture. The important point here is that the RAM is located directly on top of the CPU. In the picture, the CPU is a centered rectangle, and the RAM dies are the two rectangles on top. So removing the packaging to directly access the chip surface is impossible, and the only way to go is to measure on top or in close proximity to the packaged CPU. Of course, this way we do not expect to capture any local emanation of the engine itself, but more like emanations of the overall power distribution network. For collecting our EM traces, we use the Radio Frequency Pro by Langer EMV, which is able to capture frequencies from 30 MHz to 3 GHz. We further amplified the signal by a 30 dB amplifier and finally captured the traces with a Lecroy Rave Runner with 2.5 GHz bandwidth and sampling at 40 GHz per second. So yeah, this is the equipment used for our experiments. So we use this equipment and as a first step collected some EM traces when placing the probe close to the CPU. An overlap of 500 EM traces can be seen in the picture on the left-hand side, where we can directly identify a heavy misalignment of the traces, which complicates any further statistical analysis. This strong misalignment was present although we already established a low latency interface to the CPU for treating the oscilloscope and we configured the measurement mode to run on a single threaded core in bootloader context. This actually implies that the code running on the CPU core does not work synchronously either with the IO co-pressors or with the IO peripherals. As a result, we needed to align the captured traces. Here we identified that the misalignment often appears in groups, meaning there are several traces shifted by the same offset. As a first step, we unified these groups by coarse-align them so the strongest peaks overlap. Afterwards, a more fine-grained alignment to a reference trace was done. Here, the offset was computed by means of the minimal Euclidean distance between two traces. Even after this more fine-grained alignment, we had some outliers in the traces which we needed to filter out losing around 20% of the measured traces. The result, after aligning and filtering, can be seen on the right-hand side of the picture. As a note, we actually did not expect any leakage during the strong peaks, so we adjusted the y-range to gain a better resolution of the points where we expected leakage. The next step was to identify a good power model to compute hypothetical power models for performing a CPA. As we can select the custom key as input to the AES engine, we could try different power models and test them for good correlation between hypothetical power values on the one side and actual power values on the other. We assumed the AES engine to be fast, and it should be possible to change keys frequently. As a consequence, we went for a round-based architecture as it offers a good trade-off between speed and area throughput, which can be seen here. Considering the description was an ad hoc choice, and once we chose it, we went for it during all our experiments. After trying different power models, we identified the best-fitting power model was the hamming distance between two 128-bit round states of the decryption, so the transition occurring at the depicted register. Since we found a good power model for a fixed position of the probe, we wanted to try different positions and test which position works best under the given power model. For this, we performed a scan over the complete CPU by dividing the area into a 25 times 25 grid. And for each point of the grid, collect in 150,000 DM traces. The heat map of the maximum correlation over the grid can be seen in the right-hand picture. Here, we can identify clear peaks, and with this, we were able to identify the best-fitting position for performing our attacks. So now we get to the fun part, namely the actual attack. Of course, the 128-bit model was not the well-fitted model for performing an actual attack, as you would have too much complexity in the key guess. So we need to further reduce the power model and actually found out through further analysis that we could also only consider transitions at byte chunks of the 128-bit round state register while retaining good enough correlation with the actual power values. This enabled us to guess and verify bytes of the last round key individually. As the A4 chip features a 256-bit AS engine, we have to repeat the attack to the previous round transition, meaning from round 13 to 14. After recovering the last 120-bit round key and using this to recover the previous 120-bit round key, again in a byte-wise divide-and-conquer fashion. Having two consecutive round keys, we can simply compute the main key by means of the key schedule. Here we can see the CPA results for exemplary key bytes of the two last 128-bit round keys when the AS runs with a UID key. Each figure shows the graph of the maximum correlation over the number of traces considered and this for every 256 key candidates. The white key candidate is marked in black. As you can see, we need a high number of traces to recover these bytes. For some key bytes, we need 30 million traces and for others, 270 million traces. For the recovery of the last two round keys and hence the whole 256-bit UID was possible with 300 million traces in total. We did also recover the GID key which resulted in similar results and similar numbers of traces needed for recovery. The reason for needing such a large amount of traces is assumed to be due to a bad signal, meaning a low signal aptitude and a high noise level as we could not place the probe close to the CPU surface. They could also be implemented in dedicated countermeasures against side-signal attacks here. As we needed a high number of traces for recovering the UID when using EM traces and we anyway assumed to only capture emanation of the overall power distribution network and no local effects, we wondered whether a classical power measurement would work any better. Here we placed a one old shunt resistor in the VED pass of the core power supply of the A4 processor. The power supply for the CPU is generated on the PCB by a bug converter from the main power supply. We effectively cut the power supply open by removing the inductors of the bug converters as can be seen in the top figure. We then remove the capacitor on the bottom picture which we previously identified as a smoothing capacitor and powered the CPU at 1.35 volt by connecting the three pads to an external power supply. In the same path, we inserted the shunt resistor, removing the DC shifts by means of a DC blocker and amplified the sickness with AC amplifier. The auxiliary board for that purpose can be seen in this figure. Now we were able to measure the AC voltage fluctuation over the CPU. Again, results for two exemplary round keys are shown in the white-hand figures. For recovering the complete key we needed up to 200 million traces which is slightly more effective than the EM measurements but not drastically. Problematic here was that in the captured war traces we could not identify any communication or IO peaks in the traces and we had to perform our analysis hence without any pre-processing in this case. Now that we have extracted the UID key we can perform the offline passcode recovery by means of a group force search. For this we implemented the following steps. First, we passed the system key back. Second, we generated and pre-processed a batch of passcodes. The pre-processing involves taking the arbitrary length passcode and run PVKDF2 to compress it to a sequence of 32 bytes. Third, Apple's custom KDF is executed to derive a batch of keys which are candidates for unwrapping class keys from the system key back. Finally, each key of the batch is validated by trying to unlock the class keys from the system key back following the standard RFC 3394. If unwrapping the class key works we found the bypass code. We implemented the second and the third step to run parallelized on multiple GPUs using a bit-sized implementation of AES for the third step. So what do we gain compared to a non-device brute force search? As already mentioned, the significant advantage of our approach is that once we have the UID key extracted through a CPA we can scale the search to multiple GPUs. On the left hand side you can see our initial results performed on multiple NVIDIA GeForce RTX 2080 Ti. Here, for example, for an 11-digit numeric pin a brute force would take several lifetypes on device. Whereas with an offline brute force search on 8 RTX 2080 it takes only roughly a month. Meanwhile, thanks to the HashCAD team the offline brute force search on GPUs could be improved significantly. A 10-digit pin could now be guessed in around 6 hours on 8 Radeon RX Vegaal 64. HashCAD even provided a cool demo of their plugin for offline brute force search which can be found under the given link. Now, as the iPhone 4 is a rather outdated model it is an interesting question when our consideration also can be applied to newer models. And for the iPhone 4S, the iPhone 5 and the iPhone 5C our attack can be tried without major modification. For gaining code execution and querying encryption to the AES we could simply use the checkmate exploit instead of the cheddar exploit. From the 5S onwards the situation changes. Here Apple introduced the secure enclave processor, the SAP which has its own AES and UID key and which is used in the authentication process. So we would need code execution not only on the application processor but on the SAP as well. Here for the 5S, the 6, the 6S, the 7, the 8 and the X there already exist known vulnerabilities and for the iPhone 7 there's even a public available tool to gain code execution on the SAP. We like to mention that we already started research on the iPhone 7 and will hopefully soon see whether it is also possible for this series. And was it worth it? We showed that an SCA can be performed in a compact architecture like an Apple iPhone. By extracting the UID key we showed that passcode search can be accelerated dramatically. The only drawback is that the considered iPhone 4 is a rather old model. But the research community can anyway take our work as proof of concept and guidance for analysis of more up-to-date series. Finally, we would like to thank Apple for their support and kind communication during our work. Thanks for following this talk. And yeah, if you have any questions, feel free to reach out. You can contact either Olexi, Tom, Amir or me. And yeah, feel free any time to reach out. Thanks.