 In this talk I'll be presenting our work entitled cutting through the complexity of reverse engineering embedded devices. This research is joint work by members of the Automotive Security Group at the University of Birmingham. Myself, Sam Thomas, Ian, George, Z-Tye, Mihai and Flavio. So let's begin. Suppose you're tasked with checking if some device contains vulnerabilities. Where would you start? Would you look at the hardware? Well, we could go from the hardware perspective and attempt to obtain the schematics, try to identify different interfaces to interact with the device, for example JTAG or Virates Peripherals over Wi-Fi or LTE. But generally for an arbitrary device we don't always have access to schematic diagrams or a list of all possible interfaces. Furthermore, some of the interfaces might, available might need to be reactivated. So all of this requires quite a bit of manual work and intervention. So how about going from another angle? The software or the firmware? Well in this case, the firmware to what? Typically, these devices have multiple components running different pieces of software or firmware. The baseband, for instance, might use a real-time operating system, while the interface exposed to the end users might run some variants of Android. And to complicate matters, the source code for the firmware will rarely be available. And it won't easily be downloadable from a vendor's website either. So in this case, to analyze the device, we'll need to reverse engineer some aspects of it and we'll just answer the question, does it contain any vulnerabilities? Reverse engineering is an iterative process, where we gradually infer facts about a particular device or system under analysis by observing how it responds to certain stimuli, or by drawing conclusions by reading the disassembly of its firmware. So in the first instance, where we interact with the device directly, we attempt to treat it as a black box, we supply some inputs, and observe what we get as output. Of course, this can only take us so far, so in almost all cases, we'll dig deeper, and open the black box, so to speak. And in this case, we'll end up having to reason about the firmware and some of its components. But then we can ask questions such as, what code is executed when I supply X as input? Or, I know why input should cause the device to do Z, how does it perform Z? So as a general methodology, reverse engineering a device happens in three steps. First, we'll need to identify its interfaces, so those that we can interact with externally, such as the Wi-Fi, and those that offer us a means of looking at what's going on inside under the packaging, so for example JTAG or Traceful. Next, we'll use one of those interfaces to obtain its firmware. Now for some devices, the firmware may be available online, but we have no idea in reality if the firmware running on the device, and that which we can download on the vendor's website, is indeed the same. So we'll have to dump or extract it to make sure. Finally, we'll analyze the device, and this can be done statically, for example by loading the firmware into a tool, such as IDA or GDRA, and then reading the disassembly, or performing dynamic analysis by interacting with the device using hardware such as the USRP. But in general, we'll use a hybrid approach, since both have advantages and disadvantages. So great, we have a good methodology for forming our analysis. Unfortunately, things aren't so simple. If we choose to go the static analysis route, the firmware itself really will be just another black box. To make matters worse, even industry standard tools such as IDA Pro or GDRA do not fall well out of the box when analyzing the firmware. Sometimes they miss functions or disassemble things using the wrong instruction set, for example when we have ARM and thumb instructions in the same firmware. On the other hand, when analyzing the firmware using a dynamic approach, many devices will have nice debug features removed, although there are certainly plenty of devices that have some form of tracing hardware, those interfaces will generally be quite limited. For example, they might be quite coarse grained and only log the address of basic blocks executed in a temporal order. And this is kind of a big problem when a single component of a device might execute many tasks concurrently, as their control flow will be interleaved and so will have to have some way to make sense of the trace. These mechanisms can also be limited in other ways too. So for example, we might only be able to obtain traces up to a certain size, or if the firmware executes a tight loop, we might observe that some trace packets get dropped. Finally, when we have such limitations, it's often attractive to a new trace part of the firmware, but which part should we trace? It can also be the case that while we can restrict tracing to certain portions of the firmware, we can only define a limited amount of exclusions and inclusions. So in the face of all of these challenges, how can we analyze devices effectively? Well, that's the question we attempt to answer with this work. We propose a principal framework called incision that attempts to add some order to the chaos of analyzing a embedded device firmware by simplifying upfront reverse engineering tasks, tasks such as obtaining a relatively complete and correct idea of the control flow of the firmware and locating regions of the firmware that correspond to particular functionalities. Our framework is built around the open source disassembler, Jigra, so that it neatly fits into standard reverse engineering workflows. So how does incision work? Well, from a high level, it treats reverse engineering as a feedback loop that combines static and dynamic approaches in tandem. The main idea is that instead of treating the construction of a knowledge base or database for our firmware and our reverse engineering goals as decoupled, we can instead simultaneously improve our knowledge representation of the firmware being analyzed and also identify areas of firmware we want to analyze in more depth later. In the first step, working from a very basic database of the firmware, i.e. just all two announcers performed by our disassembler, we form a reverse engineering goal which we encourage using a domain-specific language. We call this a policy. This might be something along the lines of the device deals with the LTE protocol, where does it process X layer. Next, we use that policy as input to a process we call region inference, which essentially tries to match parts of the current firmware representation against our policy. It also contains a second stage, which groups disparate regions that successfully matched into larger functionality groups akin to compilation means. We use these regions that are identified to build trace configurations to perform tracing, when our trace mechanism is limited in some way. Next, we perform trace capture, and then we use the trace as input to a two-stage process of control flow extraction. This essentially identifies different flows inside the temporarily ordered trace of basic block presses, outputting the regions corresponding to context switching logic and task code. Finally, we integrate the extracted control flow information from our trace back into our database. This process might produce some database conflicts due to incorrect disassembly configurations, however we attempt to resolve those automatically. And now we arrive back at the beginning of the feedback loop, and decide if our improved knowledge database is good enough, or if we should continue and pose further queries to the process. We choose to continue then we update our reverse engineering policy and follow the process again. So now with a high level idea of the workflow our framework enables, I'll now describe each component in more detail. Starting with the firmware database. So the database or knowledge base is essentially just a GDRA project. We assume this will be given as input to incision, and will probably be created manually. The process of doing so is generally not too difficult, and it really just relies on us loading the firmware, so defining the basic memory mapping and layout into GDRA, and then hitting the auto analysis button. As for the execution tracing, we assume that the trace mechanism will be hardware based, as we're operating on real devices. But we don't assume that it's perfect, so it might be lossy, i.e. sometimes events might be dropped. If that is the case, we assume that we can detect if and when that occurs. We also assume that the tracing mechanism might be limited in a number of ways. So for example, it might be limited in terms of the number of ranges we're able to trace, or it might also be limited in terms of how much we can actually capture in one go. So in all, to trace the firmware we'll have a trace configuration as input, which we can view as a list of intervals of the form, start address, end address. And as output, we'll get a list of program counter values, ordered temporally, representing the basic blocks that got executed. Next I'll describe reverse engineering policies. As I mentioned earlier, policy is just an encoding of our initial reverse engineering goals. We base them on the observation that many of these goals can be expressed using simple queries if we treat the firmware and its traces as a kind of knowledge base. And by simple we mean, these queries are not about relationships, they're more about if particular patterns exist. So concretely, a policy is a vector for want of a better term, indicators that point to a particular part of firmware or trace matching some criteria. So on the left, we have a policy for matching an inch reference login script around a call to M copy. And on the right, we have a policy matching a repeated load from a memory map peripheral register, which we would identify within a trace. Within an incision, we encode policies using a simple Python embedded DSL, which is exposed by a GDRIS scripting interface. We match policies by locating regions of firmware and parts of traces corresponding to indicators. Repeated indicators, such as our Polandleaf example from a couple of slides ago, are matched using the database and traces, while other kinds of indicators are matched using just the database. So the references, labels and symbols. We rank the strength of indicator matches in density, hence we consider firmware regions with many matches that are close together by address to match them strongly. Indicator matches are assigned the bounds of the functions that enclose them. We group matches if they occur within a small delta, i.e. the bounds of the functions are quite close together. We base our approach to matching on how firmware is composed. That is, while a firmware blob contains everything from library code, task code to OS components, each of those components will be contained within its own region of the firmware. Hence related functionality will have natural grouping within the overall firmware blob. Further, the sub-components of each of those regions will also have natural grouping by functionality, as the structure of compilation units that make up each of those components will almost always be preserved. We can visualize it as so. So viewing the firmware blob as many compilation units concatenate together. Within each of those larger components, assuming the firmware developers followed reasonably good developer practices, we'll also observe that these components have a different functionality. And so each compilation unit will contain functions that perform related tasks, and this is why we tie matched indicators to function boundaries. So we'll often end up with many matches, and these will form the regions for our trace configuration. However, some devices will be limited in terms of the number of regions that can be traced at the same time. If we have more matched regions than regions we can trace, how do we form a good trace configuration? Relying on the observation that related functionality tends to have good spatial locality. To perform this merging, we use an algorithm called agglomerative clustering. This works by iteratively merging regions that are closest together by some distance metric. In our case, we use a distance between the start and end addresses of each region. So following obtaining a trace, now we need to make sense of it. Recall that trace is just a list of addresses corresponding to the executed blocks, and the time in which they were executed. So we need to perform some kind of processing to extract the real control flow for each task and bit of context switching logic. We do this by locating so-called boundary patterns, which we use a heuristic to identify when a trace is reporting control flow for a different task or thread of execution. So if we view this box as the whole trace, then what we actually have is many sub-traces concatenated together. To find the boundaries of these sub-traces, we start by following the execution of whole trace using the disassembly from our database. When we transition between two blocks, we consider that point as a possible context switch, and then consider the proceeding box as part of a new sub-trace. Now, with these flows, we attempt to discern between context switching code and task coding. We do this because context switching code tends to dominate our traces, and so by excluding its bounds from our traces, we obtain better traces in the presence of limited tracing capabilities. We discern the two types of code using two methods. The first is based on the observation or interrupt handles. We end up with a pattern such as this, where the flow on the left does not end in an indirect call, yet the next block executed corresponds to a function entry point. So unless we observe an error in the trace, we must have transitioned to the right-hand block by non-standard control flow, i.e. a context switch. Similarly, when exiting such a routine, i.e. a context switching routine, we observe another pattern, where we move from the flow on the left to the right by an indirect branch that then doesn't work at the beginning of a function or basic block, i.e. more non-standard control flow. Unfortunately, these two heuristics are not enough to discern between task switching code and law code, since traces might contain drop packets or have excluded regions. Therefore, we also attempt to find when such patterns are repeated across the entire trace, which tends to happen very frequently, and for that we try to locate repeated sub-flows within our traces. Since our traces tend to consist of hundreds of thousands of events, repeating patterns naively is infeasible. Fortunately, the number of unique events in a trace is usually much smaller than the trace itself by many orders of magnitude. And due to the simplicity of those events, i.e. that they're just basic block addresses, we can encode our traces as strands, which can be analysed using very fast and efficient data structures, namely suffix arrays and longest common prefix arrays. We use the longest common prefix arrays to find sub-flows in our trace that share a common prefix. Similarly, by inverting our trace and then computing the same data structures, we can find repeated suffixes. So after separating out the different kinds of flows in our trace, we finally rejoin the separated sub-traces into large task-level traces. Now, assuming a few trace packets are missing, we attempt to stitch the task-level control flow together by following the corresponding disassembly in our database. So, for example, if a sub-trace might end on address A and the next start on A plus 1, we would stitch the two together. For, of course, some edge cases, for example, one trace ends on a call instruction and then resumes on a function start, we need to somehow resolve that. Luckily, use edge cases can be easily worked from out. So finally, we end up with a collection of control photographs that we can integrate back into our firmware database, which we do using a process called feedback-driven refinement. This process essentially completes the feedback loop, integrating our dynamic trace information and our static views of the firmware. It works by adding any new control flow inches discovered due to our trace into the database, and is followed by performing an auto-analysis pass to build cross-references and discover further functionality reachable by the needed added edges. This process might trigger some database conflicts due to intercorrectness assembly and so on, but we can mostly work around those automatically, and if not, flag them as a needed manual intervention, which we'll show later is rare and when it is required, is easy to resolve. Finally, we close the feedback loop by selecting a new policy based on our updated database or stop to analyze the firmware in more depth using other methods. Lastly, I'll talk about how we evaluate the decision. Well, we use three metrics for our evaluation, real-world usability, correctness, and human effort. We assess each component decision separately and also use it to perform some real-world reverse engineering tasks. The firmware dataset we use is composed of 10 firmware built on top of two open-source real-time operating systems, FreeRTO-S and ZIFIR, and two end-user firmwares, which we extracted from real devices. So a random body control module, which is a bare metal firmware based on the V850ES architecture, and a Huawei R216H Wi-Fi hotspot. We analyze the baseband component of this device. It runs VXWorks, which is a real-time operating system, and it's based on the ARM architecture. The first experiments we'll look at offer correctness. In our first experiment, we evaluate region identification and grouping. To do so, we use the baseband firmware since it's the most complex in our dataset, and its trace mechanism is less than ideal, and therefore a good test of incision. The mechanism is limited in a number of ways. For example, we can only capture traces up to one megabyte in size. It drops event packets, and the number of regions we can restrict our trace to is limited. But full details of the tracing setup can sort appendix via our paper. We obtain a ground truth for this experiment by manually reverse engineering the firmware with the aid of embedded debug strings and some leaked symbols we found online. We identified regions of the firmware that process different LTE layers We then formed policies we expected to identify in each regions, and used them as input to incision to region identification and grouping algorithms. We used the area of overlap of our ground truth and the regions identified by incision to judge its correctness. For the policies tested, we found that the regions identified by incision matched well with those we identified manually. Their overlap with our ground truth was very high and exceeded 90% in all but one case. In this next experiment, we evaluate our control flow extraction algorithms. For this, we constructed 10 firmware based on FreeRTOS in Zephir. We used open source firmware so we could easily test different system configurations, such as task switching frequency and the number of tasks running concurrently. This allowed us to simulate various potential worst case conditions. We also attempted to simulate a worst case scenario by including some task code that repeatedly performs indirect calls. An effort to mimic task switch behavior and therefore induce false positives. To capture traces, we emulated each firmware using QEMU and recorded a basic block trace. Since these traces were effectively unbounded, we sampled 5 non-overlapping sub traces with 50,000 block addresses for each firmware. To evaluate our algorithms, we established a ground truth by identifying the regions of each firmware that perform task switching, interrupt handling and correspond to task level code. For each firmware configuration, we assessed if incision was able to discern between the task code and context switching code and if it did so correctly. As you can see, from our experiments, we found that for firmware based on free R2S, incision worked flawlessly. However, for the firmware based on Zephyr, while in most cases it was able to identify the regions of code performing interrupt handling and task switching and to discern it from task code, we found that where indirect calls out numbered task switches, incision failed to identify those task switches correctly. We believe that a real trace exhibiting such behavior is likely to be rare. In the next two experiments, we evaluate the utility of our feedback-driven refinement method. The mechanism we use to close our reverse engineering feedback loop. We use both real-world firmware for this task and capture a number of traces for each. We use those traces to measure the degree to which a baseline database is improved by our feedback mechanism. To do so, we measure the number of correctly identified function starts after integrating the control flow edges extracted from each trace into each firmware's database. We establish correctness once again by manual analysis. We also break down the number of newly identified function starts by which mechanism they were discovered by in order to measure the effect of performing a further static analysis pass on the database after integrating newly discovered control flow edges. As we can see, for both firmware, integrating control flow information extracted from traces increases the number of correctly identified function starts. Or for the BCM, the increases do not change after triggering an additional analysis pass. For the baseband, most of the improvements can be attributed to this step. In addition to evaluating the effectiveness of our feedback mechanism, we also analyze the database conflicts and errors it can induce. To do so, we use the same firmware and traces as the last experiment, and count the number of errors detected when merging extracted control flow into each firmware database. We provide a breakdown of these errors, those that our technique can address automatically, and if those fixes are correct, and those that require manual intervention, which we assess in a separate experiment. The traces from the BCM, we encountered no errors. More for the baseband, we found 53 in the first trace and 88,000 blocks and 188 in the second trace of 200,000 blocks. Of those, we found that for both, over half could be fixed automatically, and of those fixed, only one was reported as incorrect. In fact, after some investigation, we found that GDRA actually caused this incorrectness due to its non-returning function detection pass, which it performs as a default as part of its auto-analysis. In this next set of experiments, we evaluate incisions usability on real-world reverse engineering tasks. For the first task, we attempt to emulate our BCM firmware from reset. Since this firmware relies on a number of peripherals, and our emulator does not provide support for them, our objective is to discover satisfying values for checks against peripheral registers that would otherwise cause the firmware to stall in an infinite polar loop. We use incision to aid us in identifying the location of these checks, and to measure its effectiveness we record the number of basic blocks we're able to execute following identification of each check, and report the improvements to the firmware's database due to incision. We perform this experiment iteratively. We attempt to capture all control flow in our tracers, and hold capture once our trace buffer becomes full, for example, when it becomes saturated by the presence of a polar loop. We use incisions region identification and policies to identify the location of the check in our firmware database, and manually identify the values required to exit each polar loop, and then provide these values in the corresponding register during the next iteration of our analysis. The following graph visualizes the number of unique blocks traced after each of those iterations. As we can see we identify a number of stall states, and are able to use incision to bypass them all, thus showing incision's effectiveness when aiding a common reverse engineering task. In this next experiment, we use incision to aid in identifying how and where cryptographic keys are stored and used within our baseband firmware. In addition to traces, we also use RAM dumps to perform this analysis. We use incision as a means of improving our firmware database to make the search for keys and the functions that use them easier. This is based on the fact that a better database will contain higher quality code and data cross-references, as well as more functions. Since our objective is to identify where keys are stored and used, we use incision to help locate references to high entropy buffers that are referenced from regions of firmware handling the LTE protocol. To evaluate incision's effectiveness, we measure the reduction of the number of buffers to analyze due to the database requirements it performs. We also confirm that some of those buffers indeed correspond to cryptographic keys manually. To perform our task, we first take two traces using general policies to help improve the initial database. We then identify the regions of firmware corresponding to task switching code, and we use those to form trace configurations that exclude the bounds of those regions in subsequent traces. We then take a number of additional traces using policies targeted at capturing traces of different LTE layers. We find that overall the resulting database and identified functionality regions we obtain allows to reduce the number of potential key buffers to analyze substantially from thousands to tens. As a result, we were successfully able to identify a number of LTE session keys correctly and the routines that process them, again demonstrating incision's ability to reduce the effort of performing otherwise tedious reverse engineering tasks. In this last experiment, we assess the human effort of using incision. We recognize that performing such an experiment objectively is difficult, hence we rely on a metric we can concretely measure. The number of tasks incision offloads to its operator and the complexity of those tasks. We do this by measuring the manual intervention required to rectify database errors and conflicts due to our feedback driven refinement algorithm. We wrap the tasks by their complexity, those that require no effort, those that require low effort and those that we consider high effort. We classify low effort tasks as simple changes to our firmware database. The first is those that require adding a new function start and the second is due to re-creating a function start due to perhaps a disassembly error or an error induced by an auto analysis task. We use the traces from our baseband for this task and recall that most of the errors in those could be addressed automatically. For those that could not, we fix them manually and then categorize the effort required using the classification outlined on the last slide. Overall, we found that all errors that could not be rectified automatically fell into the low effort category and of those, the most common by a very small margin were due to us having to create a new function start. None of the errors were high effort. So in conclusion, the errors we found were low effort and even though our traces consisted of hundreds of thousands of events the errors that needed manual attention tended to occur very infrequently, indicating the manual overheads induced by incision are minimal. So to conclude in this talk, we presented incision a framework for reverse engineering embedded device firmware. We evaluated our framework and its components based on three criteria correctness, usability and human effort and also showed that it's effective when applied to real-world reverse engineering tasks by analyzing two end-user devices. Thank you for watching our talk.