 So, the next talk, Benjamin Colenda and Philip Koppe, they will refresh our memories, because they already had a talk on 3433, where they talked about the microcode ROM, and today they're going to give us more insights on how microcode works and more details on the ROM itself. Benjamin is a PhD student and has a focus on software attacks and defenses, and together with Philip, they will now abuse AMD microcode for fun and security. Please enjoy. Thank you. So as mentioned, we were able to reverse engineer the AMD microcode and the AMD microcode ROM, and I'm going to talk about our journey, what we learned on the way, and how we did it. So, this joint work with my colleagues at Rio University Bochum, and a quick outline how we're going to do this. We're going to start with a quick crash course on micro-architectural basics, and what microcode actually is. Then I'll talk about how we reconstructed the microcode ROM, what we learned along the way. Then I'll quickly give some examples of the applications we implemented, with the knowledge we gained from the second step. And lastly, I'll talk about the framework we used, how it works, and what we can do with it. And also, this framework is available on GitHub, along with some other tools, so you're free to continue our work. Okay, so when I'm talking about microcode, you can think of it essentially as a firmware for your processor. It handles multiple purposes. For example, you can use it to fix CPU bugs that you have in silicon and you want to fix later in the design phase. Microcode is used for instruction encoding. I cover this one a bit more. It is also used for exception handling. For example, if an exception or an interrupt is raised, microcode has the first chance of modifying this interrupt, ignoring it, or just passing it along to the operating system. It's also used for power management and some other complex features like inter-SGX. And most importantly for us, microcode is updateable. This is used to patch errors in the field. You don't remember the spectrometer patches and this is a microcode update. So your X86 CPU takes multiple steps to execute an instruction. The first step is decoding X86 instruction into multiple smaller micro-ops. These are then scheduled into the pipeline from your dispatch to the different functional units, like your ALU, AGU, multiplication division units. For our purposes, the decode step is the most interesting one. In the decode step, you have instruction buffers that feeds instructions to certain decoders. You have short decoders that tend to really simple instructions. Along decoders, you can handle some more advanced instructions. And finally, the vector decoder. The vector decoder handles the most complex instructions with the help of microcode. So the microcode engine is essentially the vector decoder. The microcode engine, in essence, is comprised out of a microcode ROM that stores the instructions for the microcode engine, think of it as your standard instructions. Then there's also a writeable memory, the microcode RAM. This is where the microcode updates end up when you apply a microcode update. And of course, around the storage, there's a whole lot of things that make it actually run. For this talk, you only need to know what the metric registers, metric registers are essentially breakpoint registers. So if you write an address from inside the microcode ROM, inside the metric register, whenever this address is fetched, execution control is transferred to the microcode RAM. So our patch gets executed. And the microcode updates are usually loaded by the BIOS or by the kernel. Linux has an update driver. Sometimes the BIOS updates it with the Appearance Start version. They have a pretty simple structure, a partially documented header, and followed by the actual microcode that is loaded inside the CPU. And the microcode is organized in something called triads. Each triad has three operations, essentially x and y six instructions, but with some differences. And lastly, you have a sequence word. The sequence word indicates which microcode instruction should be executed next. We have options of executing just the next triad, executing another one by branching to it or just saying, OK, I'm done with decoding this instruction, continue with x86 code. These updates are protected by some weak authentication, which we were able to break. So we can create our own, we can analyze existing ones, and we can apply these to your standard laptop and desktop. However, there can only ever be one update loaded at a time. And when you reboot your machine, this update will be gone. Also for the talk, we are going to look at the microcode and we represent this microcode using a registered transfer language. It is heavily based on x86. I'm just going to cover the differences between these two. Most importantly, the microcode can have three operands for an instruction in comparison to x86, which usually only has two. So you can specify a destination and resource operands. Also, microcode has some certain bit flags that need to be set. And these we do with these annotations. For example, .c means this instruction also updates the carry flag based on the result. Then you have the instruction JCC, which is a conditional branch. And the first operand denotes the condition up in which this branch is taken in this case, branch if the carry flag is one and second operand indicates the offset to add to the instruction pointer. Then we also have some sequence word annotations next, complete and branch. Also should be noted that the internal microcode architecture is a load store architecture. You can't use memory operands in other instructions like you can on x86. You always need to load and store memory explicitly. Now you're going to talk about how we managed to recover the microcode ROM. Microcode ROM is baked into your CPU. You can't change it anymore. It is defined in the silicon during the fabrication process. And in this picture, you can see a die shot taken with an electron microscope. And this is one of three regions that contains the bits for your microcode operations. And if you zoom in a bit more, each of these regions consists out of four arrays and these are further subdivided into blocks. Really interesting is array two, which is a bit smaller than the other ones. But it has some structures above it, which are of a different visual layout. This is SRAM, which stores the microcode update. So this is one time reproconnable memory that is still pretty fast. So the microcode RAM is located right next to the microcode ROM, which also makes sense from a design standpoint. Just an overview of how we went about. We started with pictures and then we used some OCR-like process to transform them into bit strings, which we can then further process. The bit strings were arranged into triads. We could already guess that we got the individual triads right because there were data dependencies all over the place. But between triads, there were no or very few data dependencies. So the ordering of the triads was still wrong. And this was the major part when we went ahead, what we had to reverse engineer. And this is mapping a certain physical address of a triad that we gathered from the ROM readout to a virtual address that is used inside the microcode update or the microcode ROM. But after reverse engineer this, you can just do a linear sweep this assembly of the microcode ROM and arrive at human readable output. But this recovery was a bit tricky because we required physical virtual address pairs. But gathering these is a bit harder because we worked through the other updates, but we could only find two pairs of them. These pairs were actually easy to find because every update replaces a certain triad inside your microcode ROM. And this triad is usually also placed in the microcode update. So by matching the address, this update replaces with a microcode ROM readout, you can just get your two data points. But we had to get more data points. So we generated these mappings by matching the mantics of triads in the microcode ROM readout and the semantics when we force execution of a certain microcode address. And gathering the semantics of the readout microcode we implemented a simple microcode simulator. Essentially it works on triad level. So you give it an input state and a triad and it calculates the output state of it. Input and output state are comprised out of the XID6 now state which is your standard registers and also the internal microcode registers. There are multiple temporary registers that get reset for every new XID6 instruction that is executed. But they can also be modified by microcode of course. Our emulator supports all known arithmetic operations and we have a wide list of operations that do not produce any observable change in state just so that we could process more triads and gather more data points. In total we gathered 54 additional data and additional identifiers which turned out to be enough to recover the whole mapping. This mapping essentially you have the four different arrays that map to individual blocks and these blocks in these arrays or then again permanutes it a bit and then the triads inside these blocks have some table based permutations. This is not an obfuscation. This is just from a hardware design standpoint. It can make sense to reroute it a bit differently. Also know that we can actually map a certain address to the microcode one without and we know the addresses of different XID6 instructions from our earlier experiments. We can look at the implementation of instructions. So let's start with a pretty simple one. Shift or a double which essentially takes a register, shifts it by a given amount and shifts in bits from another register. So of course you would expect a lot of shifts and roles in its implementation and this is exactly what we're seeing here. You have two shift right operands and you can see rackMD6 and rackMD4. These are placeholders. So microcode engine can replace certain bit combinations with the registers that are used in the XID6 operation. For example this one would be replaced by ECX or EAX depending on what you wrote in XID6. And at this point we can also already gather more information about microcodes than we previously knew because we know, okay, this source, this also source and this is a destination. But this source which indicates the shift amount, this one was previously unknown because it is a high temporary microcode register and we found out that these usually implement a specific different purpose. They are not, if you write to them, sometimes the CPU behaves erratically, sometimes it crashes, sometimes nothing happens. But in this case this seems to be the shift count and the shift count is given by a third operand in the instruction. So in this case we already learned, okay, if we want to read the third operand of an instruction we need to read t41. And this is how we went about recovering more information about microcode. The rest of the implementation is essentially concerned with implementing the rest of the semantics of the XID6 instruction and updating the flex correctly. Okay, so now let's look at the instructions that is a bit more complicated. If we check out ODTSC, ODTSC returns internal cycle counter in EDAX and EAX. So the upper part ends up in EDAX, lower part in EAX. So in the end we want to see writes to these adjusters potentially with a shift somewhere in there. But somewhere the CPU needs to gather the cycle counter. So in the beginning we have two load style operations. This one is a proper load which we identified and this one is unknown. But despite that we do not know the instruction we know the target. Because the result of this instruction we end up in T9 and the result of this instruction we end up in T10. So we can follow the uses of these two registers. So for simplicity I'm going to start with T10. And T10 which we later found out this is another register which essentially denotes a specific internal register. And if you play around with these bits you notice that this bit combination encodes CR4, Vx86 adjuster CR4. You can also address CR1 and CR2. And if you look further T10 is then ended with this bit mask. And if you look in the menu you find out that this bit in CR4 denotes the bit that determines whether ODTC is available from user space or not. So this is the check if this instruction should be executed. So now let's just keep in mind that T9 holds some other loaded value from some other internal register. And this video we will come back to this one a bit later. For now let's follow execution. This tried essentially padding tried. This is a common pattern we see. So let's look at where this branch takes us. And this branch takes us to a conditional branch tried. And if you look a bit up this end instruction actually updated this flag. So this is a conditional branch that determines whether this check was successful or not. So it branches towards the error tried or to the success tried. But here we already see the exit. We see a y-tool or dx or edex in this case with a shift from T9 by 32 bit which is exactly what you would expect to write the timestamp counter and the upper 32 bits of the timestamp counter to edex. And you have a handled instruction but we know okay we move something from T9 to ex which is the lower 32 bits. But we are not done here because we can still look at the error pass that is taken if the exit is denied. So if you scroll a bit down we can see a move of an immediate into a certain internal register. And this immediate actually encodes a general protection for an interrupt code and denotes to their exception handler what that is what general protection for. And later this tried branches to this address and if you look at the uses of this address we can find other immediate that also correspond on to xxc6 instructions. So now we learned how we can actually raise our own interrupts. We just need to load the code we want into the register into this specific register and branch to this address. And now we learned a lot about how we can actually write microcode but it's also interesting to see how certain instructions are implemented. So let's look at a pretty complicated one, writeMSR. WriteMSR essentially writes some data it is given to a machine-specific register. This machine-specific register differs between CPUs, between vendors, sometimes between revisions and these implement non-standard extensions or pretty complex features. For example, by your trigger microcode update we're writing to a machine-specific register. The address you want to write to is given in ecx. And now we can see ecx is read and it is shifted by 16 bits to T10. So again we follow uses of T10 and we see it is sort with this written bitmask and this bitmask is C000 which actually denotes the namespace of the model-specific registers. In this case this should be an AMD-specific namespace. And of course this one again sets some flags and you can see a conditional branch depending on these flags to what should be the handler for this namespace. Next one we have another XOR that uses a different bitmask in this case C0001. C0001 is the namespace where the microcode update routine is actually located in. So again we branch to this handler. And if we just continue on there are more operations on all the X followed by more branches and this continues until everything is dispatched to the correct handler. And this is how internally why MSR is implemented and also read MSR is going to be implemented pretty similar because it implements some kind of similar thing. Okay so now I showed you how we actually went ahead of reconstructing the knowledge we currently have. And now I'm going to show you what we can actually do with it. And for this I'm going to quickly cover what applications we wrote in microcode. We wrote a simple configurable ODTC position. This means a certain bitmask is ended to the result of ODTC so you can reduce the accuracy of it which can sometimes prevent timing attacks. We also implemented microcode assisted attrasonitizer which I'll cover quickly in a second. We also have some basic microcode instructions at randomization. Some microcode assisted instrumentation. What this means is you can write a filter for your instrumentation in microcode itself. So instead of hooking an instruction instead of debugging your code or emulating it you can just say whenever this instruction is executed filter if this is relevant for me and if it is call my x86 handler entirely in microcode without changing the instruction in the RAM. We also implemented some basic attended microcode updates so the usual update mechanism is weak. That's how we got our foot in the door in first place so we improved up in it a bit. Also we found out that microcode actually has some enclave-like features because once we're executing in microcode your kernel can't interrupt you, your hypervisor can't interrupt you and any state you want visible to the outside world you actually need to write explicitly. So all these microcode internal adjusters are not accessible from the outside world so any computation you perform in microcode cannot be interfered with so you can implement a simple enclave on top of this one. So our hardware assisted address sanitizer variant is based on the work by the original authors and address sanitizer is a software instrumentation that detects invalid memory access by using a shadow map shadow memory to just say which memory is valid to be read and written to. So authors propose software and address sanitizer which is essentially doing the same checks but using a new instruction and the instruction should raise afford if an invalid access is detected. These are the algorithms they proposed the details are not important what is important is in essence it's pretty simple. You load from a certain address perform some operations on it and if there's a shadow after these operations you just report the bug and advantages of hardware assisted address sanitizer or for example you get better performance out of it because you only have a single instruction maybe you can do some fancy tricks inside your CPU that are faster than using exactly six instructions you get more compact code and you have the possibility of one-time configuration which is a bit hard with software address sanitizer. We implemented hardware to the sanitizer our variant by replacing the bound instruction. Bound is an old instruction that is no longer used by compilers because in fact it is slow to use bound instead of performing the checks with multi-big statistics instructions. We changed the interface the first argument is the register which holds the address you want to access and the second argument holds the size you want this access to be so one by two byte and so on. This instruction is a noob as the check succeeds so if there's no bug it just continues on like nothing happened. However if we detect an invalid access we can take a configurable action we can for example just raise your normal page fault or we can raise the bound interrupt which is a custom interrupt that only denotes this one or we can branch to an x86 handler that either performs an additional checking for example write listing or it generates a pretty error report for you. Most importantly this is a single instruction we also do not dirty any x86 registers because there are some immediate intermediate results you need to store the somewhere and this usually do an x86 register so increase the register pressure maybe cost billing so overall your performance gets worse. We also found out that we're actually faster than doing the checking using x86 instructions so just by moving the implementation from the x86 level to microcode which in some way is still kind of like software we already improved the performance. Also on top of this you get better cache utilization because you have less instruction and thus less bytes in the cache so we get full cache lines and also it is really easy to tell which is testing code and which is your actual program code. Lastly I'm going to show you just a rough overview of our framework which we used during our development and which you can also find on GitHub. Early on we found out that we are probably going to need to test a lot of microcode updates because in the beginning you just throw everything at the CPU and see how it behaves and we wanted to do this in parallel. So we developed a small custom OS called AngularOS and deployed it to mainboards. These mainboards are just all AMD mainboards. All these mainboards are hooked up via serial for communication and GPIO to Raspberry Pi. With the GPIO you can reset the board, power it on, power it down and just have remote control of this mainboard and then you can connect to the Raspberry Pi from anywhere on Earth and just deploy and play around with it. This was the first version. In the beginning we didn't really know much about electronics so we used one Raspberry Pi per mainboard and it turns out Raspberry Pi are more expensive than these old mainboards. But we improved up in this and now we are down to one Raspberry Pi for four or five setups. For example you only need three GPIO ports per mainboard. You connect each of these two optocouplers just to separate the voltage levels and then you connect one side of the optocoupler to the GPIO, the other side to your reset pin, to your power pin and for input to know whether your board is up or down you connect the power LED and that way you can save a lot of space, a lot of money. And also if you're really constrained you can just remove the power LED sensing because usually you know the state your setup is in. And as I already said we wrote our custom operating system and it is intentionally really really minimal because the major feature we wanted is control over every instruction that's going to be executed from a certain point on because we are playing around with instruction decoding and if we execute an instruction that we did not intend we might crash the CPU, we might go into an invalid state and we do not even know which instruction cost it. And Angular is essentially only listens on the serial port for something to do. What it can do is apply an update. These updates or the custom micro updates they are streamed via serial. We can also stream x86 code which is then run by Angular's. This is just so we do not need to reflash so we will speak every time you want to update our testing code. And so we saw it all the errors are reported back to the Raspberry Pi and thus they are forwarded to us. The framework we use most importantly has a microcode assembler and a pretty variable STIS assembler. So STIS assembler generates the output I showed you earlier and using this you can just quickly write your own microcode. We also included an x86 assembler because we wanted to rapidly test different x86 testing codes. Using this framework we were able to disassemble the existing updates and we also used it to disassemble our ROM after we reordered it and also doing the process when we fed it to our emulator. And we can also create the proper binary files that can be only loaded by the Linux driver. We want to be modified to stock one and to just load any update you give it without checking if it's the correct CPU ID and all these things just for testing purposes. It's also available. And also of course the framework can control AngularS to make testing easier. And we implemented the really basic remote execution wrapper so you can work on a remote Raspberry Pi as if you were using it locally. And this brings me to the end of the talk and in conclusion we can say the world seems to have opened up a lot of new possibilities. We learned a lot about how microcode works. We learned about how to actually use it properly instead of just inferring from the really small data set that we have from the updates or from the random bit strings we sent to the CPU and observed what happened. But there's a lot left to do. So if you really want to hack on it just get in contact. We're happy to share our findings with you. And as I said the framework AngularS example programs that we implemented and some other stuff like the wiring survivor on GitHub. So that's it and I'll be happy to answer any questions you might have. Thank you very much. So we have 10 minutes for questions. Please line up at the microphones. We start with this one microphone number two. Hi thanks for a nice talk. A few questions about your hardware address sanitizer. As I understand you don't need the source code instrumentation because the microcode is responsible for checking the shadow memory right? The original hardware to sanitize the instrumentation is also based on a compiler extension that inserts a new instruction because it doesn't exist usually and that also inserts a bootstrap code that in its shadow map and also instruments your allocation your allocators to update the shadow map during runtime. And we essentially need the same component but we do not need the software to sanitize a component that essentially inserts 10 or 20 x86 instructions before every memory access. So yes we still need the compile time component and we are still source code based in essence. Hey and I didn't see maybe I missed the numbers how much it is faster than this initial version. You mean the initial hardware to sanitize a version or the software to sanitize it? I mean let's say Kasan kernel address sanitizer for Linux kernel which is the the usual one and your approach? We only performed a microbenchmark on AngularS and we essentially took instrumentation as emitted by the compiler for certain memory access which is your standard software to sanitize and compared it to our version using only the modified bound instruction. So I really can't talk about how it compares to Kasan or something or some like real world implementation because we only have support type and basic instrumentation. Thank you very much. Okay microphone number four please. Thanks for the talk and did you find any weird microcode implementations? I don't mean security wise just like you weren't expecting to to see the implement it to be for it to be implemented that way. So the problem is there's a lot of microcode to begin with. You have F000 triads each of which has three opcode so you have a lot of going to cover and also we have readout errors. Sometimes you're seeing a bit flips which kind of slows you down because you always need to consider okay maybe this is just something else maybe this address is wrong and also sometimes you have a dust particle that kind of knobs out in the entire region. So we only looked at the components. We were pretty sure that we recovered correctly and we only looked at a really tiny subset compared to all of the microdrom. It's just not feasible to do and to go through it and look at everything. So no we didn't find anything funny but we also wouldn't know what funny looks like because we don't know what the official spec for microcode is. Interesting we have one question from the internet from the signal angel please. Yes, which AMD CPU generation does this apply to? Yeah this is still based on the work on our first talk and this only works on pretty old ones K8, K10 and CPUs produced until 2013. This was the last year AMD produced anything like that. Newer ones use some public key based cryptography from what we can tell and we haven't yet managed to break it. Same goals for interns that seem to be using public key cryptography and we haven't gotten a foot in the door yet. Thank you we go around one microphone number three please. Yeah thank you. I would like to know how complex could the microcode programs be that you could write. So what's the complexity of new operations you could implement? The only limiting factor is the size of your microcode update RAM but this one is really really limited. For example on K8 we have performed the main majority of our experiments. We are limited to 32 triads which comes down to 69 instructions and you also have some constraints on these instructions. For example the next trial will always be executed no matter what. Some operations can only go at the second slot some can only go on to another slot so it's really really hot and you're also limited from our knowledge to loading 60-bit immediate instead of 32-bit or even 64-bit immediate. So your whole program grows really fast if you're trying to do something complex. For example our authenticated microcode update mechanism is the most complex one we wrote. It nearly fills out the RAM and we used Tia tiny encryption algorithm because that was the only one we managed to fit mostly in due to S-box and other constants we would need to load. So it's really small. Thank you microphone number one. So you said the microcode is used for instruction encoding and it needs to meet the microbes to the schedule and the micro queue in some way. Did you find out how that works? In essence we are not actually executing code in such a microcode engine. From what we understand the microcode engine is just some kind of a software-based recipe and that describes how to decode an instruction. So you don't actually get execution you just commit instructions into the pipelines that do what you want and because we have some control flow possibilities that are actually inside the microcode engine because you can branch to different addresses you can conditionally branch and loop you kind of get an execution but in essence you just commit stuff in the pipeline and the CPU does what you tell it to. One more question microphone number two please. How did you take the picture of the internal CPU? Did you open it? We worked together with Chris. He's our hardware guy. He has access to the equipment to de-layer it to take high resolution optical shots and to also take shots with a scanning electron microscope. So I think about five or six CPUs were haunted in the making of this paper. So we have one more last question. Microphone number two please. Okay are you aware of the research done by Christopher Domus where he mapped out the instruction set for your for 86 x86 processes? Yeah he means sensitive. Yeah we actually talked with him and yeah we are aware that there's a map essentially of the instruction set and also maybe you can combine it because in the beginning we reverse engineered where certain x86 instructions are implemented in the microcode so if you plug these two together you kind of map out the whole microcode ROM at the same time that you map out the whole instruction set. However there are some components of the microcode ROM that are most likely not triggered by instructions for example things like power management or everything that is behind a write MSR or read MSR. Write MSR is a single instruction but depending on the arguments you give it it just branches to totally different riots and the microcode update itself is implemented in microcode and this one is a huge chunk you wouldn't even find without input forcing all combinations for all instructions which is not really feasible. Thank you, thank you Benjamin and