 Hi everyone. It's great to be here at DEF CON. I'm here to talk to you today about breaking the x86 instruction set architecture. Before I get into any of that, my name is Christopher Domus. I'm a cybersecurity researcher at a company called the Patel Memorial Institute. I really, really like working here. It gives me a chance to encounter a lot of very hard problems. But I think what I like the most about working here is the problems and people I encounter during the day. Give me a lot of really interesting ideas for the fringe areas of cybersecurity to explore kind of on my free time. And that's what I wanted to show you today is one of these little side projects I've been working on and off for the last couple months. So this whole thing is based around trust. And I'm going to start this presentation off with a really, really obvious statement. We don't trust software. And we should. Software can be horrible. So we audit software to make sure that it works the way we expect it to work. We reverse engineer software to make sure there's no secrets hidden inside of it. We don't want Minesweeper dialing out to a Russian website. We break software. We try to find vulnerabilities and exploit them in order to harden it and make things more resilient. And despite all this effort, we still don't trust software. So we sandbox it so that when something goes wrong, at least it's contained. But the processor itself, you know, the thing responsible for actually running our software for enforcing all of our security checks, the processor itself, we pretty much just blindly trust. And we don't really have a lot of choice here because we don't have reverse engineering or introspection or auditing tools for the processor the way that we do for software. Really the best we have for the processor is a bunch of processor specifications and reference manuals, a bunch of documents that tell us this is how your processor operates and we're just supposed to take it on faith that that's how the processor actually operates. And I think that's kind of crazy, right? Because we would never do that with software. You never download just some totally random executable, open up the read me, see that it says totally not a virus and think, yeah, I guess that's totally not a virus. I can go ahead and run this. But that's exactly what we're doing with hardware right now. Why? Because hardware has all the same problems as software. So why do we trust hardware but not software? If you're worried about secret functionality, worry about that in software, does hardware have that? Absolutely. Intel infamously had what was called the appendix H documents that described all the secret pieces of the x86 architecture and was only available to trusted Intel sources. What about bugs? You worry about bugs in software, does hardware have bugs? You've probably heard of several of these. The foof bug in an Intel Pentium could allow an unprivileged user to lock the system. The fd bug in the Pentium, every couple billion floating point operations, the Pentium would be off by a fraction of a percent. That cost Intel about $500 million. The TSX bugs in the Haswell architecture Intel had to completely disable transactional memory support in Haswell because of these bugs. Skylink has hyperthreading data corruptions. The Risen processor is crashing and faulting under heavy FMA3 operation loads. Modern processors have tons and tons of bugs in them. What about vulnerabilities? You worry about vulnerabilities in software, hardware has got those too. The sys thread vulnerability allowed a user to escalate into kernel level permissions on almost every major operating system. The cache poisoning and memory sync hole attacks allowed infiltrating x86's protected system management mode. So vulnerabilities exist in processors too. The point here is we should stop blindly trusting our hardware. We don't trust software, we shouldn't trust hardware either. What kinds of things do we need to worry about when we're talking about trusting hardware? The thing I was mostly interested in for this research was hidden instructions inside of the processor. Like maybe there's secret functionality that could give us a back door or powerful access to the processor internals. Now I know that sounds almost conspiracy theory-esque, but it's not all that far off from reality. There's actually some interesting historical examples of this in the x86 architecture. So for example, early x86 chips had what was called the ice breakpoint instruction. This was an instruction that wasn't documented anywhere that would switch the processor over into a very privileged ice mode. Those same processors also had this load-all instruction. Another undocumented secret instruction that would give you access to hidden pieces of the processor registers that you normally couldn't reach. Or just in more recent examples, you may have heard about the API call instruction in Microsoft's x86 emulator. Basically what Microsoft did was they backdoored the UD0 x86 instruction and turned it into something entirely different without ever releasing that information. And that caused a fairly serious vulnerability in their emulator. So these things can't exist in real life. And to sort of highlight that point, if you actually go into your processors documentation, if you look up Intel software developer manual volume two, near the end of it you'll see dozens and dozens of tables that look something like this. What these are are the opcode maps. These are supposed to enumerate all the instructions that your processor supports. It's basically saying when the processor sees this byte, it's going to execute this instruction. When it sees this byte, it's going to execute this instruction. So that's what we're seeing in the opcode maps. But if you look really closely at these opcode maps, you'll notice something. There are gaps here and there. So this is a document that's supposed to tell us everything the processor does. This is a document that we're basing all of our trust for this processor on. But it's leaving things out with these gaps. That's not really a good start for trust, right, if we're relying on this document and it's intentionally not telling us certain things about the architecture. So I wanted to find a way to actually audit the processors that are in all of our computers. I wanted to find a way to figure out what's really on my processor. So I wanted to find these hidden instructions. How do we do that in this architecture? The challenge with X86 is the instruction format is very, very complex. In X86, you can have one byte machine instructions like increment EAX is hex 40. You can also have 15 byte machine instructions like add keyword, CS override, complicated memory access, complicated immediate value is a 15 byte X86 instruction. And you can have everything in between as well. So if we actually look at like a worst case scenario and assume all the instructions are 15 bytes just for simplicity, you're looking at something like 1.3 undecillion possible instructions on this architecture. Now if I just want to find one hidden secret instruction and all of that, that's going to be really, really difficult. So the obvious approach is for trying to like sort through all of these instructions just don't really work. So you might think let's just try all of the X86 instructions. That might work for a risk chip with a fixed length instruction set that's not going to work for X86. There's way, way too much. You might think well what if we just do random instructions and see if we find anything interesting. The problem is you could run random instructions for the next 10 years and you'd still cover such a tiny, tiny portion of that search space that it wouldn't be very useful. You might think well we've got this documentation that kind of tells us what these instructions look like. Why don't we guide or base our search based on the documentation. But the big problem with that is kind of like we already established, the documentation can't be trusted. If you're basing your search on the documentation you're kind of doomed to fail from the start. The other big problem, you might think well there's these gaps in those manuals. Why don't we just explore those gaps. But those gaps really only tell you the first byte of the instruction. If you were looking for a secret instruction that's 15 bytes long, you've still got a long ways to search even basing it on the documentation. So we don't want to do that. We want to find a better way to search through this instruction set. Really our goal is to find the bytes in the instruction that actually matter so that we can fuzz those and ignore all the bytes in the instruction that don't matter. So I want to throw out an observation here that the meaningful bytes in an x86 machine instruction impact either that instruction's length or its exception behavior. Basically the meaningful bytes change something important about the instruction and the bytes that don't really matter don't really change anything important about the instruction. So it kind of came up with this idea of a depth first search algorithm that would let us search through all the reasonably distinct instructions inside of the x86 instruction set. So the way that this is going to work is we're going to guess an instruction. Now we don't know how long this instruction is. Let's just guess 15 bytes of zeros. Then I'm going to execute this instruction and from that execution I'm going to see how long was this instruction actually. You'll see that this instruction is two bytes long. So we're going to increment the very last byte of this instruction and then repeat the process. Execute the instruction, observe its length, increment the last byte. Execute, observe the length, increment the last byte. Very, very simple algorithm here. But eventually what will happen when you do that final incrementation and execute the instruction, you'll find that at some point the length changes. Our instruction just went from two bytes to three bytes. So what we do in the length changes is we're going to move into the last byte and increment it. Then execute the instruction, observe its length, increment the last byte. Over and over and over again. So if you repeat this process, these are what your instructions kind of look like. You can see that through this like, incrementation approach, we very gradually sort of drill down into the instruction set by generating more and more complex instructions as we go along. So at some point you're going to increment all possible bytes for that last byte and you're going to see that the length still didn't change. So when you've done that, when the last byte is an FF 255, you're going to increment it one more time, let it roll over, and then you're going to move back a byte, that's going to be our marker byte. So then we're going to increment that marker. So it becomes one and we'll repeat the process. Execute the instruction, observe its length. If the length hasn't changed yet and it didn't here, increment the marker again and just repeat that over and over and over. And eventually in this situation that marker will roll over again. So you move back a byte now. If the marker rolls over move back. Marker rolls over, roll back. Just keep that up. Incrementing these bytes one by one and observing the length changes and eventually when you increment the marker, you're going to find when you execute the instruction that the length changed again. So at this point we now move back to the end of the instruction and start the whole process over incrementing that last byte and seeing how long the instruction is. So this is a sort of neat algorithm, this tunneling algorithm. Let's search through the instruction space really quickly skipping over the bytes that don't matter and focusing on those that do. Now that's basically exhaustively search the bytes and the instruction that actually matter. And effectively what that does is it reduces your search space in this architecture from 1.3 times 10 to the 36 instructions down to about 100 million. Something you can actually scan in about a day. So that's pretty neat. I was excited that I found a way to quickly search this ISA. But there's a catch here that I kind of glossed over. Namely that in order for this algorithm to work, we need to know the length of an x86 instruction. It's not as easy as it sounds. We can't just disassemble the instruction to see its length because we might be dealing with undocumented instructions or misdocumented instructions. So the simple approach here, if you're familiar with x86, might be to use the x86 trap flag. So the way the x86 trap flag works is you've got an instruction and you set the trap flag before executing that instruction. Then you execute that instruction. The processor throws an exception and gives control to your trap handler. Now that trap handler could look at the new instruction pointer, compare it to the original instruction pointer and that difference tells you the instruction length. So that's how you could try to do this with the trap flag. But the trap flag has a really big limitation here. Namely it fails to resolve the length of faulting instructions. So why do we care about faulting instructions? Well it's necessary if you want to search through privileged instructions. So let's say we're executing this in ring three, the least privileged mode of execution on the x86 processor. We still want to figure out if there are instructions that exist only in ring zero. So for example loading a control register can only get into kernel in ring zero. There are instructions that can only be executed in hypervisor like VM enter or instructions that can only be executed in system management mode like resume. So regardless of what mode of execution or privilege level we're scanning at, we want to be able to resolve the instructions that occur in any other mode. So we've got to come up with a better approach for figuring out the length of an instruction. What I came up with for this was a sort of page fault analysis. So the idea here is choose an arbitrary instruction. We don't actually know how long this instruction is right now, but we want to figure out how long it is. So we're going to map two pages into memory. The first page is going to have rewrite execute permissions and the second page is only going to have rewrite permissions. Then what we do is we place our instruction in memory so that the first byte of the instruction is on the last byte of the executable page and the rest of the instruction is at the beginning of the non-executable page. Then you execute this instruction and basically just jump to the instruction and see what happens. So internally the processor's fetcher is basically going to grab that first byte, that OF here, and it's going to see OF is not an instruction by itself. I need more bytes. So it tries to grab the next byte of the instruction, but now it sees that that next byte is on a non-executable page so the processor throws a page fault. Specifically the processor is going to generate a page fault exception with the fault address in the CR2 register. So CR2 points to the address of that second page. So whenever that happens we receive a page fault and CR2 is set to the address of the second page. We know that the instruction is longer than what we have in the executable page. So in this situation I know the instruction is longer than just OF. So we move the instruction back a byte and repeat the process. Execute the instruction. The decoder is going to grab that first byte. It's going to see the instruction continues beyond OF. So it's going to grab that next byte. That works now that this is an executable page. It's going to see that if this is 6a I still need more bytes to finish the instruction. And then it's going to try to grab the next byte. Again it's going to fault here because that byte is in a non-executable page. So we basically just repeat this process over and over. As long as we receive page fault exceptions with CR2 set to the second page address keep moving the instruction backwards one byte. And eventually what will happen when you execute the instruction is you're going to find out that entire instruction resides in an executable page so the processor could do a lot of different things here. This instruction could run. The instruction could throw a different kind of fault. It could throw a page fault as well but it's going to have a different address in the CR2 register. So regardless of what happens here all of those situations mean that this instruction has now been successfully decoded so it must reside entirely in the executable page. That means that we know the instruction length at this point. So we know the instruction length. We know how many bytes the instruction decoder consumed. What we don't actually know at this point is whether or not this instruction really exists. The processor has to decode even non-existing instructions. Fortunately it's pretty easy to check if this instruction exists or not at this point. Basically the instruction does not exist on the processor like it's really not there. Not just it's not under or it's undocumented. If the instruction is really not there the processor generates an undefined opcode exception at this point. So if we receive anything other than a UD exception then we know that this instruction actually exists on this processor. So this is kind of a neat approach. It lets us resolve the length for successfully executing instructions, faulting instructions, privileged instructions saying they can only exceeding ring zero, ring minus one, system management mode. So I threw all this functionality into a little process that I called the injector. The injector basically does this page fault analysis and the instruction search operations. The injector has a really big problem though. It's fuzzing the device that it's actually running on. So how do we keep the injector from crashing itself with these random instructions that it's generating? Well the first step is we're going to restrict ourselves to running inside of ring three and that's not such a big limitation because even in ring three using this approach we can find out whether or not certain instructions exist in more privileged rings as well. So that's not a big limitation but it does mean that we're going to essentially not accidentally totally crash the entire system through these random instructions unless you've got a very serious processor bug. So the next thing we're going to do is we're going to hook all of the exceptions that this instruction could generate. In Linux that means we're going to hook the seg fault exception. We're going to hook illegal instruction exceptions. We're going to hook floating point exceptions, bus errors, traps and whenever we receive one of these exceptions the process is basically going to clean up after itself. So by that I mean that the process is going to essentially reload all of the registers to known good values. That way no matter what happened with that instruction that we generated we reset the system to a known good state. So at step three what we're going to do to prevent other types of crashes is we'll set all the general purpose registers on that processor to zero before we try to execute the instruction. So what that does for us is it means that an arbitrary memory write instruction like this. Like this instruction adds the value 9102 into some location in memory. We don't want that to actually hit something that our program needs. We don't want that location in memory to fall into our processes address space. So by loading all the registers to zero we ensure that the memory address portion of this instruction resolves to zero. But that's not quite enough just yet because x86 has some complicated addressing modes. We can have an address like EAX plus four times ECX plus some weird long offset. So even though we can ensure that the part on the left resolves to zero the part on the right might still fall into the instructions or the processes address space in which case you could corrupt yourself beyond repair. So we don't want that to happen. Fortunately we're actually in good shape here because of the way that the tunneling search works. Essentially tunneling ensures that those offsets are constrained. If we have a four byte offset three of those bytes will always be zero when we're searching these instructions. So there's a range of different values we can get for that offset but none of those actually fall into the normal Linux process address space. Meaning that our process won't accidentally corrupt itself. Now those instructions will say false but that's perfectly okay. We catch say fault and can correct those. So we've handled faulting instructions at this point. What about non-faulting instructions? The instructions that actually, the instructions we generate that actually do run. How do we handle those? We need some way for the analysis to continue after these instructions. So imagine that you randomly generated like an instruction like jump backwards 50 bytes. The processor would jump backwards 50 bytes and then just start executing random garbage that could corrupt your process beyond recovery. So we don't want that to happen. But here's where the trap flag can actually help us to basically right before we execute an instruction we're going to set the trap flag. So what will happen is when that instruction executes a trap will trigger that gives control over to our program and our program can restore the registers to a known good state again. So with all of these things limiting ourselves to ring three, handling exceptions, initializing registers, keeping register sets and own good values, trapping execution, the injector survives. Basically it won't crash itself now. So we've got an effective way to search the x86 instruction space. Now the next question is how do we make sense of these instructions that we're actually generating? So for this I designed what I call the sifter process. It's kind of a wrapper around the injector. The sifter's job is to look at what the injector is doing and pull out anomalies from the injector. So what do I mean by anomalies? What's an anomaly in x86? Well I thought what I really want to find is somewhere where the actual processor execution deviates from the processor specifications. So my idea here was we could use a disassembler as a sort of ground truth for this search because a disassembler is presumably written based on the processor specifications. It's a good abstraction of those specifications for us. So for this work I used a capstone disassembler. It's a really great tool. So using this by comparing the results of the actual execution to the disassembled results we can start pulling out interesting things from the architecture. For example we can find undocumented instructions now. Basically the disassembler doesn't recognize a byte sequence but that byte sequence generates anything other than an undefined opcode exception when it executes. That means that this instruction actually exists on the processor even though it's not in the documentation. We can find software bugs this way. The disassembler recognizes an instruction but the processor says that that instruction's length is something different. That's usually indicative of a software bug. And with the search we can also find hardware bugs and there's actually no really good heuristic I came up with for these hardware bugs. Basically when everything goes haywire you can dive in and figure out what's really going wrong on this system. So I want to give you a quick demo. I'm not sure how well this is going to show up on our projectors but I'm going to run the SIFTR tool. What we see here is our tool buzzing the x86 processor. Now on the top half of this image we're seeing the instructions that the SIFTR or that the injector is generating. On the right side you'll see the actual machine code the raw bytes that the processor is executing. The part highlighted in white down here that's the observed length of the instruction on this processor. And then on the left we see what capstone our disassembler our ground truth thinks this instruction actually is. And the SIFTR is basically just watching for differences between those two sides and whenever it finds a difference whenever it finds an anomaly it's going to toss it into this little section down here so that we can investigate that in a little bit more depth. So after a little while it'll spit out a couple of anomalies. We can, I'll talk a little bit about what these actually are momentarily. But the tunneling approach that we're using right now is very, very thorough. It does a really good job of searching the x86 instruction space but it's also not very fast. It takes about a day for this scan to complete. So if we want just quick results we can sort of change the fuzzing mode that we're using. So I added totally random fuzzing into this and it doesn't do as good a job searching but it does let you get very, very quick results. So if we change over to totally random fuzzing mode we'll start to really see a lot of anomalies appearing in the processor. Basically what every one of these indicates is an instruction that exists on my x86 chip but that we're not told about in the reference manuals. And I think that's kind of scary that all of these things are on my processor but nobody's acknowledging what they actually are. So if you let this run for a day it'll quit, it'll dump out all the results. On most modern systems it'll come up with 1 to 3 million interesting things to look at. Now that's a lot to actually go through by hand and try to make sense of. So I built this summarizer tool which again I don't know how well it's going to show up on the monitors but what the summarizer does is it tries to condense all that information and pull out the most meaningful parts for you. So for example what our summarizer is saying here is that it found 32 different instructions that start with OF-18 that appear to not be documented. Or if I dive into one of these instructions I can drill down in a little bit more depth. So it's telling me that the instruction OF-A7C1 existed on this processor but none of the three disassemblers that we gave that byte sequence to understood what that instruction was. That's a really strong indication that this is an undocumented instruction sitting on my processor. So basically this gave us a way to systematically scan our processors for secrets and bugs. So I scanned 8 of the systems in my test library and I want to share with you the things that we found. Because we found some really interesting things. First we found hidden instructions in every single processor that we scanned. We found ubiquitous software bugs, hypervisor flaws and some very very serious hardware bugs as well. So when I set out to do this I was trying to find hidden instructions on the processor. So let me share the hidden instructions with you first. So when I scanned an Intel Core i7 chip, so this was a chip manufactured in 2012, these are some of the hidden instructions that I found. OF-O-D, OF-18, OF-1A, OF-A-E. Most of these instructions are documented for certain combinations of bits inside of the instruction but even instructions without that specific combination, the ones that aren't documented, still run. Now some of these Intel has recently updated their processor specification to document these instructions. So for example OF-18 was added to the processor documents in December of 2016. The thing is I was scanning a chip made in 2012. These instructions were sitting on these processors for a very very long time before they were acknowledged by anybody. This was another set of instructions I found on that Intel chip. D-B-E-O, D-F-F-1, CO-D-O-D-2, F-6-F-7, a whole wide range of instructions which have absolutely no documentation in the reference manuals. So that's a little worrisome to me. So I started scanning other chips. I scanned an AMD Geode processor and what I found when I started scanning processors from other manufacturers is that there was actually a lot of overlap in the undocumented instructions between these processors which kind of indicates that there's some collaboration going on between different manufacturers as to what these undocumented instructions are actually going to be doing. So the really interesting parts I thought were the places where there was no overlap where this was really a unique set of instructions only on that processor. So for the AMD chip I scanned the part where there was no overlap was OF-OF-4080 followed by some byte. Now AMD documents some versions of that last byte but the vast majority of those instructions aren't documented. It's just functionality hidden in our processor. So next I scanned a couple of VIA NANO chips. The unique parts of the VIA architecture where the instruction is around OF-A7. So OF-A7 is a unique VIA extension to the instruction set called the padlock instructions. So you can actually look up VIA's padlock reference manual and see that all of these instructions have to do with cryptographic functions. However the padlock reference manual doesn't acknowledge the existence of C1 through C7 as that final byte. So these are cryptographic instructions that are doing something on the processor but there's no hint as to what they're actually doing. So what do these do? Some of these instructions if you Google around you'll find that people have stumbled across these in the past and reverse engineered what they do. Basically looking at the register differences before and after an instruction executes in order to find out what that instruction must be doing but some of these instructions have absolutely no record at all. They're not in the documents and on the processor specifications they don't appear anywhere online. So that's I think scary in terms of trust for these things. So those are the hidden instructions I found. This thing also ended up turning up a bunch of software bugs. So this isn't what I set out to find but it was sort of interesting some of the results we got. So the issue here is that the sifter process is forced to use a disassembler as it's ground truth. And it turned out every single disassembler that we tried for our ground truth was absolutely littered with bugs. So most of the bugs we encountered with this tool only appeared in a couple of tools so they weren't all that interesting. But some of the bugs we encountered appeared in all tools that we checked. Those can actually be used to an attacker's advantage. So two of the more interesting ones I found were this jump and call instruction. So in the 64 bit version of the x86 architecture the E9 instruction is a jump instruction, E8 is a call instruction, and 66 is supposed to be what's a data size override prefix. The idea is that 66 prefix on these instructions is supposed to change the default operand size of the instruction. Now the default operand size here is 32 bits so 66 is supposed to change that to either 16 or 64. But on Intel processors for whatever reason they silently ignore that 66 prefix on these instructions alone. So why does that matter? Well it turns out because of that everyone is parsing these instructions wrong. I tried this in Ida, Balgrin, GDB, Avsham, Visual Studio, Capstone, QMU, every single tool I tried is parsing these instructions incorrectly. And here is sort of how we can use that to our advantage as an attacker. So here we're seeing how Ida parses this instruction. You'll see that Ida thinks this is a four byte instruction. It saw the 66 override, thought that that changed the next operand to 16 bits instead of 32 bits. So Ida is getting the wrong length for this instruction. Here's Visual Studio's view of the instruction. Visual Studio actually recognizes that 66 didn't change the operand size, but it thinks that the 66 caused the destination of this jump to get truncated to 16 bits. That's also not the correct behavior. So Visual Studio isn't able to resolve the target of this jump. So you can actually use this for some malicious things by throwing off these disassemblers. So I made a little bit, a little malicious program and then had Objdump try to analyze it. So you'll see here Objdump is mis-analyzing, mis-parcing that jump instruction. It gets the wrong size for that jump instruction. So now Objdump thinks we have a jump followed by an ad, followed by an ad, followed by move abs. Except because Objdump got the wrong size for this instruction, it's messed up all of the previous instruction or all of the following instructions that it tries to disassemble. So that's useful to us because I embedded a malicious instruction inside of the operands for one of these move abs instructions. What this jump is really going to do is it's going to jump into the middle of one of these instructions and execute malicious code that Objdump can't see. And to highlight that, sort of the implications of that, I wrote this little program that you can run in QMU. So it's a popular emulator. So what I'm going to do here is I'm going to SSH into our QMU VM and I'm going to run my little program. And as far as QMU can tell when this program runs, it's just totally benign behavior. It prints out, I'm totally benign. No matter how many times you run this, QMU is going to see only benign behavior. But that's because QMU is miss parsing a jump instruction in this code. So when you run this on bare metal, you find it's actually a malicious program. It prints out, I'm actually malicious. And the neat thing about this is there's not any QMU detection logic inside of this program. The only thing that allowed this is one miss parsed jump instruction that everybody is getting wrong. So I think that was kind of interesting. So I was curious about why everybody was miss parsing these jump instructions. And my best theory right now is that AMD actually obeys that override prefix on this whereas Intel ignores that override prefix in their architecture. And that's a little bit troubling when we can't agree on a standard for our processor architecture. The last time that happened when Intel deviated from AMD's specifications, just a little bit, it resulted in the very serious vulnerability called the sysred bug that allowed kernel privilege escalation. So you might think, well why don't we update all of our tools to fix this issue? Why don't we just do it the way Intel does? They have 95% of the market share. Well you can do that, but then AMD is going to be vulnerable. There's really no winning here. There's no correct answer. And I think it just kind of highlights the impractically complex nature of this architecture when tools can't even analyze a simple jump instruction correctly. So switching gears a little bit. Early on in the development of this, I was getting fairly tired of waiting a day for my scans to complete. So I wanted scans that could run a lot faster. I wanted to enumerate the instructions in the instruction set in an hour instead of a day. So I added multi-core support to this fuzzer. And I thought wouldn't it be neat if I could run this thing on 20 cores? So I rented an Azure instance in RANDIS tool, but I very quickly found out that my tool wasn't running correctly on my Azure instance. And that turned out to be that there's a small bug in the Azure hypervisor. Namely, Azure doesn't emulate the trap flag correctly if the trap flag is set on the CPU ID instruction. So the idea here is when you execute CPU ID, it causes a VM exit and the hypervisor takes over. Then the hypervisor is supposed to emulate that CPU ID instruction. And then the last thing it's supposed to do is check whether or not the trap flag was set. If it was, it should inject the trap into the guest VM. Azure forgets that last step. And what that looks like is something like this. So I've got a little program here. And I'm going to run it on bare metal. It's basically going to check that CPU ID trap behavior. And it's telling me that it's going to execute a CPU ID knock-knock instruction. And the trap flag is going to be set on the first CPU ID instruction. Now it expects to get a trap at that first knock. And sure enough, when it ran the code, it got a trap at that first knock instruction. But now what I'm going to do is I'm going to SSH into my Azure instance and try to run that exact same piece of code. What you'll see is we get different results inside of the Azure hypervisor. When I executed CPU ID knock-knock, we expected a trap on that first knock, but Azure gave us a trap on that second knock. That's a very small bug in the hypervisor. So this isn't a security critical bug. It's not an especially big deal. It would be pretty hard to run cross this in a normal situation. But it's always a little bit troubling when the hypervisor can't faithfully emulate the underlying hardware. So that sort of brings us to the last class of issues I found with this processor scanning tool. We found some hardware bugs. And hardware bugs are always troubling no matter how small they are. Because a bug in your hardware means you have that exact same bug in all of your software. Hardware bugs are very difficult to find and very difficult to fix. So I started out scanning some Intel processors, a Core Pentium and a Core i7 processor. And I didn't actually find anything too interesting on these. I did find the foof bug on the Intel Pentium. That's a malformed instruction that can totally lock the processor. But that's a very old, very well-known bug. So that was kind of anticlimactic. But it did sort of validate that this tool is capable of finding these malformed instructions that could have serious consequences. Next, I scanned AMD, a couple of AMD systems. On several of these AMD systems I noticed some odd behavior basically. They could generate an undefined opcode exception before they completed the instruction fetch. So I dove into AMD specifications that's not the correct behavior for these processors. Basically a page fault during the instruction fetch should take priority over an undefined opcode exception. So that was an errata until when I was getting these slides ready to present. I found somebody at AMD must have found this recently because they updated their documents in March of 2017 to have this tiny little footnote down here. This footnote basically says, okay, this behavior is allowed. And I think that's a bit of a cop out. If you've got a processor errata and you update your documents to allow that errata, is it still an errata? I think it is, but maybe it's not anymore. I scanned a transmeta processor so transmeta is not so popular anymore, but they were kind of popular a few years ago. I found some interesting behavior on a transmeta processor with the instructions OF71, 72 or 73, the transmeta would give us a floating point exception during the instruction fetch. So this is also the incorrect behavior. Basically if you had an FPU exception pending and executed one of these instructions, you'd receive an FPU exception in the middle of that instruction fetch. The correct behavior here, if that last button in the instructions on an invalid page is to page fault. So again, this is a very minor errata, but it's always a little unsettling when there are bugs in our processors. But that takes us to our last finding and I think this is the most interesting one. We actually found on one specific processor what appears to be a halting catch fire instruction. So a halting catch fire instruction, if you haven't seen this, well it can mean a lot of things, but in this case it's a single malformed instruction that when we execute it in ring three, the least privileged realm of execution on these processors, it completely locks the processor. So I wanted to make really sure this wasn't a kernel bug instead. So I tested this on two different windows kernels, three different Linux kernels, but then I really, really wanted to be sure this wasn't a kernel bug. So I actually wrote a small loadable kernel module for Linux that would hook all of the interrupts in the interrupt descriptor table and dump serial debug information out whenever interrupts occurred. I was a little bit worried that maybe this instruction was causing what's called an interrupt storm to make it look like the processor was locked when it wasn't really locked, but this also seemed to validate. The processor is completely locked when we execute this instruction. So unfortunately I found this about two weeks ago during one of my scans, which means the vendor has not had time to respond to this issue. So there's no details available right now on the actual chip affected, the vendor or the actual instruction format, but I did want to give you a demo of this thing in action. So I've got Debian booted on this processor. You can see it can just run pretty much any normal program. It can run Hello World and print out Hello World. But when I run a.out, it's going to do something different. Now the only instruction in a.out is this one malformed instruction. That's all there is. And when I run it, the processor locks. Now I don't mean the process locks. I mean the processor itself is completely locked up at this point. You can try to control C to break out of this program. You can try to change run levels. The system won't respond anymore. You can even try the Linux magic SysREQ keys in order to just crash or reboot the system and you'll get no response because that processor is done executing instructions at this point. So I was pretty excited about this. As far as I know, this is the first such attack on x86 found in 20 years. The original version of this. Thanks. So the, so the original Intel halting catch fire instruction was that Pentium foof bug on the original Pentium in 1997. So I think that's pretty cool. And I don't want to, I don't want to try to make people panic. This is on a very specific chip that is not in widespread use. So it's not like the sky is falling here. I think the interesting part about this is that we have a tool that can actually find these kinds of very serious bugs. And because it is a serious bug, it should be apparent I think that this is a very serious security concern if an unprivileged user on this process can initiate an entire processor DOS that's going to lock everyone else out of the system. So it is a really neat thing. And I'm hoping and if everything goes well with responsible disclosure, I'll be able to tell you more about that within the next month. If you stay tuned. So I didn't just want this to be an academic thing. I wanted people to actually be able to use this tool and scan their processors. So this is now open sourced on github. That's github.com xor eax eax eax slash sand sifter. And I encourage you to try this thing out because we don't really know what's in our processors. So you should use this to audit your processor, find your secret instructions in it, break disassemblers, break emulators, hypervisors, find these halt and catch fire instructions. Like my system didn't literally halt and catch fire, but yours might. And I think that would be awesome. So you won't know unless you actually try this out on your specific system. And I want to highlight I've only scanned a few systems with this and this is really just a fraction of what I found on those few systems. This is just what I could cram into a 45 minute presentation. So if that's what I found on these few scans, who knows what you'll find on your computer. So check your system if you're not sure about the results that this thing is generating. Dumps out all these weird instructions. You can send me those results. I'd be happy to try to dissect them for you. But the real point here is we need to stop blindly trusting the specifications. If we don't trust software, we shouldn't trust hardware. We need tools that can actually audit our processors and make sure that they're really doing what we're told they're doing. And that's what sand sifter lets us do. It sort of gives us a really important critical first step in introspecting this black box that's at the center of all of our systems. So again, you can find this on GitHub right now. That's the sand sifter project. You can find some other fun things that I tinkered with over the last few years on there. Like I wrote a single instruction C compiler for no reason. I wrote some tools for manipulating program control flow to show images in IDA. A couple of years ago I released an architectural exploit on X86. And there's lots of other random things that I've played around with over the last few years that you can check out on there. If anybody has feedback or ideas, I would absolutely love to discuss it with you. So grab me after the talk or you can reach out to me. My name is Christopher Domas. I'm at Twitter at xor eax, eax, eax, or same thing at gmail.com. And like I said, I'll hopefully be releasing all the details for that processor vulnerability within the next month. So thank you everyone.