 Good afternoon. Y'all enjoying this conference? Come on, Stephka! Come on! Yeah! So Christopher Dumas, he's going to do his talk. He's going to be running completely long. He'll be using all the help of his time. He's kind of asked that if he got any questions to do it after the talk. So with that, go ahead, Chris. Alright, welcome everyone. We'll go ahead and get started. So what I wanted to talk to you about today is honestly something that I didn't think was possible, but it should be pretty interesting. So, you know, the word hardware backdoor gets thrown around a lot these days, so much that it's kind of lost all meaning. But what I want to look at today is not the things that we're normally worried about. It's not the management engine. It's not the platform security processor. This is something totally different that I think wasn't on anyone's radar. Something we didn't really see coming. But like any good presentation, I think we've got to start off with a disclaimer. Basically, all of this research was done on my own. This does not reflect the opinion of my current employer or any previous employers. But with that, my name is Christopher Dumas. I'm a cybersecurity researcher. I've tinkered with a lot of different things in the past, but recently, the last few years, what I've sort of been focused on is the idea of low-level processor exploitation. It turns out there's a lot of fun things we can do with that. So in order to frame the discussion for the rest of the day, I want to start off with a demo of exactly what we're going to be talking about. So what we have here is... Let's see if I may pause that. What we have here, I'm logged into the system as a regular Unprivileged user. I'm going to open up a very simple C file. All it does is it loads some values into the EX register and then it executes all of these bound instructions. Now, if you're not familiar with the X86 bound instruction, it is a pretty simple instruction. All it does is it takes two arguments and it checks if your first argument is within the bounds provided by the second argument. Now, if you look carefully at these bound instructions, what you'll see is that they use some kind of weird values. Basically, it looks like random values on the right. What that means is that these instructions don't actually even have access to the memory that they're trying to use. Now, in X86, when you don't have access to the memory that you're trying to use, you get a general protection exception. Every single one of these bound instructions is going to throw one of these general protection exceptions. In Linux, you see that as a SEG fault. Despite knowing that that's what's going to happen at the end of this program, we are going to try to execute a shell. We'll exit out of here and we will run this program. We'll compile it, launch it, and exactly what we expect to happen will happen. We'll get a segmentation fault because those bound instructions don't have access to the memory that they're trying to use. I'm still the exact same user I was before. Still logged in as delta. If I make one tiny change to this program, I'm going to add one X86 instruction. It's an instruction that's so obscure and unheard of. It doesn't actually have a mnemonic associated with it. It is going to have to be written in raw machine code, OF3F. What this instruction is going to do is it is going to fundamentally change the nature of all the subsequent bound instructions so that now, instead of their original functionality, instead of checking whether or not one value is within the bounds of another, they are going to reach into kernel memory and modify the system itself. So I'll compile this and re-execute it. And all of a sudden, we'll see I am now rude. So the rest of this presentation is going to be a very long, quick convoluted journey onto how exactly I stumbled across this. And the whole thing begins with this idea of X86 rings. So in the beginning in X86, about 30 years ago, there is no concept of separation of privileges. All the code running on the processor executed with the exact same privileges as all the other code running on the processor. And it was basically chaos. But then we added the idea of rings of privilege to the processor in order to separate out permissions. And the idea being rings of privilege was that at the lowest level, we would have ring zero. That would be the most privileged realm on the processor where the kernel would live. The only really trusted code would execute ring zero and it would have complete access to the hardware. Little less privileged than that was ring one, less than that was ring two. And then way out here in the outermost ring is ring three. That's where the user code lives. Basically all the code that we don't trust to have access to the hardware is going to execute way out here. So this was the fundamental idea behind privileges on X86 where ring three, if they wanted to do anything interesting, it would have to go through very specific, carefully checked hardware privilege access mechanisms in order to ask ring zero to do something for it. And all of our other security on X86 is all based off of this fundamental premise of separation of rings. And things got kind of interesting. People started digging deeper. We found out we needed things more privileged than ring zero, like the hypervisor had to have more access than the kernel normally would. So we kind of called that ring minus one. We found out some things had to be more privileged than that system management mode. We called ring minus two. A couple years ago, some researchers found another core off the X86 core that they called ring minus three because it could do things that the X86 core couldn't do. So things were interesting. But if you've been following any of this research, sort of in the back of everyone's mind is this question, can we go further? Are there even more layers to this puzzle? So I started setting off to sort of explore that idea. And a lot of times when I'm starting out on new research, I find an interesting place to begin is with patents. Sometimes you can find some really interesting information that people don't document in public information, but you might be able to find some bits of useful ideas inside of patents. So keeping this whole idea of this ring model of privileges in mind, imagine my surprise when I was reading through a patent and I saw it just sort of nonchalantly buried in the middle, this quote, additionally accessing some of the internal control registers can enable the user to bypass security mechanisms, for example, allowing ring zero access at ring three. That's a little bit alarming from a security perspective. You're telling me like we've had 30 years of relying on rings to provide our privileges on X86 and there's just some way to circumvent all of this. So they go on to say in addition, these control registers might reveal information that the processor designers wish to keep proprietary. And for those reasons, the various X86 processor manufacturers have not publicly documented any description of the address or function of some of these control MSRs. So this, this was really interesting. I was really, really excited about this and wanted to run with this idea. So like any rational person, I went out and bought 57 computers to start doing some basic research on. So I had a, based on the patent typing and the patent owner, I had some, some idea of what processor I might specifically want to focus on for this, this research, but patents are kind of a funny thing. IP gets bought and sold by different companies. Ideas trickle through the industry. So I sort of wanted to cast a wide net and look at a bunch of different systems for this research. But what I eventually settled on was the VSC3 processor. So these are interesting processors sort of targeted at an embedded low power market. You can find these often in point of sale systems, kiosks, ATMs, gaming, since we're in Vegas, you might want to start poking around digital signage, healthcare, digital media, industrial automation, and they're in PCs and laptops as well. So I pulled off my shelf this specific system eventually. It's a thin client running a C3 Nehemiah core. And this is what we're going to look at for the rest of the presentation. At the very end of the presentation, I'll look a little bit more at what other processors might be affected by this issue, but this is sort of the target of this specific research. So I wasn't able to find a developer manual for this processor. So when we're sort of left in the dark, a good place to go is sort of follow a trail of patent breadcrumbs. See what information we can derive from patents and see if there's anything useful in there that can give us hints as to how we should move our research forward. So I had to dive through a lot of patent literature and I wanted to give a little bit of a glimpse into what that's like. So this isn't actually a quote from the patents that I ended up, or the patents that I ended up using, but it is something I stumbled across during this research. So this quote says, figure three shows an embodiment of a cache memory. Referring to figure three in one embodiment, cache memory 320 is a multi-way cache memory. In one embodiment, cache memory 320 comprises multiple physical sections. In one embodiment, cache memory 320 is logically divided into multiple sections. In one embodiment, cache memory 320 includes four cacheways, i.e. cacheway 310, cacheway 311, cacheway 312, and cacheway 314. In one embodiment, a processor sequesters one more cacheways to store or to execute processor microcode. Like my head just exploded when I read stuff like this. It's so convoluted and so wrapped up in legalese, it's hard to understand what the heck you're even reading. And just to put that into perspective, this one four-page patent contained the phrase in one embodiment 147 times. So following this to start your research when you really want to dive into things is a really, really painful process, but if you're persistent, you can pay off. So I eventually settled on these six different patents that seem to have some basic ideas that could point me into the right direction for this research. So I want to highlight some of the key ideas that I got out of those six patents. So one is it sort of looked like at the time VIA was embedding a non-X86 core alongside the X86 cores in their C3 processors. And this non-X86 core was a risk architecture. And the patents didn't have a consistent terminology for this, but I just sort of started calling this the deeply embedded core or the deck on these processors. They also talked about something that they call the global configuration register. It's basically a register that gets exposed to the X86 core through a model specific register. And they say that this global configuration register can activate the risk core. They also talked about an X86 launch instruction. It was basically a new instruction that they added to the X86 ISA where once the risk core is activated, you can start a risk instruction sequence through the launch instruction according to the patents. So kind of putting all these ideas together, what it looked like is if our assumptions about what this deeply embedded core can do are correct, you could essentially use this as a sort of backdoor, a means of surreptitiously circumventing all of the processor security checks. So that's absolutely ideal, worth exploring more for security purposes. So there are sort of three patents that give us some initial ideas on how the overall mechanisms might work. So one patent tells us that a model specific register can be used to circumvent processor security checks. Another patent tells us that a model specific register can be used to activate a new X86 instruction and another patent tells us a launch instruction can be used to switch to a risk instruction sequence. So if we sort of put those three things together and try to fill in the gaps, we end up with a sequence that looks something like this. We have to find a model specific register bit that when toggled will activate a new X86 instruction that didn't previously exist and running that instruction will activate a risk core on the processor and that core should be able to bypass the processor's security protections. So where do we begin? Well let's look at the very first step in that chain, these model specific registers. So if you're not familiar with the idea of MSRs in X86, they are basically 64 bit control registers used for a wide, wide variety of things. They're used for debugging, performance monitoring, cash configuration, feature configuration, basically anything that doesn't have to do with general computing can get tossed into the MSRs. And unlike the X86 registers you might be more familiar with, they aren't accessed by name, they're accessed by address. So instead of EAX and EDX, we have addresses from zero to four billion is how we access our MSRs. So basically you load up an address into the ECX register and then execute the read MSR or write MSR instructions in order to access an MSR. Now there's some saving grace here. We think that maybe we can use this to circumvent the ring protections on the processor, but these MSRs are only accessible to begin with with ring zero code. So maybe the rest of this stuff we can do from ring three, but in order to activate that bit to begin with we need ring zero execution. So that's good news, although maybe we don't need ring zero execution. I'm going to revisit this idea later in the talk, but so that we can move the research forward. Let's assume for now that we have one time ring zero access on that processor just to enable this backdoor feature. And we'll revisit that concept later. So let's look at these model specific registers some more. Well the patent basically comes right out and tells us that the various X86 processor manufacturers have not publicly documented any description of the address or function of some of the control MSRs. So that's a challenge for us. I think there's a bit in one of these MSRs that might activate something cool, but it's probably not going to be documented. So step one in order to try to explore this idea is just figure out which MSRs are actually implemented by the processor, regardless of what the documentation says, which MSRs are actually there. So this one's pretty easy to solve. Basically you can set the general purpose exception handler on the processor to point to some handler under your control. You can do that with the LIDT instruction. Then you can load up an MSR address that you want to check. Let's say you want to figure out does MSR number 1337 exist on my processor? Well load that number into the ECX register and then execute the read MSR instruction. Now if you don't get a fault from that read MSR instruction, you know that that MSR exists. If you do get a fault, if your handler gets control, you know that that MSR doesn't exist. So this is a really easy way to figure out exactly which MSRs actually exist on your processor regardless of what documentation says. So when you do this on the VSC3, you end up with an alarming number of MSRs way more than would normally be on an X86 processor. We found 1300 implemented MSRs on that processor which is far too many to analyze. I think one of these is going to activate a new X86 instruction that's too many to go through by hand. So step two is sort of figuring out which MSRs here are actually interesting. Which one should I be exploring for this research? So I came up with this idea of sort of a timing side channel attack against the MSRs where basically what I would do is I would calculate how long it took to access each of the four billion possible MSRs and what that looks like is you've got this read MSR instruction and then on either side of that you've got a serialized read timestamp counter instruction that lets you see exactly how long it took your read MSR to execute and that shows you how long it took to access the MSR. So if you run this code it looks something like this where on the X axis I've got the MSR numbers and on the Y axis I've got the access time for that MSR. Green is MSRs that are implemented on the processor, red is MSRs that are not implemented on the processor. So this side channel attack actually gives you some really, really interesting insights into how the processor works under the hood that we normally would never have access to. But what we can do with this then is sort of form a hypothesis. I'm going to theorize that this global configuration register, this really powerful register that the patents talk about is probably unique. There's probably not several similar versions of this register within the MSRs. There's probably exactly one of these on that processor. So what I can start to do is I can start to look for MSRs with unique access time. So like these that are circled in red and when we actually start to do that what we find is that there's relatively few MSRs that are truly unique on this processor. In fact out of the 1300 MSRs that are there we identify 43 that actually seem interesting and worth exploring more. So that's good. That seems like we're making progress. I've sort of whittled down the number of MSRs from initially 4 billion down to 43 to actually explore in this research. But it's still a lot to tackle by hand. That's 2752 bits worth of MSRs to check. Now my theory is that one of these bits is going to activate a new x86 instruction. So I want to figure out when I toggle these bits did a new instruction appear on the architecture. Well x86 is a really complicated instruction set and it's sort of hard to estimate how many instructions could actually be in x86 but a rough upper bounds would be somewhere on the order of 1.3 undecillion possible instruction. So I want to figure out did one of these MSR bits create a new instruction but I've got to search through 1.3 undecillion instructions in order to find that new instruction if it appeared at all. So even taking a really optimistic estimate if we could scan through a billion possible x86 instructions every second you can do some quick Fermi calculations in order to see 1.3 undecillion divided by 1 billion divided by 60 seconds a minute divided by 60 minutes an hour divided by 24 hours a day divided by 365 days a year means it's going to take about an eternity to scan that entire instruction set. So that's not reasonable. Making things worse that's for one scan. We have to scan each of these 2752 bits that's about 2752 eternities in order to find which bit creates a new instruction. So fortunately there's a better solution. I actually released this tool last year called sand sifter which was kind of neat. It found a smart way of searching through the x86 instruction set using page fault analysis and a depth first search algorithm in order to scan all of x86 for the most probable instructions in about a day and it can be used to find things like undocumented instructions or new instructions appearing but it still can't be run 2752 times if it takes a day to scan a single bit. So what we can do instead is we can try to toggle each of those 2752 candidate MSR bits one by one. But these are configuration bits. These control the inner workings of the processor and I have no idea what these bits actually do. So a lot of them are going to lock the processor or freeze it or panic the kernel or reset the system entirely. So trying to toggle 2000 some bits one by one can't be a manual process. We need some sort of automation in order to make this work. So the system I came up with looked something like this where we would have a target that's a C3 processor hooked up to a relay basically wire would be soldered on to the power switch on the target and that's hooked up to a relay. The relay is hooked up to a master system so the master can power cycle the target through the relay. The target also network boots through a switch from the master and the master then can SSH into the target. What the master can tell the target is toggle this MSR bit and the master is going to check did something become unstable that the system stopped responding to the kernel panic. If not it's going to try to toggle the next MSR bit. If so it's going to power cycle the target. This way the master can sort of repeatedly go through and identify exactly which of those 2752 MSR bits can be activated without the target becoming unstable. So over the course of about a week through hundreds of automated reboots we're able to identify exactly which bits we can actually turn on on that processor. So after that what we do is we try to turn on all the bits that we possibly can get all of those bits on at once and only then do we actually run sand sifter. Only then do we scan the processor for new instructions. So that looks something like this using sand sifter for this purpose. So like I said sand sifter uses page fault analysis in the first search in order to search through the x86 ISA. We're watching sand sifter sort of in the middle of a scan probably about 12 or 15 hours into a scan and what you can see it doing is generating machine code in order to try to search through the feasible x86 instructions on this processor. And if you let sand sifter run for long enough eventually it will spit out something new in that lower window there. Sand sifter after about a day finds exactly one new instruction on that architecture, OF3F. So this must be the launch instruction that the patents are talking about, the new instructions that got activated by these MSR bits. So through GDB and some trial and error I was able to figure out that the launch instruction is basically a jump EAX instruction. So from that we can figure out what the global configuration register is. Now I activated all of these possible MSR bits in order to find the launch instruction. Now I'm curious which bit was really responsible for activating that launch instruction. Fortunately now that I've identified the OF3F instruction I no longer need to run sand sifter for additional scans. What I can do is activate each of those MSR bits one by one and after activating one I'll try to execute the launch instruction. If it doesn't work then that was the wrong bit. If suddenly the launch instruction appeared on the architecture then that means I found the bit that activated the launch instruction. So using that approach we're able to find out that MSR number 1107 actually was the MSR that enabled the launch instruction and specifically it was bit zero inside of MSR 1107. Now I suspect that this is going to open the door for another architecture on the processor which will let me bypass those ring protections once and for all and circumvent all the processor security checks. So because of how much power that single bit has just potentially enabled I started calling bit zero of MSR 1107 the God mode bit. So at this point we've figured out the God mode bit. We figured out this hidden launch instruction in the x86 ISA. The next question then is how do I actually execute instructions on this x86 or on this new risk core that we've enabled. So we can sort of dive into patents to try to figure this out. The patents sort of hint at this idea that instructions are fetched out of memory and then passed to different decoders depending on whether you're in x86 or risk mode and I had to go through a lot of trial and error to figure out exactly what that might look like under the hood but this is sort of what I ended up with. Essentially an instruction will be fetched out of the instruction cache and it would be passed to some sort of pre x86 decoder. And that decoder is going to break apart the components of an x86 instruction and then those components are going to get passed to a check and that check is going to determine am I in risk mode or not. Namely has that launch instruction just been executed. If the answer is no then those components are all passed on to a further x86 decoder and the instruction executes as x86. If the answer is yes one of the components of that instruction 32 bit constant value will be torn out and passed over to the risk decoder in order to execute as a risk instruction. So basically with this setup there needs to be some x86 instruction where if the processor is in risk mode it can pass a portion of itself over to the risk processor on this chip and since this instruction whatever it is sort of joins the x86 and risk cores I call it a bridge instruction. It basically gives me a way to feed instructions over to the risk core. So the next question now is how do we find this unknown x86 bridge instruction that will let me execute risk instructions. It's not easy but it should be just sufficient just to detect that a risk instruction has been executed. So how can we detect if a risk instruction has been executed given that I have no idea what these instructions can actually do or what their execution will look like. Well an easy way would be if our theory is right if this risk core really can circumvent processor security checks then there should be some risk instruction. I don't know what but some risk instruction even when executed in ring three should be able to corrupt the system and corruptions are usually pretty easy to detect. That should look like a processor lock or a kernel panic or a system reset. Basically if I observe any of those behaviors that means I've executed one of these mysterious risk instructions because none of those things should ever happen executing ring three x86 instructions normally. So I sort of tore apart sand sifter in order to help me with this. I ripped out the core of sand sifter, changed it to run in a brute force mode. So it's still executing x86 instructions but before each x86 instruction that it generates it's going to execute the launch instruction in order to switch to risk mode. And what it's trying to find is some x86 instruction that corrupted the system and that's actually what we just saw here. We saw the processor lock when the sand sifter hit exactly the right combination of x86 and risk instructions. So once we observe that processor lock that means that we found this bridge instruction, this x86 instruction that can execute risk instruction. So it takes about an hour of buzzing to get here. We just saw a short snippet. But it turns out that this bound eax instruction is this bridge instruction. It's what's going to let me feed instructions over to the risk core. And specifically this 32 bit constant value used in the bound instruction appears to be the risk instructions that get executed on the deeply embedded core. And that's a pretty easy thing to check. Basically I can see that for some specific 32 bit values the processor locks every single time and for other 32 bit values nothing seems to happen very, very consistently. So now we found the bridge instruction. I know how to send instructions over to that deeply embedded core. So the next question is what do we want to execute on this alternate instruction set? What do these instructions even look like? What architecture am I even dealing with? So ideally moving forward I would just assume that this other architecture is probably some known common architecture. It's probably something like ARM or PowerPC or MIPS. It doesn't make a whole lot of sense to invent an architecture from scratch. So we could assume that this other architecture is some common architecture and I could try to execute a common architecture's instructions on this other core. So for example if I thought maybe I'm dealing with ARM I might try to execute an add 1 to r0 instruction. The problem or what I encountered was that for some of these very, very simple risk instructions I was generating the processor would lock. So if I generate an instruction like add 1 to r0 and I try to execute that on this risk core and the processor locks that probably means I'm not dealing with the architecture I thought I was dealing with. I'm probably not dealing with ARM if that instruction, that simple instruction locked everything up. So I was actually able to rule out 30 different common architectures this way. And I still think most likely this other architecture, this risk core that we're dealing with is probably derived from some common architecture but it seemed like it was maybe modified enough that I couldn't identify it. So that sort of forced me to move forward just assuming this thing was a black box basically treating it as some totally unknown architecture that we've never seen before. That means that we have to reverse engineer the format of these instructions for this deeply embedded core. And I spent probably the bulk of this research actually trying to reverse engineer the format of these instructions and so I started calling those the deeply embedded instruction set or the dice. So the question is how do we begin reverse engineering a totally unknown instruction set? Ideally what I would do is I would execute one of these risk instructions and observe its results. The challenge though is I have absolutely no knowledge of this instruction set architecture that I'm dealing with. And I probably can't observe the results on the risk core. So for example, if I did generate an instruction like add one to R0, but I can only view or I can, the question is then how do I view R0 from my x86 core? How do I even detect that I've made any changes to the risk core? Fortunately, there is an approach for this. The patents actually suggest that these two cores share register file, at least partially share a register file, which means I may not be able to view all the effects of my risk instruction, but I should be able to view some of the effects of the risk instruction from the x86 core. So for example, the patents show an example where an arm in x86 core share some of their registers. So that means from the x86 core I can actually see some of the effects of these instructions. So what this would look like then is I can generate an initial state for the processor. I'm going to generate a bunch of values for the processor registers, maybe random values, maybe set all those registers to be pointers to various buffers. I'm going to record that information. I'm going to generate some buffers in userland memory and in kernel memory. I'm going to record those buffers basically record all the information I can about the system state. Then what I'm going to do is I'm going to execute that launch instruction that OF3F instruction in order to toggle the risk core. After that, I'll execute the bridge instruction, the bound EAX instruction that lets me send a risk instruction over to the risk core. And I'm going to generate an arbitrary risk instruction to try out. After that instruction executes, I'll record the results of the system state, I'll record all the registers, all the buffers and everything else. And what I want to see is did something change on the system? Did that input state and output state differ in any way? Unfortunately, we run into even more challenges here. I'm dealing with a totally unknown instruction set that probably has unfettered access to ring zero. So it is really, really easy to accidentally generate instructions that cause kernel panics or processor locks or complete system reboots. And in practice, I could only generate about 20 risk instructions before the system became unrecoverably corrupted before I had to reboot the system and start over. So even after optimization, it took about two minutes for one of these systems to boot. So some more rough approximations kind of indicated it was going to take months and months of buzzing in order to gather enough data in order to reverse engineer this instruction set. So I expanded my initial setup. Instead of fuzzing one target system, I bought as many systems as I could on eBay, turned out to be seven of these thin clients, and I hooked them all up. And if you look carefully at this, what you'll see is each of these systems has a little green wire coming out of the chassis that wires soldered onto the system's power switch. All of those wires go to a relay module that relay module is hooked up to a master system over USB. All of these systems boot over the network from the master system. So the master system can boot up these systems, can SSH into them, and can assign each of them fuzzing tasks for this risk architecture. It can then record all the results of those fuzzing tasks for offline analysis. And when it detects that one of these systems has become corrupted, when it stops responding, or at the kernel panics, the master can use those relays in order to reboot the target. So we can actually see an example of this in action. So what I'm going to do is I'm going to tell the master to start a fuzzing task. It's going to think for a little bit as it generates some tasks for the targets. And then if you listen to the relays, you can hear each relay clicking on. You can see the relay lights. And if you look closely at those target systems, you can see the green LEDs on them coming on one by one as each of those systems boots up. So I think for brevity's sake, I'm going to cut this video a little bit short. But if you'd like to see a full demo afterwards, you can grab me. But in about a minute when these systems come online, we'll be able to see the targets start tasking them with fuzzing jobs. And we'll be able to see logs coming into the target. And eventually the target will detect that some of these systems have frozen and it will bring them offline through the relay so that this process can continue. So this was slow fuzzing work. It's fairly laborious. I let the system run for about three weeks. I collected 15 gigabytes of logs across 2.3 million different state diffs for about 4,000 total hours of compute time. But it's really exciting after I had the final results. We can start sifting through these logs in order to try to find something interesting. Basically, I'm curious, can I actually do anything to ring zero from ring three, according to these logs? And it turns out we can. We can pretty quickly start to find some things that shouldn't exist in a secure world. So for example, the instruction A7719563, if you look at that EDX register, I was able to read control register zero into the EDX register. Control register zero is supposed to only be accessible by ring zero, but we just read it from ring three using one of these risk instructions. And we're not limited to reading ring zero data. If we look here, instruction 8AB4 was actually able to write debug register zero. You'll see that the EBP register got written to DR zero. DR zero should only be accessible to ring zero or circumventing the ring protections through this risk core. So at this point, it's kind of time to start thinking about a payload. What we should actually be having this risk core do for us. And really, once you can reach through the ring boundaries, the sky's the limit, you can do whatever you want. But I thought it would be useful to have some sort of easy to demonstrate payload. So I thought a good demonstration would simply be elevating the current process to root permission. So that would look something like grabbing a structure called the global descriptor table, parsing out a field on the GDT called the FS segment. That FS register can actually point you to a structure in kernel memory called the task structure. And if you grab a certain field from the task structure, you can get a pointer to what's called the cred structure. And from there, you can actually set yourself to have root permissions as long as you can reach directly into kernel memory and grab and modify all of this information. Now there's really only a few pieces here, those parts highlighted in red that require us to cross ring boundaries, things like addition and bit manipulation. We can technically do that on the x86 core. We don't need this risk core to do that for us. But it was kind of fun having this unknown core execute instructions. So I thought it'd be a little bit more impactful if I wrote this entire payload for the deeply embedded core and never used x86. So I've got 15 gigabytes of logs. I want to try to build this payload. So it's time to start sifting through those logs, trying to find some primitives to use. And this sort of feels like building a ROP chain. You've got these tiny little pieces of functionality. You have an overall idea of what you're trying to accomplish and you're trying to figure out how to piece together those little pieces in order to form your final payload. So we can start finding those little pieces that we need. So for example, A331 or A313 was an instruction that was able to read the global descriptor table. That's the first piece of our payload. I'm able to find a kernel read instruction inside of those logs. D407, you'll notice a low byte of EVP got read out of a kernel memory buffer. I can actually modify kernel memory. E2B7 was able to write a single byte into kernel memory. Now this is really promising. If I can write a byte into kernel memory, it means I can do precision exploitation through the deeply embedded core. But at the end, I sort of decided sifting through logs like this by hand just doesn't scale. I want to be able to write more robust payloads for this deeply embedded core. So I wanted some sort of automated approach for doing this. So what I wanted was some way to extract behavior patterns from these state diffs in order to identify the bit patterns inside of these instructions. So I built a tool called the collector. So the collector basically helped us do automated reverse engineering of completely unknown instruction sets. So the way the collector worked is it would basically look at those state diffs recorded by the fuzzer and it would try to identify some basic operations like loading immediate values into registers like transferring one register to another. There we go. Like reading memory, like modifying memory, like shifting registers, any number of arithmetic and bitwise instructions. The collector would try to identify those just by looking at differences in the input state and the output state between risk instructions. And then what it would do is it would bin instructions based on what effects it saw those instructions having. So it might spit out, well, here is a bin of instructions that all exhibited this property of transferring one register to another. And after it had these instruction bins, what it would look for is patterns within the bins. So for example, one of the first things we might want to figure out is how are the register values for each of these instructions encoded inside of the instruction? For example, where is EAX encoded in this instruction? So it would look for each of these instructions which bits might represent EAX, which might represent EDX. So that's what I have highlighted in purple. Basically for each individual instruction, which bits might be encoding the registers that the collector saw being used by that instruction. So then what the collector does is it tries to identify patterns across the entire bin. So we can see then that these two middle columns are the only consistent piece of this pattern. In other words, these two middle columns must be the thing that records or encodes the input and output registers for these instructions. So we can do that for all sorts of the different facets of an instruction encoding. You can try to figure out which bits are used to encode the op code of the instruction. And it's not a perfect process. Ideally we would see totally consistent patterns across the entire bin. That's not what we see here. But what the collector will do will be it will try to pick the things that are most common. In other words, in this situation it'll say, well, these are the most likely bits that encode your op codes. It can even try to find things like don't care bits and the instructions or bits that seem to follow some sort of statistical patterns. But we can't necessarily tell what they are. And then it'll mash all this information together in order to sort of automatically derive the bitwise encoding of each possible instruction on this on this deeply embedded core. So looking at the bins that it creates and what sort of functionality we might want to get out of the deeply embedded instruction set these are sort of the encodings that it came up to came up with for a variety of different instructions. So we've got instructions to move registers around and load the global descriptor table. Basically we have a primitive assembly language now so that we can write payloads for this for this deeply embedded core. So we could now that we know the instruction encodings we could write some payloads by hand if we wanted but I thought it'd be cooler if I wrote an actual assembler for these things so that I could I could write some of these programs at a higher level. So basically I wrote this DICE assembler custom assembler just for this unknown instruction set that will assemble these different primitives into their raw binary representation and then wrap them in one of these x86 bridge instructions so that the DICE instruction can be sent over to the deeply embedded core for execution. So now we're ready to revisit that payload idea that we have. We can use the LGD instruction in order to read out the global descriptor table. We can use some of our other DICE instructions in order to parse that descriptor field in order to grab a pointer to the task structure in order to grab a pointer to the cred structure in order to write to the cred structure circumventing the ring protections modifying kernel memory in order to give ourselves root level access on the system all using nothing but instructions specifically for this unknown architecture embedded alongside our x86 core. So the output of this after you run it through the assembler looks something like this. Basically we will activate that deeply embedded core with the launch instruction that's on the left. Then we execute all these bound instructions. Each bound instruction sends a single instruction over to the deeply embedded core for execution by the risk processor. These instructions are going to circumvent the processor security mechanisms in order to grant the current process root permissions and then we launch a shell. So let's revisit that demonstration from the beginning and walk through that in a little bit more depth. So here we have our complete payload doing the steps that we just talked about. We load the address of the payload into the EAX register and then we execute this launch instruction on the C3 processor. After that we've got our actual payload. All of these bound instructions the x86 instructions that are able to communicate with the risk core. Each bound instruction sends a single risk instruction over to the deeply embedded core and at the very end of this sequence we launch a shell in order to hopefully gain root permissions. So if we exit out of here and run this program we'll compile it first. We'll double check exactly who we are again so we are just a regular unprivileged user but when we run the program we use that deeply embedded core in order to gain root permissions. So it's thanks. So I want to toss this out here only sort of tongue and cheek. Researchers have sort of explored down to what we were calling ring minus three but I want to pitch this as sort of a ring minus four. It is in some ways more powerful than our previously known ring minus three. It is a core sort of co-located with the x86 core. It has unrestricted access to the x86 core's register file unlike ring minus three. Unlike ring minus three it shares a lot of execution pipeline with the x86 core which gives it sort of more power than ring minus three but at the same time the whole thing is just nebulous when we're this deep the whole ring model is completely broken down and this is sort of nonsensical anymore. But the point is we've now gotten direct ring three to ring zero hardware privilege escalation on x86. That has never actually been accomplished before. Everything else where we've come close has had to rely on code in the operating system or other bugs to sort of work as a launching point. This is a purely hardware approach for accomplishing this. So the good news is that fortunately we need initial ring zero access. Everything else in this payload happens in ring three but we have to activate that backdoor using one time ring zero access to set the God mode bit. At least in theory but I wanted to poke around with this a little bit more. So here we're looking at a different system. This is another system I had sitting on my shelf here. So this is a VSC3 processor but it's a Samuel 2 core instead of the Nehemiah core that I was using. What we just saw is the system cleanly boot. It's a freshly booted operating system totally unmodified. I'm going to log in as a regular unprivileged user. It's not the fastest processors but once we get to a prompt what I'm going to do is insert the MSR kernel module in order to gain access to the MSRs. So you notice I'm pseudo here but I'm not writing anything. I'm only using this to read things out to show you. I don't have the read MSR tool installed on here. We're going to just use the hex thumb tool in order to dump out that global configuration register. MSR number 1107. So the low bit of that register was the God mode bit. If you look at that low byte what you see is 1 1 0 1 0 1 1. The low bit there, the God mode bit, is on by default when the system boots. That means on the Samuel 2 core any unprivileged code that knows what it's doing can escalate to kernel level privileges at any time. And when you totally break down x86 ring protections and privileges like this, all of our other protections fall like dominoes. Like antivirus does nothing when you can just reach directly into ring zero. ASLR depth don't help you anymore when you can just circumvent the ring privilege model. Code sign and control flow integrity. Kernel integrity checks. None of this helps once rings mean nothing. But there are mitigations. None of them are great but there are opportunities here. So one is we could update the processor microcode in order to lock down the God mode bit. Just don't let people set the God mode bit to begin with. We could update the microcode in order to disable any U code assist on the bridge instruction so that you couldn't send instructions over to that risk core. You could update the operating system in firmware in order to disable the God mode bit during boots and then periodically make sure that that bit has stayed off and nobody's enabled the back door. But at the end of the day the points kind of moved. This is an older processor. It's not in widespread use and I don't really want to throw via under the bus here. I think their target market is embedded and I think this was probably a useful feature for their customers. It was just a very flawed implementation in some of their earlier iterations of the processor. So instead let's take this as a case study. The reality is that back doors do exist. This is not just conspiracy theory stuff but they don't have to remain invisible. We do have the techniques necessary in order to find these. So sort of looking forward, even though this is not a common processor, I do think this is a big deal. This is not just a big deal for C3. This is not just a big deal for x86. I think this is a big problem for all of computer engineering in general because this hardware that we're using to protect all of our secrets to do all of our computation, we have no way to introspect it. So whether or not back doors are real in a general sense is sort of irrelevant. That question is going to haunt us as long as we can't introspect our own hardware. So I hope that's what people will take out of this. I hope when you see something that seems off limits, when you see these weird hints in patents, because the patents that I saw were just the tip of the iceberg and some of this weird stuff that's out there, I hope people will dive in deep and really see what's under the hood. Because that's how security works. That's how we build trust in these systems. So sort of along those lines, I've open sourced all of the research that I used for this. All the tools, techniques, all the code and data I've demonstrated today are freely available. So I hope people will use those to scan their own systems. I hope people will use this as a starting point for much, much deeper processor security research. And along those lines, I want to quickly throw out this other idea that I didn't have time to cover in any depth during this presentation. But I demonstrated this side channel attack on MSRs. And I actually found some really cool uses for this, totally outside of what we just looked at now. And found some kind of unsettling results on the number of processors. So that's topic for another presentation. I'll be talking about this tool, Project Night Shift, specifically tomorrow at DEF CON if you're around. And I'll sort of show some of the totally different things I found using this approach. So I hope you'll check that out if you have time. But in the meantime, all of this stuff is available on my github. That's github.com, XOR, E-A-X, E-A-X, E-A-X. This project, I'm calling Project Rosenbridge. All the backdoor research is there. C.N. Sifter, that processor fuzzer that I used heavily throughout this is also available there. I wrote a single instruction C compiler a few years back. I've written some cool control flow stuff. There's a system management mode exploits on there and a lot of other fun stuff I've just tinkered with over the last few years is all available there. So I hope you'll check it out. What I really love is some feedback and ideas on these topics. So if you want it to reach out, either after this presentation, I'll just be next to the stage. Or you can ping me at XOR, E-A-X, E-A-X, E-A-X on Twitter or same thing at gmail.com if you have any feedback or anything else you'd like to discuss about this. So thank you, everyone, for coming. And that's it.