 Hello. I'm Mike Rappaport. I work for IBM Research in Haifa, Israel. And I investigate the ways to make containers more secure in general. And particularly the last project I worked on is to allow, is to say how we can checkpoint restore containers or processes that run in an environment with enabled shadow stack. So we are going to talk today about the overview of hardware protections against ROP attacks. And let's start. So historical overview, there has always been a stack overflow, buffer overflow. And according to Wikipedia, it's the oldest and more reliable, one of the oldest and more reliable methods for attackers to gain unauthorized access to a computer. So back in the old days, main security was less of a concern. Stacks were writable and executable. So the only thing that an attacker needed to do was to inject their own code into the stack and make the program, make the processor somehow jump to that code. And they had all the control of the system, of the program. And if it was a privileged program, then all the control of the system. So this is an oversimplified example. We have a function main that has some temporary variable. And it calls the function full and passes to it a string that came from the standard input as a program arguments. What function full does, it copies the string it received as its parameter to another string, which is a buffer allocated on stack. And then it prints the result. Obviously, it's not a real-life example, but it makes it much easier to demonstrate how the things work. So the stack on the upper corner is an example of how stack would look for these two calls. Stack grows towards up in the picture. We have a lower frame for the main function with TMP and it goes beyond below the picture. Then we have CPU pushing the return address from the place full was called. And then there are local variables of the function full. Now if the argument for main was a bit longer than 16 characters, it will overrun into the stack where the return address and the frame of main was stored. And essentially, if an attacker puts a string that will contain a return address to another place, for instance, to the stack itself, and then a string that contains binary code for the attacker wants to execute, as that's what we get on the lower picture. There will be return address to 4F0s and there we'll see, for instance, call to system. It's not a complete example. You need arguments and so on, but in general, it illustrates pretty well that with writable executable stack, you can inject your own code directly onto the stack, make it executable, and from there, you could execute whatever code you'd like to. As time passed, there were much more awareness about security and people started to use all kinds of protections like stack counter is, stacks are not executable anymore, and that is not executable anymore. There is writable executable protection so that writable data is not executable and so on. There is also indemnization of virtual memory layout of the processes. And doing this is quite impossible in the modern world. So to overcome this, bad people found really good way, which is largely called return oriented programming or code chaining or return to library. There are lots of names that refer in general to this thing, to the same paradigm that you inject return, that you find chunks of code you would like to execute in the existing programs. For instance, libc has really a lot of code, so it's easy to find appropriate binary code that will serve the purpose of an attacker. And then all you need to do is to inject a return address to stack so that return address will point to a sequence of code that you would like to execute. And then the sequence, this sequence is called gadget and gadget will end in return instruction. And then it returns back to the original frame and so it goes. So for instance, we have four pointers for different gadgets and they give them the root access. So what happens? The first gadget puts a return address of the first junk of the code. That code runs and prepares arguments, for instance. Then it returns back to the stack. There is a red instruction. Red instruction fetches the address of gadget two. It jumps to the second piece of the code. And in the end, we get essentially the same. Move something to RDI and call system. It will, supposing this address of 40, 40, 40 is string binbush, so we're kind of done. The system is ours. We have a shell kit ready. One of the, and then the researchers worked hard to see how this can be overcome. And one of the things considered one of the fundamental research studies about preventing a rope and similar attacks is a control flow integrity. So as the regional paper stated, the execution must follow a path of a control flow graph determined ahead of time and if there is violation of that path, we should be able to detect it and this prevent execution further. So in a sense, it means if there is a control flow graph, you're executing call instruction, red instruction, and there is no way you should be able to jump into the middle of the function. They have several assumptions like non-executable code and non-executable data and non-writeable code, which is WXRX. And with this assumption, they suggested to validate all indirect jumps and calls by labeling sources of jumps and call, jumps and returns and calls and destinations. And adding instrumentation to each program that will check that whenever jump is executed, it matches the label compiler thought it should jump to. So for instance, if we call bar from foo at the call site at foo, we have to compare that actual target of the call is L1. And when we return from bar to foo, we have to compare at the call site of red that the actual return address will be L2. And, oops, sorry about that. And since instrumentation is a certain overhead and there are more instructions to execute, hardware vendors started to add a special hardware to assist performing control flow integrity. So there is a course green forwardage detection probably I should have mentioned that call is usually called the forwardage and red is usually called backwardage. So there are several CPU vendors that implement a course green forwardage detection. They do some kind of labeling for the call targets. There is an ability in some CPUs to check for backwardage integrity. And these are risk CPUs that have special instructions which we'll cover a bit later. And there is a shadow stack that is a fine grade protection against backwardage attacks. So course green forwardage protection is implemented in ARM64 and in x86. There is special BTI instruction in ARM64 and similar NBR instruction in x86. The compiler has to post these instructions at every target of an indirect call and then CPU can verify that the branch is actually meant to be taken. And if there is no such instruction in the target of a branch that CPU currently executes, CPU will fault and the program will be stopped. This is pretty course green because an attacker can chain several executions of function entries and still find enough gadgets to compromise the system. For instance passing a wrong argument to a function and then returning and then going back. I have to admit I'm not security researcher proper so I might not know some things. But it is considered relatively weak protection and usually it should be combined with back edge protection as well. So how IBT works I pretty much explained it already. So if you have a gadget on the stack that tries to call that using some memory corruption technique a call address was amended by an attacker and it tries to call the code that would do some bad things. The code here starts with that instruction and NBR instruction and the CPU will detect that and it will cause a control protection fault on x86. Now a bit of risk I'll start with power because I understand it's assembly a bit more than arm but this is an example of function call on a risk machine for non-leaf function. For leaf function risks usually use a register to pass the return address. It's called link register. But non-leaf function anyway have to store the return address from the previous frame on the stack and then call the branch linked with the contents of the return address in the link register and then restore everything and move on. So the first instruction here actually takes the link register from the previous frame and saves it on the stack. First two instructions actually then there is a function call which is a BL. And then after the function as a leaf function returns we fetch the return address from stack and put into the link register and return from the original function. Now the stack is apparently vulnerable. There are memory corruptions and attacker can possibly override the return address and gain the control of the system and cause it to execute changes of unneeded functions. So what the CPU can do is to use special instructions that protect the return address with the crypto calculated hash. There is a hash key instruction that generates a hash, it generates key for cryptography, it generates a unique key for every process in the system. And then there is hash store and hash check that assists storing the return address on stack and then verify that return address is correct when the function returns. So the call becomes, we have two instructions added. We store the return address on stack then we add hash st that also pushes the cryptographically computed hash onto the stack. And then after we return from the function, we load, we return from the leaf function, we load the link register value from the stack and then we can verify that it matches what we expected to match. So even if there was a corruption of a link address in the stack, the hash check will guarantee that the program will not continue or it will continue the way it was intended to. A very similar mechanism is implemented on ARM 64. It's called pointer authentication. There are a few differences in how cryptography is handled and where exactly the hash is stored. But on the bright side, ARM 64 has way better supporting Linux than a PowerPC does these days. So aside from generating the key and adding instrumentation to the programs to store and verify the key, there is also ptrace interface on ARM 64 that allows debuggers to access, to change the flow of the program and to jump from one frame to another without being considered an attacker. So this one I didn't understand what STPD and LDP do, but presumed that they also store a link address to the stack and then restore it. And there is special instructions. I have no idea how to read it. The PakiSP and AutiSP that do pretty much similar things what PowerPC instructions did. They create a record, they create a hash that authenticate a point center and then they verify AutiSP verifies that the record for that pointer is authentic. And the last techniques I'm going to cover is the shadow stack implemented by x86 vendors. The idea is to create another stack that only contains return addresses of the functions being called. So whenever there is a call instruction, it pushes the return address to two different places, to the normal program stack and to another place that's called shadow stack that stores only the return addresses. Whenever there is a red instruction, hardware verifies that return addresses in both stacks match and then if anything goes wrong, hardware generates control protection fault. So for example, if we have a stack like that and with return addresses 45570 and 40 and so on, there will be another piece of memory that's called shadow stack and it will contain the return addresses. And in case of an exploit attempt and overwritten return address on the normal program stack, at the time of execution of red instruction, CPU will detect that the new address that the attacker put there doesn't match the original address and it will cause control protection fault and it will essentially stop the program. Shadow stack is not yet enabled in Linux. I think it was like two years of discussion about that. And for now, I think the last RFC was sent about February and this is pretty much summarizes what the last RFC suggested to enable shadow stack for user space. Any program that has a special field in Elf header that generates the generated by GCC or LLVM is known to support shadow stacks so kernel enables shadow stack for that program at loading time. There are features of shadow stacks that are controlled by G-LiP-C so it's possible to override the shadow stack enablement with G-LiP-C tunables so you can say, okay, I do know that this program can support shadow stack but I don't want it today so there is an environment variable you can pass to G-LiP-C to ask for that. Control protection fault is supposed to generate 6x so the program will have to act accordingly and there is new signal code to differentiate the segmentation fault because of, I don't know, other on access from the protection fault. The long jump and the whole family of these calls that also change frames in different ways and function calls, this is all handled by G-LiP-C patches. And there is special handling of SIG return and assumption that only G-LiP-C uses SIG return and for the G-LiP-C flow there is some code in kernel that takes care of it but if your application calls SIG return directly you are going to run into troubles. So this is the link to last RFC into FolkSend. Now for the low level users like G-LiP-C for example or other Lipsy there is a new system call that creates the shadow stack memory area and allocates virtual address space and reserves the memory for it and it also set up the hardware to know where the shadow stack is and then there is an arch PRCTL that allows disabling or enabling a shadow stack on per project process basis and it allows enabling or disabling rights to shadow stack that should be done using a special instruction and what it also does and there is a possibility to say okay to lock the state of these features. So for example I've enabled shadow stack, I disabled rights to shadow stack and then I do lock for the features and then the program from then on cannot change the state of enable or disable for right shadow stack or disable or enable shadow, re-disable shadow stack again. So shadow stack is nice and good if your program doesn't do weird things or if you don't want to debug it with GDP. GDP needs to switch frames in random, might need to switch frames in random order so when you go up and down several frames and then continue execution the contents of the stack and the shadow stack will be different. So what people suggested is to use the ptrace interfaces to adjust the contents of the shadow stack like normal data, that's what usually debuggers do is ptrace.pog data and then there is a special API for ptrace that allows to modify the shadow stack pointer and the hardware registers associated with the shadow stack machinery. So if let's say with GDP you go down three stack frames, GDP will also rewind the shadow stack to the point where it matches this normal stack of the program. Now I also happen to be one of the crew commentators and I actually came to this when somebody on the thread about shadow stack said, okay, we generally break crew and they do. Crew does really weird things with the programs. It tries to checkpoint and restore so we use Sig return extensively. We need a way to restore shadow stack exactly at the same place where it was because we have to restore the process the same as it was before we checkpointed it. After I implemented the first POC with the standalone test program I discovered that I also need to take care of the feature locks that JLPC does because I found myself in a state that I have no control of re-enabling already disabled shadow stack and I'll talk about it a bit later. So a brief sideway overview of crew. What crew does is during the checkpoint we stop task with speed trace, we inject parasite code into the victim pretty much like any virus do to extract essential process state that is needed. After the process state is extracted and serialized to disk most probably we call Sig return with the frame crafted so that the victim will return to normal execution. And on the restore site we clone the task from the crew controller program. We restore the task state from the saved state on the disk. Then we clone all the threads that were originally present in the program. And each thread calls Sig return to return to the normal execution. Again with the crafted stack frame which wouldn't match the shadow stack in any rate. So the first thing we had to do is to be able to do Sig return. Otherwise the crew would stack and there is control protection fault and nothing to do about it. So with the interfaces that GDB is ought to use we were able to achieve pretty much the same functionality. The stack on the right is the stack of the victim. And then the lower part is the stack of the program itself. And the upper part is the stack of the parasite code that we inject. So the code that is injected is not called with call. We just jump there. And we jump there using set rex with Petris. So there is no shadow stack indication that we are going to call these functions. The upper part of the stack is reflected on shadow stack. But the first function we called with the set rex will never be there. So we used Petris to inject this new frame into the shadow stack. And that allowed us to continue using Sig return as previously without doing too much intrusive things in crew itself. So essentially when we stop the victim with Petris, pretty much one of the first things we do, we inject the shadow stack frame to the shadow stack. The next challenge was to be able to restore shadow stack exactly as it was before the checkpoint. What I needed was to extend my shadow stack function call with functionality similar to MUP fixed so that shadow stack will be allocated at exact address. And then there is a lot of hoops I had to jump to be able to use that WRSS instruction to write to shadow stack. But it was pretty much technicalities. But so what crew does for program with shadow stack enabled, we do MUP shadow stack. We use memory aside to store all the contents of the shadow stack that was check pointed originally. We enable shadow stack to be able to use special shadow stack instructions. And then we copy the shadow stack contents from the stash to the shadow stack memory. And then we update the shadow stack pointer using again special instructions CPU provides. And the last things that I found pretty late in the process that I also have to deal with the feature enable, disable controls. And it's all a presentation I downloaded so just delete that fix me. Whenever you see what shadow stack enabled program, it locks the features on the state shadow stack enabled, writes shadow stack disabled. And it still worked with the original model where we enable shadow stack at some point. But the problem, it didn't work the other way because whenever on JLip see lots of programs that doesn't have shadow stack health field in the health header, it locks shadow stack on disabled. And then there is no way crew could re-enable shadow stack. And for that, the current discussion was that we need the new Ptrace APIs to get and set the state of process shadow stack controls. I did something for my POC out of everything. I hope that Intel folks that work on the shadow stack will come up probably with better API that will be more useful. And we'll see in the near future. So it remains to be seen when and how shadow stack and in the upstream kernel eventually, I don't know what it will feel like the next patch that will be just accepted. So it's more work to do. Afterwards we'll need to refine and complete crew implementation. And we also started to look into how to use shadow stack hardware for kernel itself. Looks like it's a lot of mess. And truly it won't be easy. But we keep our hopes high. And I'm sorry, I couldn't connect to internet so I had links to crew and kernel patches I have at the moment in the more updated presentation. So I'll post it on the open source summit website. And that's more or less all what I have. So I noticed one of the things you pointed out in the presentation was being able to disable processes from using like the WRSS instruction. Can you repeat? I noticed one thing you pointed out in the presentation was that there was a way to disable processes from using like the WRSS instruction. Doesn't that restrict the functionality? A lot of them like it would break the ability to use long jump and things like that. Then would it be something you wanted to do? I don't know how exactly G-Lipsy hands long jump. I just trust until they do. But if you can allow writing shadow stacks, so essentially you've done nothing. Because at the same time as injecting, you can find a gadget that writes to shadow stack, right? And at the same time that you inject something to the normal stack, you would find a way to execute that gadget. So allowing shadow stack writes weakens significantly the shadow stack mechanism in general. Thank you so much.