 Hi, I'm Romoreira, I work for Intel, and I'm here to talk a little bit about FineIBT, which is a compiler optimization that enhances the guarantees provided by Intel's control enforcement technology, CET, specifically when it comes to indirect forward edges. Here are some important disclaimers, and if you ever check out our source code, just be aware that some bugs may still be there. Of course, we are not really using lasers to improve CET. This was just a bad joke. Let's start our talk with a metaphorical example. Imagine you have a function full indirectly calling a function bar. This should be your control flow. Yet, when you start executing, first FBTR, which is a function pointer is assigned at the address of bar. Then a memory unsafe function is executed, allowing a hacker to overwrite the contents of FBTR with the address of bar incremented within offset. Then FBTR is invoked and the control flow goes directly to the middle of the function bar, is keeping an important user ID check that should have been executed. Because of situations like this, Intel introduced indirect branch tracking, IBT, which is part of CET. What IBT does is that it requires every indirect call and indirect jump to target and end branch instruction, just like the one highlighted in green in the function bar. Such policy blocks attackers from freely redirecting the forward branches, preventing situations like the one just described it. With something like this in place, you can have your compiler emitting and branch instructions in the prologue of every function, creating an ABI-like scheme that forces indirect calls to always target the first instruction of functions. This is actually what we call coarse-grained control flow integrity. The result of this is that now any function can be reached from any indirect call or indirect jump. This guarantees to relaxed, and it still allows functions to be used out of context to exploit the system. In that regard, coarse-grained control flow integrity does not fully mitigate the control flow hijacking problem. Now let's talk about a not so metaphorical example. The sudo heap-basic buffer flow recently described in an advisory released by QALYS. If you look at the advisory, you will notice that there are three different methods for exploiting the bug there. Here, I'll be focusing on the first one. In this method, the attacker corrupts the pointer fngth, which is in the heap-basic through sudo hook entry, and now the corrupted pointer targets either the exact V or exact V PLT entries in the sudoers.sl file. If you look at the assembly dumps present in the QALYS advisory, you will notice that the PLT entries there display in branch instructions, showing that IBT is in place and yet was not sufficient to prevent the control flow hijacking attack from taking place. Because of situations like this, researchers came with this idea of fine-grained control flow integrity, which tightens the control flow graph of the application even further by enforcing additional rules for the ABI-like scheme. Ideally, an indirect branch should only be allowed to execute if its target is the supposed to be target. But in practice, well, you can't really define this rule during compilation time, as this is a statically undecidable problem. Because of that, we need to use heuristics to reach practical implementation. The most common use heuristics is clustering functions and pointers by their prototypes. Functions have prototypes, function pointers also have prototypes. Thus, whenever indirect call is taking place, the prototype of the function pointer should match the prototype of the invoked functions. Back to the pseudo-example, if you look at the fngat-env pointer prototype, it is charpointer, charpointer, pointer, pointpointer, which differs from execvs and execvs prototypes, but matches the prototype of sudoers hook-get-env, which is the function that should be legally invoked. Some software CFI schemes out there successfully implement prototype matching policies. Some of them are PECS GRCQ with RAP, Clang CFI, and Microsoft XFG. These schemes are implemented in software and materialized in the form of pseudo-ABIs being wired in different ways internally. RAP uses tags hard-coded in binary to mark valid branch targets. Clang CFI uses intermediate jump tables with the sets of valid targets, and XFG passes a hash for a register to an intermediate dispatcher that then also checks for hard-coded tags. Finally, other schemes for forward edge CFI, like pointer authentication for example, do exist, but are beyond the scope of this talk. Given the scenario, we came with these two hypotheses. First, can we enhance CTIBT, possibly reaching a hybrid approach that combines the benefits of having a swift, harder foundation with the flexibility of a software instrumentation in a way to make it fine-grained, and if yes, how much of the perks implicit to the hardware nature would it retain? Now we are ready to start exploring FinIBT. FinIBT is also an ABI-like scheme, which uses end-branch instructions to anchor the control flow through the beginning of functions. Then by doing that, it enforces additional policies in the functions prologue. These checks consist in instrumentation, which is emitted in the binary by the compiler. In the instrumented binary, indirect calls are argumented with hash set operations. Function prologues are argumented with hash check operations. Direct calls have their targets incremented with an offset ensuring that whenever they are executed, they skip the prologue hash checks in the colleagues. The hash is used in these operations are generated based on function and pointer prototypes. This is a regular assembly code as normally emitted by an ordinary compiler, and you can see in direct call and a direct call over there. This is the assembly code generated with the basic IBT instrumentation. You can see an end-branch instruction there. Well, this is the fine IBT instrumentation. First, this is the hash set operation. It places the hash respect to the function pointer used in the indirect call into the register R11. These are the regular end-branch and the hash check operation. Since the end-branch is in place, the control flow will go through the hash check, which then exhausts the contents of R11 with the expected hash. If the result of this operation is zero, then the following jump is taken. Otherwise, it is not taken and the execution reaches the halt instruction. In this snippet, we use an extra operation because it will consequently destroy the contents of R11. What can be important to prevent reuse attacks if you are ever mixing fine IBT-enabled code with non-fine IBT-enabled code. Finally, the direct call now targets the address right after the hash check snippet. As described, this instrumentation would work properly with statically linked binaries and possibly with kernels. Yet, in some scenarios, your applications will be interacting with libraries and sometimes they won't be fine IBT-enabled. In situations like this, we must ensure that fine IBT-dynamic shared objects do not break non-fine IBT-dynamic shared objects. If you look at the x8664AVI, you'll notice that the basic CT form solves this problem by disabling the policy enforcement whenever a non-CT DSO is loaded. We'll follow the very same approach. First, we have a bit in the DSO health header that flags if it is a fine IBT-enabled DSO or not. Then the loader is augmented to check if the DSOs are fine IBT-enabled during the load time. Finally, if all the loaded DSOs are fine IBT-enabled, then a memory flag is set informing that the policy should be enforced. In this assembly dump, you can see in green the previously described fine IBT pieces, and in red, a new snippet which will check for the fine IBT policy enforcement memory flag. The bits verify at 0x1 and 0x10 are respective to the IBT and fine IBT flags because it doesn't make sense to enforce fine IBT without IBT. The FS0x48 is the memory address where the loader sets the fine IBT global flags. Fine IBT also requires special PLT entries that have two slots. The first is a 32-byte slot, and the second is a 16-byte slot. The first slot is reached by indirect calls, and it checks for the proper hashes and then jumps to the target function. The second slot is reached by direct calls. Since an indirect branch is about to take place in the PLT, it then sets the hash and then jumps to the target. This scheme requires early binding because we don't want symbols to be resolved while the control flow is going through the PLT. This would require the hashes to be saved in memory and this would be unsafe. Here is the assembly snippet showing a PLT entry and the control flows from an indirect call and from a direct call. And here you can see how the fine IBT piece is placed into the PLT entry with the indirect call reaching the first slot, which does the hash check, then jumps to the second slot, which goes into the target. The complex PLT wiring is needed because it references pointers from the God PLT, which can be hijacked. Yet, the existing model already requires early binding. Considering that this is already in place, pushing the binaries into a full railroad model isn't a huge hassle. And with it in place, the PLT indirect branches become implicitly safe. This opens pace for optimizing the PLT by combining railroad and indirect jumps with the CET no-track prefix, what relaxes the requirement of it landing on end branch instructions, allowing the branch to jump over the hash check prologue in the target function. This optimization remains as future work, yet to be explored. The previously described scheme actually requires all DSOs to be fine IBT enabled for the policy to be enforced. Because this can be very limiting, we started exploring different models to enable the cross-DSO compatibility. This method I'm about to describe is currently under development, and it actually uses the shadow stack also provided by CET, but with a different purpose. The shadow stack is in place with the goal of preventing control flow hijacking in the backward edges. When a shadow stack is in place, every construction will copy the return address pushed on the stack to the shadow stack. When a return happens, this copied address is verified for a match with the address on the regular stack. By using the instruction RDSSP, we can read the shadow stack pointer and then retrieve the caller's addresses from the top of the shadow stack. By doing that, we can identify the function which is calling into a fine IBT enabled function, verify if it belongs to a fine IBT enabled object, and decide to enforce the policy based on that. Here's how the policy enforcement piece looks like. After the hash check failed, we run our DSSP and get the shadow stack pointer. We deference it and get the caller's address. Then we compare it to the object's boundaries, identify if the call is coming from the same DSO or not. If yes, we know that this is a fine IBT enabled DSO, thus the policy should be enforced. This method as described enforces a self-contained intro DSO policy. I believe it can be extended to support multiple DSOs, although with more overhead than the global bid scheme previously described. Yet it is meaningful that the overhead is only paid when indirect calls from coarse-grained DSOs into fine-grained DSOs happen. As calls between fine-grained objects should have matching hashes and calls into coarse-grained objects are not at all checked. Another important perspective on fine IBT is observing how it behaves when it comes to transient execution attacks. If you read the Intel's SDM, you will find the following interesting snippet that says, when CT tracker is in the wait for end-branche state, instruction execution will be limited or block it, even speculatively, if the next instruction is not an end-branche. In practice, what this means is that the speculation after indirect forward branches is confined into the coarse-grained control flow graph, limiting transient execution attacks. Yet, if you look at a fine IBT's instrumentation, you will notice that it has a conditional branch right after the hash check. Whenever enforced in a complete ABI-like scheme, the decision regarding the target of this branch will depend only on the instructions move and XOR. Because the instructions only use immediate and register operands, they are retired with variable latency. This not only makes it hard to attack these specific branches, but also limits the speculation to proceed further down if the hashes are not matching. We tried to use this conditional branch as a transient execution gadget, but we were not able to. And this led us into assuming that in a full fine IBT process, the speculation remains confined to the refined control flow graph. As said, we assume here the statically-linked scenario where no policy enforcement checks are made. I actually would like to explicitly thank Kisan, who is also a member of the Intel storm team, for doing this analysis. When generating PLTs, the fine IBT linker needs information regarding the hashes with which it needs to instrument the entries. Thus, for enabling that without depending on LTO, the fine IBT compiler embeds the required data in a special section on each generated object. Then when this data is consumed by the linker, it discards the sections with no impact to the final DSO size. Fine IBT was implemented on top of LLVM and LLD12.0. We implemented cross-DSO support on top of the Google runtime environment branch of GDBC. We also wrote a very basic IBT support on Muscle 1.2.0, as we wanted to use it for some performance tests. The source code of fine IBT is available in these repositories, being the first one, the compiler and linker implementation, the second, a claim-enabled version of GDBC with fine IBT cross-DSO support, and the third, a few scripts use it for testing. Okay, now let's talk a little bit about performance. We used two different test sets to evaluate the performance of fine IBT. The first test set consists of a custom synthetic benchmark, which was written as an attempt to take the worst out of forward edge CFI implementations. It is composed by three different applications, which are very dense when it comes to the number of forward and direct branches. The first application is a dummy loop that indirectly invokes an empty function over and over. The second is a Fibonacci sequence calculator that does its recursion indirectly. And the third is a bubble sort implementation whose swap function is invoked indirectly. Each of these applications was compiled in two different versions. The first uses a global function pointer for its indirect calls. The second uses a local function pointer for the very same indirect calls. The reason for having the second version is because in our early test with the first version, fine IBT displayed overheads much smaller than those introduced by Clang-CFI. Under a deeper analysis, we noticed that Clang-CFI's instrumentation would always force the control flow through the PLT table, while all the other analyzed binaries would go directly into the targeted function. While it remains a bit unclear to me what are the reasons why Clang-CFI misses this optimization opportunity, I felt like it would be good to provide numbers which are intrinsic to the CFI instrumentation only, trying to rule out overheads introduced by major side effects. The numbers presented for these test set are the average of 10 brands and these are computed by Perf. The second test set is composed by two applications from the SPEC CPU 2017 benchmark. SPEC has in its source code some typecasts what causes some false positive policy violations when these are executed under CFI instrumentation. To overcome this problem, I use the modified version of fine IBT which uses the very same tag for every prototype, creating a coarse-grained like policy enforcement, but with the same instruction sequences of the fine-grained scheme. To run these applications compiled under the Clang-CFI instrumentation, I had to add a few violating functions to an ignore list, preventing them from being instrumented, thus avoiding the false positives, but also giving Clang-CFI some performance advantage. The first application was 600 per bench which is suggested by the Clang-CFI documentation as a good application for testing CFI. I had to add these four functions to the ignore list to make it work. The second application was 625x264, which was picked at random. I had to add these two functions to the ignore list to make it work. The numbers regarding the SPEC runs were picked by the benchmark itself out of tree runs. The testbed usage in these experiments was this machine running federatory tree with a kernel patched to support CET. IBT was verified to be enabled and enforced. Turbo Boost, ASLR, and SMT were off. Each application was linked to a setup equivalent muscle and compiled using the same compiler and arguments, except of course for the CFI scheme that we are trying to evaluate. The fine IBT global check was hard-coded as knobs, ensuring that the policy was always enforced. The compilation arguments are in the backup slides. The tests were executed under these five different setups. No CFI, which is the regular binary fault CFI. Course, which is the binary compiled with the regular IBT feature. Fine, which is a binary compiled with the regular IBT feature, plus fine IBT compiler instrumentation. Clang CFI, which is the binary compiled with the Clang CFI instrumentation, and Clang CFI NC, which is the binary compiled with Clang CFI instrumentation, but without cross-DSO support. This last one was only used for comparing space overheads. These are the numbers for the runs when using global function pointers. As you can see, we have a huge overhead of 57.42% for the dummy application when compiled with Clang CFI. While this overhead remained at below 2% for the fine IBT version. Similarly, the Fibonacci application presented a 32.78% overhead for Clang CFI, with a significant but still smaller overhead of 13.77% with fine IBT. BubbleSort, which is the last indirect branch dense application of the mall, displayed an overhead of 5.12% for Clang CFI, while remaining on 1.78% with fine IBT. When control flow through the PLT was enforced under all setups, the overhead data became less massive, but still evident, and with fine IBT presenting better performance than Clang CFI on null tests. We believe we can improve these numbers even further after optimizing the PLT on the railroad. For the spec applications, 600 per bench under fine IBT presented a 1.66% overhead while Clang CFI overhead reached 3.56%. The second application in the test set presented the same result for both fine IBT and Clang CFI. Despite the tie, it is important to mention that Clang CFI in both applications had functions added to its ignore list and thus not instrumented. The space overhead of Clang CFI with cross DSO support is very high, especially in the tiny synthetic applications. I assume that this comes from all the machinery that Clang needs to link in to provide the cross DSO support. Thus, again, trying to provide more approachable numbers, we did the space overheads comparison using applications compiled without Clang CFI's cross DSO support. In this comparison, Clang CFI presented smaller overheads than fine IBT, yet at the high price of not having cross DSO support enabled. This comparison was done using full binary sizes of each compiled application. Now some conclusions. If you remember, at the beginning of the talk, we brought up these two hypotheses, which are if we can enhance the T-IBT in a way to make it fine grained, and if yes, how much of the hardware-related perks would it retain? I guess that the answer for the first question is yes, and regarding the hardware-related perks, fine IBT presents good performance, especially when compared to Clang CFI. It improves the transient execution mitigations which are provided by the regular IBT feature. It also presents reasonable space overheads. And then in my pipeline, here are a few tasks I'm willing to work on. The first thing is doing a deeper security validation of the fine IBT design. Most CFI implementations will check the policy in the caller before the branch executes. But in fine IBT, this check happens in the callee. Is this a problem? Is this better? If there are problems, can they be fixed? Also, what about mixing fine IBT and non-fine IBT DSOs? It certainly adds flexibility, but should also introduce weak spots. How can we make this solid? Then I want to improve the benchmarking of fine IBT. I do believe that benchmarking is rocket science, and I understand that the quality of the presented numbers can and should be improved. Thus, if anyone has any suggestions on how to do it, I'm willing to hear about it. Besides performance, I'm not super into the academic average indexes for measuring CFI effectiveness. But what better options do we have out there? How can we really measure the restrictiveness added by CFI schemes? I also want to move all the cross-DSO schemes into Clang's compiler runtime library, removing the Jalipsi dependency, and who knows, maybe making fine IBT streamable. Then I want to explore further the cross-DSO compatibility schemes, looking for more flexible approaches and for designs tailored for specific use cases. CT support was not yet merged into the upstream kernel. Thus, how can I help with this task? Once it is there, what is the landscape for running the kernel with fine IBT? Some anticipated challenges are, for example, interactions with assembly files that should require some rewiring, but shouldn't be as complicated as the ongoing effort to make it support Clang CFI. Currently, fine IBT is a C-only thing. Is the general idea useful in the C++ V-tables context? If yes, what is the best general design and how do we support polymorphism there? The performance numbers show us that the current PLT design is expensive. As we described earlier, fine IBT PLTs may be optimized under railroad. What should be the best design for this? Also, what if we take a step back? Can we remove the dependency on early binding? And these are some very important people to this research. I would like to thank each of them. Professor Vasilius Camerlis and Alex Guedes from Brown University, with whom I have been working very closely in this research. Michael Lee Mai, H.J. Lu from Intel, Kisum and Hiki Kawakami and Alissa Milburn from the Intel Storm team, Jared Candelaria and Vervias Chamborghi, who recently left Intel but left behind meaningful contributions to this research. Thank you all. This has been such a fun ride. Thank you for your time and I'm happy to answer all the questions about fine IBT. Be here for the conference chat or later for email. Thank you so much and bye-bye.