 Hello, everyone, and welcome to a slide of our demystifying Intel Houdini. My name is Brian, and I'll be discussing Intel's Houdini binary translator, so what it is, where it's used, how it works, security concerns, and recommendations. But before that, a little bit about myself real quick. That's me. I'm Brian Hong. I studied electrical engineering at the Cooper Union, and I'm currently working as a security consultant with the NCC group, where lately I've been performing a lot of Android pentests, and sometimes even Android malware analysis. Besides that, I like to build random bits of hardware, and I also like to reverse engineer low-level stuff that other people build, both software and hardware. So with that, let's get started with Android. So Android is one of the largest operating systems in the world, and you can write applications for Android, Android applications using Java and Kotlin, and you can also, as well as C and C++, using their native development kit. Android was originally designed and built for ARM devices, but Google later added support for X86, and just to note, there has been auditory support before that as well, such as the Android X86 project. So since then, there's been several Android devices running on X86, but now there's mainly two, which are X86 Chromebooks and X86 hosts running commercial Android emulators. However, apps generally lack support for X86 still, and that's because ARM is the primary hardware platform for Android, and in fact, if you have native components in your app, the Play Store only requires the ARM builds. And because of this, many native applications don't end up containing any X86 binaries, only ARM. So how can X86 devices running Android run these out that only contain ARM binaries? And this would be a great time for me to introduce you to Houdini. So Houdini, the topic of this talk, is Intel's proprietary binary translator that allows X86 devices to run ARM binaries. And it was co-created with Google, as it was designed to be run with Android native rich. We'll get to that in a second. Houdini is this mysterious little black box. We don't know how it does. We don't know how it works, and there doesn't seem to be any documentation on it. And it's possible some vendors may be trying to hide that they're using it. We'll get to that as well. And there are three variants, 32-bit X86, implementing 32-bit ARM, 64-bit X86, 32-bit ARM, 64-bit X86, 64-bit ARM. Right. So Houdini can be used in physical hardware, as there were on X86-based mobile phones. And they are still used in X86 Chromebooks, which is how we actually got the binaries. So we could take a look at them. They're also using some commercial Android emulators, such as Bluestacks and Nox. Well, I don't believe they're enabled by default. I remember I think there is an option to enable it in the settings. And also it's used in the Android X86 project, which can be run on a real hardware or on an emulator. OK. So how does it work? Houdini is basically an interpreted emulator for ARM instructions. What that means is there is a loop reading ARM instructions, the opcodes, and it produces a corresponding behavior in X86. I just want to make clear that it does not do just-in-time compilation. It doesn't translate nor output any X86 instruction. It reads it and then does the behavior. And Houdini has two main components. The first, just Houdini, runs executables. And the second, Lipidini, is used to load and link ARM shared objects. So like I said, the first part, Houdini, runs ARM executables, both statically and dynamically linked. When running dynamic binaries, it actually uses its own set up pre-compiled libraries for ARM Android, in addition to the X86 library needed by rest of the Android and Houdini itself. And there's a screenshot from the Chromebook, actually. You can see that we have a X86 machine, denoted by I686 at the end. And the program I'm trying to run in HelloStatic is a 32-bit ARM static-linked-elf binary. And I run it.sashHelloStatic, and it just prints Hello World. So some of you may have noticed that I just executed the binary directly without invoking Houdini. So you might be wondering where the Houdini come in from. And this is actually a kind of cool feature, a Linux kernel feature called binformatMisk. And if you are familiar with it, I'll give you a quick explanation. MiscellaneousBinaryFormat is a Linux kernel feature that lets you basically register interpreters for custom binary formats. Kind of similar how a Sheban works in Bash or Python programs. So in our specific case, our custom binary format is an ARM-elf binary. So the two screenshots below show the registered entries for static and dynamic ARM-elf binaries. And you can see that the interpreter is set to a system bin Houdini. So essentially, the resulting effect is when I type in .sashHello in Bash or try to exec this binary, the kernel compares the magic bytes by looking at the first n bytes. If it matches, it turns it into that. So it actually execs system bin Houdini, and it passes my program name as the first argument. Right. So now that we know how the Houdini part is being used, let's look at the more interesting second component, which is libhoudini.so. Libhoudini is itself a shared object for x86, and it's used to load other shared object for ARM. That's built for ARM. It was designed to be used with Android Native Bridge. So let's talk about that. So Android Native Bridge is the component in Android that allows this binary translation to work. It is part of the Android runtime and is the main interface between our Android side and our Houdini binary. So Native Bridge provides an interface for our Android to talk with Houdini. And I want to point out that while it might have been designed specifically with ARM and Houdini in mind, the interface that it provides can be used to implement other processor architectures. For example, running MIPS code on ARM device. So it's part of Android runtime. So it's initialized on boot. And when it's doing it, it checks that system property, RO-Delivic-VM-Native-Bridge. And if it's set to zero, it's disabled. If it's not, the value is used as the name of the library to load, which implements the Native Bridge implementation. So in that case, that would be libhoudini.so. Actually, a few interesting things about this. According to DevCon Talk a few years back, it seems like Bluestacks renamed libhoudini to something like lib3btrans.so for some unknown reasons. And also looks like Android X86 project uses their own implementation, libnb.so. But when you take a look at the source tree, it's actually just a thin wrapper that loads and uses libhoudini itself. Another interesting thing was there's a script in the Android X86 project that enables Native Bridge. And it downloads libhoudini from a couple of obfuscated.cn URL shorteners. And yeah, it also seems like they moved that link and the corresponding related code around a couple of times. Not sure what's going on there. But back on topic, Native Bridge defines too many callback interfaces. And just to get off-topic again, I have to talk about the JNI first before I get into those callbacks. So Java Native Interface is a foreign function interface that enables our JVM code to interact with the native code and vice versa. So this part is actually not specific to Android, and it is part of Java. So on the right side, we see a struct, the JNI Native Interface. And it's also typed F2. Its pointer is also typed F2, JNI F. So this struct is basically a bag of function pointers that's provided to the native code so that our native code could use these functions to perform low-level JVM reflection. I cut out a lot of it. There's a lot of functions in there. But some of them are calling methods, getting the method ID, allocating object, getting a field, finding classes, and so on. So the pointer to this struct has actually passed as the first argument when your Java code calls the native code. And we'll see how that's used later. So the first callback interface from NativeBridge is the NativeBridge runtime callbacks. It's quite simple, but it's passed from NativeBridge to our Lipudini binary so that our Lipudini can find and call native functions in the Android side or the native bridge side. The second more interesting callback interface is the native bridge callbacks. And this is kind of like the opposite. It provides a way for our native bridge to call functions in our Lipudini binary. We see some of the functions on the right. The most interesting of these are initialize, load library, and get trampoline, the latter two of which are quite similar to how DL open and DL sim works. And I'll show that in a later slide. So this struct is actually exposed via the symbol native bridge ITF, which can be seen here. By looking at it in a hex editor or this assembler, you can see all the function pointers in that data structure. So I have all of the components of NativeBridge explained kind of. So I'm going to try my best to kind of put them all together. So here we go. So normally it would look something like this. You have an ARM device running ARM Android and we want to load an ARM native library. So when your application launches, it would call system.loadLibrary, which would trigger the Android runtime to call DL open and that would load our lib.native.so into memory, into the process. And then when our app wants to call a native code, native function, it would first do a DL sim, which would return a function pointer to our code. And then it jumps to it with the first argument being the pointer to our JNIM structure. And if our native code wants to interact with the Java world, it could do so by looking for the appropriate function in the JNIM function pointers. Now this gets a little more complicated when we talk about native bridge. So before anything happens, native bridge gets loaded on boot. So it checks that system property and sees that it's pointing to lib.native.so. And it DL opens our lib.native library. So note that our Android and our device is x86 and so is lib.native.so. And our goal is to run code, which is an ARM in our lib.native.so. So after lib.native is loaded, it fetches the native bridge callbacks using DL sim and calls initialize, which isn't shown in the diagram. So after that, the Android continues to boot up. And then when you launch your app, it would try to load the library again with system.load, or load library, which triggers native bridge plus Android runtime to call our native bridge callbacks load library, which acts similar to a DL open. So it will return a handle. And with that handle, we could pass it to get trampoline to get a function pointer, similar to a DL sim. Now we can't actually just use DL open and DL sim directly because the kernel will complain it's a different architecture, and especially for DL sim because DL sim will give you a function pointer. So Houdini has their own versions, load library and get trampoline. Load library just opens our native file and maps it into memory. And get trampoline should return the function pointer, but it doesn't. It can't return the actual function pointer to our native library because our code is written in ARM, or it will contain ARM instructions in there, which our x86 processor probably won't know how to handle. So instead, Lippoudini returns a pointer to a little step inside of our interpreter, inside our Lippoudini, so that when we call the function return by get trampoline, the interpreter is going to start running, and the interpreter will in turn start reading the native code and executing it. So the last part I want to bring up is the JNIFM pointer. I mentioned that your Java code calls native code. The pointer to JNIFM is passed as the first argument. We can't pass that straight through to our native code because, well, our native code is running in ARM, and our JNIF functions are in x86. So what the Lippoudini does is it just remembers where it is, and then it creates its own fake version that's filled with ARM instructions. Actually, the JNIFM function pointers point to a bunch of trap instructions. That way, when the interpreter sees those ARM trap instructions, it knows which of the proper JNIFM function to call on the real x86 structure. All right, so now that I've kind of explained how Lippoudini comes together with native bridge and how it all fits together, let's start digging deeper into how the interpreter part works and starting with memory. So the emulation is a dual architecture. So it contains both and separate ARM and x86 binaries. And it is a shared virtual address space as well as they both have real-world view of memory. So what that means is the x86 parts of the process and the ARM parts of the process view the memory the same way, and they're in the same address space. So there's no magic translation between an ARM address versus an x86 address. And the last point is there is a separate allocation for ARM stack. And just to show, this is a snippet from one of the app's process memory, memory map. We see here our native libraries loaded up there, and down there we have our Lippoudini loaded. You could also see a bunch of ARM libraries loaded that's used by our native code. The next one is, yeah, specifically, Lipsy and a couple of others we could see are loaded for both ARM and x86. And you also see a bunch of anonymously mapped pages, which is used internally by Lippoudini, and our ARM stack lifts somewhere in around there. So moving on, I want to talk about the actual execution loop. So I mentioned earlier it's essentially a switch statement inside a while loop. So this screenshot shows the portion of the interpreter where it would fetch, it would read the instruction, it would partially decode it, and then jump to the proper instruction handler. So in this assembly I have the comments on the right, but I do have an equivalency code on the next slide. So basically I'm going to run through this real quick. The snippet of code gets the program counter from the processor state, reads the instruction from memory, and then checks the condition bit, condition field, condition code, yeah, condition code to determine whether the current instruction should be executed or not. Once it determines that it should, it calculates this offset by concatenating bits 20 to 27 and the bits 4 to 7. So that offset is used as the entry offset into this instruction handler table, which has filled a bunch of function pointers to instruction handlers. And then it jumps to it. So for example, our move R01 instruction has the entry offset of 0x181, and each entry is a 32-bit address, so multiplied by 4 bytes we get 0x6A4. And then so the final offset, final address that this function pointer is in is 4bc044. And we can see that right here, if I look at it through a disassembler, at that address we see a pointer to our function handler, instruction move 1, and just note that this decompilation is not entirely correct. So already we can see in around lines 22, 23, and 27, we see some registers being moved around, and even some shifting and masking, because move instruction has the option to do that. The important thing to look at is that all of the instruction handlers have two parameters that's passed in. The first is the instruction itself, the instruction bytes, so that the instruction handler could pull out the operands and fully decode it. And the second argument is the processor state. And it is basically what it is. It's this data structure that keeps track of the arm, emulated arms, processor state. It also contains the register values, such as R0, R1, but also registers such as the program counter, stack pointer, link register, and so on. It also contains a byte that tells you whether it's in thumb or not. And there's a bunch of other fields, but I couldn't really figure out what all of those do. Note that this is just a data structure in memory, and they have shared memory addresses between the x86 and the arm side. So you can technically just, if you find it, you could write values to the structure to change the register values inside of R1. So the next thing I took a look at was the syscalls, trying to figure out how syscalls work. Syscalls adjust instructions as well. They're special instructions, but they are instructions. So we could actually find them in the instruction handler table. And you can see on the right, it takes the same parameters, the instruction, and the processor state. This is also not entirely correct, the de-compilation. But we could see that it actually doesn't issue any x86 syscalls. Or rather, it just sets that SVC number field in the processor state and returns. So the actual switch for issuing x86 syscalls is further down that loop in the interpreter. And depending on which syscall number it is, it will do different things. Most of the time, it's just simple wrappers or pastures with some conversion between moving the arm register value to the actual x86 register and calling int 80 in x86. Or just simple conversions. But some of them are a little more complicated than the other. One such example that was interesting to us was the clone syscall. And I've actually combined fork and clone here because nowadays if you call fork, it will go to libc fork, which will actually call clone. So clone was also very interesting because it has a parameter there to pass in called child stack. And you pass in a memory region, which will be used as a child stack. And on top of that set stack will be the return address so that when the child is cloned, it will return and that address becomes the entry point of our child process. Now we were wondering how that gets handled by libhoudini and it turns out the child stack we pass in is not passed to the kernel. But instead libhoudini creates its own empty RWX page and passes that as the child stack and handles the parent and child logic. So now that we have some ideas on how it works internally, let's get to the fun stuff. So detection. Is there ways we could detect whether we're running as an app, we're running inside of libhoudini or not? We came up with a couple of ways. And the first way would be we build an ARM native app. And in that app we could check the host architecture. Either via Java's OS Arch system property or by reading the PROC CPU info. But as it turns out, you actually can't do that because Houdini hides these. So when you do OS Arch system.getProperty, libhoudini makes it say rnv7l from the Java side when you're running with native bridge. And when you try to cat PROC's CPU info, it would actually return rnv8 processor revision 1, AR64. Actually, if you're careful, you might be able to tell whether you're running on Houdini because there seems to be some inconsistency. Because one of them will return rnv7, the other one will return rnv8. And hardware says placeholder, funny enough. So there are some other ways as well. So checking the memory map. You could try to read the memory maps and see if either libhoudini is loaded or both ARM and our XZ6 libraries are loaded. So these are OK methods. But we think the best ways are those that are undetectable as self. So like no syscalls issues, no files being open. That would trigger an analysis tool. So the method I came up with was using the jnimv function pointer. So I mentioned earlier, if you're on a real device, I mentioned earlier that libhoudini creates its own ARM version of the jnimv structure function pointers. Now if you're on a real device, those function pointers will point to real ARM outputs. But if you're running under libhoudini, the function pointers will point to also real ARM instructions, but those would be syscall instructions. I'll have a quick demonstration of that later. So the next thing is, once we detect that we're running inside a libhoudini, can we escape to XZ6 with it? Of course, we could call nprotect and write code to memory. But again, this isn't very subtle. We would need to call nprotect, which would probably trigger most of the analysis tools. And another way we could try to do this is by XZ6 stack manipulations. We know approximately where the XZ6 stack is, so we could try to clobber the stack with raw payloads and have it jump to somewhere. And this method is much more annoying. But one of the harder parts is trying to figure out where we could actually run our code. So we need to find a page that has execute permissions or try to find a bunch of a lot of raw payloads. That brings us to security concerns. Turns out libhoudini creates a bunch of rwxpages that use internally, and we saw one of these for that, which is being used for the clone syscall. And they have read, write, and execute permissions, which means we could write XZ6 code to them and just jump to it. And they're again shared memory, so we could write code from either both XZ6 side or from the ARM side. So just to show you what some of these are used for, the ARM JNIM, the ones filled with trap instructions, is in there. The ARM stack is in that memory region. So back to security concerns, we have rwxpages in x86. So what about trying to get code execution on ARM? So it turns out houdini ignores this bit entirely, which just means you could write code anywhere and jump to it. And I don't think I need to explain why that's an issue. But yeah, ARM libraries themselves are loaded without the execute bit on their pages. So regarding the behavior ignoring the non-execute bit, not that this is correct, but if you think about it, this kind of makes sense, houdini is an interpreter for ARM. The interpreter gets the data input. That means if you could read the data, read the instructions, it will run it. So to demonstrate that, I got this little program here, nxtack.c, and in my main, I allocate some memory on the stack code 512, and then I write two ARM instructions on it. And then I make that and cast it into function pointers and jump to it. And normally on a real device, real ARM device, this will cause a segfault. But as we see below, it doesn't. It just works. And actually, in the first iteration of this code, I accidentally had the memory outside of the function, so it was in the data section or some other region, and it still worked perfectly fine. Well, they worked, this runs fine with devices running with lipidini. So the next step is a couple of quick demos. So for the demo, this is on the Chromebook. And for the demo, I wrote this app. And I've actually built two separate versions of it, this exact same source. I just have two versions of it. One is built with just x86 libraries, or native code, and one only contains the ARM binary. So the top one is the x86 one, and the bottom one is the ARM. And the Chromebook itself is x86. So to run the bottom app, it is running through lipidini. So the first tab is CPU info, well, you know, overturn the values that I mentioned before. And the top one doesn't have any lipidini, it's running x86 on x86. So all the values are correct. And we see genuine until everything is all nice. Whereas on the bottom, we're running it with lipidini. And we saw the output we saw before, on V7, on V8, inconsistency, as well as hardware equals placeholder. The second tab demonstrates the detection method I quickly described. But here we see on the x86, when we do reference the get version and call static input method functions, I believe, I mean, those are valid x86 instructions. I just, I think those are a bunch of push instructions, as you've often see in the beginning of a function. And on the bottom, when we do the same thing, fetch all. So this is running with lipidini. So we will see ARM instruction. But specifically, those instructions, 0xef000, those are assist call instructions. So you could use that as a method to detect whether the lipidini is running or not. In this case, the, yeah. So the third tab actually is not for demonstration, just the utility to show you the processes memory map. So in x86, there's no lipidini and everything should look fine, right? But when we look at the ARM versions process map, we see a bunch of anonymous map memory. We see lipidini right there. And we should also see our ARM libraries loaded in right there. Okay. So the, I think the most interesting tab is the last tab, the exec, which demonstrates the NX bit, or the lack of NX bit check on the ARM side with lipidini. So top one is running without lipidini. And just to kind of explain you what this layout is. So on the left side, you will write some bytes that's going to get written to a buffer up. You're going to type in some bytes, and then you're going to click run, and then it will be passed to a native code. Well, those bytes will be actually written to memory and then jump to. However, the top, top one is our x86 version. So obviously we can't run ARM instructions and there's no lipidini loaded. So it's going to crash. That was the intended behavior. However, on apps that's running with lipidini, we could actually just type in valid ARM instructions, click run, and it would run. And I have a couple of different programs written up there because I don't want to type it up manually. Run, and then multiply and multiplies R1 and R2 and then adds it to R0. That's correct. And getSP actually reads the stack pointer of the ARM processor and returns it. And just to show you that this is dynamic, these bytes are actually being copied. I could change the actual bytes of the instruction. Reading the register 15 will be reading the PC. I could also modify, so the left side is completely changeable. As long as it's executable ARM instructions, it will run it. I'd change the one to a two or three and would add three. Same thing for adding two integers. I added three times. So two times three plus six is 13. So just for completeness, I have the same app, but now it's running on a real ARM device. So this device happened to be ARMv8. So we'll say ARMv8 and AR64 processor. It's going to all look correct. There's no lipidini running on this because it's an ARM device running ARM code. So we go to the tech tab, skip to it, we go to tech tab. We see just valid ARM instructions that are not, that are not assessed calls and in the maps, you know, this should also be fine, completely fine, no Houdini, no, yeah. And of course, this is not running with lipidini. This is just running on actual ARM hardware. So when we try to copy these bytes into malloc memory or stack with a heap and then jump to it, it would crash and it does. With demos out of the way, let's now talk about possibilities of malware that knows about lipidini. To start, we know that applications are often run in sandbox environments for analysis. This is mainly done in one of three ways. Running on them on actual devices would give the most realistic behaviors, but it is hard to do on a large scale and also hard to instrument. The second best option is fully virtualized environments like Kenyue. But these have somewhat a performance overhead since they would have to emulate the entire hardware and the processor. And that brings us to our third option, Android emulators. And those Android emulators on X86 devices can use technology like Houdini to run our application. This has the least overhead as it would only emulate parts of the application instead of the whole hardware and the operating system. And on another point, most of you would agree that inconsistent behaviors are harder to debug. And similarly, apps that may or may not have behaved maliciously are harder to detect and are also harder to analyze. So let's combine those points. And so for example, a malware can use one of those detection methods mentioned previously to figure out whether or not it is running with lip Houdini. Then it's possible for the malware to act benevolently when it thinks it is under analysis by seeing that lip Houdini is being used. And in other cases, show malicious behaviors when lip Houdini is not present. Yeah, so what about the other way around? We could also perform malicious actions only when Houdini is present, abusing the knowledge of its inner workings to further obfuscate itself. And for example, we don't know what the Play Store uses nowadays, but it seems like their automatic app testing doesn't use, doesn't run ARM APKs on X86 with lip Houdini. In a case like this, a malware could detect that it's running on, well, it's not on their analysis. And when it is running on their lip Houdini, for example, inside a commercial emulator, then it could do some tricks like running code from the stack, which you can't do on a real device. And trying to analyze that would prove to be difficult, because a static analysis tool would see that you write some code onto the stack and it jumps to it, and that should crash. Or else, if you're running on their lip Houdini, it works. So we finally come to the recommendations and how not to write an emulator. And we could start by talking about the RWX pages. So we noticed that Lip Houdini internally uses, well, Lip Houdini maps a bunch of RWX pages to be used internally, and those should not be there. If it's really necessary, we recommend performing a finer-grain page permission control. And one of those methods would be implementing an efficient NX implementation. So we see that we understand that checking page permissions every instruction would incur a very significant overhead. Every instruction we want to run, it has to check the page permissions via software. So instead, what we could do is we keep track of it in a data structure, and we only check if the instruction we're currently running is different than the previous instructions. So in the case of jumps or instructions across a page boundary, we could check those. So this basically becomes our userline page table implementation. Given that, our recommendation is to just use virtualization. Simple enough. But regarding actually implementing the userline page table via software, we could do it in a couple of ways. We could only trust the text section of the library on load, and the other option is to check the memory map, and every time a new page is added. And then if a new page is added, we add that to our data structure that we keep track of. And third, we could hook the memory mapping related syscalls, and then whenever, for example, Mmap is called, or Mprotect is called with the execute permissions, we update our data structure accordingly. So ideal solution combines the last two, the two and three. So every time you do an Mmap or Mprotect, for example, it would add an entry into our data structure that keeps track of the page permissions. And just as a catch-all, we could check the memory map for new pages that's not already in there. And this has some good side effects, such as we can now, since we have a userline page table, we could do dynamic library loading via DL Open, and we could also do legitimate just-in-time compilation. And of course, the used JIT pages should be cleared, properly cleaned up after usage to prevent page reuse attacks. And another thing is that, of course, this data structure is a critical data structure as it acts as our page table and should be heavily protected. So some of the things we mentioned is writable only when being updated, so on the right guard pages, not accessible to ARM, et cetera, et cetera. And another thing we recommend for researchers or vendors doing analysis of rendered applications is, when you're running dynamic analysis, you should also run apps through libpruditing. As we mentioned, it's possible for malware or any other applications to behave differently when they see that libpruditing is enabled. Also, when doing static analysis, we should look for access to Houdini RWX pages or attempts to execute from non-executable pages, which would work if it was running under libpruditing. And just to add on that, anything scan for JNIB and function pointers as that was one of our detection methods. So to summarize, what I'm trying to say in this presentation is that Houdini introduces a couple of security weaknesses into processes using it, and that would be ARM-native applications running on x86 devices. Some of these impact the security of the emulator ARM code, such as the lack of nxbit check, while some also impact the security of the host x86 code, such as the reroute execute pages everywhere. And I actually think the fact that Houdini is not well documented publicly nor easily accessible has something to do with preventing wider security analysis and research into this, which could have caught these issues earlier. Which brings us to our few last slides. I'd like to give big, big, big special thanks to Jeff for mentoring this project and helping develop the methodology. Also, Jennifer for all the support and research and amazing feedbacks, and Esi for basically bootstrapping this research. And with that, thanks everyone for joining, and I believe you are at Q&A right now.