 First off, I have a couple slides in this talk that have a bit of small font for code samples and stuff. So if you go to this URL, thanks to the great website that this event has, you can just like look at the slides on your own laptop. All right. I'm Jan Horn from Google Project Zero. I'm here today to talk about some tricks that can be used to exploit race conditions in Linux environment. All of the bugs that I'm talking about have been fixed for a long time, like one of the bugs is from like 2016 to other stuff from last year, and all of the exports are against kernel 4.4, which an Android land still is relevant for some devices, but of course on desktop Linux land is very ancient. But the focus here is on like exploitation techniques, not on the individual bugs, so I think that's fine. So I'm going to be talking, I'm going to be using three different bugs as examples today to talk about exploitation techniques. The first bug is something that gives you a use of the free of a physical page through a state LTLB entry, and exploiting that requires hitting a narrow taming window, and I'll talk a bit about how the behavior of the body allocator influences how you can exploit this and how you can play with preemption and with the scheduler to make your race window bigger. And then there's a second bug that was a kernel bug that let you decrement the reference counter in a struct file, and I'll talk a bit about how you can exploit that in a way that would normally be a race condition, but then use user fault FDN fuse to make it deterministic again, and how the KCMP's call helps with exploiting use of the free bugs. And then there's a third bug, which was an Android user space bug, where exploitation required a primitive that was somewhat similar to what you'd get with use of user fault FD, but those are unavailable on Android, so I'll talk a bit about how you can sort of build a poor man's alternative to that. Okay, so the first bug I'm talking about is a race between the Emory map and F locates as calls, has a quick bit of background, but you need to understand how the bug works. In the CPU, you have the translation look-aside buffer, which caches page table entries basically so that you can do address translations faster than always walking the page tables, and while the page table entries are essentially like ref counted pointers to pages, the TLB just borrows references from your page tables, because the CPU isn't going to be incrementing and decrementing references on your page structures. So this means that when you're removing entries from your page tables, you need to also clear out the corresponding cache entries in the TLB, otherwise you get a use of the freeze. The first syscall that participates in this bug and the one that actually had the buggy code is the Emory map syscall. Emory map can be used to move a memory mapping from one virtual address to another virtual address. This of course requires creating a new virtual memory area structure and so on, but it also requires moving the actual page table entries and allocating new page tables for the destination address range. So Emory map has to go and allocate new page tables for the new mapping, then move the page table entries over from the old mapping into the new mapping, and then clear the translation look-aside buffer for the old address range. The second syscall that participates in the race is FAllocate. FAllocate allows you to allocate a deallocate space for a file, and if you use FAllocate to deallocate space on a file, it not only deallocates pages on disk, but it also tries to free up page cache files, sorry, page cache pages that are currently cached in the kernel. So when you use FAllocate, it has to go and iterate through all of the virtual memory areas across different processes on the system that include this file range and then tries to yank out the pages from them. So it looks for page table entries that are non-zero, deletes the page table entries, flashes the TLB range, and then drops the references on the pages. And the race here was that Emory map didn't hold any locks between the time it moved the page table entries from the old address to the new address, and when it did the TLB flush out the old address. So this means that you could have a race where first Emory map moves the page table entries from the old address to the new address, but you can still have still TLB entries for the old virtual address. Then FAllocate comes along, deletes the page table entries to the new address, does the TLB flush that removes entries for the new address range, but not the old address range, and then FAllocate would drop references on the pages. And at this point, you'd have still TLB entries pointing to pages that have already potentially been put back on the page free list. So this gives you a use after free of a physical page. And on Linux before 4.9, this was actually exploitable to get write access to pages. Starting with 4.9, getting write access was a very narrow timing window. And only read access to pages was easier. So I decided to write an exploit for the Pixel 2 phone, because that still runs a 4.4 kernel. So the first exploit idea that I had here was, okay, we have a physical page in the free list, and we have like full read and write access to this page. Let's try to reallocate this page so that it contains kernel data. The thing is that the kernel page locator, the body allocator has the vaguely this behavior I have up here. This isn't entirely accurate, and if someone here knows the allocator, please don't think to badly of me for simplifying it this way. But basically, the idea is that we have this free page that gets put onto this red per CPU free list up there, which is a free list that is specific to the migration type, migrate movable, for pages that can arbitrarily be moved around by the kernel. And kernel allocations for the kernel and all data structures normally come from the migrate unmovable lists. And the page allocator kind of tries to keep those memory types separate. And if you want a page to move from like movable type to an unmovable type, you need to create memory pressure and stuff. And it gets messy, and I didn't want to deal with that. So instead, I decided to go for reallocating the page as, again, a page belonging to user space. So you could either reallocate the page as an anonymous page in some other process, but that would require you to interact with another user space process, and that user space process might do other things that disturb the kernel heap and so on. So that didn't seem like such a good idea. So I decided to go for reallocating the page as a page cache file. So what this means is you trigger this bug to get the page freed, then you trigger a read on a specific page in some shared library that is used by a privileged process, and then this read will allocate the page that you just freed and put it in the page cache for the file, read data from disk into this page, and then hopefully you can modify this page of code before some privileged process continues to use this code, and then you can escalate privileges into the context of that process. There are two things that make this kind of hard. The first thing is that you need to detect when and at the right point where you want to interrupt it and do your allocate operation. Luckily, as an attacker, we can use procfs for this because procfs contains statistics about the memory usage of each process, including specifically how much memory is used for page tables. So we can see exactly when mremap is allocating memory for new page tables and then use that as a signal. The other thing is that the way I chose to exploit this requires that we have this TLB entry long enough. It requires that the mremap operation takes so long that in the middle of the mremap operation we can not just reallocate the page but also do disk.io because we want to be doing our use after free write after data has been fetched from the disk into the page. Otherwise the disk is just going to overwrite the data we put in there. So the plan for exploiting this requires knowing a bit about how preemption works. So Linux has, on Linux, user space programs can always basically be preempted by sending inter-processor interrupts or something like that. But it gets more complicated when you're running in kernel context. Linux supports three different preemption models. One of them is the voluntary preemption model where kernel code explicitly calls a con resched to say, hey, I can be preempted at this point. And this is used by many Linux distros by default but Android uses full preemption model in which you can send inter-processor interrupts to a interrupt but to preempt kernel code that is executing in almost any context as long as it's not holding a spin lock or something like that. And importantly, new taxes do not be able to spin locks, new taxes do not prevent preemption. And Linux also gives us a lot of control over the scheduler's behavior. So basically the idea is that you can tell the scheduler to make your task run faster, give it higher priority, but you can tell the scheduler that your own tasks should run at a reduced priority or should only be running on certain CPU cores or things like that. And if you, for example, say this task should have idle priority, it should run at very low priority, and there's some other task that's running on the same CPU at normal priority, then you only get working up something like once a second. Also a task with idle priority never preempts other tasks. So this means that if you have an idle task that is waiting for some input to arrive, and then that input arrives, but some other task is currently running on the same CPU, then you will not be switching over to the idle task until like the next scheduler takes arrives or something like that. And importantly, these scheduler controls do not just affect your tasks when they're executing user space mode, but they also affect the execution of kernel code and in syscore context on behalf of your tasks. So this means that you can, for example, create two tasks yourself, put them on the same CPU core. By telling the scheduler these tasks may only run on the CPU core, set one of them to idle scheduling priority and the other one to normal priority, and then let the idle priority task start executing some syscore that, for example, takes a lock at some point. You can have your idle task execute some syscore and then in the middle of that syscore, wake up your normal priority task and then you can use this to stall the execution of this syscore, potentially in the middle of some race condition or something like that for quite some time. So here's a timing diagram of how I ended up exploiting this bug. I go through this step by step. So we have five tasks here across four CPU cores. Each task is pinned to a specific CPU core. At the start, we have tasks B and C sharing one CPU core, each other task has its own CPU core, and only task B is running at idle priority. Everything else has normal priority. Task E is the task that is a task that is just in a busy loop trying to read and write the page, the virtual address in the old mapping where we are about to get the LTLB entry. The idea here is that by in a busy loop, we are constantly accessing the address as long as the page table entries still exist. When the LTLB entry goes away, we are immediately refreshing the LTLB entry. And then later, when the page table entry has gone away and we slough out the LTLB entry, we can detect when the page contains the code we expect. So we actually know that the page was reallocated in place we wanted. And then we have the LTLB entry overrided. Task C starts off by readings from some empty pi which causes it to block, which means that at that point, task B can execute since it's the only thing that's runnable on the CPU. Task B starts executing memory map and starts allocating memory for page tables and moving page table entries. At this point, task D, which is in a busy loop pulling the statistics on procFS, can notice the progress of the memory map operation and wake up task C by writing to the pipe that task C is blocking on. At this point, task C preempts task B because task B is our priority on task C is not. So now, task B is in the middle of this memory map operation where we wanted. And the first thing task C now does is to move the task B over to CPU zero where task A is running and task A is just spinning a loop so task B is probably not going to get woken up for quite some time. And at this point, task C can use F allocate to perform the other side of the race, putting the page on the per CPU free list and then use the PreadSys call to read a page from the library we're targeting and reallocate the page as and at that point, hopefully, task E will then see the page cache contents and override them with arbitrary code. Okay. That was the first bug. As a second example, this one's a bit easier. This was a bug from 2016, a bug that let you arbitrarily decrement the ref count on a struct file. The bug itself doesn't actually have a race condition in it, but the way I chose to exploit it would normally be a race without special tricks. It's a bit of background for this. On Linux, you have two mechanisms, user fault, FD, and fuse that allow user space to synchronously handle page faults. In the case of user fault FD, that's precisely the intent. User fault FD is specifically a mechanism with the intent to allow user space to synchronously handle page faults. Whereas with fuse, you can basically construct the same primitive by mounting a fuse file system and then mapping a file from it. And when you have a page fault in a file that is backed by a fuse file system, the fuse file system gets to resolve the page fault at any time at once. And you can use these two tricks on like normal desktop Linux systems to arbitrarily block kernel code execution at any point where the kernel does things like copy from user, copy to user, get user, put user, and so on. But on Android, user fault FD and fuse are not exposed to unreligious code, so it's not really usable there. So here's the bug I want to use as an example here. Basically, you can see the code on the right-hand side. We have this FD get, which takes reference to a struct file. And then we call it this dip-dip-bpf-map-get function, which has an error case where it calls FD put, then returns an error code, and then we go into this if-al branch in the upper part of the code, and this branch again calls FD put, so it's like a straightforward bug that just over decrements the reference count. A syscall that's very useful for exploiting user-to-freeze on certain data structures, including struct file, is the KCMP syscall, which is available on kernels that activate a checkpoint restore. KCMP, you can basically see what it does on the right-hand side, is a syscall for making arithmetic comparisons between obfuscated kernel pointers. So the intent here is that if you have, for example, a process with a big file descriptor table and you want to figure out which ones of those file descriptors map to the same struct file, so in other words, which file descriptors share the same file descriptor object, instead of doing like pairwise comparisons between the file descriptors somehow, you can do like a normal N log N sorting algorithm, where as the comparison function, you use the KCMP syscall. So this works on a bunch of different data structures in the kernel that are quite important, including like struct file, mmstruct, file struct, and so on. But this is also useful for exploiting user-to-freeze, because for example, if in your file descriptor table, you have one pointer to a file, you have a dangling pointer to a file that has actually been freed, and then you reallocate that memory for another struct file and get a pointer to that in your file descriptor table, then you can ask KCMP, hey, did I reallocate the new struct file in exactly the same place as the old struct file or did I hit some other memory location? So this can be used to make exploits very reliable, and I think this also might have some interesting implications in the future unless it's handled especially for memory tagging, because you can ignore this if you don't know what memory tagging is, but basically with memory tagging, you have these tag widths as part of your pointers, and these tag widths are secret, and if you can leak them, you can defeat the mitigation. And this thing compares the complete pointers, including the tag widths, so you could use this to repeat leak query whether the tag widths are the same, and then if they're not the same, you can retry until they match up. Yeah, so here's what back in the four to four days, the VFS write function looked like. So at the very start of the VFS write V function, we have a check that checks if we have read access, if we have write access on this file descriptor, and if we do not, it bails out. So the way you can exploit the bug is you create a fuse file system, you map a file from this file system, you open a slash dev slash null as writable, and then you start a write V operation on dev null, which has its IO vector stored inside the fuse mapping. So the write V syscall comes in, does the check, sees the file is writable, we can continue, and then further down you can see this import IO vector call where we import the IO vector from user space. So at this point, import IO has to do a copy from user on this memory region that is backed by a fuse file system. This blocks until user space resolves the page fault by supplying some data. So at this point, we have as much time as we want to trigger our use of the free, free this dev null file struct that we are operating on, and reallocate it as something else. Now, normally when people exploit use of the free, they do stuff like replace the struct with a struct of a completely different type, so we turn our use of the free into a type confusion where we are interpreting numbers as pointers or interpreting a point of one type as another type or something like that, but what you can also do is you can just allocate another struct file. So we open slash dev slash crontab as read only, and then we can use the kcmp trick to check whether we indeed place the crontab file at the same location where we previously had our dev null file, and then if we see that it worked, we can resolve the page fault and do read v write v continues, and it performs the actual write operation on the etc crontab file. So now we can write arbitrary content in the crontab and elevate privileges to root, and this whole thing works without any type confusions or like rob or any of these things. Okay, and the third bug example that I have is use of the getpid con function in Android. So as background, when you have some system where you're integrating with SLinux, user space demons sometimes need to figure out what is the SLinux context of the peer that I'm talking to, like if you have some demon and it's receiving requests from some client and has to check, is the client allowed to do this? So for Unix domain sockets, the situation is pretty nice. You can use things like SO peer seg to ask the con, hey, what's the security context of my peer? But until recently, Android's binder IPC system didn't tell you that. Binder just gave you the UID and the process ID of the sender. So luckily there's a helper function that you can use in this case which is called getpidcon. You give it a PID and it gives you the SLinux context of the process with that PID. So obviously this has problems because, for example, there's the classic PID reuse problem that if the sender of the message goes away and then another process spawns and reuses that PID before you get around to doing this check, then you see some completely different SLinux context that has nothing to do with the actual sender of the message. And the way that getpidcon is implemented is basically that it opens in Proc FS under the process directory, the address slash current file, and reads from that. So in Android, there's this hardware service manager thing. You don't really need to know what it is, but it's basically like some demon that manages names. And this is readable from like the normal application context and from other places in the system. And this thing receives some IPC calls and it has to figure out what the context of the sender is. And it used getpidcon for this. So to exploit this, we have to exit our sending process and then we need to make some privileged threads spawn somewhere in the system that reuses the PID. And like for making a privileged threads spawn somewhere in the system that requires user space interaction and isn't very fast. And so it would be nice if we could stretch this race window out between the time the binder IPC is received and the time getpidcon actually reads the as Linux context. And luckily, we can make on like 4.4 kernels, we can make this getpidcon call that just opens the file in Proc FS and reads from it, takes something like 15, 20 seconds. So as background for this on up until like kernel 4.7, there's a mutex in the inode struct, which protects a bunch of operations, including a getdance, which is used for like the read the ellipsy function. So this function takes this mutex on an inode, then iterates over the directory entries for this inode, copies the directory entries to user space while holding the lock, and then drops the lock in the end. So if you have a big directory and you're doing this operation, you're holding this mutex while accessing a lot of users based memory, which can take a lot of time. Another path that also takes the inode mutex is the lookup slow function, which is used if you don't have like a cache directory entry for the name that you're trying to look up. So for example, in Proc FS, if you haven't accessed a process through Proc FS before, there won't be a cache directory entry for it. So this means that if we can make, if we can getdance and make that take a long time, then we can also block the open call that getpidcondice for the same amount of time. Most of you probably already know this stuff, but in operating systems, you can have the problem priority inversion where you have three tasks, task A with high priority, task B with normal priority, task C with low priority, and then if task C, which has the lowest priority, takes some log and then gets driven by task B, which is running for an extended amount of time, and then at a later point, task A wakes up and tries to take the log, task A blocks on the log, task A can't acquire the log because task C is holding the log and task C can't make progress because task B is blocking the CPU. So effectively, task B is running even though it has lower priority than task A. And this doesn't just apply if you have a task A and task B with different priorities. It also applies if you have tasks A and B with not the same priority because then you're still violating fairness between the true process that should be scheduled like 50-50, but actually one of them gets all of the CPU time. And this also works with column U taxes because they don't protect against priority inversion unless you're on like a PMDRT system. So this means that we can potentially block execution by creating, by artificially creating priority inversion problems. Okay, so here's the basic idea. Instead of using user fold FD to block a user space access for as long as we want, we create two tasks, task A and task B and pin them to the same CPU. Task A gets idle priority, task B gets normal priority, and task B is executing a spin loop. So now we let task A execute a reader operation. This reader operation takes the mutex, and then it starts doing a user space memory access. The user space memory access triggers a page fold, which triggers IO. Now the IO operation itself is relatively short. But when we trigger IO, our task stops running and yields the CPU to another task until the IO operation is completed. But because idle tasks never preempt the execution of non-idle tasks, even after the IO operation is completed, our task A doesn't get scheduled again for like something like a second, actually. So this allows us to make this user copy operation that is happening while holding the mutex stall for an extended amount of time. And we can repeat this if we have a large user copy operation. So we can map a bunch of pages so that the reader head logic doesn't fire, for example by explicitly opting out of reader head. And then we can, by spawning a bunch of processes, make it so that the get dense operation on ProcFS writes over something like 21 pages. And then we get something like one second of delay for every single page fold, waiting for the scheduler to move us back on the CPU. And then this gives us something like 21 seconds of total delay. And for this duration, we can install the get-pit-con call. Well, I went through my slides way faster than I expected. Yeah, that's it. So questions. I think I have a lot of interesting details. And the part for the KSMP and the memory tagging is fully new for me, so I'm going to check what sounds very scary. Thanks for the great talk. First question about the first bug. Why it is easier to exploit on kernel before 4.9? Was it the scheduler changes? I know. It was a specific change in the vulnerable code part. So basically the code behaved differently depending on whether the PTEs that was flushing were writable or not. And in the case where the PTEs are writable, it would actually do a TLB flush earlier. On your kernels? Yeah. Thank you. And second, about second bug, I like user fault of D as well, very much. But there are cases when you have several K3 calls before you have your code running, your spray running. What happens then? So it is the free element is somewhere behind in the free list. And your next allocation doesn't reach it. Yes. So I actually kind of oversimplified this here a bit. And what I actually did was I think I did a, I opened up a bunch of times. And then I used KCMP on each of the open instances to see whether one of them managed to reuse the same allocation. So like, yeah. And what was the size of slab element? Which slab cache did it happen in? Files structs have their own slab cache, like modules, like slab merging stuff. So they're not on a K-malong slab. Okay. And how many files with grunt up you had to open to just reach it? Was it a lot? I don't know where I put this. Sorry. I don't remember. This was like 2016. Thank you very much. Thank you. Great talk. Questions? Everything was very clear. Normal questions? Might be quite ignorant. On normal desktop Linux, all these scheduler calls, anybody can make them, right? Can normal user code do that on Android as well? Yes. Normal user code on Android can use like set scheduler and the pinning stuff to like make themselves, to move to other priority or to pin to specific CPUs. Next speaker.