 Every Linux device is riddled with security bugs – bugs in the Linux kernel itself in particular. There's a system called sysbot, and this system continuously fuzzles the kernel and reports found bugs. If you look at the dashboard, you will see that there are over 800 bugs that have not been addressed yet. And these are just the bugs that have been found. There are many more that remain unknown. Now, not all of those bugs are vulnerabilities. However, some of them are. One example would be BadBinder, and BadBinder is a bug in the Android kernel that was supposedly used to attack Android users. It was ultimately found and fixed by Project Zero, but it had actually been reported by sysbot two years before that, and no one noticed it back then. A couple of years ago, the Android security team decided to find out the main cause of vulnerabilities in Android, and they studied the bugs reported through their bug bounty program. As they discovered, 90% of those bugs were memory corruptions. Of course, bugs reported through a bug bounty are only a fraction of bugs that are present on real devices. And it also could be that memory corruptions are just the low-hanging fruit. Still, this study shows that most of the bugs that exist on Android devices today are memory corruptions. Unfortunately, finding and fixing all security bugs is impossible. There are just not enough resources for that. Instead, it makes sense to dedicate these limited resources to implementing mitigations. And a mitigation is something that prevents bugs from being exploitable. Since the majority of security bugs are memory corruptions, it makes sense to start with them. And there's a number of mitigations for memory bugs that have already been implemented. For example, Pan and Pxn, which prevent the kernel from directly accessing your space. Or C5, which doesn't allow the kernel to call wrong function pointers. However, these mitigations target exploitation techniques. They do not try to prevent memory corruptions from happening in the first place. And this is where memory tagging comes in. My name is Andrei, and I do stuff related to Linux kernel security. I've worked on adding memory tagging support for the Linux kernel, and this will be the focus of my talk today. We will explore a recently added ARM feature called memory tagging extension and how it's integrated into a kernel. And we will also try to figure out how good memory tagging is for preventing attackers from exploiting kernel memory corruptions. As you can see, I'm recording this talk outside instead of sitting at home in front of my computer. And this idea was inspired by a besides Berlin keynote I watched recently. This keynote was given by Fabian, who is also known as Life Overflow. Recording outside is a bit awkward. There are sometimes people walking past looking at me, but I think this is an awesome idea. And I decided to give it a try. So I hope the result will be entertaining for you to watch. And with that, let's talk about memory tagging. Memory tagging is a technique for detecting memory corruptions. And here's how it works. All memory is divided into small blocks called granules, and the size of which granule is the same. Each memory granule is associated with a tag. You may think of a tag as of a number, but I'll be using colors to represent tags. Multiple memory granules can have the same tag. Besides memory granules, each pointer into memory also has a tag attached to it. These tags take the same set of values as memory tags. When a memory block is allocated, both the allocated memory and the returned pointer get marked with the same tag. Now, whenever a memory is accessed through a pointer, the memory tag is checked against the pointer tag. If the tags are the same, the access is valid, and the execution continues. However, if the tags are different, this means that the memory is not being accessed through the right pointer. So whenever a mismatch between the pointer tag and the memory tag is detected, there is some kind of memory corruption. Memory tagging is a great concept. However, to implement it, we need a mechanism for assigning and checking memory and pointer tags. There are purely software implementations. Those rely on compiler instrumentation. That is when a compiler inserts checks into the program, and these checks run during the program execution. However, software implementations are slow. Instead, it would be cool if memory tagging support would be built into the CPU itself. Unfortunately, there was no such support in any widely used CPU architectures, up until recently. Two years ago, ARM announced a new extension to its architecture called memory tagging extension, or MTE for short. To be clear, this is not a CPU that supports memory tagging. The way this works is ARM releases a specification and then CPU manufacturers implement it in their chips. Right now, there are no CPUs that support MTE, but hopefully we will see them in the coming years. Previously, we looked at how memory tagging works in concept, but we had some details missing. Like, for example, how pointer and memory tags are stored, or how and when these tags are compared. These details depend on the implementation. So let's take a look at MTE and try to fill these details in. Let's start with pointer tags. To store pointer tags, MTE relies on another ARM feature called top byte ignore, or TBI for short. When TBI is enabled, the CPU ignores the most significant byte of a memory address on access. This means that whatever value we put into the highest byte of a pointer, the CPU will assume it's either ff for kernel addresses or 0 for your space addresses. The actual value of the byte does not matter. With TBI, the top byte seems like a great place to store any data that is associated with the pointer, like, for example, pointer tags, and that is what MTE does. However, instead of using the whole top byte, MTE only uses 4 bits. The remaining bits are left to be used by other ARM features. Having 4 bits for a pointer tag means that MTE allows up to 16 different tag values. Storing the pointer tag in the pointer itself is also great because it's easy to read or modify its value. You only need a few bitwise operations and no special instructions or anything like that. Now, let's talk about memory tags. This is where things get a bit more complicated. With MTE, every 16 bytes of physical memory have a corresponding memory tag. In other words, the size of a memory granule for MTE is 16 bytes. In turn, the size of each memory tag is 4 bits, which matches the size of a pointer tag. But where are these memory tags stored? Unlike these pointers, we don't have any spare bits for that. Turns out MTE uses a dedicated RAM region to store memory tags. This region is reserved during CPU boot and reserving a part of RAM means that it will not be accessible for normal usage. Basically, this RAM region becomes invisible for the system. Since we have one 4-bit memory tag for every 16 bytes of RAM, MTE effectively decreases the size of RAM by about 3%. Having a 4-bit tag for every 16 bytes means we have one byte of reserved memory for every 32 bytes of normal memory, and 1 over 32 is about 3%. Now, if memory tags are stored in an invisible RAM region, how do we change their values? For this, ARM introduced new MTE-specific instructions to manipulate memory tags. There is the STG instruction that allows to set the tag value for an address, and there is the LDG instruction that allows to read the tag value. There are also new instructions related to generating tags. For example, the IRG instruction inserts a random tag into a pointer. Internally, it uses a hardware source of entropy, which makes generated tag values hard to predict. These are just a few examples. There are 16 new instructions in total. Finally, the thing that binds memory and pointer tags together is the tag checks performed by the CPU. With MTE, whenever the CPU executes a low-autostore instruction, it will internally check that the pointer tag matches the memory tag. And if the tags are different, it will generate an exception. Well, actually, it's a bit more complicated. MTE has three different modes that specify how the CPU checks tags and how it handles tag mismatches. The first one is called the synchronous mode, or sync for short. In the sync mode, the CPU checks tags while executing the instruction. Until the tag check is complete, the next instruction will not be executed. This mode allows to precisely tell which instruction causes an exception. However, as the instruction itself might be executed faster than the tag check, this mode might slow down the execution. Another mode is the asynchronous mode, or async. In this mode, instead of waiting for tag check to be complete, the CPU goes on, and the check is executed asynchronously, hence the name. In the async mode, when the CPU finds a tag mismatch, it sets an exception bit in one of the system registers. This register is then supposed to be checked manually. Compared to the sync mode, the async mode is faster. It doesn't make the CPU wait until the tag check is done. The disadvantage of this mode is that the CPU is unable to point out the exact instruction that causes an exception. Finally, there is a mixed mode that is a combination of the first two, but I'm not going to expand on it. To summarize, MT is another feature that implements memory tagging. It relies on TBI to store pointer tags and it uses a special RAM region for memory tags. MT adds a few new instructions to manipulate these tags. And finally, a CPU with MT support internally compares tags on accesses and reports mismatches. Perfect, this is exactly what we need as a hardware implementation of memory tagging. The next step is to take MT and integrate it into the kernel. All right, how about a change of scenery? Let's move to a different location and continue there. Now, let's try to figure out how to integrate MT into the kernel. There are two parts to this. The first one takes care of the things specific to the ARM architecture. This includes initializing system registers during boot to enable MT and adding high-level wrappers for working with MT tags. Basically wrappers for those STG and IRG instructions that I mentioned. Also, the architecture part provides an interface for receiving notifications about tag mismatches. This part is fairly low-level and it's outside of the scope of my talk today. Many thanks to Enchanzo Catalin and other folks from ARM for implementing it. The other part is about taking the routines provided by the architecture part and using them in the kernel. For example, making kernel allocators smart memory and pointers. This is what I worked on and this is what we'll discuss right now. To implement this part, we need to change the implementation of kernel allocators. The simplest approach would be to call empty routines from the allocators code. However, if you start looking for places from where these routines should be called, you will keep stumbling upon calls to routines of another tool. And that tool is called Casan. Casan is a memory bug detector for the Linux kernel. It catches many types of memory corruptions, including out-of-bounds and use-after-freeze. Casan is the go-to tool for testing and pausing. But it significantly slows down the execution, so it is impossible to use it in production. If you look at how Casan works, you will find that it's very similar to MT. Casan checks each memory access for correctness, just like MT. However, Casan uses compiler instrumentation for that. Casan stores metadata for each memory cell in ShadowMemory, which it preallocates. That metadata is how Casan keeps track of which memory is accessible and which is not. For MT, we also have metadata of a similar kind, and that is memory tags. Finally, Casan adds annotations to kernel allocators, and this is what we need for MT as well. Casan is so similar to MT in its structure that even has a software memory tagging mode. This mode is called Software Tag-Based Casan, and I gave a talk about it at the Android Security Symposium last year. To be honest, the existence of this mode is not a coincidence. It was added in preparation for MT. Since Casan is so similar, it makes sense to reuse parts of its implementation for MT. The main change would be to make Casan annotations use empty routines from the architecture part. And that is what I did. I added a new Casan mode called Hardware Tag-Based Casan. The word hardware means that it's based on a hardware implementation of memory tagging. For simplicity, I refer to this mode as MT-based Casan or In-Colonel-MT. This mode is intended to be used in production as a security mitigation, unlike the other Casan modes. Let me show you how MT-based Casan works with a few examples. We will start with looking at how it detects slap-out-of-bounce bugs. Let's say the kernel wants to allocate 35 bytes by Kmalik. Here's how this happens when MT-based Casan is enabled. First, the slap allocator rounds the requested size up to the granular size. Then it chooses a suitable memory location. For our allocation, the most fitting Kmalik cache is Kmalik64, so the total size of the allocated object will be 64 bytes. Before a slap object is allocated, its memory tags already have some values from previous operations. These values are known to us. After choosing the location for the object, the allocator generates a random tag. Let's say this tag got the value 2. Then the allocator marks both the pointer and the memory with this tag. For the pointer, it stores the tag in the top byte. Note that the allocator does not mark the whole object. It only marks enough memory granules to cover the requested size. The leftover granules are marked with a so-called invalid tag, which has the value e. And this tag is reserved for marking inaccessible memory. It's not one of the tags that are generated randomly. Finally, the allocator returns the tagged pointer. This is how the memory tags look once our object has been allocated. The first three granules and the pointer are marked with the generated tag, too. And the leftover granule is marked with the invalid tag, e. Now, if the kernel accesses the object within its bounds, there will be no tag mismatch. And this makes sense as there's no memory corruption. But what will happen if the kernel accesses the object out of bounds? Well, it depends on where exactly this access lands. If the kernel accesses offset 50, for example, the access lands in the leftover granule. This granule is marked with the invalid tag. So in this case, there will be a tag mismatch, and empty will catch the bug. Now, what happens if the kernel accesses some other address in memory? For example, if it does an out-of-bounds access with a large offset. In this case, two things can happen. The most likely result is that the access will land in a memory granule that has a different tag than our object. MTE has 16 different tags, so the probability of two random tags being the same is about 6%. Nevertheless, this can happen. The access might land on a memory granule with the same tag. In this case, MTE will miss a memory corruption. So the ability of MTE to detect such accesses is probabilistic. There is another type of out-of-bounds accesses that MTE misses. These are accesses to the last valid granule that go past the requested object size. In our example, this will happen if the kernel accesses offsets from 35 to 47. Let's take 40 as an example. This offset ends up in the last granule that is marked with the same tag as the pointer. Therefore, this access will not produce an empty exception. Such accesses are bugs, but MTE cannot detect them. Now, let's take a look at what happens with use of the freeze. Let's assume that after allocating our object, the kernel freed it via K3 and then accessed it. With MTE enabled on K3, the allocator re-tags the object using the invalid tag. The whole K-malek object will now be marked as inaccessible. Now, if the kernel accesses it, MTE will detect attack mismatch. However, there is another scenario for use of the free box. Let's say the kernel reallocated the freed object in another context. In this case, the object will get marked with another random tag. This new tag could happen to match the tag from the first allocation. So there is a chance that the kernel MTE will fail to detect use of the freeze on reallocated objects. The chance of this is about 6% as well. So this is how MTE-based Casant prevents out-of-bounds and use of the free accesses. As you can see, in some cases, MTE has a chance of missing the bug. However, of the main types of memory corruptions, MTE catches them with a high probability. Alright, let's take a closer look at a few implementation details. In the examples, I mentioned that Casant has a reserved invalid tag or marking inaccessible memory. This tag has a value E and you may be wondering, where does this value come from? So, besides E, there is another tag value with a special meaning and that is F. F is used as a module pointer tag. This means that accesses through pointers with this tag are not checked by the CPU. MTE provides a way for assigning such matchhold tags. Note that this is a pointer tag. There are no matchhold memory tags. The reason in kernel MTE uses a matchhold tag is because it makes it easier to handle objects in the allocator and it also simplifies some other things. The value F is chosen to match the value of the top byte of native kernel pointers, which is FF. And E is the value that precedes it, so that's why it was chosen as an invalid tag. Both E and F are excluded from being generated randomly. Having two reserve tags means that there are only 14 random tags left and this slightly affects the chance of kernel MTE missing a bug and brings it up to about 7%. Currently, MTE-based Casant only tags memory returned from the slab and page allocators. There is no tagging in VM alloc and there are also no tag checks for stack and global variables. However, all these might be implemented in the future. There are a lot more details to how MTE-based Casant works. There are common line parameters to disable Casant or choose between the sync and async mode. There are additional tag checks to detect double frees and some other types of memory corruptions. There are tests to make sure that in kernel MTE actually detects bugs. I don't have enough time to go through all of them right now but I have put descriptions and links in the slides, so check them out if you're interested. One thing that I do want to mention is the memory usage and the performance impact of in kernel MTE. If MTE-Casant is to be used in production, it must not waste a lot of resources. For RAM, the only source of overhead is the region reserved for memory tags. This overhead is about 3%, as I mentioned. Aligning the sizes of all slap objects up to 16 bytes seems to make no noticeable difference in terms of RAM usage. Now, what about performance? How bad is it? Well, the thing is, no one knows. Right now, there are no CPUs that support MTE, so measuring the real performance impact is impossible. The expected overhead is less than 10% for the sync mode, and the async mode is supposed to be a few times faster. But even 10% sound great, if you ask me. To summarize, there is a new Casant mode based on MTE. It detects the main types of memory corruptions. The detection is probabilistic, but the chance of missing a bug is low, about 7%. MTE-based Casant is intended to be used as a security mitigation in production. However, it can also be used as a debugging tool. The new Casant mode is available in the mainline kernel since 5.11. You can try it yourself. For this, you will need fresh QEMO and either Clank or GCC with MTE support. Then just build the kernel with the Casant HWTax config enabled and boot it. Okay, let's change locations one last time and then talk about weaknesses of internal MTE. I think I know a perfect place for that. Okay, we have come to the final part of my talk. Let's assume we do have a system with internal MTE. How secure is it? What can an attacker do to bypass the protections? To swim in forbidden waters, so to say. Let's start with the obvious part. I have already mentioned that right now there is no support for VMalek stack and globals. If the exploit only corrupts this, MTE will not prevent that. Considering the number of VMalek-based workflows we have seen recently in Android drivers, this is a realistic exploitation scenario. Funding support for these memory types is a question of engineering effort, so sooner or later they will be protected. Now, what else can the attacker do? I also mentioned that the ability of MTE to detect memory bugs is probabilistic. If the attacks happen too much during a wild memory access, MTE will not catch it. There are some improvements we can do for certain bug types. Like for linear buffer overflows, we can make sure that generated attacks for neighboring objects are always different. Generally, if the attacker can retry the exploit for a probabilistically detected bug multiple times, at some point the attacks will match. The way to deal with this is to panic the kernel on the first attack mismatch. This gives the attacker only a single attempt to run the exploit, and the probability of attacks matching on the first try is very low, especially if the exploit requires more than one pair of attacks to match. As MTE has no false positives, panicking the kernel will not lead to any stability issues, unless there are bugs in implementation, of course. MTE-based Kasan has a common line argument to enable panics on attack mismatches. Panicking the kernel might not be enough. As I mentioned, MTE has two different operation modes, sync and async. Based on their expected performance, async might sound like a better one to enable. However, in the async mode, attack mismatch is not detected immediately. Therefore, the attacker has a small window between the memory corruption itself and the moment this corruption is detected. Is that window large enough for the attacker to execute emulation payloads? I don't know. This is something that we need to investigate. Personally, I would use the sync mode to not have to worry about this at all. Another internal MTE weakness comes from the presence of the match-all-point attack. If an attacker can craft pointers with FF as the top byte, any memory can be accessed. But crafting these pointers has to be possible without memory corruption. So, for example, overwriting pointers via a use after free is out. It might be possible to get rid of the match-all attack, but I'm not sure if this will be easy to implement. At the very least, this requires carefully annotating allocator code to make sure it uses proper tags when managing its memory. And there are other issues to be resolved that this needs to be investigated as well. Another type of bug that MTE does not stop is intra-object overflows. This is when a buffer overflow happens within a single allocated object. For example, a missing size check on an array field within a structure might allow overwriting the fields that follow it. MTE will not detect this as the whole structure will be marked with the same tag. However, there are certain types of intra-object overflows that MTE can't help with. There are drivers and subsystems that allocate the chunk of memory, divide it into blocks, and use those blocks for various purposes. For example, implement their own allocator. An overflow from one of these blocks to another one is a bug, but MTE-based cousin cannot catch it because it assigns the same tag to the whole allocated chunk. But what we can do here is to add additional annotations into such drivers. These annotations would either assign a random tag to each block, or if this is not possible, at least add inaccessible red zones between blocks. One more bug type that MTE cannot detect is a type confusion, like when the kernel casts a pointer to an incorrect structure type and then accesses its fields. This will not directly lead to a tag mismatch, unless the access goes out of bounds, of course. Finally, MTE cannot detect logical bugs, so missing privilege checks and stuff like that will remain exploitable. But the bug must be logical without any MTE-detectable memory corruption being involved. So MTE does have its weaknesses, and I'm pretty sure if it ends up being used as mitigation, we will have a long road of bypasses and fixes for them. But the key part is that MTE does prevent the most widely spread out of bounds in use of the free box. Right, we have come to the end of my presentation. Let's do a quick recap of what we talked about today. We started with looking at what memory tagging is and how MTE implements it. Then we discussed how MTE-based custom works and how it prevents memory corruptions. And finally, we tried to figure out what its weaknesses are and how we can deal with them. If you want to learn more details, check out the slides. There are a few that didn't make it into the video. As a result of this work, MTE is now integrated into the kernel. Thanks to everyone who contributed, especially to Fox from ARM and to my ex-colleagues from Google. Right now, it is unknown whether in kernel MTE will be used to protect real devices, but things are looking promising. First, in May of this year, ARM announced a few ARM V9 CPU micro-architectures that integrate MTE. This is an addition to releasing the MTE specification that I mentioned before. Then MTE-based Casan has been enabled by default to install Android 12 generic kernel image. Whether it will be used on actual devices is now up to device manufacturers. So will memory tagging mitigate all kernel security bugs? No, it will not. We are stuck with our systems being vulnerable forever. But when it comes to memory corruptions, memory tagging will make a difference. All right, with that, I'm done. Thank you for listening. It was a huge effort for me to record this talk, and I hope you enjoyed it, and there was not too much noise from the environment. And this is probably going to be on YouTube, so like and subscribe, and also don't forget to follow me on Twitter. Thanks once again, and take care.