 Okay. Good morning everyone. Thanks for coming here. My name is Konstantin Serebryny. This is my co-speaker Dmitriy Lyukov. We're both from Google, from Moscow. And we'll be talking today about two bug detection tools and a little bit of how to apply them to the channel. So first bug detection tool we want to tell you about is called Adverse Sanitizer. It finds memory errors and currently works for user space programs. The second tool is called Thread Sanitizer. It finds data races also in user space programs. After we tell you about these two tools, we'll tell you some of our thoughts about applying these tools to the kernel. And at the end, we'll share some of our requests to the kernel and to the limit distributions to make these tools better. So Adverse Sanitizer. The beauty of C++ comes with a lot of different kinds of bugs. Probably the nastiest bugs are buffer overflows and use after free. But there are quite a few other kinds of memory bugs that you don't want to have in your program. Let me give a brief overview of the tool that we have. The tool is using compile time instrumentation, which is more or less platform independent. And it also has a runtime library which currently supports Linux, OSX, Android, which is the flavor of Linux and Windows. The Linux support is the best on other platforms. The tool has been released in May 2011. In November 2011, it has become a part of LVM distribution. And just recently, around a month ago, it has become part of LVM distribution, starting from 4.8. Let's start from some examples. How do you use this tool? Suppose you have a program and here is a four-line C program that has a global array. And the program is accessing this array out of bounds. In red, you see the buggy part of the code. You compile this program as you do usually, but you add one extra command line option. And you simply run this program as you usually do. If the bug is found, the warning message will be printed to STDR or to some other file if you change options. And it will show you the type of bug. It says global buffer will close here. It will show you a spec trade where the bad access happens. And it will give you some more information. In this case, it says that the memory access is four bytes to the right of the global, and it gives the name of the global and some examples. Some more examples. The tool is also capable of finding stack buffer over close. This is a similar case. We have a stack buffer and we access it out of bounds. Again, the tool shows the stack trace, and now it shows the complete stack frame of the function and where the memory access happens. In this case, we see that the array ends at 432, and the access was at 436 bytes. Next example is key buffer recall. You have a buffer located by U or by malloc or whatever else. And you again access it out of bounds. Very similar situation, but now it shows you where the allocation has happened. And this is probably one of the most frequent bug types you find. It's user to free. You allocate key memory, you deallocate key memory, and then you access it. So in this case, the tool finds you where the bad access happens. It also shows you where the memory has been deallocated and where it has been allocated before that. So how does this work? First, let us know that if we take any aligned 8 bytes of the application, they have only 9 states with regard to addressability. First, n bytes are good, and the rest, 8 minus n bytes are bad. This gives us just 9 different states. And there are no other states because malloc always returns 8 aligned memory, and you can align your stack of global memory by 8. So just 9 states. And since 9 is less than 256, you can encode the state in just one byte. And we call this information byte, this metadata, we call it shadow byte. And this picture shows how we encode the shadow. If all bytes of the 8 byte word are good, the shadow byte is 0. And there are other values that show that there are bad bytes in the 8 byte word. We store shadow in a separate portion of the address space. This example shows you a typical 32-bit address space on Linux where the application uses a little bit at the bottom and a lot of the address space at the top. So we take one eighth of the address space in the middle. We say that two parts belong to the shadow memory. And the mapping is very simple. You simply shift the application address by 3, and then the constant also. So this way you get a mapping between every 8 bytes of application and 1 byte of shadow. This mapping causes one portion of address space to be completely unused, so we end protected just in case. Pretty simple. It actually can get a little bit simpler. If you build with the IE on Linux, you get all of the application addresses at the higher parts of the address space. So you don't need the bottom. And this gives us even simpler mapping, which is just shift the address by 3. One instruction. So what do we do in compiler? When compiler sees some memory access, whether it's read or write, it doesn't matter. It computes the address of shadow by shifting and maybe adding an offset. Then it reads this shadow just one byte. And if this byte is non-zero, it means there is a bug. And we report it. And then the original memory access code. This example is for 8 byte access. And for smaller accesses, the implementation is just four instructions more. But these instructions are very cheap. Let me give you an example of assembly, which is generated by the compiler. This time this is x8664, where two works perfectly as well. So the original address is an RDI. We move it to another register. We shift that register by 3. Then we simultaneously add an offset, load the byte and compare it with zero. This is magic of x86. We can do it in one instruction. If the result was not zero, we jump to the place where we call the error handler. And then we run the original instruction. Very simple. Very fun. In order to find bugs like buffer overflow, we need to create so-called red zones around the buffer. So when you hit the red zone, it actually has some non-zero shadow value and the report will happen. So what do we do with stack? Suppose we have some stack variables in a function. The compiler inserts red zones around this variable. It inserts a 32 byte red zone at the left. And then it inserts red zone at the right, so that it's 32 bytes plus a line on top of 32 bytes. Then, in the very beginning of the function, it puts poisoned values in the shadow. This fsfsf means that we're poising the whole 32 byte region. And these many fs and 2 zeros means that we're poising 24 bytes and left unpoisoned 8 bytes. Then the original code colors and then we unpoisoned the shadow before the function ended. So as you can see, we spend just like six store instructions for every function to make it poison. Of course, if there are more stack variables, it gets more. We do the same for globals. We do it as a program start-up. And the runtime library does this for heap. When a user asks n bytes, it actually allocates n bytes plus something and poisons the red zone. The runtime library also collects stack traces, delays the reuse of freed memory, and does some other book to keep. So this is a good part. Many of you probably have heard to find bugs where MAMSAT touches bad memory. Yes, the question was, why do we need to intercept MAMSAT function? If MAMSAT function touches bad memory, which it shouldn't touch, we will find it. If we don't intercept MAMSAT, we'll not find it because we're compiling implementation and MAMSAT sits in the EC code. And we actually have quite a few interceptors, like maybe a hundred. The good thing about this tool is performance. It is insanely fast. On spec 2006, it shows on average less than two times slowdown. And this is fair comparison, 02 versus 02 plus at the same time. On most GUI applications like Chrome, and this is I'm presenting from Chrome now, you will not see any slowdown typically because GUI applications don't require all of the GUI. On server side applications, we typically see from 50% to 3x slowdown. And this is 02 versus 02. Memory overhead of the tool comes from many different sources. But typical, the overall overhead is between two times and three times compared to the regular run. So this is one of my favorite slides. These are trophies. In the first two months, in the first 10 months of testing Chrome with this tool, we found about 300 bugs, 200 years of the freeze and 100 of different buffer overflows. Now it is almost two years that we're testing Chrome with this sanitizer. And I think we already found more than 1,000 bugs in Chrome and in other libraries. Besides Chrome, our users found bugs in Mozilla, I guess more than 100 already. And basically everywhere else where we or someone else applied this tool. Of course, we are also testing Google server side applications with this tool and perhaps couple of thousands bugs found already. There is some future work we want to do. We want to do some static analysis to avoid instrumenting safe memory accesses to make the tools faster. We want to instrument or recompile the libraries so that we find bugs in any library code, not just memset. We want to learn how to instrument in line assembler if your program has one. Because currently we don't handle in line assembler. I mean, the program will work, but it will not find bugs if they pee in assembly codes. And finally, we want to adapt this tool to use it in kernel to find bugs in the kernel. And this part we will discuss later in this talk. So as a summary of this tool, C++ has suddenly become much better, much safer and we are welcome to use this tool. One tool we will not be talking today is memory sanitizer, which finds users of uninitialized memory. It's sort of, if you combine memory sanitizer and memory sanitizer, you get functionality of Veldrin, but 20 times faster. The tool we will be talking today is ThreadSanitizer and I am giving the word to my colleague. Hi. So ThreadSanitizer is a data-raised detector for C++ and the Go language. So what is data-raised? Data-raised occurs when two threads access the same variable concurrently and at least one of them is the right. So here you see a simple program with the data-raised. And you also just need to add the F-Sanitizer equals Thread flag to the compiler. And below you see an example of the report. We described two racing memory accesses, print the stack traces for them, and describe the involved threads, print stack traces, and also describe the new text that was held during the memory access. So we started working on the ThreadSanitizer back in 2009, but it is now what we call ThreadSanitizer version one. It was based on the Veldrin and it was very slow, like 20, was it typical slow down and sometimes it was more than 400. But we still found thousands of races with it and at that time it was faster than others. So about a year ago we started working on ThreadSanitizer version 2, which is also based on the compiler instrumentation instead of Veldrin. And we completely redesigned the runtime library, so now it's really parallel. There is no mute access so expensive atomic operations on the fast pass and it scales through really huge applications. It has better and more predictable memory to print and also brings very informative reports. So here is some performance numbers. This is our server-side application. The RPC benchmark is a highly parallel to boot benchmark and there is some typical server application test and stream GPU test is a very simple single-server test. So there you see that slow down of ThreadSanitizer 1 is like 25, 40 and on the RPC benchmark it's more than 400. And for ThreadSanitizer 2 the slow down is about 2 or 4x which is much, much more acceptable than much faster. So the compiler instrumentation is very simple. We just insert the function calls into the runtime library to intercept the interesting events. So we insert function calls in function and try before function exit and before each memory access we insert the callback which says whether it's the read, write, the size of the memory access and pass the address. And we also intercept the atomic operation in the compiler model. So ThreadSanitizer also uses the shadow memory and also the simple zeroth mapping which is very similar to address synthesizer so currently it requires dash 5 flag so that all application memory is at the top of the address space and the shadow mapping is similar with just two arithmetic instructions. So the shadow is more complex. It consists of so-called shadow cells. Each cell is 8 bytes and it represents one, describes one previous memory access to the application memory. So there is 16 bits for ThreadID, 42 bits for default or a scalar flow when the memory access was done and 5 bits describe the size and position in 8 byte application block and one bit says whether it's read or write. So all the information is in the shadow cell it's completely empty. Then for each 8 bytes of application memory that you see on the left we have 4 such shadow cells that describe up to 4 previous memory access to that 8 byte application memory. So let's consider a simple example. So initially all shadow cells are empty and then Thread1 makes the write to the first 2 bytes. Since the shadow is empty you just remember in the first slot the ThreadID is closed and the size, position and that is the write. Then Thread2 makes the read of the last 4 bytes so we remember that information and we check the different cells to find potential races. So in this case they do not race because they touch different memory. And then Thread3 makes the read of the first 4 bytes so we store that information and check the previous memory access. So it does not race with the second memory access because both are released. But it can potentially race with the first access because they are from different threads. At least one of them is the write and they access the same memory. So now we need to answer the question whether there is a data rate or not and so we need to check whether they are synchronized or not whether they run concurrently. So for this we use the so-called happened before relation I won't describe in the details how we exactly calculate this but this constant time integration we just need to extract the thread ID and the clock from the shadow memory and make one load from that local storage and make one compare. So when all shadows are full and we need to store a new memory access sometimes we can replace one of the previous memory access without losing any useful information but if it's not possible we just replace the random values with the new memory access. So when we find the data race we bring the report and it contains two spec traces for the current memory access and for the previous memory access. Current stack trace is easy just need to unwind the stack but the previous memory stack trace is kind of problematic because all memory access we don't know if it will erase with some future memory access or not and it means that we need to remember full stack traces for all memory access in that situation and to remember the stack traces we use the surface cyclic buffer of events so events are memory accesses function and try exit and mutics lock and lock each event is 64 bits for type and 61 bits for program counter or associated address so when we need to restore the stack trace we just replay this buffer from beginning to the memory access and while we replay we model the state of the stack and when we reach the interesting memory access we get the stack trace for this memory access and similarly we get the set of mutics that were held during the memory access so this replay is slow but it's done only when we report the race so it's not relevant but it's very fast to add to the stack trace we just need to increment the position and store one word into the trace and also has the predictable memory footprint but since the buffer is cyclic we lost the information after some time so by default we hold 128,000 events per fret which is roughly 1,000 sorry 128,000 previous memory accesses for each fret so we also have function interceptors for more than 100 functions like malloc free to pre-thread mutics, something pre-thread create, destroy and so on for memory copy and also for read, write, open and for example this we can find races and file descriptors when for example you write the file and can currently close the file descriptor so what are our headaches, it is timeouts so the programs are still slower so some typically server applications sometimes trigger some timeouts, usually just increase the timeouts so it's memory consumption sometimes programs get killed due to out of memory also we have problems with non-instrumented libraries especially if we miss some synchronization on atomic variables for example in non-instrumented libraries and the last thing is so-called benign data races it's when you increments encounter without any synchronization and you think that it's okay yeah the threat synthesizer complains about such things as well. Now I move to address synthesizer for Linux kernel so we are not kernel hackers at all and currently we only have very early proof of concept of the tool so it's more about what we want to do so what is currently there, there is config debug slap which adds red zones and poison into the kernel memory blocks, it can detect some out of bounds and use after free, but it doesn't detect out of bounds reads and use after free detection it's kind of best effort there is KMM check which triggers page and fault on every memory access and just simulates the memory access so it kind of finds I think most of the bugs are very slow due to page and faults there is also debug page alloc which unmaps the freed pages from address space so it can find use after free but only if whole page is freed so there are also some static analysis tools but they're kind of complementary to the dynamic tools and there are some experimental or academic tools but as far as I understand that they are not in the widespread use so what we want to do is config asen which is intended to be fast and comprehensive solution for use after free and out of bounds bugs so it's based on the compiler instrumentation so it will be fast so it can find out of bounds for both reads and writes it provides strong use of the free detection due to delay to use of memory and it detects the bad memory access promptly right when they happen and also provides the informative reports with stack traces for when you where you're located freed the memory block and so on so this is our ideas and on how to implement this so on top you see the virtual address space in the kernel there is a user part and the kernel part and in the kernel part there is physical memory region so we want to place the shadow memory into the physical memory region at the constant offset for example 64 max or something and the size of the region will be 1 eighth of the physical memory so then the calculation the shadow mapping is very similar to user space so we need to check that the address belongs to the physical memory range and then also just divide by 8 and add an offset so there is also virtual memory and now it's unclear what's the best way to handle it because so ideally we want the mapping for the shadow memory to be just a divide by 8 and add an offset without any additional branching or something so so we want to start with the slab allocator and add red zones and poison and poison them and add the delivery use of the memory blocks to find the use of the free box so the API that the ASAN will provide is looks like this there is a function to poison the memory region and poison the memory region and check the addressability of the memory region so with this API we want to instrument the mem set mem compare function and so on and probably some other memory allocators in the kernel so here is the problem that we know about so we need to find the way to do the fast shadow memory mapping we need to figure out what to do with the bootstrap process because we can turn off the instrumentation due to performance considerations so there will be some text size increase and there may most likely be some problems with interrupts especially if we want to print a report from the interrupt and there may be some problems with modulus for example like if the module is not instrumented or instrumented with the wrong version of the address sanitizer and most likely there are a lot more problems so potentially we can also implement the threat sanitizer but it's much more challenging because address sanitizer need to intercept it need to intercept some memory management function and some memory accesses so if it does not intercept some other memory accesses it just won't report the bugs there but threat sanitizer need to intercept all synchronization absolutely otherwise it will report false positives so and the main issue that we see there is with the atomic memory operations in the kernel so in user space we rely on the C11 like atomic operations where you also have where you have the address of the atomic access you see that this is an atomic access and you also have the associated memory ordering like a query release but in the kernel it's usually expressed just like a plain memory access and then there is a write a read memory barrier and there is no way to figure out what exactly this memory barrier intended to synchronize and now I'm giving the word to Kostya as we told the tools perfectly well very well at least for user space but they could work even better with some of the help from the Linux kernel and distributions actually mostly from the kernel so the ideal address space layout for these tools is when all of the application memory is somewhere in one place either it's at the very bottom or at the very top we don't care the simplest thing is when everything is at the highest one eighth of the address space if this is 64 bit of course so this is from 7 all zeros to 7 ffff and today we actually achieve this ideal address space layout on Linux if this is x8664 if this is PIE if address space layout randomization is enabled and if the stack size is limited and this is what we use for T-SAN and this is what we prefer to use for A-SAN ideally we would have this space layout always as Bill Gates said in 1981 16 terabytes of address space should be enough for anybody say it again sorry we didn't see any slowdown it's round noise maybe 1% this is not peak this is PIE so PIE is less expensive than peak and on the programs we would try to measure this and we didn't see it and some programs are already built with PIE by default like Chrome is built with PIE by default so there is no difference at all a little bit more like complain about address space with address space layout randomization off so if you take a C program that prints an address of main and if you build it with PIE on the recent Ubuntu it will print something like 5555 and this is with ASLR off and on the older Ubuntu you will see something like 7FFF and we like this 7FFF much much more so this actually means for us that T-SAN threat sanitizer cannot work with ASLR off for example if you run it in GDB you have to explicitly disable it so that you can run T-SAN and this is for us it is a serious regression between Ubuntu 10 and Ubuntu 12 we know actually the guilty commit and we would like to find someone to explain us do you really need this folks? another problem is unlimited stack which is really too greedy today if you run a program with U limit-S unlimited it will actually allocate 84 terabytes of address space for stack do you need that much? I don't think so so this is one this is another complaint it also causes us some trouble with T-SAN because we cannot run T-SAN in a setting where stack is unlimited and for some unknown reason GNU make sets stack to unlimited so if you run T-SAN from make it will not work unless we do something tricky this is not a complaint this is a request we run these tools in a memory limited settings like we want to run many many tests on a single virtual machine where they compete for memory and since the tools take more memory than the regular run they sometimes require more memory than there is available and then the process just dies due to out of memory the good thing about the tools is that all three tools are designed in this way that the shadow value zero means it's okay no bug there it's a good memory so if we could give these shadow pages to the kernel when it is close to out of memory such that when we need it back it comes either with the old state or with all zeros this would be perfect and this a little bit similar to M-advice don't need but not exactly equivalent we've seen something very similar as f-advice volatile somewhere in the patches but this is not in the main kernel so this is something we want very much to make the tools more stable in the presence of strict memory limit we actually don't want the VM to discard these pages first we don't want the VM to discard these pages first we want the VM to discard these pages lost if we talk about this yesterday at much time we really want to end it by once but don't need if you don't have to but if you do get rid of them because we're interested in the data but if you need them take them away because zero page is fine for us and it helps people to use malloc alfator as well because two more commands exist you know if I want these pages I want them back because you know I've got my structure set up for them and I can use them but it's not there this is required for the tools to run in the memory limited environment otherwise the tools work well today also one of our request is how to limit the memory of the application running under these tools ASAN address sanitizer uses 20 terabytes of memory T-SAN uses 97 terabytes of memory or virtual memory of course and memory sanitizer uses 72 terabytes in this setting u-limit-v is useless you cannot say I want to use like 73 terabytes of memory it will not work so we need some way to limit the real memory for the process maybe the containers will help us I'm not sure but I guess we're not the only guys who want this functionality in the kernel another problem is general robustness in the presence of large m-mapped regions so all of the three tools map the shadow memory with m-map no-reserve flag which means that we allocate these many terabytes of memory but we're not actually using all of them and unfortunately in the current state Linux is very unstable if you are actually starting to use all of those 20 terabytes or 40 terabytes there was a bug with mloc all which caused the machine to hang if you first allocate 20 terabytes and then you say mloc all machine dies this bug has been recently fixed about couple of months ago in the trunk and we still see problems if you just start using that memory too much most usually this is either bug in our tool or some very serious bug in the user code which triggers these hangs but still I think that the kernel should not be dying on such conditions one of our requests to the Linux distributions so all of these three tools are very good at finding bugs in user code but sometimes a user may call a library code like maybe libc or maybe things like gtk or qt or whatever and that library will access some freed memory for example so this is not a bug in the library this is a bug in user code which calls library code on incorrect pointer but all of those will not find it because the library code is not instrumented which is worse for T-SAN if the library is using atomic synchronization and we don't see it we will see false false positive reports so the best solution we can think of is to ship instrumented libraries or maybe to have some automated way to build them from scratch if needed and I'm afraid this has to be done on per distro basis or at least like it has to be done one way for RPM, one way for Debian packages one way for anything else so let me summarize our talk address sanitizer finds buffer overflows used after free bugs in C++ code this is very fast, very robust and we believe it must have for all C++ developers and this is actually already like this in Google more or less all of the C++ code is tested with address sanitizer thread sanitizer finds data races not just in C++ but also in the Go language so if you are using one of those languages and you have threads both tools are possible for the kernel T-SAN is a little bit more tricky because of these atomic synchronization issues we are currently in the investigation stage and help is more than welcome if we get some support from the kernel and Linux distributions the tool will become even more awesome both tools, well actually all three tools are open source A-SAN and T-SAN are part of both LLVM and GCC distributions and M-SAN is a new tool which is currently only in LLVM thanks a lot and we are ready to take questions. The question about ARM architecture yes, these tools can work on this device we do have regular process of running Chrome tests for Android on the Android device, no virtual machine just a plain device we know that someone has tried A-SAN on Ubuntu for ARM but we ourselves didn't do it so it is known to work on Ubuntu ARM. Yes, when we free memory in A-SAN we need to mark the shadow for the freed memory as poisoned. We don't need to touch the actual memory and since the shadow is 8 to 1 it is compact we actually need to memset much less memory than the actually the allocated chunk. It is a fast operation yeah so there are red zones or like if you have a region either global or stack or heap buffer around it we have poisoned areas. They are typically something like 32 bytes on each side for large heap locations they are larger so if you access 1 byte out of bounds or like 10 bytes out of bounds you hit the red zone and it is poisoned for example like 1 kilobyte out of bounds you may be unlucky and you may hit another location so the bug will not be detected. But the red zones are adaptive so if you have a large heap block the red zone will also be large. Yes before we actually read the memory location we also read the shadow of the memory location we cannot detect all of the wild pointer references but the way our locator is made we actually do find a lot of wild references just due to we sort of increase the probability of finding wild references Yes so the question is about the GoFret synthesizer so the algorithm is essentially the same. We have maybe a few defines in the runtime library for Go because for example Go is type safe language so we don't need to store the size of the memory access so the first byte identifies the variable and we also have some different constants for example for max count of frets because GoRoutines and GoA might much lightweight so they can do more of them but the compiler instrumentation is done in the Go compiler but otherwise it's the same. Thank you for the questions. If you have more we are here for the rest of the day and we will be happy to answer more questions. Thank you