 Let me welcome Ted Unang, who will talk about developing software in a hostile environment and has made a career out of removing software so that his account name has become a verb in OpenBSD community, so I'm sure this involves lots of removals in surviving. All right, thank you, hello. So talking about developing software in a hostile environment. So let's just talk about and who is it for. I'm going to begin by referencing some talks that Theo has given over the years on exploit mitigation and what it is, how it works, and why you want it. And usually his focus is on stopping the bad guys. And I noticed something a bit funny about how the material for these talks is structured. It's usually, on the one hand, very technical, how exploits work and how you stop them. But then on the other hand, you have the people who care most about this is probably end users who want to, you know, they don't really have a choice in what software they run. They're given some crap and they need to use it, but they want to run it in a secure environment. And so they, so there's kind of this gap where you've got the people developing, you know, the exploit mitigation technology and the people who are benefiting the most from it. But I wanted to take a look from a different viewpoint, which is how can we use this to help the good guys, us. And so this is a talk about developing software. And my examples are all pretty much going to come from OpenBSD, but that doesn't mean I'm only talking to OpenBSD developers. Usually I'm talking to people who are not OpenBSD developers, but people who develop software which will run on OpenBSD or which somebody else might want to run on OpenBSD if you are writing the software. And so in this case, OpenBSD is the hostile environment. And what I mean by that is, you know, if you're only just now discovering that the internet is a hostile environment, you're about 20 years too late to the party. So what makes OpenBSD a hostile environment? It doesn't always conform to expectations and it certainly doesn't condone many mistakes. So developers talk a lot about standards, is the C standard, the POSIX standard. But then there's also real world, the factor of standards, like what you can get away with and what seems to work. And so let's challenge some of the assumptions that people make. Let's go back to the standard and see what it says we can do and what it says we must do and then try to change what most other systems actually do and see what, you know, shake the tree, see what falls out. And basically a strictly conforming program will run just fine. It conforms to the standard, OpenBSD conforms to the standard maybe. And so a well-written program will work, but a poorly written program which makes assumptions and cuts corners is going to have a lot of trouble. And those are the kinds of programs that you don't want to be running on a platform that provides even that, excuse me, you don't want to run those programs on platforms that provide those assumptions because those assumptions are sometimes violated or can be violated by an attacker. And so everybody loves secure software, but as OpenBSD project has maintained, oh, no, no, that's good, I like that slide. So, you know, as we've maintained for some time, secure software is simply a subset of correct software. And so my focus here today is not on how exploit mitigation makes your software more secure. It's how you can use this to develop software which is correct. Yeah, so the outline, I'm going to start off by discussing a few features that someone developing software on OpenBSD might want to use. And then I'm going to discuss the theory behind these features, like why we change things, the way we do, and what kind of bugs this shakes out of a program. And then, okay, and so along the way, I'll mix in a few examples from OpenBSD. So there's lots of bugs that we've discovered over the years which laid dormant for 10, 20 years, and then you make some change to malloc, you make some change to the kernel, you randomize this, and all of a sudden, programs start crashing, they start misbehaving, and these bugs had always been there. We just didn't know about it because they were kind of skating by. And so my philosophy is software is bugs. Everybody knows this. And lots of bugs go unnoticed. But that doesn't mean they're harmless. It just means you haven't noticed them yet. But sooner or later, somebody is going to notice them. And so what we want to do is provide an environment where bugs are noticed as quickly as possible. And whenever we've been adding new exploit mitigation techniques to OpenBSD, something almost always breaks, just always. And this makes the development of such features pretty exciting because you make a change and I don't know what to name names, but certain programs were pretty much guaranteed not to start after you make certain changes. And then you kind of wonder, hey, was the change I made wrong, or is that program always been broken? And I'm just kind of exercising some latent mistake. And so from a high level, my philosophy is instability today leads to stability tomorrow. The sooner we can break it, the better. And everybody will tell you the same thing. You want to fix bugs in development. You don't want to wait until you push things out to production to fix them. But there's this kind of mindset then where people, once they deploy it, they want to play it safe. They get super conservative and they think that they can postpone or eliminate bugs in production by keeping things real steady. I think that's a mistake because two years down the line when you get a bug in a production system, that code is two years old. You don't know what it's doing. And you're so far past it on your current branches, it becomes very difficult to debug. And then you're also in a bit of a panic. I think it's much better to have a bug in production one month after you deploy. And you can't avoid the bug. It's going to happen. So it's always better to pull it forward in time. So the big thing here is malloc.conf. This is a sim link that you can use to control the behavior of malloc. It's on every BSD platform. I'm going to talk specifically about OpenBSD, although a good chunk of this applies to other platforms as well. So probably most people are familiar with this. I think it's kind of discussed fairly frequently in the BSD community. And so it's a great place to start. And then after I discuss some of the malloc features, we're going to move on to some of the theory behind this and dive a little bit deeper into how allocators work and how this affects program behavior and how your program can have hidden dependencies on specific allocator behaviors. So the BSD mallocs, all-sport malloc.conf, I think it made its appearance in PHK malloc way, way back when. And then it was subsequently retained in both JEMalloc and AutoMalloc. And there's been some divergence in the available options since then. But basically, you create a sim link in Etsy and you point it at some string of letters. And every letter turns on or off a particular option. Some of them are only useful in education. Some of them are pretty useful all the time. My favorite option is J for junk. And what this does is it pre-fills memory. So when you allocate memory, it overwrites it. So it's non-zero when it's returned from malloc. And then when you free memory, it also immediately overwrites it so that whatever previous contents of the memory were are overwritten. And this catches two different classes of bugs. First, many programs fail to completely initialize heap objects. This is just like a classic stack, uninitialized variable. But it's a lot harder to detect. Your compiler gives you very little support for detecting these bugs. And so they also go unnoticed because initially, when a program starts, malloc is returning new memory. It gets this memory from the kernel. And the kernel is almost certainly zero-filling it for you. And so all the memory coming back from malloc when you run your test cases is zero-filled. And it looks like, hey, it's great. I don't have to initialize this. But then as a program runs and runs over time, it starts freeing memory. And then malloc starts reusing that memory. And it has the contents of whatever the previous usage was. And so you end up not getting so lucky. And then the bug manifests. Much better, in my mind, to catch it on the first allocation. And the second thing is, many use-after-free bugs rely on the memory remaining untouched after you free it. So you free an object. And until the next malloc call, you can almost certainly continue reading that object and writing to it without ill effect because the memory is still there and the contents are still there. And so if you free something and then dereference a pointer in a struct that you just freed, hey, no harm, no foul. But we don't want things to work like that. So if we override it with a junk pattern, then that pointer, all the pointers in a struct will become invalidated. And you will crash. Great. Now, there's no guarantee that junking memory will flush out a bug. Depending on how you initialize or don't initialize the memory or what kind of use pattern you have, you're not going to trigger every bug. But the bugs that it triggers are a good subset of bugs in real world. We've seen a lot of programs where you crash and then you look back at the instruction pointer. And when you see like do, do, do, do as the pointer that you tried to jump to, it's pretty obvious what happened. Now, the man page for Malik, in my mind, unfortunately, kind of downplays the J option. Recommends it for debugging and testing. I think it's a great option to use all the time. And it's not just for debugging or testing. And so despite whatever warnings are in the man page, and I think actually the previous D man page kind of goes over the top and specifically advises not using J, I think it's fine to run J in production. Because I mean, you have to measure it. There is a performance, obviously, because you're overriding it. But the amount of time you're going to spend overriding memory compared to running the rest of your program, hopefully, isn't, the ratio is not going to be too bad. And so basically, there are two scenarios. You run with J, and your program doesn't work. Well, that's not good. Or you run with J, and the program works fine. Now, the problem is, so you're like, oh, well, the program doesn't work with J. I'm going to turn J off. That program is still buggy. Using, like, disabling the Malik option didn't make the bug go away. It just means you're going to hit the bug later. And so you should think carefully about whether you actually want to be running a program which can't be run with these kinds of options turned on. And lots of OpenBSD users, I think, already run with Malik J. And so a lot of the bugs that it catches have been already caught. But there's always some that slip through because nobody's testing all the time. And so one of the changes I recently made is to start junking small chunks by default. And this immediately triggered a bug in the Postgres Ruby gem where it was using memory after it had freed it. And nobody had happened to run this particular code with Malik Conf J turned on before. And so as soon as we switched to junk by default, bam, the bug came out. Now the change that I made, I'll just describe it in a little more depth, to the default is not the full J. And you can actually disable it by going to lowercase J. So it's kind of like a midway, half uppercase J. And the reason we did it that way is we had a couple of choices. The full J, as I mentioned, it can be expensive. We don't necessarily want to impose that penalty on everybody. But we looked at the class of bugs that exist and what kind of bugs we hope to catch. And so we kept it to only junking small chunks. We don't want to necessarily go and junk hundreds of megabytes of memory that you freed. In particular, Malik, because it uses M-Map, it's zero fill on demand. And so you might not even have paged in all of those pages. And so it's silly for Malik to fault the page in just to zero it or junk it when it never would have been allocated in the first place. And also, I think use after free bugs are more pervasive and more dangerous than uninitialized bugs. And so that's why we don't do this on the front side coming out of Malik. We only do it on the structs that go back into free. And that actually works out pretty well because then as a side effect, if this object is recycled, when you out Malik it again, it's already been junked by the previous free. So it's a pretty efficient way to flush out bugs. And so we turned this on by default because for as much as the user bases run with J turned on, you can always find more bugs by conscripting more testers. So if you don't want to volunteer, you're going to be volunteered. Now there are some other options that can be interesting. There's the F and U options, which are free guard and unmapped. And what they do is they unmapped the free pages list. This is a little bit more of an expensive operation. And it only works at a page size granularity. But this can trigger and detect a couple other cases of bugs where instead of just overriding the data by actually unmapping the page entirely, we use the kernel memory protection to trigger seg faults whenever the page is accessed. And so there are some other options, the P option as well. And ongoing work I have is to try to keep distinct objects on distinct pages so that you actually end up being able to unmapped pages more frequently because right now there's a situation where you allocate one chunk and then you allocate another chunk and then you free the first chunk. As long as the second chunk is still allocated, the page can't be unmapped. But if we can keep separate chunks and separate pages and fragment the address space on purpose, then we get the ability to unmapped more aggressively. And one option that I'm going to mention, which I think used to be more popular and I don't necessarily recommend it as much anymore, is the G option, which turns on guard pages. Currently, the kernel actually does a pretty good job of sending down randomized addresses for each allocated space that M-Map has. And so you end up with implicit guard pages between a lot of the allocations. And that's, I would say, sufficient for running all the time. If you want to turn on G, you can. That comes with a heavier performance hit. And so it actually affects some of the important code pass. Like every allocation has to allocate their guard page and then it has to M-Protect it. And then when you free it, it has to M-Protect it again. And it does all these crazy dances with it. So if you're looking to conserve performance, I'd skip on that one. It works. Obviously, more options is better. And it works better than the option without. But I think in the long term, the number of bugs that's going to flush out is not as great as some of the other things. I know you had a pointer. OK, uh-oh. My slide's too big. So another term for junking memory is poisoning. And so this comes, this is a term we use in the OpenBSD kernel. For instance, and so I'm going to talk a little bit more about kind of a little bit of the theory behind junking and go discuss, like, some other ways that we can use this to our advantage. And so malloc, which I've been discussing, is a general purpose allocator. You give it a number of bytes. It gives you back bytes. And so it kind of operates on memory. But I want to switch the discussion now to discussing objects, because I think that helps to better understand and comprehend the kinds of bugs we're going to have. And so every object has a lifetime. It's allocated, then you use it, and then you free it or destroy it. And bugs result when the code using an object doesn't respect the lifetime appropriately. And so what we're looking for is an enforcement mechanism. And that's where poisoning comes in. And poisoning an object, it can be as simple as overriding the memory with a simple pattern. But it can also be considerably more complex. For instance, you can try to pick a fill pattern that is deliberately designed so that pointers are invalid. And so in the kernel, the popular choice for this is kind of like dead beef. But actually on the i386 architecture, the OX dead beef pointer is in user land address space. And so you don't necessarily want to be dereferencing that, because it actually might work. So what we did instead is we changed it to an OX EE something address. And that page is guaranteed to be unmapped. And so dereferencing that pointer is guaranteed to crash. And you can also use a few different fill patterns, because bugs, as I mentioned, have a tendency to adapt to whatever you do. And so this is a case where, despite all the things that we were doing, there were still a bunch of bugs. And for a while now, Theo and I had been discussing the possibility that the dead beef value we had been using in the kernel might have accidentally, conveniently aligned with some flag values that were actually valid. And so you would free an object. And it would clear or set particular flag fields in the struct, which rendered the struct still valid. And now what happened was, so as an experiment earlier this summer, I inverted the bit patterns used. And it wasn't long before the smoke came pouring out. And so it's actually two values used in OpenBSD kernel, but they're very similar. One was originally used for pool, and one was used for malloc. And they've only varying a couple bits, but a lot of the bits are the same. But by inverting them, we ended up with an entirely different bit pattern. And there was a bug where the function which establishes interrupt handlers on I3D6 failed to initialize the flags field of a struct. And so it would just get like OX dead beef set in it. And that was fine, because the flags that were being set or not set were meaningless. And so the code continued to work. However, when we inverted the bit pattern, it set the mpsafe flag. And so when you mark an interrupt handler as mpsafe and it's not mpsafe, that's when bad things happen. Now, unfortunately, this is actually kind of difficult to track down, because you don't crash right away. Instead, the interrupt handler is running without the kernel lock. And all of a sudden, maybe corrupt some memory, maybe trigger some other assertions. And only some drivers were affected, and only on some machines, because the way interrupt handlers are run, sometimes you have the big lock anyway. And it was a very complicated thing. But certain machines were just kind of like boom. And so we start kind of backing out diffs. And finally, somebody backed out the poison diff that I had realized, hey, now the kernel works again. And then we realize, OK, so this is uninitialized memory. So let's start taking a look at all of the fields that have been touched. And we inspected the interrupt handler structure, looked at its flags field, and realized that the value there didn't make any sense, because it had a lot of bits set that should not have been set. Fixed that bug. I think there were still two or three other bugs that were suspected of being caused by the bit inversion change. So the changes actually backed out, because we were kind of going into release, and it was causing too much trouble. And so just kind of a cautionary tale that you add these options, and you want to flush out bugs, but then you end up with bugs that depend on the bug, like detecting code. And so you want to change things up from time to time and keep the patterns that you're using variable. And one last thing that the kernel does, and unfortunately userland doesn't do this at present, although I kind of had some work going along these lines, is that you check the poison value. And so what did I have? OK, I already said that don't depend on poison, but you want to check that it remains invalid. And so this can detect right after free, you fill an object with OX dead beef, and then the pool code is actually pretty good about this where you can set an option in pool where on every allocation and every free, it then walks its entire free list to make sure that all of the objects which have been previously freed are still in their pristine poisoned state, and nobody has muddled with them in any way. And so if anybody changes an object after you freed it, that's a good time to panic. And now, so the, while I'm on that subject, so this brings us to the topic of recycling. And so recycling policy is when a allocator decides to reuse memory. You're going to have a free list. It doesn't actually necessarily have to be a list. Like user land malloc actually uses a bit array. But in the kernel, like the kernel malloc, it literally is just a list with pointers to the next free chunk as you go. And so there's two things that how recycling relates to poisoning is if you poison an object after you free it, but then you allocate it again, it's going to be reinitialized. So then you can't detect a use after free on that object anymore because the poison has kind of been washed off. And now the previously dangling pointer is pointing to a valid object again. It's not the object it thinks it's pointing to, but it's a valid object. And so this just causes more and more corruption and kind of delays detection of the bug. So there's a couple ways that an allocator can decide whether it's going to, what pointers to recycle and not. And probably the most common policy in a lot of systems is fast recycle, which is last in, first out. This is you free something, the very next call to malloc is going to return the same object that was previously freed. This is great for performance because it keeps things hot in the cache. Unfortunately, it's not so great for detecting bugs. As I was explaining, you have an object, you put it, you free it. And so you got this thing that you're marked free and you want to watch it and make sure it doesn't change and you want to detect bugs in it and keep in these crazy poison values. But then if you allocate it again and turn it back into an object, all the code that is buggy and uses that object is going to see a valid object. And so yeah, this is caused a lot of problems. And also, from a security standpoint, this is also pretty bad because this is probably the most predictable deterministic behavior that an allocator can have. And so going a little bit, and so we've been addressing this in part from the direction of exploit mitigation, where if you read about how exploits and the heap work, they generally require allocating and freeing and allocating and freeing objects in a particular pattern. That pattern is easiest to manipulate with a fast recycling allocator. And then that gives the attacker control over the heap and allows them to guarantee that an old and new object will overlap in the same region of memory. Now the opposite policy would be slow recycle, which is first in, first out, or last in, last out. And so this is where you free something, and then it stays on the free list for as long as possible. This is LRU, and the buffer cache uses slow recycle because you want to keep buffers valid for as long as possible, although buffer cache isn't quite like an allocator. But now this is difficult to implement in some cases, and also it's probably the least performance friendly in terms of you have an object in cache, and you're going to put that at the end of the free list, and then you're going to take something from the head of the free list, which has not been in cache for as long as possible. But what you see in a number of allocators actually is what I'll call indeterminate recycling. And I'm not sure that this is a great name. I don't really have a better name. And what I mean by this is most allocators kind of have a hybrid approach where you have a number of free lists, like in malloc. You'll have a free list for each page. And so you have a current working page, which you're going to return objects from. And that page will be fast recycle. The last object put into that page is going to be the first one out. But if you free an object from a different page, it's going to get stashed away somewhere else and won't get recycled right away. Now, the problem is this can often decay to fast recycling where you allocate something. You put it back, you allocate something, you put it back. Those objects are always going to come from the current page. And so then you're going to put them back in the current page. And you're going to get them back from, you know, malloc is going to come from the current page. And so back and forth. And so in both userland, malloc, and in the kernel pool code, we try to avoid this by occasionally reselecting a random current page. And so you'll run for a while with the current page. But then after every so many allocations, the code will just say, OK, that's enough for this page. We need to swap to a different page to make sure that we get a little variety in where our objects are coming from. And then there's random recycling. And so userland free and OpenBSD does this, where you deliberately try to avoid a deterministic order for allocating patterns. And what we do is when you free an object, it's not actually freed. It goes into a queue of objects to be freed. And then we randomly select something on that queue to be actually freed. But because it's random, you shuffle the order. And this was added as a security feature because it flirts efforts to create a deterministic pattern. But it's also great at mixing things up in everyday programs as well. In this subject, I wanted to point you to the Google Project Zero blog. This is a very technical blog. They write up how they exploit some current vulnerabilities. Two in particular are an earlier post, the poem for fun bug exploiting Safari, and a more recent post exploiting Flash. Both of these bugs were heap exploits. And their blog posts do a great job of explaining how they arranged all the structures in memory so that their exploit would work. And even if you're not an exploit developer, understanding how allocators work is key to understanding how programs work. And so this is a lot of useful information in these posts for everyday developers. OK, so I wanted to address a topic of mostly harmless bugs. And as I mentioned, one of the things that we added an exploit mitigation technology, and then it shakes out a bunch of new bugs. One of the more difficult things that we integrated was the Stack Protector propolis because lots of latent bugs turned up when we added that. And in large part, that's because it uses not only a stack cookie to protect the stack, but it also rearranges stack buffers so that even small one byte overflows are going to hit the cookie. And in practice, this means that lots of tiny one byte overflows are detected. And then we did a similar thing in Malik where we try to rearrange pointers and shuffle things around so that allocations are going to end on an unmapped page. And so as soon as you overflow even by one byte, your programs get seg faults. And unfortunately, this kind of mentality, whenever a one byte overflow is found, the people are like, oh, that's harmless. Then they kind of revise that to, oh, it's mostly harmless. Then, well, possibly harmless. Then, OK, so it's not so harmless. And so I think finding these bugs in everyday running is very important because it flushes them out because you don't want somebody else to find them later and be able to exploit them. And so we're trying to kind of self-exploit here for a little bit and drive out more bugs. Actually, so on that random digression here, there is an option in OpenBSD to GCC called fStackShuffle. As I mentioned, ProPolice rearranges the buffers so that a buffer that might overflow is right next to the cookie. The problem is if you have a function with two buffers, only one of them can be next to the cookie. The other buffer is going to be next to the buffer. And so you might not detect the overflow. Those buffer overflows are perhaps not exploitable because you have an overflowed to the return address on the stack, but it's still a bug and we still want to catch these. So the fStackShuffle randomly sorts all the buffers on the stack every time you compile the program so that they're different. This is a compile time option. It's not runtime. So if you want to test a different sort option, you have to recompile. Unfortunately, it's kind of hard to generate random stack frames at runtime. But this, as soon as we added this to the tree, within, I believe, like one day, two bugs were found. But it's not on by default. So OK, so yeah, that was a point. So I think Mio did a build with this option turned on and fDisk broke and LDSO broke, I think. But he didn't even attempt a ports build. So who knows what else is out there? So now, in practice, there are other approaches to bug detection, the static analysis, of course. There's being careful, there's code review. And so what I've been talking about is kind of in the category of dynamic tooling and there's lots of options there. I always like to kind of reference electric fence. Electric fence was kind of the original inspiration for a lot of the work that we've done in malloc. There's Valgrind. And the problem with, I think, a lot of these tools is you don't use them enough. Back in school, we had Purify, which was phenomenal at what it does. But the thing is you debug and test your program and then right before you'd submit it, you'd run make purify. Then you'd run the purify build and make sure it didn't print anything out because the TA would take points off for that. But you didn't run it normally. And then if Purify did find a bug, you're like, oh, crap. Then you went back. But you didn't know what changed triggered the Purify bug because we weren't running it after every change, every compile. Yeah, that's the silly thing to do. We should have been doing it all the time. But it's human nature to kind of take shortcuts. And so what we've been trying to build here with OpenBSD is a system that's on all the time. And so we catch more bugs this way. Speed it up a little bit here. And oh, just one final point on that is no matter how good your test coverage is, it's never going to account for all real world inputs. And so it's really only by running code in the real world with these kinds of mechanisms in place that you detect all the bugs. So putting this to use, if you don't run OpenBSD, I think you should. And if you are a software developer, you should consider adding OpenBSD to your test farm. And I think there are a lot of reasons to pick OpenBSD. But hopefully I've kind of given you one more here. And software that's developed on OpenBSD tends to work on other platforms. The reverse isn't always true. And I'm not talking about APIs and portability. I'm simply talking about correctness here. And so actually, unfortunately, I think this affects OpenBSD's reputation somewhat negatively because people say, oh, hey, this program crashes when I run it on OpenBSD. Meh, Meh, Meh, OpenBSD sucks. Sorry, I beg to differ. I think it's the program that sucks. And just because a program doesn't crash when you run it on some other platform doesn't mean it can't be induced to crash. You should actually try to run your programs on the platform that crashes the most. If you're developing a library, don't fight the operating system. If you're developing an application, you should be aware of how your libraries are allocating memory and how they're treating it, so that you can be aware of these issues and be on guard for what kind of latent bugs you might be introducing on what kind of assumptions you're making implicitly based on your library's behavior. Fast recycling is very common in custom allocators and caches. And it hides lots of bugs. So to pick on one example, another OpenBSD developer told me they patched the Apache portable runtime, which has a pool allocator. They changed it to just pass through directly to OpenBSD malloc and free, and subversion stopped working. So there was a bug. Next. So we have the assertions. So usually I think people, this is kind of a more general development technique. But we add assertions to say, hey, this has to happen, or make sure this didn't happen. But there's another thing that you can do with assertions, which is you can say, well, this can happen. And so let's make it happen. We did this in the OpenBSD kernel where you can call pool or malloc with a flag that says wait OK. And so when you indicate this flag, it means that you can go to sleep waiting for resources and then wake up sometime later. So I made a change where when you pass that flag, instead of being able to sleep, you always sleep. And boom. NFS broke, P-trace broke, the UVM KM thread allocator broke. There were race conditions everywhere where people were assuming that these calls to malloc were going to be atomic. Even though they were saying, hey, this is safe. You can sleep. I'm aware of my state, but it turns out none of them were. And so this is just an example that if you, general principle, if you can do something, make it happen. Or if you don't have to do something, don't ever do it. And then see what happens at the end here. So randomization along those lines, if something can be random, make it random, this is good for security, but also good for development. I have a funny story about a long time ago. There was a bug in liveR thread, which shortly after we switched to liveR thread and released it, somebody noticed a crash. And this is kind of technical, but the highlights are there is a reaper function, which would garbage collect stacks that were no longer in use. But it did this by determining which threads had exited based on their PID. Unfortunately, there is a race where a thread could exit, and then a new thread could be created with the same PID as the exited PID. And then the reaper would delete the stack of the new thread, and so that caused the program to crash. This was noticed by a user. He sent in a bug report with a stack trace, and I was like, hey, that looks like you're deleting a stack of a newly created thread. I wonder what happened. Wrote a short test case, which just created and deleted a whole bunch of threads and interleaved them a little bit and was able to have a reproducible test case in a matter of minutes. This is the kind of bug that would have gone undetected probably for years if we had used sequential PIDs because the probability of reusing a PID if you're strictly cycling around is actually very, very low. However, that doesn't mean it can't happen. So what would have happened is somebody would have some box with four years of uptime and some massive Java process that had been running and running for years and years, and they would hit this race condition right at the instant when PIDs happened to roll over. Then their $1 billion Java process is going to crash, and they're going to be very sad, and it's going to be impossible to debug because we won't be able to reproduce it for another four years, and nobody's going to care. And so ironically, a feature that was designed to prevent race conditions by reducing the predictability of PIDs actually helped create a race condition and make it more easy to exploit, actually. And so then by exploiting this race condition in LibR thread, we were able to fix the bug. There was another bug in, this is hysterical, in Hibernate that Mike found where for a long time the Stack Protector Cookie in the kernel was a fixed value because there was nowhere to get a random value for the Stack Cookie from. By the time the random subsystem was up and running, you had already been running for too long to use a random value for the Stack Cookie. And so it was a fixed value of an old byte and a carriage return and some other stuff to make it hard to stuff into a string. But we made a change to the bootloader, where now the bootloader can inject a random Stack Cookie. And the way Hibernate works is you boot one kernel and then it loads the old running image from swap and copies it over itself and then keeps running. But the problem is it did this. It changed the Stack Cookie from the one on the running kernel to the Stack Cookie from the previously running Cookie, but it didn't change its own Stack. And so then when it returned, the Stack Cookie didn't match what it should have been and the kernel panicked. And so the bug, I think, was that we were running on the wrong Stack for too long. And we should have switched to a Stack on the new kernel or the old kernel. Hibernate's kind of confusing. But moral of the story is we changed some of the conditions that the kernel was operating in. And this challenged some of the assumptions that the code running had been using and a bug was found. And again, it was a case where the bug was harmless, but probably would have been flushed out sooner or later by some other change. But this particular change brought it to light sooner. And so that's what made it a good change. I think the next slide says QA, does it? Yes, I'm done. Any questions? Have you considered making greed and rights as calls randomly returned to e-inter? Randomly returning e-inter. We were talking about this at dinner last night for some other system calls, actually. But yes, I think there is some work there. And actually, probably some of you have always heard about the Netflix chaos monkey thing, where it just goes around and kills random processes in their distributed system. And that proves that their load balancing and redundancy is working correctly. And the only way to test that is to actually pull the plug and kill random processes. I think that's a little aggressive to actually be running on your desktop. And so we're trying to kind of stay within the bounds of making things a little challenging. But yeah, I think the rate of progress here is hampered by maintaining some degree of system stability and being able to run software that we want to run. Any other questions? All right, let's thank our speaker.