 Welcome everybody. My name is Otto Moerbeke. I'm going to talk about my or OpenBSD's implementation of Melloc, especially from a developer point of view. So who am I? My name is Otto. I've been an OpenBSD developer for more than 20 years now, and as a day job I work for a company called PowerDNS, and I work on DNS software apparently. For OpenBSD, I worked on many things, mostly in user-land, libc stuff, but also things like patch, bc, dc all around user-land, and well, some kernel stuff maybe in file system area. And well, I would say one of my major contributions during the years has been my Melloc implementation, which is incorporated by this commit I'm showing here in 2008, which is about 15 years ago. Before that, OpenBSD used the same Melloc implementation as FreeBSD did, PHQ Melloc. In the meantime, FreeBSD also switched Melloc implementation. Well, but let's talk a bit about what Melloc, the API actually is. It's quite a simple API actually, and that has big consequences on your degrees of freedom for implementing it. So I assume that you all know a bit about C programming, and if you want to store something in memory, you basically have two ways to do that. One is on the stack, things that are transient and will disappear when your function ends, and one is more long-term, and the basic APIs to do that from user-land are Melloc. There are other ways, unique specific ways like M-Map and things like that, but we're talking about the POSIX API for Melloc here. Two calls, Melloc and Free, basically saying I want a piece of memory and I do not longer need it anymore. There are a few related functions in the API definitions. They can be expressed in terms of Melloc and Free themselves. Of course, any implementation which is doing serious work will have specialized implementation for both, for the other API calls like Realloc and Caloc, because when you know a bit about the actual implementation of Melloc, you can make much faster versions of them than compared to expressing them in the basic Melloc and Free ways. There are a few extra rules with the API, like things, the memory returned, how is it aligned, but the most important thing is that since the API is very simple in definition, there's a lot of freedom on how to implement it. Well, one of the things you have to consider is that Melloc needs to store some information somewhere to be able to do its job. You can imagine a very simple implementation of Melloc, given a way of getting memory from the kernel, for example, M-App. That implementation confirms to the API but will probably perform very badly. That is, if you get a Melloc call, you call, you say, well, I'm rounding up to the page-sized size, first page-sized size. Call M-App for that size. Return what I'm getting from the kernel, do some little error handling there, and the free call, I'm just not implementing. That's an empty body. That's actually a conforming implementation of Melloc. Of course, it will perform quite badly because it will do an allocation on each call. It will also greatly overuse memory, both since you are not freeing at all and also because you are actually allocating at least a page for each allocation. But it works. You can do it. It's nice to play around with those type of implementations. But for real work, for production work, for serious thing, of course, there are a lot of other things to decide. Originally, the old Unix has had a single way of getting memory from the kernel for an application, for a process. And we were just basically saying, well, at the bottom of the red space, you have your code. At the top end, you have your stack. And anything in the middle is either data for the application or an accessible, not mapped memory. With S-break system call, you could extend the size of the data accessible for the application. So that means that is a continuous part of memory, virtual memory, which can grow in one direction. It could also shrink, but not a lot of implementations do that, actually. But since quite some time, there's also another way of getting memory from the kernel. And that is called M-App, which gets you at least page-sized pieces of memory from the kernel. So that is one way of how do you do that? How do you get memory from the kernel? Because the melical has to get its memories from somewhere. That's one design decision. And related to that is also, where do I store your metadata? If you want to be able to implement free, you have to know, just given a pointer, which was previously returned by Melloc, what size the allocation is to be able to do the correct administrative work to free that memory, to mark it unused, and maybe even return it to the kernel as saying, this memory I do not longer need for the application. That metadata, richly, was mostly stored pretty close, or maybe even as part of the application itself. So if you would ask a piece of memory, Melloc would store in front of that return pointer just before a little lower address, things like a link to another piece of Melloc memory, maybe a size, a few bits to give some status. But that is one way of doing it. We'll see that my Melloc, or OpenBeastMelloc, takes another approach where the metadata needed to do the administrative work is not stored any close to the allocated data given to the application. Like I said, there's also the decision what you do with the freed memory. If you have a page which is totally unused, you could return it to the kernel, so to speak, because you don't need it anymore. Kernel is then free to do other things with it. You could also say, well, I'm saving it for later. Maybe the application will need that page sometime later on. And you could also say, well, let's say maintain it for a while, but in some caching manner. But when I see that the memory requirements from the applications are not as high anymore, I'm only then returning it to the kernel. So some policies are important. So on OpenBSD, M-Map is a bit different than many other applications. So that has consequences, as well, for the malloc. One of the design choices I made was is that any memory that is used by malloc or needed by malloc to give to an application originally comes from a M-Map call. That M-Map call is randomized in OpenBSD. That means that any call where you say, do not say I want a specific address, but just give me some memory, some pages of memory, will return a randomized address within the address space of the application. That has consequences, because in general, it's a bit harder to keep track of memory when it is not contiguous in space. It is scattered around the address space. That will introduce some overhead here and there, not only on the application side, but also on the kernel side. But we do really think that is a worthwhile approach. Why is that? If you have a memory page, which is at a random spot in the virtual memory, it is surrounded by unmarked memory most of the time. So that means that any out of bound access to that page will be a segmentation violation without any extra cost to the application. Of course, there is some page management overhead on the kernel side, because a fragmented memory map introduces more overhead than contiguous. We are willing to say that's a good thing, but because in the end, it will help us finding bugs. We'll see that thing of randomizing allocations much more in different areas. Of course, since you are requesting a page at the time from the kernel, there's a minimum size, which is typically 4k. There are other architectures less used, which like Spark 64 or some MIPS implementations that use larger page sizes. But the main point to remember is that the ASLR in OpenB is not only looking at where you store things like libraries and executables in memory in the stack as well, but also extended to the heap this way. Each run of a program will use a different memory layout. If you're talking about smaller applications, what Melloc does is it asks the page from the kernel, devices up into smaller pieces, and maintains a bitmap to see which of those pieces, called chunks, are free and which can be given to the application or which are marked in use and are not free to give back to the application. So some design goals. The main point is what we'll see is that since the API is simple, you have a lot of degrees of freedom, you can choose a lot of different approaches. We do strict internal consistency checking because we maintain metadata out of band. We can do that out of band, meaning there's a separate piece of data structure which knows the sizes of various pieces and some more properties. Randomizations in many places, not only when asking a page from a kernel, we get the random address. But also if we allocate a chunk for a small sized piece, we also make sure that that is randomized. So you don't always get the same chunks in the same order when you're requesting memory and but also freeing and which causes caching. Caching of those data structures is also randomized. So reuse will also be randomized. We'll talk about that a bit more later. So store metadata out of band. That's a very hard design decision. Spend effort in trying to detect API misuse. For example, double freeze and also implement facilities to not only detect API misuse but also misuse of the data itself. Typical examples are an out of bounds right, which normally in C is perfectly possible. There's no guarantee by the compiler that won't happen. But malloc can help you at least in some cases to detect those. Use after free is a different type of error, which says, well, I've freed a piece of memory, but I'm still writing to it. Reading is one thing, but writing is even more devastating. So we try to do that. And it turns out that reaching those goals is not as hard as you would imagine in the sense that they are overlapping in the sense that something which is announcing security often also helps the developer in finding bugs. So look at some of the design choices. And I'm comparing a typical malloc implementation, which could be a G-Lib-C or G-Malloc or any other malloc which exists typically because, of course, there are specialized implementations. And this is also not to bash any specific implementation of malloc because what I'm only trying to show is that here that you have a lot of design choices and we made different choices. Other implementations have their own approach and reach their goals in the way they would like to do that. And that has also consequences, of course. For example, if you look at the bottom row speed, there are quite a lot of malloc implementations which are very, very, very fast. And OpenBSD is not one of them. It takes time and effort to do some of the checks. The randomization, for example, is certainly not free. OK, so typical other mallocs say, well, I have a compact memory model. So allocations are close to each other. That helps for speed. Helps also to make your data structures a bit more simple in a lot of cases. So we're going for the compact. OpenBSD typically really chose to have a scattered layout so memory is scattered around the address space. Returning memory, there are some very fast malloc implementations. But one of the reasons they achieve that speed is because they never return memory to the kernel. That simplifies things a lot, not only because you do not actually have to call the unmapped call, but also because you do not have to decide whether you are going to do that or not and keep track of some of the data to do it. Stored metadata near allocations. Typically, other malloc implementations do that. And that has security or bug implementations that the consequences of a bug in your application program will potentially have much bigger impact than when you do it. When the typically out of bounds write does not end up destroying your malloc metadata. Typically, if you store your metadata close to the application data, it will get overwritten one day or other for bug reasons. Randomization, internal consistency checks, we do a lot. Other malloc implementations do it, but not as extensive. Typically, randomization, same story. We do it all over the place. What I also like to stress is that we have some additional optional checks, which I will address later. And for example, also, when something bad happens, what do we do? Some malloc implementations just ignore it. Some give some feedback, but still continue. Some have some bits you can tweak to say what's going to happen on an error. But what we do is we never continue. If we see something which is inconsistent use of the API or other error, we just abort. With a message, that is. We try to explain why we abort and because we try to help the developer in finding the bug. And lastly, I think I already mentioned that is speed. It takes, there's no such thing as a free lunch. So we pay the price, and I think often the speed loss is not that big of a problem. In some cases, it is not even that big. But often, you would say that our malloc is speedy enough for many, many, many use cases. But I completely admit that it is not as fast as other implementations. OK, let's look at some typically errors a developer could make and what our malloc does with it. So this is a little program. And like I say, I'm saying it has at least one bug, probably more. I know I think at least I could spot three bugs. But let's continue. Let's try to concentrate on the specific use of malloc and not only the use of malloc itself, but also the use of the memory returned by malloc. I hope some of you at least spot the bug. And I think that NEC programmer here will probably have to admit that he made this kind of error at least once and probably more times. So we're in the program. And there's no indication in the output of the program that there was actually a bug. It prints a pointer failure, which I've assumed that is the right value. That is the pointer P we received and nothing else. So what I'm doing here is I am allocating 10 pages of memory on a typical system of a 4K page. And this is a bit we know. At least some of us know where the bug is, I hope. But no, no indication that actually any memory correction occurred. Let's try it on OpenBSD, because the other one, other system was not OpenBSD. I'll explain that later. And there we see a segmentation fault. This is an example of the scattered memory layout where you have 10 pages somewhere in memory. Memory is very big. Gigabytes of physical memory, but your virtual memory space is much, much, much, much, much, much bigger. The chances that this allocation is next to a mapped page is virtually zero. So what we are accessing here in the assignment in the middle of the program is an out of bound write, because we should have written as the minus one. Of course, this is not a bug detected by malloc itself, literally speaking, but it is detected because we say, well, we prefer a memory layout that is scattered around the virtual memory space, which enables us to let the kernel do its job of catching out of bound writes. So the first system is a 3BSD system, which is GE malloc. By the way, Debian using glibc has exactly the same behavior. And so what I explained is what 3BSD is doing by default is extending the memory in a contiguous way. So typically your allocations are close to each other, are contiguous. So the overwrite would have probably ended up in some other allocation. Finding that type of work without tooling is pretty hard often, because there's a little corruption in some other piece of memory related to that assignment. And well, it might take some time to actually find that issue. So that was a big allocation of a 10-page-sized allocation. For smaller allocations, things are a bit different, of course, but because even on OpenBSD, small allocations share a single page that has the consequence that a out-of-bound write for a small allocation typically will end up in another small allocation. Having, let's say, a smaller example of the same problem we discussed in the first example. So what we would expect is that also on OpenBSD that type of work will not be caught immediately. There are maybe a few cases where you say, well, if you have the small allocation is the last one on that page and your out-of-bound write reaches beyond that page, then you end up with a segmentation fold. So, but typically that will not be the case. Oh, yeah, that is correct, yeah. So we look at the first at a FreeBSD system. So the other way around, that's the same way around as we did with the first example. No problem. You see the last command line. We try to allocate 1,000 bytes. And we get no indication of any problem at all. On the OpenBSD system, it's the same. The first command line with the execution of M is the old one with the 10K page. The second one, the last one, is the 1,000 bytes allocation. So this is what I kind of predicted. The override ends up in a mapped piece of memory, so no segmentation violation. But we do have some means to actually detect this bug. And that is a flag which is called C from Canary. I'll explain a bit more about that later, which enables the application program to detect this bug with the help of malloc. How does that work? The allocation, actually, which is done by malloc, is bigger than what the application asked for. And we use that to write some patterns of bytes into that allocation, beyond that allocations, which I will show later. If we look at the Debian system, there's also no indication of a problem, so there's the same issue. There is an environment variable called malloc.chat check, which you can set with it. If you read the manual page, then it means that it does not actually implement more checks. It just varies the information printed when it detects a problem. So that's a bit of a bummer. You have no way of, at least not with the base system, to detect this bug on Debian as well. So the Canary check that OpenBSD is doing is making use of the fact that the allocation is actually ending up in a chunk of 1,024 bytes. In this case, 1k allocation. So the last 24 bytes are what I call malloc-owned, but not application-owned. The application only owns 1,000 bytes. So the accept bytes are on allocation filled with a pattern by malloc. And when the allocation is freed, malloc will check if those bytes are not overwritten, if they still have the same value as originally written. This is an optional check, because it does take some performance away, but it's very handy when developing or bug hunting to enable it. It is runtime always available, so even you can enable it on any system. You don't have to compile any special version of malloc, but it is just a runtime decision to enable it or not. OK, let's take a look at another typical heap management problem that is double-free. So here we have a different program, which also has a few bugs, but let's concentrate on the memory management bug, which is a bit strange, because it's not actually a bug if you look at the malloc API, because what I'm freeing is something which was previously allocated. And there was no extra free in between. Still, it's a problem, potentially, not with the heap management itself, but the way those pointers could be used in the rest of your program. There's no rest of program here, but you can imagine that what's happening here is that if I pass Q to some other piece of the program, that actually is freed memory. But well, it depends much on the other code, what the other code is doing with it. If you actually see the bug or see the problem. So what's happening here is the second call to malloc returns the exact same pointer that was freed before. And so typically, even if you are closely reading what's happening to the pointer, you don't see a bug. When you look at OpenBSD, there is a case where, at least with randomization, that it runs fine with this code. But in some cases, it will catch it as a double free. Why is that? Firstly, randomization is that there is no guaranteed reuse. Because if you ask a piece of, get an allocation, another one, and you free that, the next call to the same sized allocation will might return the same pointer, but in most cases, it will return a different pointer. There's also another mechanism in the OpenBSD malloc, which is the delayed free list. That the delayed free list is good for security because it avoids some aliasing problems. We were using pointer for a different purpose while it was actually freed. It also helps you detecting this double free bug because the moment a free is called, the OpenBSD stashes that pointer into a little queue, picks some other pointer from that queue, actually frees that pointer. And next time free is called, the first pointer might be chosen, or some other pointer in that queue. So the order of random frees, which is actually done, has is varying. And any pointer which was freed recently and freed again will be spotted. That check actually is also done on a sample basis. You can force it to check all slots in the free list by specifying on the malloc option again. So reusing allocations is vital for performance reasons. Think about the very simple malloc implementation I sketched, which always allocates new memory. What I'm saying is immediately always doing that reuse is potentially dangerous. And it also does not allow you to catch double frees in a short period of time, within a short period of time. If you look at some recent bugs in various software than reuse of pointers, this can lead indeed to nasty bugs, filling a buffer with, let's say, secret data, freeing it, but still having some pointer aliasing problem and passing that pointer to the other piece of software which can handle that exact same piece of software which contains secrets, well, not very nice. So not really reusing allocations helps in finding double free bugs, but also with security in general. So that is a fixed thing in the OpenBSD malloc. And for page size allocations, we also have a cache of larger allocations, page size or larger. And their reuse is done, but also randomized. The cache maintenance is randomized. If your cache is full, we randomly pick regions to give back to the OS, so to speak. And also when we need pages, we pick a random one from the cache if the cache is not empty. The delayed free list has a drawback in the sense that errors reported by it tell you about maybe some other piece of memory than you are actually freeing at that point in time. So you call a fee on memory allocation x, and you get complaints about allocation y. That can be a bit confusing, but I think it's worth to pay the price. And the last thing I like to discuss is leak detection. Since we have a separate data structure that keeps track of allocations, and the metadata is not, let's say, part of the allocations given to the application, we also can use that to list memory leaks after a program has been run. For mallocs implementations that do not use that, that use inline metadata, it's a bit harder because, well, in some cases, malloc doesn't even know what it allocated unless you pass it a pointer back to free. So that malloc leak detection has been available, at least in some form, for a very long time, but it was never much used for several reasons. So the original solution was not compiled in by default because it was not really nice yet. I was not very happy about it, actually. But what it did is when you specified malloc option D, at the end of the program, it would dump leak information and a lot of other extra information, which is not very interesting for the general developer. But you need to create a file for it in the current working directory, the file should have been writable, but also the application has to be able to write to it. But with the modern things we have in OpenBSD, that's not always possible because many, many, many programs do not actually have the rights, the pledge to be able to write files. So the old mechanism didn't work very well and was not easy to access. The new solution is always compiled in, so it's always available. It's not active when you run time, but it is there. It exports data using U-trace, and it has a small modification to K-dump to display the information in a nice way. And we have some flexibility also, not only record the immediately color of a function, but also some deeper color that is very handy, for example, for C++ programs, which otherwise would report all colors being new, which is not really interesting. You want to know which code called new. An example, very simple program with, as we can see, two leaks. You see a program run, which specifies malloc option D, and also a K-trace enabled run. Tu, I'm only interested in tracing the user, the U-trace records coming out of this, and not the other things like system calls or IO. After this K-trace run, I have a K-trace out file, and I can use K-dump menu malloc to get the information. And what we see about the information below is an F column, which has two addresses of the actual function location, where the location where the leak happened, and some total amount of bytes leaked. The number of calls that were involved in the leak and the effort size, well, in this case, of course, since we have number of calls in both rows of one, the average number is the same as the sum, and also an address to line to actually print where the call happened, which uses a different address than in the first line because the address to line piece will compensate for the start of the library or executable. So this shows how you can actually see the location of the malloc calls that caused the leak in line nine and line 11, we'll see. Well, if you start counting lines above that ends up being the malloc call because the loop boundary is wrong in the free call, in the corresponding free call, because I cannot point to the free call only to the malloc call, so you have to figure out yourself, still have to figure out yourself why that piece of memory was not freed, but at least that shows you where the allocation happened. This is, of course, basic, simple information to detect leaks. I'm not going to discuss how it actually works because time is running out. Mm-hmm. Yeah, for five minutes, okay. There are built-in calls in GCC and as well as in the Clang compilers to get the color of a function when I see that I record that in one of the metadata pieces. I'm storing for pages allocated by malloc and when the program finishes, I'm calling add access handler if enabled, of course, because normally we do not do that. I'll aggregate all non-free allocations on their F value and make a little report out of that. For chunks, the handling is a bit different because I do not want to store an extra pointer for each chunk, so I'm storing a single pointer for the allocation that ends up in slot zero, so you get there a sample of leaks effectively. And if you have a run where you get zero F values for your particular leak, well, just try to run it again. Hope your allocation ends up into slot zero and see if that helps you. Now, next question is, it's a basic leak detection. Why not do more, like things like storing complete stack traces? I can't think of some other ones because you might want to store also for chunks more information, but I try to find a middle ground between things that are not too complex to implement, also not too complex to review, and for some solutions, even if not enabled, it would incur extra run time overhead, which I do not like. So my advice would be, you could use the built-in facilities to detect there are leaks, and I have not enough information in the dump to see where the leak actually occurs. Then we'll say, well, use some other tool to do some more digging. There's a very good tool, full grind. Sadly, the port on open music is not finished yet, for a long time already, that yet, but an example run on, in this case, Debian, but FreeBusD will give you the exact same information, will give you more about where the leak actually occurred, and also, if you have things like overwrite, it will also tell you where the actually overwrite happens and where that buffer was allocated and much more information. So I'm hoping that the developer working on the port of fault grind on open music will continue doing that and make it unusable in more cases. So to summarize, we have a lot of nice options that help you detecting bugs, to detect bugs. If you have more exotic bugs, you can enable certain options to help you there. And I would advise that if you are developing or bug hunting, just use S, because it's very farro, it will cause you some performance because it also disables the page cache. So that means that any page feed will be returned to the application, but accessing that page will cause segmentation faults. On OpenBSD, SSH and SSHD actually enable S all the time. So Unmap is also really nice because it unmaps. Actually, that's not the right term. It says the protection of pages in the cache will be brought none, which basically has the same effect as accessing a non-existent page. So you get a segmentation fault, but S will enable you with all the nice things. So to summarize, the strictness that OpenBSD malachs has does not only help you with the security, but also it helps you in finding bugs. Randomization is important not only for security reasons, but also for finding memory layout specific bugs because each one of your program will use a different memory layout. And if you have a bug, which depends on a particular memory layout, you can be certain that not all your runs will have that particular memory layout, so you are able to find it. The idea is, strictness is good, even if it costs performance, within reasonable limits, of course. Use Malach Option S while hunting for bugs or just during development. Of course, the earlier you find the bug, the better, that's common knowledge. And check your program with Malach Option D for leaks. And if D does not give you enough information to find the actual leak, use some other tools to maybe or even on another OS if need be to actually find the root cause. And to end, above is a picture of Dijkstra, famous for a Dutch computer scientist, famous for various things, but he when asked what he would like his students to remember is actually one thing. He said, well, if one of my students later on is writing a dirty hack or using a non-rigorous approach to solve a problem, I would really like you would think, well, if Dijkstra would be looking over my shoulder, he would not have liked that. And of course, I cannot look over everybody's shoulder, but I can use my Malach as a kind of proxy for that. So if you are doing a quick hack, not doing your memory management in the proper way, my Malach will likely at least be that proxy and give you a bad feeling. But in the end, if it helps you fixing the bug or finding a better approach, I think the bad feeling is worth it. But in the end, you will have a better program. Thank you. Any questions? Yes, the question is, what is the bitmap used to keep track of which chunks are free and which are not? Is that also outside the data? Yeah, that's part of the metadata. So it's stored out of sight of the regular program and it has a little, very small chance that a random write will actually hit that piece. Mark, it is 16s and that is implied by the minimum chunk size, which is 16 as well. So when I get a page from the kernel, it's always page aligned. I divide it up in chunks of minimum 16 size. So each chunk will have that minimum alignment and for larger chunks, that holds as well because it's always a multiple of 16. That's it. Thank you again. Bye.