 Okay, please welcome your next speaker, Brooks Davis, telling us everything you ever wanted to know about Hello World. Okay, there we are. Okay, welcome. Thanks for coming. So, okay, so I'm here today to talk to you about Hello World, the simplest little C program, the first example in KNR. And explain a bit how we got to the point where the basic Hello World is over a megabyte stripped on MIPS. There's actually good reasons for this, but it's a little unobvious. But before I could dive into that, I'm gonna give you a little introduction as to how I dug in, you started digging into all this stuff. So, turned out I wanted to understand the process AVI, how processes start, and the basics of the C runtime. Cuz I want to change all of it, so I'm not really gonna talk about how I'm going to change it today very much. I'll allude to some bits, but the reason that I want to change it is that I'm part of a research project at SRI International and the University of Cambridge, where we're developing extensions to CPU, ISAs, which give us the ability to make C compatible memory safety and enforce it in hardware. So, we can recompile C code and eliminate most buffer overflows pretty much out of the box with only some minor changes to run times. And I had to dig into all these bits of how a simple program like Hello World works in order to change them in the right way to make this all work. Just a few more things about our architecture. We replace integer pointers with unforgible capabilities. This means that you can't materialize pointers out of thin air. You can't, if say you're the author of Lib Curl, you can't download them off the internet by accident and run them, things like that. And as I said before, we prevent buffer overflows and we also make compartmentalization extremely cheap. We'd like to be able to go from current compartmentalization where something like Chrome or Firefox has a process per tab. We'd like to have a million compartments in your web browser. So, that's just an aside, when I mentioned Cherry throughout the talk, that's what I'm talking about. But now we'll dive into the way it works today on 64-bit MIPS. So, here's the first example from, in Kerrigan and Richie's, the C programming language. It's quite simple, all it does is print a string Hello World. Of course, you'll notice that this particular code isn't actually valid C anymore, so we need to make a few tweaks before we talk about it. We need to declare the arguments to main or rather ignore them in this case. And then, that's a little boring. So, we're gonna take this one step farther and make printf take a couple arguments. One thing that's interesting here is that printf in C is pretty darn close to a complete test of the C programming language in runtime. One person I talked to who wrote a different C-like language said that they got to about 90% in their conformance suite before it was possible to run Hello World. So, there's lots of complicated stuff behind the scenes, and we'll start to dive into that a bit. Naively, you might expect that this program is actually something like this program where simply you have a string, you smush the string bits together, and you write them out. You might further expect that it looks like this bit of MIPS assembly. Now, don't worry, you don't really need to understand this to understand the talk. But what happens here is we're putting the first argument, this one here, into the first argument register. We're loading the address of the string down here into the second argument register, the length here, and then the system call number here. And we're triggering a system call exception. Okay, I will try to turn my body instead. So, we hear at this point we call the system call, and then we return from that system call, and we exit. You might think that's what's going on in the previous program here, but it's actually quite a bit more complicated. The assembly program previously compiles to nine instructions. It's less than a kilobyte, mostly random elf junk. Yeah, however, the minimal C program there, and this is without printf, or seemingly without printf, it's over 550K. So most of that, it turns out, is malloc, and a bit of it is localization. And we'll get into that more. So where does all this stuff come from? This program didn't do anything requiring localization. It has to basically make a simple system call. So let's take a look at program linkage. So you might be linking hello world, here's a simple command line. If you pass the rather strange argument to your C compiler, and this works on both GCC and Clang, of dash, hash, hash, you'll see what things it's invoking. So here's the actual linkage command. You notice it's a bit of a mess. We have some simple stuff about the ABI, that we're linking it statically, that we're outputting it. And then we have a bunch of extra files, so where did those come from? And then we have some extra libraries, and then we link in libc implicitly. And then finally, some more files. So what are those? Here's a quick table of the various bits. So first of all, CRT 1.0, this contains the actual function that gets started at the beginning of the program. It's by convention on previously, at least, underscore underscore start. And its job is to call main. Next up is CRTI. CRTI is a weird bit of code. It contains the prologs to two functions, underscore init and underscore fina. All it does is set up a stack frame. Later code, including any file in your program, might conclude bits of that function as raw machine code. It's a bit terrifying, and it is deprecated, but it's still nonetheless there. Next up, some variant of CRT begin is included. There's different ones for relocatable code, shared code, and for static code. This includes the constructor and destructor sections, and declares function calls to these. Finally, there's CRTn, which terminates those arrays. And then there's CRTn, which contains the bits of the functions for init and fina that actually return. On FreeBSD, the paths that they're included are here. Before I go on with the rest of the talk, I'm going to be showing you a lot of flame charts generated with traces that I've made with QEMU, with a customized QEMU, and then the charts are generated with Brendan Gregg's flame graph tool. Those charts are online as clickable SVGs, so if you want to zoom in and out, it's kind of neat. Otherwise, you'll see them on the screen. So we're going to start before the program starts. So we're going to look at the execvee system call. Execvee's job is to take a process and replace its entire contents with a new one. So typically, if say you're starting Hello World from a shell, you will have forked the shell, made an extra process, which is a copy of the shell, and then you're going to replace its contents using execvee. On some operating systems, you will more commonly use something like POSExpon, which does both of these things at once, which is convenient that you don't have to set up a process that's a complicated copy of your bash or t-shirt or whatever just to throw it all away. But in this example, we're going to look at execvee. So now we're going to zoom in a bit. The first bit of execvee is exec copy in args. It's job, so it does a bit of memory allocation in the kernel, and then it copies in the program path, as well as all the command line arguments for the program and all the environment that was passed to the program. It has to get this out of the parent process and into the kernel so that it can then be copied into the new process. Next up, after that, we move on to the next part, which is the body of the sys execvee function. So inside there, curn execvee is the general implementation. In FreeBSD, we separate the system call, which is the sys underscore variant, from the actual implementation in most cases. The reason is, for instance, that we use the curn version for both FreeBSD and letex simulation. We may have to make some adjustments, say, to make Linux work, but the underlying bits of replacing the process are all the same. Similarly, we use this so we can run 32-bit programs on 64-bit machines. So the first bit of curn execvee is ease's name i, which is horrible and complicated and involved in the file system, so we're not going to talk about it any further, except to say that its job is to resolve the path that was passed in and find it in the file system. Next up, exec check permissions verifies that the file has the right permissions that were allowed to open it and execute it, and then it opens it, which is a bit mashed together, but that's how it is. Then, exec map first page does about what you'd expect. It maps the first page worth of data from the program, but it does this so that it can examine the headers in the program and determine whether, for instance, it is an actual ELF executable, as we're talking about, or perhaps a shell script, where it needs to find a program to run. Now, the body, the main part here is ELF, is this exact ELF image act. It's an image activator, which is what we call the thing that takes a program and instantiates it into memory. If you were wanting to dig into this part a bit more, one hint, all the ELF things in previous D are compiled and are named in a very strange way, so that 32-bit and 64-bit code can be shared. This bit down here is what the function declaration looks like, so if you're trying to find this particular one, it's not a whole lot of fun unless you know where to look. I include this largely because I was writing up a paper version of this talk recently, and it took me about 10 minutes to figure out where the darn thing was, so I thought a little hint might be appropriate. So the first bit of the image activator is a call to exact new namespace, which does kind of what you might think, just to say it takes the process, and it rips all the VM space out, so it rips all the page mappings out, and you end up now with an empty process which has no memory behind it. Next up, so internally, sorry, it goes in, rips out all the pages, and then it maps a stack, an initial stack into the process address space, so you can see down here at the bottom. From there, we get back into the image activator code, it's job it is to actually load the program. First, ELF load section maps the text segment into memory, for those who aren't familiar with the text segment, the text segment is the part of the program that contains the executable code, as well as typically some constants related to execution. For instance, offsets for jump tables and the like. Next up, another call to ELF load section, maps data and the BSS region, which is the part of data which is initialized as zero. So for instance, some random global variable that you've declared, it will default to being zero, and that's part of the application binary interface with basic UNIX processes. So back to current exec VE, we now have much of the program mapped, and now we're going to copy out the various arguments and environment bits and a few other things that I'll get to in a minute, to the end of the stack. Finally, this exec setregs function sets up the registers of the processor so that when the exact VE system call returns, it returns into a call to the underscore underscore start function. So we're actually about to get ready to actually run code. So back to this exec VE, we have set up all of our memory, we have mapped all of our bits, and now we're about to return to the address space. So as a recap, the stack has been mapped into the address space, the program has been mapped into the address space, and various strings, environment variables have been copied in, as well as signal handler and some other bits. So a bit of an aside on those various bits that are copied in, despite the fact that we're on MIPS, where you're using the SCO, I3D6, ABI, and in fact on basically every process we follow this ABI. This is how memory is laid out so that when we enter the program, we can find things we need to find. So there's a bunch of bits here. Most of the bits to the far right are free BSD specific, so PS strings, SIG code is the signal return trampolines. When you have a signal handler and it returns, the SIG code is responsible for actually getting you back into the kernel. This is in fact a terrible idea, and we need to take it off the stack because it currently requires the stack to be executable. There's some other bits here. And then currently shown on the far left of that orange section is where all of the strings for the environment and all your argument vectors are stored. Next up, there's the elf auxiliary arguments array. This contains a bunch of information about your process, where it's about your program, where it's mapped into memory, where the start address is. Basically it allows the program and the runtime linker if you were using a dynamic program to find a bunch of things about the program without having to parse the entire executable, which is rather expensive. Additionally, it gets used for things like, for instance, a pointer to the stack canary value, which is a randomly generated string, and it's just convenient to use so it keeps getting extended and extended as people add new features. Next up is the environment array. This is the array of pointers into this set of strings for the name value pairs for all the things in your environment. Likewise, the argv array, and then the argument count. All of this looks a bit like how you would call a function under an i3d6 system, and that's because that's where this convention came from. One oddity here is that, well, the argc argument of the main program is an integer. Here it's always a long because that allows the stack to remain properly aligned on 64-bit machines. So at this point, the stack pointer on return is set to argc, and it's also passed as the first argument to the start function, and that's used to derive values and get the arguments required for main ready to go. So now we're ready to jump into user space. So starting in underscore underscore start, which is in execution for the entire life of the program. One thing you'd notice here, if you look at it carefully, is that there's a lot of J underscore functions in here, and in fact, this program spends almost all of its time in malloc. So J malloc is in fact about 80,000 lines of code. It's part of that portability, but it is a beast, and so for a program this trivial, it's way overkill. It's a super high-performance threaded allocator, but in this case, we pay a lot of overhead. I'm not gonna dive a whole lot into how malloc works, but just wanna point out that there is an awful lot of the cycles are spent there. So underscore start, the first half of underscore start, which is a very simple function, sets up argc, argv, and an environment pointer. These are then passed to this function handle argv, which takes them and processes argc and argv a little bit, and then sets up the environment pointer, the environment array, also sets the program name variable, which is used for programs to query how they were run. Next up, init TLS. This is actually why this program is so big. So init TLS sets up thread local storage. So thread local storage is much like global variables, except that they're per thread. You might think it's a simple hello world. Why do I have thread local variables? But it turns out that printf is localized, which means, for instance, that the decimal character could be different. And therefore, you need to have a per thread locale available, and there's a number of other things related to malloc, but that's the main thing. As with the rest of the program, most of the time spent in malloc, the first thing that init TLS does is it has to find the elf auxiliary arguments vector. As you might recall, that vector lives after the environment array. So in fact, this lovely bit of code finds the auxiliary arguments vector. We take the environment pointer that was previously set up. We walk to the end of the array, and then we walk off of it. This is, of course, not defined behavior in C, but nonetheless, this is what we do. And in fact, on the cherry processor, where we have memory safety, we've had to take a different approach here, because if you have buffer overflow protection, defeating them just so that you can do silly things like walk off the end of one array to find the next one, it's not a good idea. So we find this elf auxiliary arguments array. We use that to find the program headers. There's a field in there that says the program headers are in the right place. We use the program headers to find the TLS section. This contains the initial values of all thread locale variables, which aren't initialized to zero. We do a bit of allocation to allocate some space for TLS. That results in that all those cycles spent on calling malloc. And we copy over the initial values for the non-zero bit. Yes, yeah, so the funny names here are in part because J malloc uses TLS extensively, but yet we need to use malloc to allocate our TLS storage because if we were a dynamic program, we might later load a library that also needed to allocate TLS. And at that point we might need to free the current allocation. And so there are other ways around it, but the current solution is in fact to just allocate some space. Finally, we set the TLS pointer. One interesting thing here on MIPS is that historically that was done through a system call which was fine when no one was using thread locale storage. It was also retrieved through a system call. It's probably okay to set it through the system call. We don't do it very often, but retrieving it through a system call meant that every call to malloc required a system call. We fixed that recently. It was an apparent that people were not paying enough attention to our MIPS platform. So out of an IT TLS, we're gonna return to underscore, underscore, underscore, start, and this handle static init function is responsible for calling all the initializers at program startup. Through a variety of historical accidents and inability to deprecate functionality, there are four different ways to call a bit of code to initialize something in your program. There's an array of function pointers called the pre-init array which presumably exists because of the init array that you see at the bottom. People realize they need to do things earlier. There's the underscore init function I alluded to two before, which is this terrifying thing where you smash a bunch of assembly code together and hope that it goes in the right order and doesn't have any side effects, and then you call it as a function. And then there's the Cetors array. In at least the variant of the GNU startup code in pre-BSD, that's actually called via init. There's not really a good reason why it couldn't be called it by some other mechanism, but it is currently the one thing that uses init in the system. So now, we get to the point where start is about to call main, so we're actually gonna run our program. It's very exciting. We're 22 minutes into my talk and we can start to actually run some code. Or at least the code we thought we were running. So what happens in main is it calls printf, printf calls vf printf, which is the underlying implementation. In pre-BSD, no matter what you're printing to, you're always printing through a file struct. And it's either one that's printing to standard output, to an output file, or it could be to something that's appending to a string. In this case, we're printing to standard out. So as I said before, here's why we need TLS. We need to call git locale, so that we have our current locale and we can pass it on to the localized version of printf inside. It does a bit of work. It's not very complicated, it just, but it is something that has to be done. So back in vf printf, vf printf calls underscore vf printf. Yay, abstraction. Here's the comment from the top of the file. This code is large and complicated. It's not that bad, the file's only 1,000 lines. It could be worse. It does make extensive use of macros, so the code is, in fact, a bit bigger than that. So now, as usual, all this bit is allocating a buffer. This has done the first time and not done again, but nonetheless, we spend a lot of time allocating memory. So let's get into the part that's actually interesting. So here, we're getting ready to print our string. And what we see happen is the first thing we do in printf is we, for reasons of historical laziness, always look up the decimal point separator when we look, get into the function. This involves taking that locale we were passed and querying it and asking, so if I did happen to have to print a floating point number, how would I separate it, the whole part from the fractional part? After that, the function allocates a structure containing up to eight strings, which will be concatenated together for each section of the string that's being printed. The reason for eight is due to the details of floating points and I probably should have put a slide in here on that. But it's the various sections, so the bit preceding the first format you find, whether or not it has a sign, whether or not it's a hexadecimal thing, a number, a decimal point, things after the decimal point and somehow that adds up to eight. So the way this code works is it loops through the string looking for formats. So at the beginning we find, oh, we have a percent s, so it finds a percent sign and then parses the format and says, oh, simple, I got a type, so I've got a string here. So that's quite straightforward. It takes that and it sends this IO buffer which just contains a pointer to this hello string. That gets accumulated and doesn't get printed just yet because the standard behavior of printf is that IO is buffered. Otherwise we'd spend forever making right system calls to write out a simple string. Next up, having parsed that one, we get into the next stage where we repeat the process. This time we parse a bit, we find a space and then we find a percent. So we have a single character string that we've appended to our buffer or our vector. Then we have our number. There's an internally allocated buffer that's used to render this integer into a string and that's buffered. So that's added to the file structure. Next up, we find the slash n. The slash n, because of the buffering behavior, triggers, will trigger a write. So we'll get to that now. We're gonna look at the S print function a bit. S print parses the set of vectors and looks over each one of them, looking to see if there's a new line character. That new line character will only exist in the first buffer in normal circumstances. So it does a search on every output buffer. This is a bit suboptimal but it does do it that way. So it finds a new line character. Because of that, we will later call f flush, this responsibility it is, to actually call write. And at this point, we get our output. Now, as you probably noticed, nothing is ever simple. So now we're gonna return out of printf into main. Main will return to start. Returning a value. The start now has to exit. It calls the exit function, which is not in fact a system call. It's a wrapper around a system call which does a bunch of stuff. So the first thing it does is, when I was talking about initialization earlier, all of those initialization methods have a counterpart for destructors or finalizers. So for the init function, there's a finet function. For C tours, there's D tours. There's destructors to go with the constructors. And there's also a finet array of function pointers. In this program, none of them actually get called to speak of. But one function does get called, which is this underscore cleanup function. It's job is to ensure that any buffered file descriptors which still have output get flushed. In our case, there aren't any of those because we have successfully flushed the file descriptor, the buffer. But it would take care of that if we had not included that slash add. Finally, it calls exit and the program exits. So that was all a static binary. I'm gonna give you a very brief view of how a dynamic binary differs. So as you recall, in our previous program, we had memory, we had a program, data, text, and the stack. In a dynamic binary, the program needs to load libraries in order to do its work. For that to happen, the kernel, in addition to loading the program, loads the runtime linker. The run, and enters the runtime linker initially. The runtime linker relocates itself because it might end up anywhere, depending on how big the program is. And then it loads and relocates libc. This is, in fact, almost all the work that this program does. In fact, so it loads libc here, and it does a bit of linking. It resolves global variables as required, does a bit of setup. And then it calls underscore start, which is this tiny little box on the right. And that was the whole previous talk, except now it's more complicated and does more work. So here we are. In underscore start, the first thing it does is call printf, but printf is in libc. So it has to resolve, it has to look up the address in libc. It does this lazily because if it were to look up every address or every function in libc and do something with it, that would be tremendously wasteful. Even just the ones that were accessed that are potentially called by this program and are known to the linker would still be fairly wasteful. Because in fact, most functions that are linked to libc in a small program like this are probably things related to error handling of errors that aren't going to occur. So what happens here is there's a table called a PLT which contains a little, there's a little stub that we use to jump to the table and look up and normally we'll jump to the printf function. Initially though, we jumped to a magic function in the runtime linker whose job it is to find the printf function and then replace the stub with something that calls the right bit of code. So that bit is a MIPS RTLD bind. I am short on time, so that's about that. So it's mostly the same except that every call into every call into a libc function has this extra step where it has to go off and find the function and call it. So that is my talk. I'd be happy to take questions at this point. And one request, I would really like to get feedback. So use the Fostum feedback form, send me an email, talk to me later, I'd like to know what you liked about this and if I dwelled too much on something boring, tell me that too. You showed us that underscore underscore start calls main with three arguments, but we can declare main with zero arguments, one or two arguments or three arguments. How does it work? So in practice if you, so it calls it with three arguments in ordinary calling conventions, if you don't acknowledge that you're being called with those arguments, it makes no difference, you just ignore them. So my question is knowing what you know now, if we were going to start again, what would we do better? Or is this actually the best way? So I'm in the process of, there's a bunch of things that are coupled to the stack that probably would be better if they were stored somewhere else. That's probably the most obvious thing. Yeah, so I would say that the things that are coupled to the stack, that's definitely a historical accident with no good reason. I mean, it's there because that's how you call functions on X86, and it is in some ways useful and it is lazy, but I probably, I wouldn't do that. Most of the rest of it is relatively sane. I mean, the MIPS bits are all insane because that's MIPS, but otherwise it's relatively sane. Well, thank you.