 Satisfying lunch and you're ready for another fine afternoon. Would you like to introduce my colleague to talk about fighting I.O.S. as part of cold starters? So this is a big focus and you know, Firefox 4, performance and start-up. Can you hear me? Yes. Before I start, I would like to know how many people in the audience are not Mozilla developers? Awesome. How many of you are developers? Awesome. So, I'll be talking to you about I.O., which is an unexpected problem on most systems nowadays. So, I'd be introducing you why we need to address I.O. somehow and how it has an impact on software. Then I will introduce you how to actually see what's happening, hopefully, and what can be done against it. 20 years ago, I had my first real PC, one with memory management units, and it was fast at the time. But, well, PCs nowadays are really, really, really faster. Processors speed by then was like tens of millions of instructions per second. Now you can count in tens of thousands of instructions per second. Memory capacity has more than... Yeah, 20 years ago, you could count in megabytes. Now you count in gigabytes. Memory bandwidth was maybe one gigabit per second. Now it's more than right. 200 or 3, 4, 500 gigabit per second. Hard drive capacity has exploded. My hard drive by then was 200 megabytes. Now you see terrorites, and throughput is good as well, because by then you had one megabyte per second, now you have hundreds. Hard drive access time also increased. 20 years ago, you had a drive that had the next time of 20 milliseconds. Now with SSDs, we have 0.1 milliseconds. But, that's SSDs. In practice, this is not true. Most people don't have an SSD. So we're still stuck with very, very slow access times in the order of 5 to 10 milliseconds. So what's the problem? That's the problem. So on the horizontal axis you have time, and this is Firefox startup on Linux. Vertically you have the disk. It's the offset on the disk. And what you can see is that, well, you read stuff and you go elsewhere and go back, and you're going back and forth on the disk. And with really slow access times, it means that every time you have a virtual bar, it's really slow. And even if you zoom on the other accesses, it's not really good either. And even these, or these, are hurting a lot. I did a little experiment with the data I gathered just before. Instead of taking the IO as it was, well, I did take both the IO as it was experienced in reality, and I also reordered to see the difference. And on the slowish disk, which is only 30 megabytes per second throughput, the normal IO takes around 2.7 seconds. With the order IO it takes half the time. On a faster disk, around 85 megabytes per second, the order IO is three taxes less. So it's really critical to avoid however possible any seek on the disk. And another point is that we don't really have a problem with warm startup. Anything that is CPU bound is not a problem. Why? Because you see, that's the Firefox startup. On a cold 2D dual system, it takes four seconds to start almost. And on a warm startup it's much faster, well under one second. And on an i7 system, the cold startup is not really pretty much different, but the warm startup is twice as fast, but it's not really a big difference because it's under a second. It's more clock time. So that's the time it takes to start on cold 2D dual with slow disk. So you have to know what are the problems with IO. And the problem is that at the moment we have a lack of tools. There are ways to track some kind of IO, but it's really hard to have an actual grasp on what's happening, actually. Linux has some tools that allow to have some idea about what's happening, but it's really cumbersome, and I will show you some of the tools. And you can't have widespread knowledge of what's happening. And getting relevant startup times is hard. The system I used was a virtual machine which I rebooted 50 times to get a rich time within some kind of boundaries. This is not something you want to do every day. I wrote some automation tools to do that, but it's really cumbersome. And tracking IO is also not really as simple as tracking read and writes. You will find a lot of scripts on the net actually doing that, and it's wrong. Very wrong. It's really simple because, for example, if you open a file, read from it, close the file, open the file again, read again, and close it again. What do you think will happen? You have one access, not two, only one, because, well, the system is quite intelligent. It's caching. You hope it does. Another interesting point I discovered is that CPU scaling actually has an influence by IO. Nowadays, the CPUs are not running at their full speed every time. And what's happening is that if you go back, whenever that kind of stuff happens, the CPU is waiting for this, which means the CPU is sleeping at its slowest speed. When you need to go back to full CPU speed, there's a latency that happens, because, well, the CPU can't really switch from slower speed to faster speed in an instance. So what happens is that if you somehow find a way to have your CPU maxed out during the startup of Firefox, it's faster by 10 to 20%, which is quite impressive and unexpected. So together, the data I showed you before, the graphs on the disk, I used ftrace, which is a kernel tracing facility in Linux, which has the advantage of not needing anything else than the kernel combined with the right flags, but usually distros come with all you need for that. All you need is to mount the debug FS, which might or might not be already mounted, depending on the distro, and do some filling with... So here we just enable the trace on the disk, where everything will be traced. We say that we want block tracing. Here we say that we want the block IO complete events. We enable, and then here we get trace. The output is not really readable. You have a lot of outputs, which you don't really know what it means exactly, because there's not much documentation about the block IO tracing facility. So you have to guess. I used what I could. I just took what looked like was happening. There are some fights in the events directories. You have much more events than that, but you have a format file in each of these events directory, which is supposed to contain the format used by the output, and it doesn't really match. Another tool that I used, and I will show you graphs just after that, is SystemTap. SystemTap is a kernel tracing Swiss Army knife. You can do anything with it. You can insert code in the kernel during it's job. You can do anything with almost. You can crush the system with it if you want. I wouldn't advise it. The big downside of that is that you really need to know the kernel internals to actually do something. So the graphs I did after that required a lot of fitting in the kernel code. Obviously it's hard to get to the right data out of it, because it also depends on how the kernel is optimized. Because the kernel source is not exactly designed for that. There are many codes from static functions to other static functions, and sometimes they are inline, sometimes not. So you can put crocs on some, and you can put crocs on some others. It's really a pain. The URL I wrote there is a blog post I posted a few months ago, maybe a month ago, about the SystemTap setup I used and the script also to get this. There's more than the SystemTap output here. But here is a summary of all the IO happening on the LiveZoo file, which is the main file containing most of the code in Firefox during the startup. And the original stripes correspond to some sections, some big sections in the file. So the pinkish one is relocations, I explain it to you later. That's code, that's read-only data, that's read-only data, relocated, read-only data, and this is data. Here, it's something you have to endure on 64-bit systems. It's EHCray, which is used to unroll exceptions, which is actually not used in Firefox, but it's in the LBA. So what's happening? The process starts, then you have some reads here at the beginning and at the end. Why? Well, that's kind of an unfortunate state of the binaries, is that to know what to read in the binary, you have to read at the beginning of it and at the end of it, which is pretty weird, but it's the way it is. It could be changed, by the way of the linker, but at the moment, you can't do anything about it. Actually, most of what is there, you cannot do much about it, because main is run around here. So after initializing libzoo, here it does nothing in the file, that's because it's doing things on other libraries, similar things. And then here, you see readings here and here, and what's happening here is that it's doing relocations, and relocations is something that is necessary when you have random address spaces, because the library is not necessarily loaded at the same address in the address space. So this means that you have to change a bunch of offsets to make it work with the address at which the code is loaded. Fortunately, you don't have to do that in the code, because the compiler and the linker does a great job at it, but you read a lot of data, and you also update a lot of data, and you do it a bit at a time, so you go back and forth. Here it's forward. That's another thing. That's static initializer. So I think it's next slide. Yeah, these are static initializers. These are only examples. And for each of those, the compiler will actually create a function. This function will be in the corresponding object file, and the result is that when you link a lot of your object files, you have a lot of object files, and each of these functions are in each of these little object files. So each static initializer is called for each object file, and it's done backwards because GCC developers decided it was going to be backwards. The reason is that there are constructors and there are destructors. So you have static initializers and you have the other hand. And to be safe with object files boundaries, they have to be run in reverse order from each other. And it was decided, unfortunately, that those going backwards are the static initializers. So the main problem with that is that it's really easy to get static initializer without knowing, because who would know from that, for example? You could guess here, maybe, but really it's something stupid from the compiler because this is a constant. And doing this, it will actually create a function that just sets this value, not the function column, anything, just this value to this plot. Just a function for that. Here it actually calls something, so you have to know that it will create a static initializer. Ice grind is another tool. It's actually two tools. One was developed by Terras Glech and one was developed by myself. I took Terras's one and I changed it to do what I wanted it to do. So they are both white-white planes. My version tracks all the bytes, the single bytes, that are accessed during the execution of a process. Only ones. So at the end of the run it will tell you what byte ranges have been touched. The one from Terras, you give it a list of sections, whatever you want. What we use it for is taking, for example, the output of ld, which will give you a map of all your object files and functions. And we list all the functions and we can know that way which functions are called when in what order, but only once. So what can be seen with ice grind with my version, so the one taking byte-by-byes, is that while the kernel actually reads a lot of data, for example, text. The text section is the code section. So the red bar is the size of the section in the file. The green one is the read ahead, while the kernel actually reads, which is a lot. Most sections are read almost entirely. And the blue bars are what's actually needed. And you see the code? Nothing is moving, almost. But you still read all that, which is waste of time. Avant-Starter is something new that's coming in Firefox 4. So the blog post actually has the extension. It's a small extension, quite stupid one actually, only displaying the three values. So it's tracking when main is called. It's not exactly main, but it's the main function in LedZool. When a session is restored, which is when all the tabs have been initialized, but not necessarily loaded from the net. And when the first pane offers whatever it means. We also gather data, actual data from users through alums.modzela.org pins, that you send when you want new alums, or when it wants to know if you are up to date. And a real estate extension with graphs and stuff like that is working progress in some bugs somewhere. So we have a lot of unexpected enemies. The file systems, for example, during the course of all these testing, I copied Firefox a lot, and it turns out that files were mixed. For example, the LedZool file, which is 20 something megabytes, you had a bunch of it, a bunch of another file, another bunch of LedZool, another bunch of another file, and so on and so forth. The tool chain doesn't really help. As we saw, the compiler doesn't help here. A static and dynamic linker doesn't help. And CPU scaling doesn't help either. So what can you do about it? So we can, for example, do something about that. We can try to do something about that. We can try also to do something about that, but that's something the linker should do. And we should definitely try to do something about all these, which are basically most of the time due to code tool system libraries. So what we have to do is, well, avoid fragmentation. We had something for Firefox 4. The SQLite files for places, for example, were very, very fragmented. And we improved that by allocating by bigger chunks. Reduced the number of files. It was done in Firefox 4. Before, we had a lot of different files in components and prune. They are all grouped into one file now on the jar. We improved the binary layout, so we tried some things about that. We're ordering the object files, for example, which is the easiest way to do that without needing a new tool chain. Reduce the size. Anything you can do to reduce the size will obviously save some IO. And avoid going back and forth between files, because, well, it's kidding. So, for example, it's actually sad. This graph is kind of sad. This is the 3.6 start of time, and it's actually faster to start that 4. So this is 4-beta-8. I need my data gathering a lot of time at home. Without Omnijar, we see that Omnijar actually gives a good improvement here, but we are still slower than 3.6. That's unfortunate. But we also have extensions packing now. So instead of unpacking all the extensions when we install it, we keep them packed when we can. So these are profiles I used with the six biggest big users' extensions. Well, only those that work on 4.0 as well, because not all of them do. So actually, version 4 is actually faster with extensions packed. And something stupid I did is trying to reverse the static initializers, those that go backwards. I just hacked the file so that the pointers would go forward. And the result is actually surprising. I didn't expect that much. I did expect some improvement, but not that much in the order of 10-20%, just by going forward instead of backwards. These are the various changes I tried. Unfortunately, they won't make it except one to 4.0. So here we have normal tarot. Here we reduced the static initializers, some of the static initializers, but not all of them. And here we reordered the binaries and packed relocations and reduced the static initializers. So these are five sizes. So it's the libzul size. So it's quite reduced. And the result in start time is actually deceiving for less static initializers, which is actually good for when you put all of those together. So you have two block costs with more data about that. So the reason why the static initializers are actually slower is that when you stop reading here, some of the stuff here, you should stop reading there. Well, sometimes here you will probably have to read after me, because here it's cached from the first time. And here you won't see anything for these offsets, because obviously it's already in the cache. But if you read less here, you have to read them there, which kills sometimes. I'll skip this one. So what's next? We also try to avoid FC, which is also killing, because basically it's telling the system it's crushing anything it has in cache that it's not reading. We also try to separate hot and cold functions at startup. What will be tried is to actually separate in two libraries. One for the hot functions, those that are actually used at startup, and the ones that are not. Removing dead code, because we have some dead code, and it's taking space, and it might actually be read by the kernel or by mistake. And preloading. This is a small experiment I did. I just preloaded all the library files from the Firefox directory. I just did a cat on the whole files, and it's actually faster stuff. And the faster times include the amount of time it took the captain to dead null? Yes, it does. So this is a three-line change to our startup script? Exactly. And the improvement is what percent? A lot of them. I can sense a lot of questions out there, but we're only going to take two hours of our time, basically. Is the Firefox and Mozilla organization supporting you in this, and how can other open source projects benefit from it? Anything that can help, really help, I'm actually... Can you repeat the question someone asked in the back? He was asking what can people do to help, so yes. That's not what he asked. He asked, are you getting paid to do this? And how can other projects benefit from that? Ah, okay. So, yes, I'm being paid for this. I'm actually contracted by Mozilla. And how do other people... Well, you can start to use the code we wrote, like ice grind or something like that. You can contact me. There's an address I can probably give a hand. Feedback will be helpful from your experience because you probably have problems as well. If you have strips, better strips for system.org or whatever, that'd be helpful. For the moment, it's... it's working properly, so if you want to give a hand, you can. Any more questions over here? For example, the static initializer problem is kind of solved in GCC 406. It's not really solved in the means that it still generates functions for stupid things. But it roofed them which saved the day. There are other things happening in GCC and actually Mozilla is trying to get people to do some things on GCC site. Thank you.