 So, welcome to this talk on boot time reduction. Standard license stuff, we can kind of skip that. Quick mention about myself, I'm Chris Sibbins, I'm an author and trainer. And the only important thing on this slide really is the fact that I've written a book. In fact, you'll notice Master in Betty Linux programming second edition. So, this actually means I've written the same book twice. So, that's kind of an advantage, I guess? Oh, absolutely, yes. So, this is the traditional boot time reduction talk that we always have at embedded Linux conference. And so, what's it about? So, we all want shorter boot time reduction. That's always a requirement. Nobody ever comes along to your desk and says, hey, your system is working too fast. Make it boot more slowly. Okay, it never happens. So, you may ask why is this? I gave my first boot time reduction presentation in 2008. And we're still talking about it. Why is that? Why is that? And it's really because there is a conflict between the requirements of having software that is general purpose and will probe hardware and work in a wide variety of environments versus software that does exactly the right thing and does exactly that, exactly that right thing at exactly the right time. So, mostly, this is about how to customize the software, excuse me, in order to make it more specialized to your particular requirements. The other aspect of it is that we're software engineers. So, if we're given the task of optimizing something, we'll keep on optimizing it until, well, forever. So, we need something, we need somebody to tell us when to stop. And that's kind of a topic of this talk. So, the issue comes down to what is your requirement? How fast do you want it to boot? And how much effort are you prepared to put into getting to that requirement? And how big a mess are you going to leave once you've done all of that? So, let me illustrate that last point. So, this is a British-reliant Robin free-wheeler car, very popular. If you've ever watched Top Gear, you will certainly see some of these being destroyed. So, it had an 850cc engine. It was never a fast machine. But, hey, given an engineer, we can improve it. We can make it go faster with enough effort and customization. So, the issue is how much time do you want to put in? And when you've finished, how maintainable is this? How safe is this? Is this really what you want to do? So, this is a pragmatic view of reducing boot time. And I want to emphasize the measurement and then the evaluation of the result. And then how to go about modifying the system to take that measurement into account and to improve that particular metric. And I have to give attribution here to Andrew Murray. Here's this presentation at ELC Düsseldorf. I think it was a couple of years back. Kind of inspired me to do it this way. I'm going to attempt a live demo, but if it doesn't work, I'll just skip over that bit. And the measurements, I wanted to kind of make it a little bit concrete. So, I used a Beaglebone as a demo system. It's not a particularly fast system, as you probably know. And I'm starting off with a pretty much vanilla Yocto project build. I actually used the QT4E demo image as my example because that booted up quite nicely on my Beaglebone. The one thing I have changed from the default is I switched to using SystemD because that seems to be the more common use case these days. And in terms of measuring the boot time, I'm going to be emphasizing the three common tools that you would use for this. So, we'll be using GrabSerial to measure the entire boot process. And then we'll be using BootChart to look at the user space, time usage, and BootGraph to look at the kernel bit usage. So, these are all nice off-the-shelf things. These just kind of work. You may want in a more serious environment to have certain more specialized ways of measuring these things. Ultimately, the only real way to measure the exact time from which the power is applied to the application starts running is to use some external hardware like an oscilloscope and then instrument your code by putting code in to change GPIOs and then just monitor those GPIO Wiggles. Okay, okay, okay. This is too much detail. But yeah, I have some time for questions and such at the end. So, when I say oscilloscope, I'm taking that in from the software definition of oscilloscope, which includes logic analyzers. That's just an oscilloscope, as far as I'm concerned, for the purposes of this presentation. But yeah, some kind of external device is all I mean here. Okay, so, starting off then with Grab Serial. This is typically where you will start for measuring your boot performance. So, Grab Serial is written by Tim Bird, who I'm sure you all know, and who is one of the instigators of this whole conference, in fact. And so, when he's not organizing conferences, he occasionally writes some software as well, it seems. And one of those things was Grab Serial. A nice little Python script simply grabs whatever is on a particular serial port as in a time stamp and a few other things. Really handy. And it's particularly handy because it's non-intrusive. You don't have to change the software on the target at all. You just are capturing the serial port. So, here's an example. We need to tell it which device to capture from. We need to add the minus T switch to actually add in the time stamps. And then, the minus M thing, this is a string that will tell Grab Serial to start making measurements. So, I can set this up and then I can hit the reboot button on my Beaglebone. And the very first string that appears on the console is actually U-boot SPL. So, at that point, Grab Serial will begin its measurement. And then the other thing is that if you're saving to a file, it makes sense to put a time out on there so that Grab Serial will correctly shut itself down and save all output to the file. This is what you will see from Grab Serial. Yeah, it's not hugely interesting. What you really want to look at, of course, are these time stamps down the left side. And, okay, time for a demo. So, what I want to do is just to boot the system up using a more or less default configuration, which actually is the one that's in there. So, this is my Beaglebone. I'm using one of the LCD capes. You're probably not going to have a little bit of difficulty seeing this, but I don't have an easy way of putting it on the screen. So, this is running the vanilla code. If I hit the reset button here, we should see, yeah, it starts booting. There's Uboot. There's the kernel. There's SystemD. And then at some point, it runs the QT demo, and it boots up and does the animation thing. So, that's the starting point. And it's not particularly fast. So, analyzing that using Grab Serial, we find that the time spent is roughly even between Uboot, the kernel, and init, with the kernel being slightly longer. And I measured it to be 11.76 seconds. I should say I'm actually measuring from the point at which I start the QT demo app. I didn't have any means of registering when the image actually appears on the screen. Ideally, I should have a camera or something to detect that, but I don't. Okay. So, we need to start improving on that. 11 seconds is way too much. So, I'm going to start off with optimizing user space, which was 3. something seconds. You could argue maybe I should have started with the longest delay, which is in the kernel, but no, it's better to start with the user space first. It's easier to optimize user space and then turn attention to the kernel afterwards. So, kind of things we can do here. So, I've graded them from easy to, in this case, difficult. So, the easiest thing is just to reorder the sequence in which applications are run. So, we can do that in a couple of ways. Oh, okay, that's on the next slide. So, we can actually start application earlier in the boot sequence, which is great. Other things we could do. We could tweak the compiler optimization flags that may or may not have an advantage. Chances are, actually, they're already set to fairly optimum values. And then, if you want to still dig into it deeper, you have more complicated things. You can change the way the linker works. You can pre-link things. You can merge sections together and so on. But that gets more and more messy as you get into that stuff. To help me with this, I'm going to introduce the second tool, which is boot chart. So, boot chart is a little demon process called boot chart demon. You arrange so that that demon process runs instead of a knit. So, when you boot up, it runs the boot chart demon. The boot chart demon then starts the real in-nit process and then has the ability or the opportunity to log everything that happens from that point onwards. So, we'll need to modify the kernel command line and point it at SBIN boot chart D. And that will generate after it's booted up. It generates a file, boot chart dot TGZ. And then you copy that to your PC. And you can then use the boot chart command to actually analyze that and to produce a nice graph. So, I have an example of the graph on the next slide. You're not expected to be able to read it because I had to shrink it down quite a lot to get it onto the screen. So, this is purely eye candy. But you can see what a boot chart screen looks like. And so, in this part here, we can see the various things that get started up and the relative time they get started. So, somewhere down here should be... Actually, I can't even see it just now. Somewhere down here is... Oh, here it is. Here's the QT demo example. So, that gets started at this point here. And then these little lines here, they give you some idea of CPU resource usage. So, blue line if it's actually executing on the CPU at that point. So, you can look at this and you can analyze it and you can say, hey, we need to optimize this piece. We need to start this piece before that piece. And so on and so forth. So, looking at that boot chart, I discovered that the application wasn't starting up until three and a half seconds after init starts. And that's kind of way too long given that the whole point of this system is to run my, in this case, QT application. Really, we need to start this app sooner. So, there's a couple of things I could start doing here then. If I were using the default system dnit on Yocto, then I could play around with the start sequence. As you know, with system five in it, the sequence in which demons are started is determined by the start script, which begins with an S and then a two digit number, the lower the number, the sooner it starts. So, I could, for example, set it to be S01. And that will be one of the early things started. If I'm using system d, you can do a similar kind of thing. You have to put into the unit, sorry, you have to change the dependency in the unit. And there's an example of that at the end of this presentation. Or you can just kind of skip all that stuff out completely, which is what I'm going to do in my example. And you can write a little script, which actually runs your program, and then runs in it. Very much in the same way as the boot chart demon I just described. When doing all of this, of course, you need to validate that at the point at which your application starts, the resources that it's going to be using are ready and available. So, you should be aware that quite likely, if you do as I'm going to do and start the application immediately that the kernel has booted, you have to be aware that most likely at this point, the root file system is read only. Other file systems will not be mounted at this point. Most likely the network won't be configured and so on. So I would have to modify the behavior of my app to handle those cases. Nevertheless, this is the most important thing in the world, so I need to have it running early. So here's a little script that will do just that. So we can see that, well, first of all, there's an echo at the start there. That's mostly so that I have a string that I can recognize in Grab Serial, so that I can see in my Grab Serial log exactly when it started. Then I launch QT demo with appropriate strings, and then I'll put another little string which I can log to see that it has been launched. And then I discovered even when I did that, it was still a little bit slow to start up because actually when you launch system D, that takes up a lot of CPU cycles. So I hacked that just by putting a sleep in there to give QT demo a bit of a chance to get going before system D jumps in and gobbles everything up. And so first pass optimization then, we're basically by using this technique, cut out the majority of the user space or the init time delay. So now we've got it down to eight and a bit seconds, and user space is no longer the issue. So it's time to look at the kernel. So hack in the kernel. So there are a number of things you can do here, again, graded from easy to not so easy. The simplest thing to do, and it gains you quite a lot, is just to reduce the amount of number of messages you get on the serial console. Because every time you have a print K in the kernel that goes to the serial console, it has to then wait for that string to be sent at this case 115 k board. So you just need to add quiet to the kernel command line. And that typically is going to save you something of the order of a second. So that's nice and easy to do. What else can you do? Well, typically, especially in the case of the Beagle bone, it has been configured with a whole bunch of things I'm not actually going to be using. So I can start slimming that down. I can do a similar kind of thing to the device tree. So if there is hardware defined in the device tree that I'm not actually using, I can just go through the device tree market has been disabled. That will then tells the kernel, the device driver, not to probe that piece of hardware. And that can save me some, some milliseconds or tens of milliseconds. If after you've done all that, you find you still are not hitting the your benchmark, then you can consider doing more complicated things. Ultimately, you need to look more deeply into why the kernel is taking time, maybe optimize debug a buggy device driver, maybe do some other stuff. I'm not going to go into any more detail on that. At this point, we need to look at another tool, the third of the three tools I mentioned. And this is boot graph. So boot graph is something that is basically a kernel feature. And it allows us to graph what happens in kernel during initialization. So to do this, you need to turn a couple of configuration options on in the kernel. And you need to tweak the kernel command line yet again, you need to set init call a debug. And that will then cause a whole bunch of stuff to be printed out, which you can then capture. And I'm using here a dmessage and redirecting to boot.log. So that gives you the init log. And then you use a fancy Python script, typoscript in the, which is part of the kernel source code, boot graph.pl. And that takes that log and generates a nice little graph for you. So I did that on my Beagle bone. And there's this big green section there, which I wasn't expecting. And it's marked raid six select I'll go. And that puzzled me when I first saw this, I thought raid Beagle bone. How does this work? So a little bit of exploration of the kernel, I discovered that raid six is a dependency of the butterfs file system. So by default, in fact, even might be necessary. Sorry, I'm not quite sure if using butterfs, then it has to enable raid six. Okay, don't ask me why, but that's just the way it is. But the good news is, I'm just using ext four. I'm not using butterfs. So I can just whip all this out. So in this particular case, just by removing butterfs from the kernel configuration, I save myself two seconds. So that's quite a big win. So I removed a butterfs added in quiet to the kernel command line. And I also experimented reducing the kernel size. So I also can change that to the configuration stuff that I'm not using. So for example, the Wi Fi driver is enabled by default. But I'm not using Wi Fi in this case. So get rid of it. That and a few other things. I managed to cut the kernel size down. There's what the Z image, the compressed kernel. I cut it down from seven, no 5.6 megs to 3.2 megs. So not quite half, but nearly. Putting that all together, I managed to save three, three seconds from my boot time. So we didn't quite well. So I've shrunk the user space. I shrunk down the kernel's boot time. It's now time to look at the Uboot boot time. So kind of things we can do here. Well, there's one very obvious thing we can do, which is to remove the boot delay, which I'll come down the next slide, which is by default used in at least in development when you're working with Uboot. The other thing I can do is to look at the default boot scripts, which in the case of the Beagle Bono, quite complex. I could then do the same with Uboot as I did with the kernel. I could actually then start compiling out Uboot functionality that I don't need, prevent it from probing devices that are never going to be used. And I could do a whole bunch of other optimizations by hacking the code if I wanted. And there is one thing which I've marked as hard here, although it's not that hard. I could switch to using Falcon mode, which I'll describe in a couple of slides time. Okay, so boot delay. So this is a bit of a cheat because this is so simple. It's kind of hardly worth mentioning. But by default, you will see that when you boot up Uboot, you get this line that says something like press space to auto boot in two seconds. So that automatically adds two seconds to your boot time. So at the touch of a mouse or touch of a keyboard, I can get rid of that and set boot delay to zero and I save two seconds. Fantastic. The only slight snag with this is it makes it difficult to get back to the Uboot prompt. It is still possible because there was a bit of time. It doesn't really take zero seconds that it reduces it to. There's a bit of time. If you hit the key after Uboot has initialized the U art, but before it actually checks for input on the U art, so you have like half a second or something, maybe a little bit less, a few hundred milliseconds. If you're really quick, you can hit the space bar at that point and you can still get back to Uboot prompt. It may take you three or four attempts, but you will do it. The boot scripts. If you look at the boot script for the Beaglebone, it's grown every time I look at it, it gets bigger. So it's about a hundred lines of fairly dense Uboot script. And it basically probes every possible device that you could find boot files on. So it could be MMC, it could be USB, it could be Flash. It spends a whole lot of time probing all these things and then it finally decides what to do and then does it. In practice, you only really need like two lines. You need to set a boot command and you need to set the boot args. Nothing else really matters. So again, we can make this more specialized. We can just put in our own customized boot script, which just does the things that we want to do in our particular platform. And that can save you 100 milliseconds, something of that sort, maybe a couple of milliseconds. On the tricky side, I mentioned Falcon mode. And this is kind of quite neat. It requires a little bit of jiggery pokery, which again I described in more detail at the end of the presentation in the kind of extras section. But it's kind of interesting. So to understand how Falcon mode helps, you have to know a tiny bit about how SoC's bootstrap. So typically there are three stages. When you power on, you start running the ROM code on chip. And the ROM code will load the first stage boot loader from some storage medium, usually either MMC or Flash memory or SPI nor or something. It loads the first stage boot loader into some on-chip static RAM. At this point, we don't actually have working DRAM. So we can't put it into DRAM. It has to fit into the SRAM. SRAM on most chips is fairly limited. On the case of the BeagleBone, it's 128K, less a few bits, which leaves you, I think, about 120K of useful memory. And so typically there isn't enough space in one, certainly in 128K to have a full U-boot binary. So hence we have to go from the second stage loader, the SPL, to the tertiary program loader, the TPL, which is U-boot itself. And then you run the U-boot script and then U-boot loads the kernel, which is kind of a four stage. So that's what normally happens. But we can, in some cases at least, miss out the TPL stage and go straight from SPL to kernel. And this is Falcon mode. So essentially with Falcon mode, the way this works is magic is essentially by creating a shrunk, putting all the functionality rather of U-boot, shrinking it down and putting it into the SPL, which remember has to be less than 128K. Doing this means you have to drop most of the user interface stuff. In fact, all of the user interface stuff, you have to do some special things to create a fixed set of environment variables in a particular area and put that into the U-boot environment and so on. So it's a bunch of steps. It takes like half an hour or something to do it the first time. So now taking the measurement, so that the measurement on the slide shows the effect of simplifying the boot script and just removing the two-second delay. This does not include the Falcon mode stuff. And so doing this, I should get it down to 2.8 seconds. If I were to add in the Falcon mode, that would get it down to nearer two seconds, which is kind of my internal, I didn't actually specify this, but my aim when doing this was to get the boot down to two seconds or thereabouts. So I haven't quite achieved that, but I very nearly have. The other thing I should say that all these measurements I did, really because I'm thinking of doing this as a demo, I actually did all this on an SD card. If I were to put this into the on-chip EMMC, that also would save me half a second, I'm guessing, because EMMC is faster than SD cards. Okay, so just for the fun of it then, let's have a quick demo, see how that looks. And for this, I need to do a bit of a flip. So for the purposes of the demo, I just, like I said, I kept everything on SD cards, so I can, in theory, at least flip the right one in, which to mini-com, yeah. It's definitely better. There's still more things that could be done here. So let me just show that kind of live. Again, I'm not quite sure if you can see this at the back, if you've bought opera glasses with you, you may need them, if not, maybe not. But if I hit the reset button now, display goes blank, penguin appears, and QT appears. Yeah, it's not snappy, I'll admit. But it's definitely an improvement on what we had to start with. Okay, so that's kind of the end of the stuff I wanna mention at this point. Key thing I want to keep on pushing home is the idea that when you're doing this kind of things, do things in a reproducible kind of way. So begin with the low hanging fruit, go through, see how far that gets you towards your target, and kind of iterate until you get close enough. And to keep all this kind of happening, so it quite often happens that a particular use case is optimized for, and yeah, we reach the target, and then people kind of forget about it. So it becomes important to include boot time as a criterion in your continuous integration, and also a criterion when you are designing and adding in new features. Somebody in the room should be saying, ah, but that feature is gonna take five seconds to initialize, how are we gonna fix that? Okay, so it needs to be part of the process you need to keep on keeping this in mind. So yeah, so that's basically the end of that. I have five minutes or so for questions, and then I'll quickly go through the other slides. So first of all then, who has a question? Okay, you're closest, so you get to ask the first question. Thanks for that, Chris, that's very interesting. One thought came to mind. Do you have any comment to make on whether it's useful to look at loading uncompressed kernel images if you've got a decent throughput from your storage medium? Can that help? Ah yeah, okay, so the question is then, yeah, should you compress the kernel or not? I didn't put that on the slides because, well, I didn't have time really. So the idea is that normally we have a compressed kernel stored in a storage medium in flash memory, and then you load that into memory and decompress it. So there are two stages there. One is loading the compressed image, and the second stage is uncompressing it, and both of those take some time. So sometimes the decompression, I mean it depends on the speed of your processor, how fast it can decompress versus the speed of the storage medium. If you have a fast processor and slow storage media, then compressed kernel is the best answer. If you have fast media but a slow processor, then uncompressed is the best answer. So the simple answer is it depends on the ratio of the speed of your storage medium to the speed of your processor. Okay, a person there at the back in the white, can you sort of negotiate between yourselves to exchange the microphone? So we need the microphone so that the question gets recorded on the video, otherwise people watching this at home won't know what we're talking about. So first, many things for your talk. I just want to add something to userland boot time optimization. So actually for our use case, for instance it turned out that we lost a lot of time, like two seconds on bit shifting. So I just want to add some tools to make the thing round. So first, cheap rough may come to mind, but be aware that when using cheap rough, it has like only resolution of 10 milliseconds. So then you may go for O profile, but be aware that you need to enable high resolution timers in kernel. Well, so everything is contained in the LT, T and G suite, I think provided by Eclipse. So it's a bunch of tools. And it really gives you a nice 80, 20 optimization our guard. So which functions are called the most of the time and also the time which is spend and the functions. So I just wanted to add it because it really saves us some seconds because it's not only about shifting priorities of the processes, but also to optimize and also leave the optimization to compiler. So compile with minus 02, this was also some other seconds. Yeah, sure. So just to give some pin points for this because it really saves us a lot of boot time. Cool, yeah. The one thing I would say, so profiling your application is definitely a great thing to do. The one thing I would say is never use G prof. There are always better tools to use. Personally, I would either use perf or use O profile, or if I really wanna delve in, LT, T and G is great, but never ever use G prof. It is broken in many ways. Next. Could you bring the microphone down the front, please? I was gonna say something along the way. How about that? Go on then. Just in terms of Uboot, I wanted to put in a plug for the tracing feature. Okay. Trying to really get your boot time down very, very, very low and obviously turning off the console. Yeah, yeah. And similarly to putting quiet on the kernel command line, there is an equivalent Uboot command which will make Uboot much less chatty and therefore take less time. Okay, yes, sir. So I've got a couple of comments. Firstly, if you don't need drivers at boot time, compile them out as modules because then you can load them once your application needs them. Oh, yeah, yeah. Again, that's another thing that I could have put in but I didn't want to overemphasize these things. But yeah, one thing you can do is basically delayed execution of certain device drivers. If you need, for example, Wi-Fi but not at boot time, then compile out, make the Wi-Fi drivers modules and then those modules will get loaded up after your application has started. So you're kind of doing a bit of time shifting there. The other thing, I don't know if Linux has got it yet but being able to thread the later init calls once you have a number, if you have a number of processor calls, you can parallelize your init inside the kernel which can save you a bit of time if you have a lot of devices. Indeed you can. Again, as you delve more and more into these things, you get to more and more customizations which give you a maintenance burden but yeah, you can parallelize the init in that way in particular cases. Okay, anyone else? We have exactly one minute left. No, okay, well thank you all very much and happy lunchtime.