 All right, time to come to start, sorry. So I realized that I just have 35 minutes for this talk. It's a bit short, but I'm supposed to boot fast anyway. So let's do that efficiently. I hope you have a fast brain as well, at least at this time of the day. So I'm Michael of the Knacker. You can read that. So actually, this presentation has a goal to actually give you some measurements about boot time techniques that can be used to help you guys know in advance whether some technique is worth exploring or not. So I'm going to go quick on the technical explanation, like how this works and things like that, and add more focus on the savings that I achieved on this kind of system. So it's basically a bigger bone black with an LCD cape. Connected to it, there's a standard USB webcam. And it's just playing on, booting on a standard SD card from Kingston, like a cat 4, like a relatively normal one, not especially fast. So it's going to, on the software side, it's going to play FFMPEG just to show the video stream from the USB camera. And then the initial time for the system was 9.45 seconds, including late the seconds time out in U-boot. So it's easy to optimize, of course. Another thing being very fair here. So again, here the goal is to give you results to use. And in particular, for U-boot's Falcon mode, I had trouble finding some details. Like it seems that not so many people are using it yet for boot time optimization. And it's really worth it, actually. So I'll share some technical details and some results with you. So here, there are great presentations about boot time reduction. They have already been archived from previous ELC conferences. So I'll give links to them. Here, I just want to give you simple principles. The idea is to first focus on optimizing things that won't hurt your ability to further reduce and measure in the next steps of the boot process. So start usually by the end, like user space, tool chain, and things like that. And progressively move into the early boot steps, like the init scripts, and then the kernel compression, and eventually bootloader. But when you optimize the bootloader, you won't touch the kernel again. So that's fine. If you do that the opposite way, you lock down. I mean, you make it much faster. You use the ability to optimize the kernel anymore because you have no tracing, no debugging. So another technique I'm using, too, is keeping quite slower, relatively slow storage, and quite as long as I can so that I can, like it's like a magnifying glass. It makes the times longer and easier to observe, like some deltas in behavior are easier to observe if they are slow enough. This way, I don't use like the fastest compression, for example, I keep the default one, which makes a bigger kernel, which takes more time. And of course, you can have a boot time presentation without this quote from Donald Knuth, that the premature optimization is the root of all evil. So first tool chain optimizations. So I just compared an ARM tool chain with a TAMTO tool chain, and the result was surprisingly good. Like with TAM, I have like the full system size is 18% smaller than with a regular ARM tool chain, so it's really worth it to use it. There's no real performance benefits. It's about the same performance, so I'll buy it. I'll buy that. Smaller in the same performance, I keep this. I also tried to replace UCLib C, well, obviously, GLib C as well, by Masson. And we have a win in terms of size, library size, starting with a file system that has the library inside it. So I'm saving 16% of the library size, so it's worth it. The goal eventually is to have a small system that will fit as an intramfs in the kernel binary to have like just one read from the MMC instead of multiple ones. So one of the goals is to reduce the system size as well to make it faster to load. So we stuck to UCLib C, then optimizing the applications. So the general idea is to compile your application with less features, less dependencies. And yes, so at the end, the program is smaller and you have less libraries that are built. So at the end, what I got with M5MPEG was a reduction of the total file system size as generated by build route from 16 megabytes to three something megabytes, like almost minus 80%, which saves about 150 milliseconds application loading, execution time, because probably it would try less things. Normally it is loading on demand, but probably the code is trying to load a few things and testing things. At the end, the total reduction was about 350 milliseconds, maybe because they might be some fast amount time as well. I expected less, but let's take it. Now let's talk about the init script optimization and also reducing the size of the root fast system as a whole. So there are some techniques that are documented like to analyze the boot process with boot char d, for example, or using some tools associated with system d. You don't mount slash proc slash js, you simplify the busybox configuration and think a few things like that. They are well documented also on the Elinux subiki that is nice chapter about boot time reduction. And switching to static executables. So I'll go through each of them. So yes, effectively a smaller file system is faster to mount, especially as our goal at the end is going to be to load the root file system inside the root file system. So the kernel will be smaller to load from storage if the instrument faces smaller. Kernel decompression will be faster too, normally. Well, I'll explain that later. And effectively the kernel and file system are loaded inside a single read operation from storage which also makes sense rather than multiple small ones which may need time to start and that could be some overhead, the file system layers and things like that. The intramfs is a very efficient technique for accessing files. So a technique to detect unnecessary files is to take advantage of the fact that Linux stores the last access times for files. So don't hesitate to boot your system the first time and then just take your SD card out and just run a fine minus a time on it and you'll find the files which have actually been accessed during the boot process. So that's a way of eliminating the ones that were not accessed or finding out at least and you can decide. What else? We also simplify the busybox configurations which this reduced the size from 600 kilobytes or 700 kilobytes to 86 kilobytes with dynamically compiled against Ucdip C so it's quite nice in terms of size. And at the end, the total file system size was reduced by 34% like down to the 33 megabytes. In terms of boot time difference, it's hardly noticeable because of the on-demand loading, I guess, and also because the init scripts didn't take much time anyway. Here in that system, I had just had two executables, actually busybox and FFMpeg. It really made sense to eliminate the shared libraries because you have a lot of code that you eventually don't use. Effectively, you might copy some code inside two executables, FFMpeg and busybox, but the overhead is still not so much. So here, as you can see, you can see the content of the file system at the end. It's the only thing that's left in the file system after my optimizations. So eventually, yes, I'm down to 158 megabytes of total storage space, as measured by the TAR archive that Biltroot generates. So that's all I need, actually, at the end. Some file system optimizations. Well, we could test various file systems here. The goal is simple. We're just switching to an interim FS. So we did compare with other file systems, but it really made sense to switch to an interim FS in that case. So the root file system is embedded in the kernel image. This will just want access to storage, as I told you. You don't need block storage and file system drivers. So the kernel is gonna be bigger because you include the file system inside, but once you remove the, that's the next slide, actually. Once you remove the block support and MMC support in the kernel, you actually get back the overhead, no, not exactly. The boot time that you lost because the kernel was bigger. The kernel is still bigger with an interim FS than it was before, but you recover the extra time by compressing the kernel and copying it. So there's something you have to remember that people may not realize is that it's very important to not compress the interim FS, which is not the default, at least for the Big Bone Black DevConfig file. So make sure you have an interim FS, sorry. Compression none. Otherwise, the interim FS is compressed twice once before being embedded in the kernel and then the kernel compresses it again. So this actually achieved a reduction of size of 200 kilobytes and saved about 170 milliseconds of boot time, right? So at the end, effectively I got back the same boot time as before, even a little less than before, even though the kernel was bigger. Thanks to having less things to initialize at boot time, like no block support, no MNC support. Now a few words about kernel optimizations. You know there's this in-call debug feature in the kernel that allows you to dump actually some information about the starting time and ending time of function calls during the boot process. So you can use this to dump more information in the kernel log, and then you write your demo stage or you copy the console output to a file and you process it with scripts, boot graph.pl and it generates a graph like this with the biggest consumers of boot time. So one of the first ones was, I don't know, first the technique, you could use, you could actually get the names of the functions here. You can look them up in edX here, typically, edX.bootlin.com that indexes the kernel source code. And then you can try to eliminate, to figure out what this means and whether you need it or not. So just be aware that some functions that are reported by the in-call debug are actually the name of a module and the score in it. They don't correspond to an actual function name in the source code. Then you can use some techniques to try to optimize the existing drivers, like looking for parameters that would impact the behavior of modules and account for the increased delay. So just an example, using this, I realized that before this, there's a tracer init tracerfs. So all the init tracing infrastructure here was taking about like 550 milliseconds to initialize. That was enabled by default, so I removed it and saved a lot of space and time. Others, there's the TTY interface, the serial interface for the OMAP which was taking a huge amount of time. I didn't find an obvious reason in the code. I found some parameters corresponding to the number of TTYs that are initialized, but haven't managed to change that yet. Other things like a network driver that will be disabled anyway, so I took care of it. Some things that are related to the USB initialization of the camera, it cannot be skipped, I have to keep it. And all the other ones were actually not low hanging fruits, they were quite big, quite small, sorry. So I meant, I expected to save the corresponding time by eliminating some features from the kernel by working on the kernel configuration and reducing the number of features. The preset loops per GP is a well-documented technique for reducing the time, so when you boot for the first time, the kernel is going to estimate how fast it goes through the calibration loop that you use in new delay. So it's not necessary to run this loop every time, you just do it once, get the loops per GP value and feed it to the kernel command line. So in the past it was saving more time, now it's saving 82 milliseconds. On the Linux, the Linux wiki, it may be still 200 milliseconds if I record correctly. Or 25 GPs, or 250 GPs, yeah, 25 GPs. Another thing is SMP, so people were saying try SMP, but I have never tried, and effectively on a single core CPU, or even in a multi-core CPU, it could make sense to disable SMP, which makes your system faster, like this, like 126 milliseconds of boot time savings, and noticeable size reduction as well. Compress kernel, so all the time when I mention a kernel size, it's going to be compressed, so it's even bigger if you decompress it. And it also contains the Inetram-FS, so it's not completely fair. I mean, if you remove the Inetram-FS size inside the kernel, it's even better savings. So here, if you have only one CPU core, you could, that's a clear winner. Otherwise, if you have multiple cores, you might try to start with, you need SMP, of course, but you may try with one core and plug in the next ones. I don't know what it's going to give you. Removing kernel module support will save 82 kilobytes of compressed kernel size, 20 milliseconds of boot time. I had a bigger number in mind, but now that's better, probably. I remembered from past experiments that something like 300 kilobytes, but now it's more efficient, I guess. But again, it's the compressed kernel. So be careful when you do this, be cautious, take your time, otherwise you end up like being very feeling lucky and removing lots of things and you have no clue why your kernel doesn't boot anymore. Oh yes, and notice that what I had to do to do this is first remove all modules. Otherwise, if I remove module support, all the modules are turned into, which I don't use in my system, I don't load modules, are turned into static ones, and you end up with features that you don't know whether you can remove them or not. So I'm looking for a way to turn all modules to no in an automatic way. So maybe by just doing a set on the config slides, but it's going to disrupt some dependencies, so I'm not sure exactly what the best way is to turn modules into no. Maybe a new star config makes something, or no modules or something like that would be nice. So remove all modules first, and then you can remove module support and then remove the static ones that you don't need. Another technique is to silence the console with the quiet command line parameter. It says 577 milliseconds, that's good. That's very nice because the console is a slow device, so that's easy to understand why it saves that much time. Then, of course, you can go ahead and beyond and remove complete support for writing messages, like even the messages don't get compiled in, which saves a lot of kernel slides. So disable config, printk, config bug, which saves about 5% of the size. Also removing case old sims, which surprised me because it saved 107 kilobytes for the compressed kernel, so it needs a lot of space. The total savings for both, although these things silencing and turning off the kernel messages saved a lot of time, 700 milliseconds and more than 200 kilobytes in the compressed kernel, so in terms of, before compression, it's probably much more. Using config embedded and config expert allows you to turn your generate kernel, which can run any application into a dedicated one, which can only run the system calls that you need to run. So you can come out, compile out some system calls that you're sure you never use in your particular system. This reduces the size by 51 kilobytes compressed and a boot time by 34 milliseconds, essentially because there's less code to initialize and the kernel is smaller. I tried some to kernel, didn't work this time, so in user space it was a good result, but in kernel space, surprisingly, the same tool chain, I got 40 kilobytes of extra space and a total boost time that was also increased by five milliseconds, so I didn't select that one. Probably need to. Any idea why this is this way? Ah, okay. So I did try, effectively, to disable arm unwind separately, but you advise to do that together. So, I'll try, yeah. Thanks. I'll do that the other way, like first disable the arm unwinder and then thumb. Okay, sounds good. I made some tests with the slab allocators. No change. Slab is still the best. Slab, which is supposed to be much simpler, is actually not so good in terms of performance. It's a disaster, even for, well, I have like 512 megabytes of RAM, so maybe it's too much for slab. So I stuck to slab. So slab apparently makes sense for very small systems with a very small amount of memory. If I had reduced the amount of RAM, maybe, ah, yeah, that's an idea. Like boot my bigger bone blank with just 16 megs of RAM and see if it gets bigger, if it gets better. I forgot about that idea. But, effectively, it's much, much, much, much slower, like more than almost 150 seconds of increased boot time. Kernel compression is interesting too. You have various compression schemes that are available. By default, it's as an MA, at least in bone black. I guess it's more dependent, or maybe it's a default, which makes sense, I guess, for 86. So I made some tests. I updated the test I had before, which were based on a 3.1 kernel. So here, as you can see, LZO wins over LZMA, and even GZIP, which is actually the closest one. So the best contenders are GZIP and LZO, but LZO wins. So LZO is a very fast decompressor. It doesn't compress as well as GZIP can do, like 15% less, something like that, but it's very fast. So you have a bit bigger kernel, but it's much faster to decompress. At that time, it was time to switch to faster storage, because we are close between GZIP and LZO. So I selected some SD cards with better performance. And eventually, what happened, I found a model that was not the highest end model of SD cards, but was the best performance I could get. Even if you take a Sandisk, extreme, extreme, extreme, something, the limitation is eventually what the hardware can do in terms of read performance, I guess. So here, as LZO still wins, even with the differences about the same. But then from now on, I'm sticking on the fastest storage, but I didn't want that to disturb the tests. I tried CC Optimize for size, so compiling the kernel with minus OS instead of minus O2. Actually, the results depends on how fast your CPU is. So on the bigger one, black, the winner is O2. The kernel eventually is, no, OS is a little faster, a little faster, but hardly for many seconds. But if you have a slower CPU, you can really have OS slowing down your machine significantly. And keep in mind that the system calls that you're making will run some kernel routines that will be slower to execute. So that might not always be a good solution to use OS in terms of long-term performance. So here, I chose OS, but it really depends on your platform, on how fast, your CPU can be, or how slow it can be. Other things, removing the proc file system removes about 50 kilobytes of space, but in my case, even though proc was not mounted, I believe, FFmpeg stopped working when I removed support for proc. So maybe FFmpeg mounts proc by itself, if it doesn't find it, that's funny. So at least I could remove some support like proc slash sys, but saved a little bit, not much. Removing ccfs saved 22 kilobytes and 35 milliseconds of boot time, so that's good. So you can do that if that's compatible with what your applications are expecting, of course. So if you're lucky you don't use proc and sys, you're fine, otherwise you can't do that. Removing all the compile time checks, all the compiler options in kernel hacking will save about 40 kilobytes, especially config debug info, which I expected to be bigger. And also, effectively changing the arm and winder technique for nicer stack traces, replacing the EBI stack on winder by the default one, saves, effectively, 24 kilobytes of space. So I'll try with the FFmpeg 2 and see how they interact. So there's a little bit of here increase in terms of booting, but it's almost negligible. So I kept the default and winder, now in the original and winder for arm. Another technique you can use is appending the DB to the kernel. It's just because I observed that, I observed that if you, in your boot, you see the first big loading of the Zimage is fast, because it's big enough, but the DB is smaller, and therefore the performance is not the best one. And therefore, oops, the idea was to use the old technique. When you boot didn't support the DB, you just appended the DB at the end of the kernel, and the kernel binary, and it just works. So I'm just doing this, like cutting them and loading that, I just have one image to load in set of two, and effectively the performance is better, like I saved 20 milliseconds of boot time, like I measure that starting kernel time, user space starting time. It's a little bit, but it can give you the extra milliseconds that you need to achieve your objectives. And to finish, the boot loader optimizations. So there are lots of techniques that are well documented for improving your boot, which is a slow program by default. Here, I just decided to skip UBoot all the way, and using the Falcon mode. So it's just, instead of loading the second stage of UBoot, you just load the Linux right away, taking advantage of the Falcon mode infrastructure of UBoot, which is actually easy to use, and the next thing about it is the same technique for using it, whatever the board that you have provided, it supports the UBoot SPL. So you need the UBoot legacy image, so U image instead of a Z image. So you have to load the loading address of that kernel inside the U image container. You copy that to the SD card. An optimization in the UBoot compile time was disabled support for an environment in the SPL, which apparently is possible, but takes a lot of time. So I saved 250 milliseconds, that's really, the environment was huge, like 128, whatever, maybe, I don't remember the size, but it was unnecessarily big, right? So in case you want to use the Falcon mode, we have full details on our training labs, I'm sorry. But here are the most important ones. So to run the Falcon mode, you just need to first load the U image in UBoot, set the boot tags, if you don't have it yet, you really have it, actually, when you boot in normal mode. Then you simulate the booting the Linux kernel. So it's just like boot Z, or actually boot M with U image, it just loads the kernel headers, prepares the eight tags for passing that information to the kernel and boot it, but it doesn't do that, it just prepares that information from reading the kernel binary and writes, eventually you can write that to a file that's called args on the MMC card, just storing that information so that UBoot doesn't have to compute it every time you boot. Here the size is a bit arbitrary, it can be probably smaller, so that's 16K for that information, that I didn't have time to figure out exactly the exact size of it. So once you have done all these four steps, you just can reset and your boot will, your board will just boot with through the DSPL, the first part of UBoot and directly and straight to the Linux kernel. And in case you need to get back to the previous UBoot, just take your SD card, do a normal UBoot, just take the SD card out, remove the args file, and you're good to go. And that's the normal case with DTB, so that's the same thing, except that you're, instead of exporting eight tags, you export FTD information, flattened device tree. But otherwise here, when you reduce SPL export, we have the address in RAM where the data are, like this, this one, sorry. And the size of it, like the end and the beginning, so you compute the size from both numbers. So the Falcon mode is nice because it saves almost 500 milliseconds of boot time, like not going through UBoot, so it didn't even have to optimize it. And the total boot time at the end here is now 2.5 milliseconds. So here I have something to explain. That's, this time is when I'm in use space, so I'm in use space after 75 milliseconds roughly. And now I have, think because of the fact I'm connected to a USB camera, USB is asynchronous, I have to wait for the camera to be detected. And in my case, it takes 1.2 seconds. I haven't managed to address that yet. So it's a bit frustrating because I'm doing nothing for one or two seconds. I could like display a splash screen and do things, or just instead of using a camera, I probably will switch to just displaying a video which could start immediately. So now you can see after one step video is ready, I asked a run FFMPag, which takes about like 500 milliseconds to run and start showing the first frame on the screen. So I modified FFMPag to actually write a message on its log after decoding the first frame and actually to write a GPIO as well so that I could count the time on Arduino that's connected to the board. So I didn't manage to effectively to reduce that time for USB, any ideas for making USB synchronous? Like in the driver? Okay, cool. I'll try, like the USB core? Ah, okay, thanks. Hey, thanks. Didn't manage to boot yet without the TTY layer but there's probably needed extra work but I expected to save hundreds, a few hundreds of milliseconds. And so if I eliminate the 600 milliseconds that I was found in the boot tracer plus this 1.2 seconds for the USB camera to be enumerated, I'm gonna be, I should be below one second. One second, so that's quite nice. Yeah, the time is limited, I just have two minutes left. So don't miss Chris Simon Simons' presentation tomorrow on optimizations related to system D. You have other presentations from past ELCs which are brilliant. So I put the ones I love best. There's also our boot time training materials and labs which are more extensive slides on the various techniques to be used and advice. So that's a training course we can teach as well. So questions, suggestions, or comments if we have time. And I like to finish with this, like a picture of the various techniques that I used and highlighting the most important ones as hopefully it helps you to explore some techniques which are really worth it in our case at least. Any questions? Yes. One of the suggestions for you, if you change the font size, you take your image from the font. If you make it very small, you can reduce boot time or if you use raw partition instead of font, it would be even faster. Also, you can remove font support from the U-boot which is quite true. That said, I haven't had time to do that but it's the next one. Yes, effectively. Directly read raw data at fixed offsets rather than reading, going through fat. Thanks. I have a question about another site of this optimization. How do you debug such a system without pins, without trace? It's getting worse and worse. Do you enable everything back and debug it or? No, well, I really have to do things in the right order. Debug first and then optimize. But so analyze the performance with things like S-trace and things like that. Do as much tracing as I need to do ahead of time or not too late at least. Otherwise, with this approach, effectively at the last stages I'm stuck, effectively. Actually, one of the things I used to do was I had some macros to write directly to the serial port. And then I could use a serial timing program. Or you can, if you have a logic analyzer, use gpios. Ah, yeah, so it's not like having the full printk but just for your own... Just out of character. All right, good idea. Or a number, or maybe... And by the way, I was using grab serial from you, so thanks. It's worth mentioning. Everything was done with grab serial here. Thanks. My beyonds are to this and I didn't understand it correctly but did you already... Did you at all include init scripts in your example or did you just do init equals ffmpeg? Yes, right, exactly. Yeah, so init equals my play video script that waits for the video device to be ready and then start ffmpeg. So just one... I replaced the init script by just my own script, which is not init anymore. So if you... Effectively, if you exit, there's a kernel panic immediately. Have you tried bypassing the init script altogether, just pointing directly to ffmpeg with a sim link? Ah, right, but I still need to parse the command line for ffmpeg. Well, I have a script that calls ffmpeg almost right away waiting for the device, video device to be ready. But it should work. Effectively, you could have all the mounting and things like that is easy to do from C code. So waiting for the device file to be present, mounting something if you need to mount something can be done in C code. So effectively I could tweak ffmpeg to do that directly. Yeah, true. And it would save the time like executing busybox.sh and things like that. It would save a lot of time too, sometime at least. About the disable tracing, it takes a 550 milliseconds. It sounds a very big part, I think. So why does the disable tracing take so much time? Which one, sorry? Disable tracing. Oh yeah, effectively, that's quite surprising. Well, I guess there's a lot of overhead in that case. I should investigate that probably in deeper detail, but that's what I observed. Yeah, it's surprising that in a production DevConfig, you have this option that really slows down boot time significantly. So that's something worth investigating and maybe change the DevConfig for that CPU family. Unfortunately, I think we're out of time. Yes. Thank you. Clear.