 Okay, hello everybody. My name is Michael and I'm talking about boot time optimization for the real world So perhaps I'm first a few words about myself I've been working for Pengatronics now for over a decade. I'm doing mostly Linux embedded Linux consulting and development these kind of things and This also includes quite a few boot time optimization projects over the years So Let's get started First my motivation for this talk in general I was at the ELCE last year and there were two talks about boot time optimization actually that year and One of them was we need to talk about systemd boot time optimization for the new innate demon This is was basically introductory for them Introduction to presentation mostly for beginners. So I wasn't exactly Target audience, but it was a nice start and the other one was some time boot time reduction techniques and this was actually quite interesting had a few nice techniques and a Rather impressive result, but at the end of the talk I was thinking I think thinking well This is not actually something that I can use this way in my project because There were many compromises made many choices to reduce the boot time, but Many cases at the cost of utter, let's say secondary functionality and This is it's often the case for me at least that that boot time is important But there are many other topics that are even more important or at least as important So focusing only on boot time was a bit limited for me So I want to talk about a bit about more the overall Situation there when optimizing and looking at ways to not just optimize optimize optimize for boot time, but Thinking well, I can do this But maybe I shouldn't because I sacrifice this feature or how can I improve the boot time? but without Sacrificing other features in my system Okay So before we start with the actual Topics, I want to think a bit about why do we do boot time optimizations? because this usually brings us to What are we optimizing for what are the actual requirements? We have to boot fashion. So The first part is that When we do optimizations, there are what I call hot requirements. This basically means that There are situations where There is a from a specification of things like that for hardware requirements, it says, okay when a device is powered on then After half a second or one second the device must respond to an outside request or must send some message This is typically In the automotive industry. There's there are cases like that where I'll get to that later a bit in an example But this these are hard requirements. So This means we must hit the deadline There is no flexibility there and then there are the what I call soft requirements Where it's mostly about user experience. So you switch on the device and If you start thinking maybe Switching on didn't work because there is no reaction to your action then there is some optimization to be done here or if Later on this is usually a bit a longer time available there That you switch the device on and then you're getting Annoyed because it takes too long to start before you can actually use the device. These are the kind of more Flexible requirements because there's no fixed deadline. It says, okay after 500 milliseconds We have to show something on the screen or something like that, but more a There is a reaction to what I do and we have more flexibility to solve the issue When it comes to maybe let's do something specific show something on the screen or whatever to Keep the to keep the user entertained for example Okay, so Whenever you do optimizations when we get started here, it's important to choose a target What you want to optimize because that's what you need to measure. So you need to decide there's no I want to boot in five seconds because What does it mean to boot in five seconds? There is no if you don't specify what has to happen Before the end the five seconds are over. So the first step is always pick What is optimized what are we optimizing for? So this can be especially for example for the for hard requirement They can this can be that we have to send a can message after 500 milliseconds or after a second this can be one requirement or I talked a bit about reaction to power on so we want to show something on the screen anything just to show the user powering on the device actually Did something switch to pressing a button actually cost something to happen and Then there might be another if we're going further along in the timeline. There may be the point where The first user interaction is possible. So for example We can log in if there is a login screen or Things like that already I have the first view that shows me initial information and Then there is usually it can be later in many cases that the full interaction is that this possible So that device is fully working and all features are accessible Okay, so and what I started with was I want to not just look at boot times, but also at what is What are other? Requirements that I have other things in the system. Usually it's not directly features for the device The primary features but more things like well We still need to debug the device in the field because this is a complex machinery and bucks Are often not reproducible in the lab So we want to be able to do minimal debugging in the field That's also something what might be interesting or necessary. Well, let's talk about the restness Maybe if an application crashes once it should be restarted and here that's why I'm voted at the site I'm only talking about systems with that are Starting with system D because system D provides me with a lot of features just for that Monitoring if an application is actually still working properly restarting it when When it crashes things like that or let's talk about the next topic security that's always important and often conflicts with with boot time optimization because it can slow things down is and system D provides me with features that I can easily enable to Isolate individual applications these kind of things and then there's development and testing in many cases There are recommendations to Disable all kind of things for boot time optimization But at the end device is no longer debugable So what happens is that the developers use a different setup than the final release Because they can no longer debug the final release and at that point Testing with the final release is reduced because the developers no longer or let test less with the final version And this means We have to do more testing if we have a different setup for development as we have for release This is something to take it to take into account Because that's always a money question as well because we have to spend more time on extra testing to keep up with the quality here and then there is maintenance if we add a lot of Ugly hacks to boot faster and then we need to do a system update Then we have to port all those hacks This is can be a lot of effort because we spend a lot of time in the first iteration to boot to the boot to do the boot time optimization and then We have to do it again basically because it was just hacks that were piled on top of the old version So what are we looking at? Well, there's just the typical thing that's disabled everything. Well, most things I'm not looking at that today because Well, there are a lot of presentations out there that do that for us So What can we do? We can delay things so instead of just disabling all the features. Let's look at ways to Do it later after our initial Target for optimizations reached or well my most The thing what I prefer most in most cases is to actually Optimize the code look at things and say hey, this is slow and instead of disabling thinking Well, why is this slow at startup? Can we improve here and still keep the feature? It may not give us as much optimization as disabling the feature But it will probably give us something and the best part about this is it's not just for one project It's something we if it's contributed upstream especially it's something that we can and others can profit from When the next project comes around and then it's cheating. I mean especially in the case for The with the user interaction Presenting the use something to the user often distracts them from realizing that the device is actually not finished booting For example, I have this old phone. If I switch it on I Get the screen to enter my pin and then I'm in this regular menu and I can call a number But what I cannot do for quite some time actually is Open yes, let's go and actually I didn't notice that for over a year since after I got this the phone to just realize that what was happening and I think that's the actual the most important point because we can Put something in somewhere and leave things out and delay them because they're not actually needed To get these let's say soft requirements resolved Let's get to some real things the serial console This is something where most people when they start with With the boot term optimization say That's a bit quite the background here is of course Outputting on the serial console especially from the kernel as far as slow so it can easily add a few hundred milliseconds And user space is not quite as bad, but it's still there So but why I'm proposing is not don't use quiet because quiet disables all or errors as well There is log level equals five. This means we only show warnings and worse and quite frankly In the final product there should be no one so we have no output until an error occurs and So this is a setting one can keep during development because the error Is still visible when something happens during development and the same thing we can do for the user space target because system He has options for that as well System the log level warning same thing only print warnings or higher And then there's the system these show status if we set it to auto it will not print The typical system the messages, you know the ones with the green okay that comes scrolling by at the beginning they're not visible and less an error occurs and in that at that moment System these switches to data mode and prints all the following messages that come at that point So again, if no error occurs we have no output. So basically you have the same effect as quiet But only in the good case only in when there is no error and that means we can keep this Active during development. So we don't have a different setup for development. Well, and then when we're in user space there is Udev Udev cold plug basically for those who don't know what it means is it looks at all the devices in the system that already there and Basically announces to the user space this device is available for use So this is necessary in general because Well, we don't know if a hardware is actually accessible right from the beginning Let's for example USB devices take some time to be initialized Which means that if you're trying to access for example a USB stick a USB mass storage device too early It's not actually there yet So you'd have handles that for you, but the problem is this takes a long time iterating over all existing devices in the in the system Doing stuff with them enumerating data about him about them. It takes time So in general we want to avoid dependencies like that Which is sometimes a bit tricky because well Let's take an example The root file system is mounted. We don't need for robustness So we need a data partition to save our configuration for example our error data This is another device that needs to be mounted and has a different in general the dependency on Well, Udev Well, what can we do to avoid that? Well, there is the first version is use automons automons means The file system is mounted automatically in the background I don't already starting to do that or triggered by an access to actual file on the file system Which means we can start the application right at the beginning and as soon as The application tries to access a file on the file system This will block until the file system is mounted So we can do the application startup and The Udev code plug and the mounting of the file system in parallel Saves a spool time The other version is to trick system This has a few requirements in general Let's say we have a emmc and there's a partition for the root file system as efficient vision for the data and They're on the same device, which means we know that if the root file system is mounted Then the data partition is always available. So in this specific case we do not need to wait for the device to appear because it's guaranteed to be there already and while system D in general tries to do the generic thing So it will always add the dependency but Well, if we fake every tribe we trick system D into believing that there is no actual device behind the file system for example There are some files in that don't have a device Like a bind mounts for example mounts where we don't need a specific device, but just the source directory there Or there are other things as well And if system D doesn't know that there's a device it will not add a dependency for the device and you have two ways to trick that one is to write a manual system D mount unit and write What equals to your ID equals? Falses in your ID in there It's not a path anymore not of a device and the falses in the device file system anymore And now system D doesn't know that there's a device behind it. So that will not add a dependency the other way is we actually use a sim lick outside of slash death and Use that as the device Also, just and we will check is my source for mounting a device and if yes I'll add a dependency if it's outside slash death system D thinks it's not a device. So they will not add a dependency The trick thing is We need to do things like FS check manually. So this indeed doesn't know it's a device So it will not add an file system check for it. So we do need to do that a manual so Let's look at a small example to get in a feeling for how that works what it helps I did a small example. It's a small cute QML application It does some basic setup and then reads a file from the file system basically faking I'm reading my configuration then it loads the QML basically loads the UI Shows the window and says, hey, yeah, I'm ready Let's compare it. I've done this on the shm32 mp1. This is a rather slow CPU I used this because it Gives me relatively large numbers to work with so I'm I can show the effect quite quite nicely It's a dual core cortex a7 800 megabytes are really not nothing very fast and I've used the emmc as mass storage So in the beginning and was basically with very little other boot time optimizations We're starting at eight seconds from kernel startup So Then I tried and this gave me the eight seconds. That's my baseline so with the auto mount I Got 7.4 seconds with the fake device. I'm a trick system the that's six point seven seconds and I can of course mix the two right I can do a fake mount the fake device and Still doing for example the FS check in parallel to the application startup By using auto mount but in this case The same was time was actually the same because probably in the dual core device It didn't actually help anything from because it wasn't I don't know I didn't look into details but but Conceptually probably the scheduling overhead there was worse than what I was gaining things like that And it really depends on the use case for example if the application is is loading the file later Reading data later doing more initialization at the beginning This can have an impact or if there are more than two cores the cold plug is faster with more cause cause it Scales it pair it paralyzes quite well in many cases. So The cold plug is faster with more course while an application startup typically a single threaded. So there's just one core used for that So it really depends on on on the use case if one or the other or both of the techniques help here Auto mount has to Make this is easier. Well, there's not a lot to do. It's simply saying in the comment basically in the Fs tab and that's it well for the for the other case we had to just need to add the sim link or we need to write a full Unit file and we need to handle the fs check manual these kind of things Okay, what's next so cold plug Is used for more than just Saying there's a device it also do does some initialization for example it can do Changing the can change the the ownership which when we look at security is important that for example device is actually accessible by the application and so if you want to do that the application does not it's not running as route which it hopefully doesn't and Then we need to wait for you to change the ownership and the group of device We could do this manually of course, but Then we adding Extra stuff or in there need to implement things ourselves So it's something that I'd like to avoid and of course you know changing the ownership is only a Very small part something something simple. There are more complex things that you can do with it. You deaf So and what you can do here is Split the you deaf you deaf Just cold plug is this you deaf admin trigger Which enumerates the existing devices and what you can do here is say? Well, I don't want to trigger all devices, but maybe in my case. It's for example Only the DRM devices the graphics devices so We're reducing the load for the cold plug to make it run faster because we don't have a lot of ways to actually order Which device comes first so if we reduced the amount of devices to only those that we actually need at boot time We can make this part faster and then when the application is then starting We can do the other half all other devices And we did still do the cold plug. It's we still have a full setup of our of our system But we've just split it in parts and one is run when we need it during the hot path And the rest is run after the application. It doesn't really matter when it's running here. So this gives us a way to Yeah delay a lot of device initializations It can actually be used if you don't do the you deaf It's not written that down here, but if you don't do the you deaf for a certain group of devices this may mean For example that they're not loading the modules for it possibly I'm not sure if that's right. Maybe not but you're not initializing the devices for the user space you can You can delay things Until they're actually needed Maybe not at all or maybe only for development or for maintenance these kind of things Okay, so let's say well, we don't need any devices here or only devices that are already there and waiting for the initial setup with system D is too slow before we Actually want to show something on the screen then we can do a boot slash With an application with a very simple application It runs as pit one. So we're executing it instead of system D and well It just shows static image on the screen and Then it forks and one application Just the child just stays there in the background because if you use the RM and the application exits well then The device is closed the content is lost Display goes up goes off again. So we don't want to do that. So it stays round But it does release a DRM master this basically means if another application comes later and opens the DRM device as well it is actually allowed to do that and to provide new content for the screen Then at that point we can kill the application Because another new application has taken over the screen. So and the pit one after forking the child will simply execute system D This is a relatively easy way to provide something on the screen and we can get usually Order of magnitude about one second after power on this is manageable in many cases It depends really on the hardware, but that's about the order of magnitude what you can reach with that maybe a bit faster maybe it's lower and In many cases, I've often done boot time optimization and say okay here and we do this and we have ideas and Well, why don't we put here a boot screen to get something on the screen and then we can optimize the rest But we have a bit more and then say hey boot screen Well something on the screen after one second ready great We're done. We don't need anything more because in many cases That's actually the biggest concern is your power on the device and the device is nothing and There are boot screen really helps Right But sometimes it doesn't Because we actually need to do something already And but now it gets complicated. I try to avoid this because writing applications to run early Requires a lot more attention to the details need to know what's available. What's not there's no temporary file system available at that point You may need to mount some file system the procfs or sysfs or things like that Explicitly manually because they're not available at that point So this makes things a lot more complex and I'd like to avoid that To to allow application developers just to write the normal applications But sometimes it's necessary and then we don't have a boot splash application but a regular application that's actually doing more than just dump something on the screen and then wait and Again before pit one start system D and the other one runs in parallel The problem with that is that when we do that we cannot We can no longer use a lot of the features from system D because this application runs out of the outside of Scope of system D where system D tracks the applications. So no research No resource limits all these kinds things are a little bit We can Improve things a bit hard I mean for services and things like that system D uses the groups and Movings applications or processes from one C-group to another is possible. You need to correct permissions So we need to take care of that but in general you can import this running application into a service later on That's possible however It only it's only part of deal right We can now track if the application exits and if we do things right with telling system D This is the new Actual the pit of this app if they're still running already running application is actually the main pit So this is the main process. We can track. Yes, this service Has failed because the application exited we can do that We can actually do things like watchdog because this As the notify socket that's used to do this watchdog pinging to system D And We can pass that socket to another application, but it requires work and it's only half the deal because things like Resource limits other restricted access The security features basically they're not available because the setup for that happens before the application is forked by system D Before it accept so system D forks sets up the limits and then access the actual process That's not possible when we import the process What we can also do instead of importing we can restart it basically But that means while there's a running application a new one starts and we need to transfer the state from one application to the other It's possible, but it's all a lot of work So I try to avoid that when possible, but it's a way to get the best of both worlds I know there are Proponents that say well if you really want to boot fast you need to run it Your application as pit one. Well, here's a way to mix that with with some work to get most or almost all features from system D as well all the Additional stuff we have and if you do it right you can actually Test both ways basically you can restart the application normal as a service and you can do the importing and these kind of things They are ways to test that a bit, but it gives you a lot more features and mixing basically The two ways and still have fast-booting and all the features and Actually, if you put a lot of work in it for both of the cases both the splash and The application you can do that in an in it already and in ramfs It saves a little more time possibly because you don't need to mount root file system it really depends on your device if you have a Rule file system that where the device is detected very fast and mounting is very fast Then you don't save a lot, but if mounting the root file system takes a lot of time Comparatively then in it already can be used to start the application But it this means even more work We don't just have to access system D and do more things later on, but you'll have to first mount root file system To the change route kind of things and then access system forward and access system D. So There are ways to mix all these kind of features, but every step is a bit more work write debugging This is actually interesting because At some point some a while ago a colleague came and I want to debug here's an issue I want what should I do and another colleague says hey use Function tracer to to see what's actually happening in the kernel As I have function tracer There's nothing here just this file and debug FS where I should write something in there. It's not there when she realized well, yeah, it wasn't there because the kernel was boot time optimized and the tracing was disabled in this specific case Enabling The tracing infrastructure This was not a lot of work. I Mean that's just board on our development desks But if we are talking about I talked a bit at the beginning about Debugging in the field then that's a whole different keyboard. Yeah so I'd like to keep Debugging features, but the tracing is at this point. Well These are the numbers From the same example from the same hardware I had before for the other examples though Out of the eight seconds originally for good 1.4 seconds were From chrono start until root file system is mounted And out of that 1.4 seconds point nine seconds So two-thirds basically were just some initialization For core tracing. It's not just function tracers, but the other traces, but basically that some Core tracing infrastructure that is needed by multiple features in the kernel That's enabled by multiple features if you if you switch them on in a comma conflict So well point nine seconds as a lot so Well, should we disable it? Probably right now. It's the only way if you really need that, but I Did spend a little time of looking what's actually going on there and I noticed in the end So I'm not really a kernel developer I know my way around it a bit and I've done bit of kernel development, but mostly I'm a user space guy But from what it looks like to me, there is some function that's called initially for the kernel Basically all the code that's already there And later on the same function is called for each module that is loaded To me that sounds like it's probably something we can delay as well for the kernel part We don't need to do it immediately Because we're doing later stuff later for the modules as well So it's probably something maybe something we can delay until we actually Do some kind of tracing now? I know we can do enable tracing We are the kernel command line. So in that case it has to be started immediately But in the production system on a real hardware in the field We don't enable tracing on the clock kernel command line. So my hope is We can actually remove those 900 milliseconds or a lot of those 900 milliseconds and do that sometime later I've not done that yet or rather I've not had the chance to ask a colleague to do it for me because those are the kernel hackers because well I was preparing a presentation for conference and this is not the customer project We have a bit more of a budget to spend to do something development But my hope is here to actually Eventually get rid of those 900 milliseconds in this hardware. It's fast from that or her prayer to Be able to keep the tracing infrastructure enabled Without the penalty the big penalty Well, yeah, let's see patch opportunities Maybe someone else find some time to look at that now that I've pointed it out and we'll get that fixed And then next week on the work let's see maybe the next time I get the boot time optimization project on my desk I can say hey here. I know there's some optimization possibilities and where I get someone to do it and then Another interesting copy security. I mean for me My perspective for boot time optimization is usually I provide a platform, right? Preventing in its kernel the basic users space libraries slip see a system D and Customers they're writing the actual applications. That is the real things the important things So when I do boot time optimizations one of the biggest problems I have is that there is one application in many cases is one application is a black box for me basically and When I ask the customer often doesn't actually know what the requirements actually are so which devices are used What other dependencies in my system do I have which file systems will be accessed? so it's one big black box for me and makes it really hard to optimize because Exactly the kind of things that are proposed to move things later I can only do that if I know they're not used that I know they're not needed to start this application And that's versus where security comes in because with real good security concept Just split your application. There's not one single monolithic application, but you do multiple processes to make Privilege separation possible so you have one process that is only For the UI main and then maybe a control process in the back and another process that has communication to the outside these kind of things and The interfaces are clearly defined because if it's not specified It's not allowed to access something. So if there is some hardware that can be accessed It will get a permission denied because we didn't specify that it's allowed to access this hardware. So We have split and split into multi applications. We have clear definitions on what's actually needed That helps actually a lot to do boot time optimizations because all this information I can use to do my ordering and That's something I can communicate where I can communicate with the customer because in many cases where we don't know what we need and what we use But hey, does it matter the application is running? It started We've already done this working on this project for half a year and it was no problem Why don't you do your boot time optimizations and let us work on our application? so and with security There is all this information available because someone needs to say okay here access to this hardware is allowed Which means this hardware is needed? So there is opportunities here, but they are off of course Also downsides, I mean security has a penalty always there is always some overhead when you enable some feature I mean we've seen that with all the hardware vulnerabilities how much overhead security in these cases can be and if you enable Features like sec comp which system provides which basically means that for a system system call you do We track if it's allowed to be called that check costs So and it slows down startup time as well. So There is an overhead, but there are also Opportunities because we can mix the effort to do boot time optimization with the effort to do security And one last thing I want to talk about That's how we're I mean in general you say premature optimization and these kind of things right Well, I'm saying if you want to boot fast you better think about that when you design your hardware Because the wrongly designed hardware is always a problem because That's not something you can fix in software most important thing is use fast fast storage Really really important because at the end of the day You're loading a lot of code you're loading a lot of data from the device If that device runs twice as fast Takes half the time of course you can try to load less But still if it's faster to load it's still faster to run and Then there is USB USB from the specification has certain timeouts. So it takes some time Before device is available if we need that device at boot time That's not something we can reduce right we can read off the time There's actually a good example from the one boot time optimization task talk from last time last year Because it basically was a USB camera and the content was sent to the screen and I think at the end of At the end of all optimizations half the time was spent on Waiting for the USB camera to be available and I mean that's frustrating right? You're doing all these kind of this kind of good work to get the software for us and then you have this hardware limitation that Basically says here's my data hard limit doesn't get better and There are other ways for example with cameras There are on system on chips camera interfaces where you can use a camera. That's actually fast to start And all these kind of things so think about What you want to do when you design your hardware when you say I want a need to boot fast and need to boot within a certain limit Then I Need to think about it with my FPGAs is another issue here if you load an FPGA bit stream Sometimes you were very very slow Interface and if you take a half a second or a second to load a bit stream from the bootloader That's a second that you cannot optimize away if your design is that way So think about these kind of things when designing your hardware you cannot do everything perfectly there, right? but You can avoid the biggest issues well That's it from my side, so I'm open for questions now