 So welcome ladies and gentlemen to my presentation how arm systems are booted and introduction to the arm boot flow I'll start with a slide about myself My name is Rufen Sevinsky, and I work for Pengu Tronix You can find under Okay, can anybody hear me Wonderful then we can start now Welcome ladies and gentlemen to my talk how arm systems are booted and introduction to the arm boot flow A short slide about me. My name is Rufen Sevinsky. I work for Pengu Tronix EK You can find me on github under imantor and the email address is on there So if you have any questions afterwards and you don't know how to contact me just download the slides write me an email I'll answer eventually if it's if the mailbox isn't too full obviously I started at Pengu Tronix doing testing for embedded systems and developed a testing framework called lab grid Then I worked a bit on opti the open platform trusted execution environment Did a bit of system integration Just bundling libraries into a complete linux systems for customers and now i'm working with media systems So cameras and video interfaces, but at Pengu Tronix we also do a lot of low level work So we have a fork of u-boot which is called bearbox and we deploy bearbox in production for customers And that's why I also do bootloader stuff or work on bootloaders actively I'll start with the shortest claimer. So this talk is not how arm systems are booted This talk is how some of the arm systems are booted mostly looking at the imx8 m variants And you have to totally consult your vendor documentation because your vendor vendor might have decided to do everything Totally different from what I am presenting here today because for embedded linux systems. There is no standardized boot flow, unfortunately For our for a server systems. We have Standardized on uefi and acpi, but for embedded systems. There is no such standard yet And I don't think that a standard is going to come up eventually because embedded systems can be very difficult or come in various shapes and sizes and It's hard to standardize something like this We are going to look at implementations and implementations are always vendor specific So you have nxp with the imx series which the which do other stuff than nvidia does or qualcomm does or samsung does And there's also a way which arm used to standardize initially So the whole arm trusted firmware, which i'm going to talk about later was an initiative by arm to get a somewhat standardized system for arm v8 or arm 64 bit systems up and going I'm not going to look at arm v7. So 32 bit systems today Because we have a lot more different implementations in that area of the code There are vendor specific kernel drivers for stuff like Turning a cpu on or getting a cpu to idle correctly And This is a lot different from what we see in the arm v8 world Arm v8 has defectual standardized on arm trusted firmware and this is what i'm going to present today and what i'm going to talk about So let's start with the short table of contents. We just went through the introduction I'm just going to talk about the exception levels on arm processors initially I'm going to talk a bit of the about the requirements we have for booting a system So what is necessary to get a system up and running? Then i'm going to talk about the tfa and the tfa services which are Omnipresent on almost every arm v8 or arm 64 bit system And then we're going to the kernel start sequence So we can finally end up in the kernel and finally or eventually start user space to run our applications on the processor Let's take a short look at exception levels. So for most of the arm 64 bit implementations There are four different exception levels And there's also a separation of the normal world and the secure world I'm not going to talk about the secure world today because the secure world implementations are Not necessary to boot up a system and run your own applications But might be interesting for implementations like trusted storage as an example Um, we do use all the Zero to ill to exception levels in the normal world And then we have exception level three, which is the highest exception level with the most privileges Which is here declared here as a secure monitor, but for arm 64. I think the tfa runtime services name is more common Um in the normal world exception level two is going to be occupied by something like a hypervisor And exception level one is where your kernel runs and exception level zero is what your application uses to run We can also put this on a timeline during boot up So we'll always start out in the highest exception level, which is exception level three Um, then the rom code is going to start the rom code is usually on your vendor implementation or Is fused into your soc Then an spl or a second stage boot loader is going to start and we're going to start up the exception level three services The exception level three services will then start the full full strength boot loader And uh, it will hand over to the kernel which then provides hypervisor facilities if enabled And puts itself into exception level one Finally when the kernel has started up and our users user space is available We'll run our application at exception level zero. There's also a slightly different naming scheme from the Trusted firmware arm trusted firmware world where the rom is always equivalent to bl1. So boot loader one The spl is usually named boot loader two and the ur3 services are boot loader 3 1 so the third stage, but the first part which is going to run in Um exception level three permanently while your kernel is also running on the system and then it's going to start the So called or what we know from the arm 32 bit world boot loader as boot loader 33 And then everything else is either taken up by the linux kernel in the form of hypervisor And us kernel and finally we have an application layer which is here named us for user space Let's talk a bit about the first stage of our boot up process We have the first stage which is bl1 in the arm trusted firmware speak Which is the rom code and this is usually a mask rom fused into your soc And if it requires memory it uses some small sram, which is always available on your soc And this implements vendor specific storage access or next stage loading So all the implementations i've seen so far have a specific firmware header which declares I'm going to load from this address on the sd code or i'm going to load A specific address from the sd card and the firmware header contains further information Where the next stage is going to be either on the same sd card or maybe on an spi flash or somewhere else So this is very vendor specific and you need to consult the documentation about this Um rom code usually also implements something like a usb upload or serial upload So you can do development on devices Um a very popular example of this is maybe the nintendo switch systems which initially didn't have this mode disabled That's why you can upload your own firmware to the very early early release systems They sold on the market later systems had this usb loading disabled But for development this is really really nice because you can hack on the bootloader And if it doesn't start you can reset your soc and then upload a working bootloader again For the rom code it's almost always smaller is better Because smaller means less size and less size in the end means less cost for the vendor And we want we don't want to put unnecessary costs onto the soc So we try to make this as small as possible It also implements same default settings So it's going to enable some of the clocks It's going to enable some of the power or all of the power depending on the implementation But it's going to be in a safe state. So it's probably going to use the Minimum clock frequency to be stable regardless of what power is provided externally for example This stage will then go on to load the next part of your bootloader system And that is the second stage or bl2 in tfa speak and this is either a part of the arm trusted firmware Or it might be an u-boot spl or it may be a barebox pre-boot loader We don't know it's totally up to the implementation But they all have in common that they are loaded by the first stage using the vendor specific header They read the boot rom read out and they also need to set up the ram in some form or another I know that older imx implementations for example Contained the ddr training code within the firmware header. So there was no additional Run training or rom training code that needed to be provided But modern implementations of imx8 use ddr4 firmware ddr4 and ddr4 Requires a vendor specific implementation of code. So they are going to run training pattern live on the soc to find the best values And therefore in this case the training code needs to be Embedded into your second stage bootloader And um This stage will then go on to load the next stage again from a storage medium Or it may be compiled in into the binary already. That's up to you and up to your implementation And I said before the different bloot flow boot flow flows from here can be uh, we are going We are a uboot spl and we are just going to start tfa for the runtime services And then end up in uboot. The same is also true for bearbox on imx8 bearbox We'll start the pre bootloader and then start up the runtime services end up in exception level two Exceptional and one exception level lower and then start the real bearbox bootloader or you might also be Using tfa as a bootloader level bootloader As a second stage bootloader and then go into uboot or bearbox I know that the armjuno reference implementation uses these or uses this bootloader scheme But I don't know any other system which implements the tfa as bootloader two system at the moment At least not on arm64 Let's talk a bit about armtrusted firmware. So armtrusted firmware is a framework to implement standard firmware firmware services We have psci and scmi which i'm going to talk about later and it's also an exception level three secure monitor and silicon provider router And it's going to be explained on the next slide And it can be used as bl2 and or a bootloader three dispatcher So it's going to start up all the different binaries which are required to run on the soc It's going to start the exception level three runtime services. It's going to start up your secure worlds Operating system or trusted execution environment, whatever needs to be started up there And then it's going to pivot into a normal Normal world bootloader to start up Your linux kernel it's mit licensed which is very very nice for vendors because The soc source code of these binaries doesn't need to be available At least nxp on their imx8 series also releases the source code So you can do upstream work or add features to an upstream tfa But at least on rock chip as far as i know, we currently don't have the armtrusted firmware Source code so we can't modify the armtrusted firmware All we get is a pre-compiled binary which obviously works because we can start the system But makes debugging harder if we at some point find out that the buck we are chasing may be in the exception level three runtime services So this license has kind of a dual edged sort at this point tfa provides silicon provider services So these are implementation and soc specific services as an example on imx8 m They expose their secure boot api this way and they also expose ddr frequency scaling So if you want your ddr to downscale while the system is not in heavy use and you want to save power That's done by calling into the armtrusted firmware using the imsmc calling convention and also nxp provides a way to sign your own binaries using their secure boot scheme To communicate with the on chip rom code on the soc that you want to verify a specific binary blob You've signed before this is also done using a silicon provider service Because only the highest exception level on the system is allowed to call into the rom code Otherwise you'll get a fault and it won't work And this is the reason why we require this Silicon provider services because the vendor has locked down the rom code to only be called from exception level three So How does this community communication work? Um, we do have two instructions which are used there And these are the smc and exception return instructions Or we can see on the right hand diagram that the normal world does an smc And it's going to end up either in the next highest exception level or if the if the smc call isn't trapped via a register or Via configuration then it's going to end up in the exception level three firmware And the firmware can then decide where the call is going to be Am I the one who provides the services? Is this a call into the rom code or is it a call for our secure world operating system? It's going to decide decide this based on the arguments supplied on the registers when doing the smc call So all the communication doing The rmsmc calling convention is done by registers You're going to put the The call you want to do on the first registers and extra argument on other registers And then you do the smc and either the call works And you're going to get a this call worked fine after the exception return from the higher exception level Or your call went wrong and then it'll indicate an error But there shouldn't be a case where your system hangs afterwards One of the next services is psci is the it's the power state coordination interface The power state coordination interface was initially Invented because every implementation has a different way on how to start up a cpu Some implementations need more register accesses some implementations Need specific wait times and implementing all these in the kernel turned to be Turned out to be rather tedious because you had to go through all the kernel review systems But with tfa you can now put your CPU startup sequence onto your tfa and you don't have to worry about a reviewer system in this case because The tfa is mit licensed and you can just chip the binary bop to your customers In my opinion the review process of the linux kernel is very justified and It's very very worth it. But for vendors it's far easier to put it into the arm trusted firmware And it's also a defined interface which is always available For the kernel. So every i'm 64 socket implementation out there is Almost or almost all of them are using psci to enable cpu on cpu off system sleep Or cpu idle of the individual cores So it somewhat lifts the complexity out of the kernel into the arm trusted firmware And it's standardized or somewhat standardized on rmv8 systems I know there are some rmv8 systems out there which don't implement pci or scmi the apple m1 socs come to mind They don't implement the exception level three at all But all embedded linux i'm 64 systems. I know of implement these via psci and the arm trusted firmware And usage on rmv7 is also possible. We have the stm32 processors out there which are using psci even on arm32 bit systems The next evolution in this series of lifting the complexity out of the kernel into the arm trusted firmware is scmi Which is the system control and management interface and this not only Provides a discoverer api for how do I turn on my cpu course? It also provides a discoverer api for clocks in power And the simple reason why that's required and we can can't only manage that on the kernel side is That this is useful since normal and secure world may require the same clocks As an example if I have some kind of crypto accelerator on my system Which is used from the secure world and normal world, but the linux kernel decides I'm not using the crypto accelerator at the moment might as well turn off the clocks to the accelerator to save power The secure world may want to use the controller in this case and in this Case the scmi service within the arm trusted firmware will know the normal world doesn't need the crypto accelerator But the secure world side does so even if the normal world side wants to turn off the clocks I'll not turn off the clocks in this case because it's currently used on the other side of the system And this also again provides a simplified control interface for the linux kernel We only need to implement scmi and the discoverer discoverable api And we don't need to implement clock control power control for every individual sock We'll just route everything through the tfa I have a short excursion here because if we want to boot up a linux kernel We need some kind of system to describe how or what hardware is available on the system And we have many many slightly different soc versions out there So you might have an mx8 processor, which is for course and all the usual peripheral hardware like media and coders Camera interfaces accelerators and so on but there's also a slightly different variant which is only two cores so it's only a dual core and The implementation for the linux kernel is not to use or not to hard code this information into the kernel It's done by writing device trees Device trees are files which describe the individual individual hardware In this case on the right hand side, we have an example here, which Has an include line at the very top, which we also know from the c world Where we include a different file in this case We include the whole sock hierarchy of an imx8 mq soc and then we Add our custom implementation details to it So in this case the example was taken from the imx8 mq evaluation kit And it's going to set up a model name to be easily Identifiable if the user wants to look at the model name It adds a compatible So the kernel can do specific driver probing or specific patches for this evaluation kit hardware And then lower down we can see that we provide additional information for the fec, which is the ethernet controller So we're going to set up some pin control handling So which pins on the soc are routed to which of the fec hardware ip inputs And we are also going to say the phi which does the communication to the outer world is connected by rgmi i And then we're going to provide a phi handle and lower down Under the mdi o bus connected to our ethernet controller We have an ethernet phi and this is also described within the system and um This is also very very useful because we not only have very Many many slightly different soc versions. We also have shared components across soc generations So as an example the imx6 and imx8 urs So serial interfaces are functionally totally equivalent So there's no need to write an additional driver for this because we can reuse the existing driver We're just going to add a new compatible Maybe it's slightly broken in a different way then we can fix it in the driver via via a tweak And uh, then we can use the existing driver That's one of the reasons why bring up of newer soc generations often is a lot faster than on older generations Where no previous driver was available And this is also the case for some of the media encoding hardware as an example on the imx8 There's a hardware encoder for jpeg Which can also be found on the rock chip On some rock chip soc variants And this is because vendors are not building every block on an soc themselves A lot of the more Difficult or more complex parts used on an soc are often bought from from another vendor So you will end up with socs who implement the same hardware blocks and then we can on the linux kernel side reuse the same drivers So finally after going through the exception level three runtime services We are going to end up in exception level two And here our boot loader in this case described here barebox proper Is providing additional services So none of the previous stages as far as i know implement anything like networking for or nfs boot Which is very very convenient if i'm developing on a system The boot loaders also implement stuff like boot spec parsing where they where we have a file on our File system or on nfs which describes Which kernel the system is compatible to and then the system can just probe through all the entries in the boot spec file And decide okay. I'm that system. I have that compatible I'm just going to use that device tree and everything works and boots up fine The boot loaders also provide some kind of usb gadget or serial support So in this case barebox provides usb gadget support for serial So you can start up barebox plug in micro usb via Or via usb otg and then just connect via your normal computer without needing any additional serial adapters It provides mass storage. So To the outer world it's just going to look like a usb disk where you can drop files Maybe you have a development kernel you want to upload to the system And then you can drop it onto the mass storage device and it's going to show up within barebox on your embedded system Or we we also implement stuff like fast boot Which is very convenient to upload a kernel or a complete file system using sparse support Which is then really Nice and fast Let's talk about a bit about kernel style. So we are now in bl 33 or in barebox What do we need to do to start our kernel? So the first thing we need to do is that we need to decompress the kernel In older arm 32 bit systems the kernel had an internal decompressor, which was called before the kernel was started But for i'm 64 for simplicity it was decided to move the decompression into the bootloader So all modern bootloaders implement the usual decompression algorithms To decompress the kernel and then put it into a memory location, which is defined in the kernel header We also need to copy the device tree or in this case not the device tree description I showed before which was the source description. We are going to copy a device tree binary into memory Then we also need to mask interrupts. So our boot is not Interrupted by any device receiving ethernet data or a timer interrupt firing And we need to initialize the standard arm timer to a default value, but keep its interrupts off We are then going to put our kernel at a offset specified in the header. So This is somewhat specific to the system, but it's usually very close to the lower bounds of the RAM on the system We also need to disable our mmu if the bootloader enabled the mmu before Usually the bootloaders and bootloaders enable the mmu because it's a lot faster to use the mmus and Uh, yeah, it needs to be able to disable before kernel start and we also Need to disable and flush the caches. So there are no stale cache entries before starting up the system And then we need to initialize our cpu registers either for ill 2 or ill 1 depending on which exception level We are actually in and at the moment for arm 64 bit. That's really easy There's only one argument we need to put on register zero And that's the address of our device tree binary or device tree blob in this case And the very final thing we are going to do is to jump into the kernel And then our kernel is going to start up probe individual drivers and go through the usual startup sequence And now we can pray that wi-fi is kind to me and I can show you a live demo of how this looks like But It doesn't look like it So unfortunately due to uh spotty wi-fi reception at the moment It's not really possible to show the demo because I'm connecting to a system at home Which has um the relevant information on it So I'll start from the slide so This is as far as a general overview of the boot up system looks like and I'll be very very happy to Answer any questions which are open now. Otherwise. I have two other slides which go a bit into the secure world side of things So if you do have any questions, please ask them now. I see a question over there Do we have any microphones? Yeah, um, how is the access? Can you hear me? Can you hear me now? Yes. Okay, how is the access to the cpu registers limited According to the exception levels. Is there some kind of limitation? Can you define which cpu? Registers can be accessed. I mean the memory mapped registers can be accessed from which exception level Yes, so this is also vendor specific So for i mix 8 systems or i mix 8 m systems There's something called the central security unit and this defines policies based on exception level or whether your call is coming from the Normal world or secure world. So this way a system can decide that Only exception level 3 can access this memory mapped i o device But it's sock specific. So you need to consult your vendors or your vendors documentation and if that's not enough, you have to talk to your F.A.E. I was wondering why you need to rent a bearbox at bl2 and not bl1? Um, um, bl1 is always do you have the headphones on or you can email? Okay, um, bl1 is always reserved for the rom code So we in theory could put our Excuse me, sorry, I was meaning el2. Ah, I see my mistake So, um, why do we need to run bearbox at el2? We don't need to run bearbox at el2 Our system may not implement el2 at all. That's also possible the ironspec leaves A lot of freedoms to the systems and we can totally run bearbox on el1 as well We are not limited to el2 in this case Any more questions If there aren't any more questions, thank you very much for coming to the stop