 I will try really hard not to ramble for the next 40 minutes about all the chaos that happens before Linux even gets involved. So when I joined TI, one of the things that I really struggled with was just understanding the boot flow that these chips that we make do. To be fair to myself, a lot of the stuff that you see online is really focused on the X86 world. When you search for arm-based boot flows or how these chips are starting, you really give, for lack of a better word, unconfidence-inspiring results. So the goal behind this talk is really just to give a few search terms and links that might help you if you're planning on trying to start booting an arm-based SOC. So with all that being said, the very first question that you can think of is what is a bootloader even doing? The simple TLDR type of answer here is it does everything that the name says it does. Its sole purpose is to go out, find a binary to load into the SOC that we need to run the chip. We can and often do this, split this up into multiple phases and files. They allow us to focus on specific tasks about booting the SOC while still staying within the restrictions the chip has while it's in the process of starting. So when you search online, one of the first things that you'll see is the X86 boot flow that's been with us for a very, very long time. The CPU will reset to a known address. This address will be where the BIOS is at. BIOS will run the power-on self-test. If you're old enough, you probably still have in your mind's ears the beep that all the computers made back then. This is just to ensure the chip has everything, RAM, everything that it needs to start. After that, BIOS will then walk through the boot order that you set up with like F10, Escape, Delete. Probably the most chaotic thing right here is trying to figure out which key to press that will go through your boot media looking for a master boot record. This master boot record, which points to Grubb, which is our configuration for Linux. Essentially it has all the information to load the correct kernel and everything else. This flow has been with us forever, and that's probably why you see it online. There are new innovations happening with this flow. Things like UEFI, but for a management level overview, this is really the typical boot flow that we all know and love. When we get to the ARM-based world, things are different. Things are changing. Chips that we work on routinely have gigabytes of DDR space now. You start seeing things with distributions. These chips have full distributions now. Think Raspberry Pi, things Beaglebone. All of these mini computers basically have full-fledged Linux distributions running them. Oops, that's not the right direction. These distributions want a way to standardize this handoff between the boot loaders and to their distributions. They don't want to implement a full distribution for every single board. Think a distribution for Raspberry Pi 2 and Raspberry Pi 3 and 4. They just want a distribution that can fit all UEFI-standard implemented boards. Things are changing. Things like the EBBR from ARM, which reuses a lot of the UEFI standard that we see from the X86 worlds and stuff have started to make their way into chips. We see popular boot loaders like Uboot actually starting to implement UEFI. Things are moving in the right direction. These chips are also wildly expensive to develop. They cost many, many millions of dollars to design and manufacture. They require enormous amount of focused thought and attention to highly specialize people to design these chips. So even though the implementation details of these chips may differ, they largely copy after one of another just so that they don't deviate from a known working formula. Just imagine what would happen if we develop a chip only to find out it has a fundamental flaw that immediately eliminates all sorts of ability to sell the chip, really. ARM even has what they call the reference boot flow. This is basically a maintained open source code from each stage of the boot process that developed by ARM for all of their CPUs. We have a BL1, which is essentially ROM code. And then the BL2, which is trusted boot firmware. Think of like an SBL or Grub at this point. And then depending on the chip, you have like a BL31, which is exception level 3 runtime services. And then you'll have more BL3, 2s, 3s, whatever for your secure OS and your application OS and stuff like that. It's really this creativity that ARM gives ARM manufacturers the way I consider it to be chaos and also why it's a blessing. It's really up to manufacturers like us at TI to implement whatever we want. We can pick and choose any one of these things. And that's usually where most of the differences between all these chips come from. So all this to say is even though no one is making the ARM manufacturers follow standard flows, we kind of end up loosely following the same thing, even though it's just maybe out of habit or just savvy business decisions. So now what does an ARM-based bootfiller look like? You'll have to search around Google for quite a while, but the high level management overview of it will look something like this. This is kind of like a standard implementation that you'll see. The chip will start running ROM code that was etched into the fab when it was made. This is code that with the sole purpose of finding and loading the secondary program loader or SPL, typically this will probably be U-boot. U-boot has an SPL, which is a very popular open source bootloader, but you can use anything you want here. It's core boot, whatever. This SPL is just a tiny binary designed to fit into the internal SRAM of the chip. And basically its sole goal is to configure DDR. So essentially you get thousands of times that the address space to load these binaries as they get bigger and as the chip gets more capable. Once we have DDR started, we can start loading in the bigger EL3 runtime services in U-boot. I say they're massive, but they're megabytes in this size, but even at this level we're talking four or five times what we have in internal SRAM. So U-boot, once we get there, we have all the configuration information. So think of grub at this point. We have all the configuration information to load everything we need to get Linux running. But really once DDR has been initialized and we've configured it, the boot flow largely becomes artificial design decisions, really just for trying to get the application that we want on these embedded chips running. So for this example, we've added Opti, which is like a trusted execution environment. Think software TPM kind of thing. So once we've started the EL3 runtime services, it'll allow us to start up the secure OS, which will jump back to EL3 and then we can start initializing our application OS. So another example, I guess, just to show how flexible we are. And a lot of our K3-based chips that we make at TI, we actually have another U-boot SPL. So after we've set up Opti, jump back to EL3 runtime services and start initializing our application OS, we'll actually load in a U-boot SPL. This keeps our binary smaller. There's a lot of heterogeneous chips. We like to put microcontrollers all over our chips. And so we can keep this small and so we have the ability to load in microcontroller firmware into remote cores that allows us to keep the boot time pretty low. Then the U-boot SPL will go and load up U-boot into DDR, which again has the configuration information to load up Linux. It's the same flow, just another little binary that we'd like to throw in there. Going back, again just to display how creative we can be, a lot of times in simulation, just to speed up the simulation process because simulation can be pretty slow when you're designing these chips, we can turn on a big application core, load up EL3 runtime services, and then just jump straight to Linux. We don't need to do all these shenanigans because simulation is slow. So just again, it's up to us on how we want to do this. Most of the interesting bits that you're going to see that are actually coming from design considerations or limitations from the chip are going to happen in these first two stages or phases. How smart the ROM code is, how much internal RAM does the chip have, how capable is the CPU running these phases are. These are all decisions that are made during the design process of these chips, and it's one of the real reasons why you see weirdness. If you think of Raspberry Pis and stuff, they have a lot of binaries that they load in, but if you look at K3, we don't do all that stuff, but it's all trade-offs, all design decisions. So far, this talk has been pretty abstract. I don't do well with abstractions. I didn't really feel like I truly began to understand what this chaos was and what it looks like until I started digging into our own code and just started tinkering with things. For me, this was our AM62 family of SOCs that we make. It's our low-end, well, I call it low-end, but it's our more power-efficient chips that we make. And so I thought it would be a great exercise just to kind of walk through this boot flow just to get an idea of how this would work in a real-life chip. So for a little background information, the TRM, I'm trying to add a lot of links on the bottom. Those might help people if you're getting into ARM, but basically this is a technical reference manual. It has everything you could possibly want to know about this chip. But for a little background information, the AM625 has four application cores, big Cortex-A cores that will be running our main operating system. It also has a lot of microcontrollers. There's a few in what we call the wake-up in MCU domains. Essentially, they're outside of the big application cores, but they're not inside something that we call the security subsystem. Essentially, we have a few more cores in the secure subsystem. Think of like a secure enclave. I don't know if we meant to do this as a little aside, but it always makes me smile that we've managed to fit all of the Cortex-A, AR, and M chips or cores into our ARM-based processor. Just for whatever reason, I don't know if they did that. It just made me laugh. But back on track, we also have to work with inside the limits of the chip, which means we have 256K of internal SRAM for our SPL to load into. I mean, if you're thinking about this, this is the average size of an NES game in the 80s. So this isn't a feature-rich SPL that we're talking about. But again, going back, I kind of glanced over this in a few slides ago, but assuming we're running Linux, we should probably nail down what we're trying to do here. Again, we can do whatever we want. We're tailoring this to an application, but there's a great document in the kernel describing exactly what Linux needs before we jump into Linux to start it. Essentially, it boils down to we need a device tree. ARM lets us put anything on our register map. We can put anything anywhere. So we need a way to describe where all these devices are and where they are in our register map so that the kernel will know how to go and grab a UART console or how to start up PCIe, or all these devices that we put everywhere on these chips. The kernel needs to know about it. So we need a device tree that is loaded into DDR and that will be given to Linux when it starts. This also means DDR needs to be started. And then we also need the uncompressed kernel inside of DDR. The 64-bit ARM doesn't have an ability to decompress the kernel in memory on the fly. So if we're using compressed kernels, the bootloaders are going to have to do it for us. There's also a lot more implementation details depending on what you're doing, on what architecture you've chosen, I guess. For us, we have Cortex-A's, which have TrustZone, which also means we need EL-3 runtime services running to handle all the... Ooh, I can't think of the acronym, but there's security's messages that we can send and system calls that we can send to the EL-3 runtime services to route things properly. Again, there's a lot more on this list, but that's basically up to the individual chips to implement. So, okay, we're finally here. First step, power on. What happens? So right after we flip the switch to send the power, before any CPU can be released to compute on anything, we first need to ensure the chip is physically stable and in a secure state. This is essentially just our post-check. Because these chips are designed to be everywhere, and they absolutely are. They're from in-toasters, they're in rocket ships, we have to make sure that they're in a stable state for that environment. And we can't assume that these chips will function properly without some type of checking. So this is where we make sure that the power rails are all powered, the clocks coming in are set to same frequencies. And then we also set boot mode pins. These pins will inform the chip what kind of crystals we've connected to it, what kind of storage media we have on them. And that will basically describe everything that the boot loaders are going to need to do. This stage is really an electrical engineer's problem. The whole goal is just to make sure the chip is in a physically stable point. This is an electrical engineer's problem. But it goes to describe how tightly coupled each phase of this boot process is. Even though we have multiple binaries and phases, they remain tightly coupled and rely on what the previous phase did and what we need to do to enable the next phase. So when you're thinking of this, I've separated them all out, but they are highly coupled together. So after we've made it through this check, we're just assuming that everything went well, we can finally ramp up a CPU. However, we quickly run into our very first problem. Our boot loaders with all the code to start this chip is sitting out on remote media. We have no way of getting to it. So we need some code to start the chip to load the code to start the chip. This is solved by our ROM loaders. It's basically just a tiny piece of code that was etched into the SoC when it was founded at the FAB. It's designed by people at Dallas before the chip even existed physically. So basically the whole goal of these ROM loaders is to look for a very specific binary in as many ways as possible. Think OSPI, UART, USB DFU. All these different boot modes, ROM will have to have pre-compiled in it ready to go so that we can enable these chips on as many devices as we can humanly possibly can. But it's a simple, inverse relationship. The more space we use for these cool features for ROM, the less cool things we get to put in the chip. So if ROM is megabytes in size, we lose some accelerators or some crypto accelerators, some UART instances. So we have to keep it in balance. And that balances why you see so many different chips doing so many different things. However, ROM is the absolute worst place to find a bug. We can't make any changes to this. So, well, I mean, it's going to have a bug. I mean, for example, think about all the ways that booting over an Ethernet interface could go wrong. So there is a bug. These bugs are instant free rides to large stages at DEFCON because the only way to fix these things are to actually just remake the chip. This is not an option we at TI want to make. So this is why the first CPUs that we actually start will be in the secure subsystem. So right after pre-initialization, the CPU to be unlocked will be in the secure subsystem, and it will run ROM just like our ROM loader will. However, it's technically ROM, but it's etched into the chip just like everything else in the fab. However, it has a completely different purpose. Because this is another bit of ROM, we can't change this. So we just go with the assumption that ROM has a bug in it. So we just assume ROM has a bug in it and that we need to protect the chip from this faulty ROM code. So it does things by setting up systems. We set up a watchdog timer to reset the chip if our boot loading is going too slow or it gets stuck. Think about a crypto miner hacking this and our bootloaders are now just mining Bitcoin instead of actually booting the processor. We start up the crypto accelerators. That way we have an ability to authenticate the binaries that we loaded in. We want to make sure that we're actually running the right stuff. And then to pass these binaries into the secure subsystem, we need something like a secure proxy in ring accelerators just so we have ways to pass the information into the secure subsystem. And then finally we set up firewalls. They ensure that we load the binaries into the correct places that we expect on the chip. They're called firewalls officially, but in my head I call them bear traps. Essentially, they kill the bus instead of just blocking requests. When you think firewall, everyone thinks network appliance and so this right will just not happen. But in reality what happens is you go out and touch a piece of the register map that you're not allowed to and your bus dies. And the only way to get it back is to reset the device. So finally once everything is in place, we've set up everything. We can start up another microcontroller outside of the secure subsystem to run our official ROM loader. For all of TI's K3 based chips, our ROM code is designed to run on a Cortex-R core. This was a design decision back when the K3 architecture was first implemented. It was a way to... R cores are really rock stable. There's not going to be any specter meltdown style of problems with these cores. And so it's a really firm foundation to start the chip off so that we can start building up. Depending on the boot mode pins that we locked in the pre-initialization phase, ROM will set up and configure all the clocks, power domains and devices that we need to load our SBL from the selected storage media. This is the hard part because when you think of SOCs, we have clock trees and power trees and all these things and we don't get the luxury that the kernel does with having firmware loaded already. So we have to keep all of this in scope. And so it's kind of working in bare metal. There's a lot of things going on right here. But much of it is just like what BIOS is doing. It's searching for a master boot record in our storage device. So once ROM finds and loads our SBL into RAM, it will then use the secure proxy and ring accelerators set up by the secure subsystem to ask the binary to be authenticated. However, what is this binary that we're looking for? So the binary that ROM is looking for is called TI boot3 bin. For the security subsystem to be able to authenticate our SBL, it means it needs to have a few more things than just an SBL. In reality, ROM has basically loaded something that looks like this. It is a variable list of binaries all appended after an X509 certificate. You can build this using the K3 image in repository. Essentially, it's just a helpful script to get everything packaged up so ROM will be happy with you. Once the binary has been loaded, so once the binary is loaded in, we have our first bit of firmware. It's called TI Foundational Security. Essentially, the secure subsystem, when it starts, we don't want a lot of ROM code. So we just start the accelerators that we need to authenticate this TI boot3 binary. TIFS will have everything else to enable all of the secure subsystem services so that we can start authenticating the next binaries in a few more slides in more flexible ways. Basically, yeah, there we go. The X509 certificate is at the start of TI boot3 acts much like the table of contents in this file. It has all the offsets and links for all the binaries. It also has the hashes for these binaries just to make sure that we're loading the right thing. And then it has configuration information. So the secure subsystem will... Well, I'll get ahead of myself. The secure subsystem will use this X509 certificate to compare with the keys etched into the chip. We can etch keys into these chips so that the only binaries that these will run actually came from our desk. It's a way to keep things sane. Finally, there's a standard Uboot SPL. This will be inside the TI boot3, but it's going to run on this microcontroller outside of the secure subsystem. It doesn't have to be a Uboot image here. I can probably put Uboot everywhere, but it's just the convention that us at the Linux team in Dallas uses. So the job of this SPL is to set up everything we need to start our big application course. So we'll need DDR, but it also has to stay in our 256K limit, so there's not a whole lot we can do here. But it means we need to start DDR, so we have the address space to load in our big binaries, I mean megabytes big, but whatever. And this also is the first bit of code that we're actually running that we compiled at our desk. So this is a great opportunity to start up serial consoles. I mean, think of like, I got here printf statements. So once you see prints, you know that at least you got out of ROM, and so you've made ROM happy, which is always a great feeling. I don't want to go into too much detail here, but if you're really interested in this stuff, there is the boar initf function inside of Uboot. It's inside the Archarm Mach K3 folder in Uboot proper. That will describe everything we do to set up what the SBL is doing here. So after our microcontroller has found TIboot 3, it used the secure subsystem to verify the binaries that it just loaded. The secure subsystem is going to reset our chip again, or our market controller again. If the authentication went poorly, like let's say the binaries didn't match their hashes or their certificate didn't match the keys inside the chip, ROM is, or the secure subsystem is going to reset this microcontroller back into ROM. It's going to go look, it's how the microcontroller to look up the backup media for another TIboot3.bin. Essentially, this is going down the boot order, just like in an x86 flow. However, if things went well, the secure subsystem is going to read that configuration information in the x509 certificate and reset us into this SBL that we just loaded. So after the SBLs configured everything, the next binary that we'll need to start our big application cores needs to be loaded in. This means that we have to go out and find another file. So what is this file? The file is our TISpl.bin. Basically, it has everything that we need to start our application, or at least a few things. One of the interesting things is DM firmware. So this is device management firmware. It's also, sometimes you'll hear resource and power management firmware, but when we make our jump from our microcontroller and to start up our application cores, we're leaving a dead microcontroller behind. So this is a great opportunity for us to load in this firmware to provide some abstraction services. So remember the clock trees and power trees and all these things that are implementation details of these chips, we can put all these implementation details in this DM firmware which allows us, in Linux at least, to just make system calls or just make calls to this remote core saying I want this clock at this speed or I want this device to be turned on and we don't have to worry about the clock trees and power trees or anything. We just say it's done and it's done. So, that's kind of the details. DM firmware is making these details and abstracting away a lot of these implementation problems away for us. All right, so a quick aside. This is a popular view that you'll see on the internet. It's sort of, it's basically your exception levels that your CPU can be. These big application ACOR CPUs can have. I thought it would be better for me to show it in this view rather than just where all these binaries are in DDR space. So, but when the SPL first turns on our big application core, the CPU will be at exception level three, which is on the bottom. It's the highest privileged level and it's reserved for like low level firmwares and security code. So we'll have like EL2 at our hypervisors for if you like hypervisors, we don't use them. EL1 will be the kernels and then EL0 will be our applications. And so when we first start this chip, our EL3 will be at EL3, which is the perfect time to start our EL3 runtime services. However, this entire talk I've been saying, EL3 runtime services, but they do a little bit more than just, or it does a little bit more than just provide runtime services at that exception level. When we first start the chip during a cold boot flow like this is, essentially what's going to happen is it'll help us initialize the architectural details of these chips. Things like register settings, power, reset and power operations, platform data, exception vectors, a lot of workarounds, all of this will happen here. For our TI based chips that we're walking through, we've used a microcontroller all the way to this point, which means we have to set up everything for this. Certain boot flows where you start with a bigger arm core or something, the BL1 will set up a few of these things that we can override in the EL3 runtime services, but for, yeah, so that's just implementation details. So once the core has been properly initialized, if we're using a secure world OS like we are, EL3 will then worry about setting up the secure world. So Opti, the open portable trusted execution environment is basically a software TPM. It is a trusted execution environment, TI, that is a software implementation of the ARM trust zone technology. A rough idea is a software TPM. Basically, it's a software TPM with a little bit of enforced isolation that is provided by the CPU. It gives us a sort of root of trust for our applications so that we can use secure monitor calls, instructions. So once Opti has set itself up, it will jump back into the EL3 runtime services. Sorry. And it'll jump back into the EL3 runtime services to start working on our application OS. We set up Opti, we can then jump back into Opti. So in our K3 example, essentially we're going to have another SPL. Oops, I am sorry. We're going to have another SPL. And then that SPL will load up the microcontrollers if we want to, and then we can load up UBoot. UBoot has all the configuration information that we need to load up the device tree and kernels and then we can manipulate the device tree that all that we want before we can then just jump into Linux. And with that, we have made it into Linux. So thank you, everyone. I hope this was a little bit helpful. I know some veterans here maybe might have been a little bored. But I published all the slides on SourceHot. So if you see any spelling mistakes, I'm the type of person that works on this at the very last moment. So there's probably a machine level of spelling mistakes. So please send them to me. But anyone have any questions? Yeah, headphones. Sorry about that. We have one online question. Okay. For the K3 processor, you said TIboot3.bin needs to be signed with an X.509 cert. Can we change the keys while the processor accepts or does this binary always need to be signed by TI? So that's a good question. So once you sign these chips, you don't have to, well, oof, implementation details. So one, when you sign these chips and you etch these keys into the chip, they become high-security chips. So they'll have certain things that they require in the boot flow. Sometimes we have also general-purpose chips that aren't signed that have a different boot flow. They don't technically need these X.509 certificates. But we're moving towards high-security flows. So it's essentially more of a yes and no, I guess. So high-security flows, we do use these keys. You can use a degenerate key. So you'll have like a TI key that everyone knows and then you can use this X.509 certificate for. Or you can use customer keys. You can add in customer keys so that it will only run the binaries that you at your office use. I hope that answers the question. Probably adding to this question, how do you prevent the key to be read out of the system? So these are etched into the secure subsystem. Ooh, that's a good question. So essentially the TIFS services will provide all that. But I don't know the implementation details that require us to prevent us. I'm assuming TIFS has set up a firewall, but that's me speculating at this point. Yes, so a part of this configuration information, you can actually have encrypted binaries. You can pass in how they're encrypted and how to decrypt them, and then your subsystem will know what to do with that. And then you can just keep following that chain. Yes, I guess that's everything. Thank you, everyone. It's kind of cool.