 Hello everybody. This is Ajay Kumar. I welcome you all to the session delving into the Linux boot process for an ARM SoC. Today, me and my colleague Tiago Ramalingam from FDF Software Solutions, Samsung Semiconductor India Research Bangalore will take you guys through the Linux boot process on an ARM SoC. Regarding the topics being covered, we will start off by describing the basic ARM V8 SoC architecture and few of the SoC components, including the internal memory. Then we will see the role of bootloader in booting a kernel like setting up and initializing the RAM, fetching and copying the kernel images to main memory, decompressing that kernel image and how a kernel image looks after the decompression. We will see how the bootloader prepares the SoC environment for jumping into the kernel. We will see how secondary CPUs are specified and how actual jumping from bootloader to kernel happens and the subsequent kernel routing which runs after jumping into the kernel. Later, my colleague Tiago will take you guys through the architecture independent kernel starting point which is the start kernel and discuss about the process creation, the memory management data structures initialization and scheduler initialization, IRQ initialization and rest of the boot up things. Before going to the actual session, we will make a few assumptions like ARM has both 32-bit and 64-bit architectures, but with the interest of time, we will stick to the 64-bit architecture, the ARM V8 mainly and we will assume we only have a single guest OS which is the Linux. So, we will not be utilizing the hypervisor feature and the bootloader nomenclature like BL0, BL1, etc., are for explanation only. They are not assumed to be strict nomenclature. It can vary based on the type of bootloader you are using. Also, any complex SoC boot not only involves the ARM V8 blocks, there might be few other microcontrollers which might be supporting the SoC boot that is beyond the scope of this talk. Let's start by describing a simple ARM V8 SoC. The figure depicts a typical big little ARM SoC. So, the main components of this SoC are the CPU cluster, the main memory and the memory controller, the system buses and other DMAable and non-DMAable subsystem which are necessary for the functioning of the SoC and what's not depicted in this picture are few of the critical power management and clock management blocks. In this session, we will mainly concentrate on these two blocks which is the CPU cluster and the main memory. So, apart from the main memory, SoC will have other types of memory namely the ROM which is the read-only memory and the SRAM which is also the volatile memory. The ROM code usually contains the reset vectors which are executed upon the reset release of the cortex A block. This ROM will prepare this SRAM block by initializing the same and the SRAM can support any C execution routines. The ROM code as we said is executed upon the CS95 or the Cortex A block reset release and since it is the first thing to be executed, it executes in the secure EL3 mode. This ROM code initializes the storage or the flash memory and also the SRAM memory and helps in copying the bootloader image EL0 from the storage to the SRAM. After copying, it also sets up a stack on the SRAM so that it helps in the execution of C routines from BL0. The bootloader BL0 starts executing from the SRAM and it initializes few of the core clocks and also the power domain for the DRAM block and the clock for the CIN. The clock for the memory controller will be usually based on the memory speed and also the BL0 is expected to initialize all of the main memory available in the system. It can do so using some static information present in the code or maybe some hand checking mechanism which detects the memory on the fly. After the BL0 has initialized the main memory, now we can load bigger images like the bootloader one BL1 which can execute from the main memory. The BL1 actually initializes most of the system components and also copies the binaries needed for Linux boot like DTP, image and RANDIS to the main memory. At BL1, you can have interrupts enabled and also initialize few subsystems like the display for displaying some boot of logo or provide some sound acknowledgement using the sound subsystem. Also either the BL0 or the BL1 should keep a secure monitor code for handling secure accesses from the panel in future. The bootloader image BL1 executing from the main memory copies the DTP image under RANDIS to main memory. Now what are these images? The DTP is the device tree block which is a compiled version of a device tree format which in turn is a description of the hardware on this SOC or the board. The image is the actual kernel binary image. The RANDIS image is actually a minimum root interface loaded before mounting the actual Linux root frame system and this might be required for execution of few initial Linux startup scripts. Let's discuss more about the device tree block. A device tree is actually a description of your device hardware in a format Linux can understand. A special compiler called device tree compiler takes the device tree as input and converts it into a device tree block. How your device tree represents your hardware is by providing the memory map addresses, IRQs, GPIOs, clocks, regulators, etc. which are needed by your device. It can also go on describing the SOC by defining the CPU and memory nodes and other things. Before we jump into the kernel, the bootloader must select the appropriate device tree file which is to be bootable on this board and pass the same as an argument into the kernel image. We will see how the address of the DTB image is embedded into the kernel image in the next few slides. But what we should remember here is that the DTB should be within 2MB size and also kept at an address which is 2MB aligned. This is because the kernel might map the same into a single 2MB block with cacheable attributes. Unlike 32-bit ARM Linux, the 64-bit ARM Linux doesn't have a kernel decompressor for decompressing the kernel image. So whenever you compile the kernel for 64-bit ARM, you will get two kinds of images. One is the uncompressed image and the compressed image. So in case you are storing the compressed version of the image on the storage, the bootloader has to take care of the responsibility of decompressing the same and place it into the main memory. The real one is the bootloader which is supposed to do the same. In case you don't want to spend time in decompressing the same, you can always use the uncompressed image directly to place it into the memory. The header of a decompressed kernel image for an ARM 64-bit kernel has following 64-bit header. The code field represents the startup text section inside the compiled binary. The field text offset which has become obsolete now, which once used to represent the image load offset from the DRAM base. The image size field represents the effective kernel image size in little indian format. There is also a flags field in which there are four bits which are significant as of now. The bit zero indicates whether the kernel indianess is decadent or the little indian. And the next two bits represents the page size or the page granularity which will be used inside the kernel namely 4k 16k or 64k. The third bit also is a hint for the physical placement of the kernel. If it is zero, that means the two MB aligned kernel base address is very near to the base of DRAM. If it is one, that means the two MB aligned base of the kernel image can be anywhere inside the main memory. This picture shows a sample dissection of a kernel image header dump. As you can see, this represents the code fields and this represents the image size and the obsolete text offset. You can see being zero here and this 0000008 is actually 1010 which means little indian kernel with 4k pages and one which is the two MB aligned base of this kernel image can be anywhere in memory. Also, this magic number is nothing but the ASCII value for ARM and 64 for the 64 bit. Now that we have placed the kernel image and other necessary binaries in the main memory, what remains is the setting up of jump environment. ARM64 Linux mandated certain boot protocol should be followed before jumping into the kernel. Mainly, any DMA capable device has to be disabled so that the active DMA does not corrupt the main memory. Coming to the primary CPU which is executing the bootloader core before jumping into the kernel, the x0 register of the primary CPU should contain the physical address of the device tree block. X1, X2, X3 should be zeros. Likewise, for the secondary CPUs, all of the X0, X1, X2, X3 should be zeros. All forms of interrupts on the CPUs must be masked using the DAF flags. For all the CPUs, the MMUs must be off and the data cache has to be disabled while the instruction cache may be kept on or off. But this instruction cache should not hold any stale entries corresponding to the address range of the currently loaded kernel image. Coming to the architecture and timers, timers at different exception levels have to be initialized before jumping into the kernel. And all the CPUs which are booted inside the kernel must fall into the same coherency domain which is the inner cerebral domain as mandated by ARM64 Linux. Any platform related configuration which is needed to enable the same has to be done by the bootloader before jumping into the kernel. Also, all writable architecture system register inside the CPU has to be written and initialized by the software running at a higher exception level than the kernel, example the EL3. And all the requirements which we spoke of till now, like the CPU register modes and the state of caches, the condition of MMUs, architecture timers, the coherency condition and system registers should apply to all the CPUs. Remember, all CPUs should enter the kernel at the same exception level. The primary CPU which is executing the bootloader anyhow jumps directly to the first instruction of the kernel image. How the secondary CPUs are booted during the kernel boot are mentioned in the enable method dt property which are specified inside the CPU nodes. The most commonly method used is the PSA enable method, which is the power state coordination interface method where the kernel during the course of boot up will issue the CPU on calls for the respective CPUs as described in the PSA guide. From the ATF side, the secure monitor will take care of powering those CPUs internally. There is also another method called the spin table method, where we also need to specify a CPU release address property and the secondary CPU spin outside of the kernel in a reserve memory area, which is already specified in the CPU release address property. Once the bootloader BL1 has performed all necessary SS initialization like initializing the clocks, the power domains and the other subsystems needed for the kernel boot and it is done repairing the jump environment. It will eventually jump to the kernel. Let's take the example of code boot to see how different jump instructions are performed. As you can see the precondition before jump, it is disabling the CPU interrupts in this line and it is also disabling the MMU and plushing the catches over here. It is repairing the kernel entry point infrastructure where it will embed the starting point of the kernel address. And at the end of these instructions, we would see bootloader jumping into the kernel and the first kernel instruction getting executed. At this moment, if you see the hardware snapshot, you can see only the primary CPU 0 is active and if it is in the EL1 mode, the CMU and PMU for the CPU clusters have been configured for all the CPUs. The MMU is off, the data cache is off while the instruction cache may be kept on and all the necessary binaries are placed at their respective addresses adhering to the respective alignment or some other placement constraints. You can also see we do have a SMC place in order to respond to any of the PSA calls from the kernel. So primary entry is the name of the assembly routine which is entered for ARM64 kernel upon exit from the bootloader code. This function is defined in the file arch ARM64 kernel entry.s file. Let's briefly discuss what the kernel subroutine primary entry does before calling the start kernel function. So the primary entry calls the subroutine preserve boot args which preserves the argument passed by the bootloader in the registers x0 to x3. If you remember, we had passed the address of the device tree blocking x0 which it is copying to x21 register. Then it calls the subroutine init kernel EL which will do some setup based on the current kernel exception level whether we have arrived at EL1 or EL2 and store the same in the register W0. After this, if the kernel has KSLR enabled then some setting is done for kernel address space layout randomization and then whatever CPU boot mode we stored in W0 is set to the flat boot CPU mode for later usage. After this create page table subroutine is called where we set up the initial page tables required for kernel initialization. As you can see, we create a identity mapping for the MMU enablement code which usually happens at the lower addresses which need to be set in TTBR0. The page table responsible to handle those translation is created at idmap.pgl and a linear mapping for the first few MB of the kernel is also created in this subroutine at init pg depth. After this it calls CPU setup subroutine which further initializes the processor for turning the MMU on. It starts off with clearing the TLB and then setting the size for the virtual and the physical addresses to be used in the translation and also checks if some of the VM features can be enabled. It does so by setting the translation control register and the system control register. The last subroutine to be called is the primary switch which sets the page table addresses prepared in the previous slide which is the idmap pg depth and the init pg depth for the lower and the higher addresses. Since we have the page table assigned already, we can enable the MMU. So, we enable the MMU by configuring the required page granularity. If the KSLR is enabled, we perform the KSLR operation here. The last subroutine which is called is the primary switched subroutine which assigns the vector table for the EL1 level, clears the BSA segment for the kernel, sets up a stack for the kernel, creates a mapping for the FDT addresses and eventually calls the CROT start underscore kernel. From here on, my colleague Thiago will be taking over. Thank you. Thank you, Ajit. Kernel is always booted by an architecture specific code, but then execution is passed to the start kernel function that is responsible for common kernel installation and is an architecture independent kernel starting point. Start kernel executes a wide range of installation functions. It sets up interrupt handling, further configures memory, starts the init process which is the first process base process and then starts the idle task via CPU idle function. Notably, the kernel startup process also mounts the initial RAM disk that is init RB. Init task presents the initial task structure that stores all the information about the process. The process 0 is statically defined this is the only process that is not created by kernel thread nor fork. Set task stack and magic function will find the end address of the stack and then insert magic value for overflow detection. The stack can either grow down or grow up. Either way, the end of stack function will return the end address. The first processor activation. The boot CPU init function initializes various CPU mass for the bootstack processor. The kernel uses the CPU mass to record the state of the CPU. CPU mass provides a bitmap representing the set of CPUs in a system. A bit of zero or one represents the state of the CPU under each CPU mass. The status of the CPU can be basically divided into following four types. CPU possible, CPU present, CPU online and CPU active. The setup arch function is responsible for initial machine specific initialization procedures. This includes setting up machine vector for the host, determining the location and sizes of available memory. Set up arch function also initializes a basic memory allocator called bootmem to use during the boot process. Calls paging init to enable the host memory management unit. Many devices have buffers where a driver can write something or read from there. This is achieved through mapping the control registers and the memory of a device to the memory address space. None of the memory mapped iu addresses are used by the kernel directly. There is a special iu remap function which allows us to convert the physical address on a bus to a kernel virtual address. In other words, iu remap maps iu physical memory regions to make them accessible from the kernel. We need to initialize early iu remap for early initialization port which needs to be temporarily mapped iu or memory regions before the normal mapping function like iu remap are available. The parse early param function parses kernel command line and sets of different services depends on the given parameters. The kernel initializes the paging mechanism of the kernel through paging init so that our virtual runtime space is initially established. And the mapping of a physical address for virtual address space can be completed. In ARM64 architecture, the kernel completes the initialization of the mem block through the ARM64 mem block init function and then starts to initialize the paging mechanism through the paging init function. Paging init is responsible for establishing page tables that can only be used in the kernel and the user space is inaccessible. The PSCI is a firmware interface implementing CPU power-related operations specified by ARM PSCI spec that includes CPU on, off, suspend, etc. Did you need any function? Initializes the kernel's PID hash table, a lookup table for quickly mapping process IDs to process descriptors used by the kernel. Also the function then initializes the vectors and bottom off handlers used by the kernel various internal times. SMP on ARM SoC. A symmetric multiprocessor system is a multiprocessor system with centralized shared memory called main memory operating under a single operating system with two or more homogeneous processes. Most of the SMP code is not architecture dependent in the kernel directory. Few SMP functions related to SOCs are SMP init CPU functions that set up the set of possible CPUs via CPU possible function. It is called very early during the boot up process from the setup function. SMP prepare CPUs function enables coherency, initializes CPU possible map, requires the resources like power, RAM, clock, etc. This is also called very early during the boot up process that is before the init calls, but after the setup large function. SMP secondary init function perform platform specific installation of the specific CPU called from secondary start kernel on the CPU which has just been started. SMP boot secondary function actually boots a secondary CPU identified by the CPU number given in a parameter called from CPU up on the booting CPU. The init IRQ function initializes the GIC controllers, scans the device tree for matching interrupt controller nodes and calls their respective initialization functions. The time init function initializes the host system tick timer hardware. It installs the timer's interrupt handlers and configures the timer to produce a periodic tick. Rest init function. The start kernel function initializes dozens of kernel subsystem and ends calling rest init function. Rest init function in its turn spams the very first user space processor through kernel init function. Its process ID is one. It will become the direct or indirect ancestor of all user space processes. It also spams thread process. Normally the process ID2 that's the parent of all kernel threads. Finally, it runs CPU idle function. A process that take over CPU whenever there is no other process using it. Kernel init function will start any additional CPU code. If there is an initial RAM disk is defined, it will decompress and mount it. Then it loads the device drivers, mount the root file system in the read-only mode and finally call the init process. Thank you.