 Hi everyone. Thanks for joining. This is Keerthi from Texas Instruments. I will be presenting today's session on tweaking the Linux boot flow for an accelerated ADAS experience. A little bit about us. Both Bridgesh and I joined Texas Instruments way back in 2007-2008. So we are about 15 years in the same company. Bridgesh is mostly worked on the real-time operating systems and camera vision use cases. He's the expert in that. And I have predominantly worked on Linux power management, thermal management and boot time optimizations. So I'll start off with the agenda for today. So I'll explain what is the actual problem statement that we get from our ADAS customers. Then before actually diving deep into the software layers, I would want to touch upon the SOC hardware architecture, which is typical for any ADAS SOC. And then I'll deep dive into the various softwares that run on each course, starting with vision apps, which I'll explain in some time. The default boot flow, the time it takes to boot to an ADAS use case with the default behavior, and how we tweak the actual boot flow. What is the impact of boot media on each phase of the boot? Then followed by the optimizations at various boot stages, including bootloader, Linux and file system. And then I'll touch upon how we had to redesign the ADAS use case to maximize the parallel processing on various course. I'll wrap up with results and then giving a generic summary on how to accelerate ADAS use cases. On the problem statement, so basically it's to get the ADAS use case up and running as and when the car starts. So the time is as soon as possible. So it's of the order of two to three seconds. And Linux is expected to be the one that drives that and controls all the hardware accelerators and the processing course. Of course, we have heterogeneous course to execute in parallel. Some of the most commonly used ADAS use cases that need early boot time are 360 degrees around view for you to get out of the parking and then auto valet parking is to actually get into get a parking slot and then get inside that early camera use case for the backup reverse camera, early display and camera mirror systems, the CMS systems. Typically, we get boot time optimizations on these use cases. Okay, before I jump into the software side of things, I just wanted to touch upon with a typical ADAS SOC. This is TDA for women from the Jacinto family of SOC. We can see that at the top level, there are green boxes. There's a main domain. There's a wake up domain and there's MCU domain. Typically, these are voltage, these are called voltage domains, which are fed by individual voltage domains. The main domain is hosting all the compute course, the big arm course, dual core H72, the C7X DSP and the C6X DSP and some R5 control course. The MCU domain hosts the R5 dual core R5 subsystem, which is actually the boot master. And there's the secure wake up domain, which hosts the security controller, which is basically a Cortex M3 and a bunch of peripherals. So you can see that there are a lot more peripherals that are getting controlled by these cores and used by these cores. So with this introduction, let me quickly jump into the deeper details. So for ADAS use case, we definitely need some sort of real time computing. So we use real time free ADAS, real time firmwares. Linux is mainly run on the application ARM core and there is some secure core that runs the security, foundational security firmware. So mapping the exact SOC view, main domain. So I'll start off with the MCU domain, which hosts dual core MCU R5 of typically run in lockstep mode. This is our boot master. There's the one that is getting out of reset by the ROM code. And then there's the Cortex M3, which is our security controller, which runs the TI foundational security firmware. And there are a bunch of cores in main domain, which is R5 of control cores, ARM A72 cores, C7X, which is the DSP and C6X. So this main domain typically hosts all the compute cores. MCU is the startup core and M3 is a secure. So MCU is R5F is for safety. It's in lockstep mode. M3 is the secure monitor and then all the rest are compute cores. We can see that apart from A72, which is the Linux, which is running the Linux and M3, which is running a security firmware, all others are mostly running RTOS, which is free RTOS for vision acceleration. Before I get to vision apps, a quick run up on the primary responsibilities of Linux, which is running on A72. All the storage is handled by Linux. The MMC, EMMC, which hosts the root of S is controlled and initiated by the Linux. The CPSW IP, which is the networking IP, which we need for real-time Ethernet use case, is also controlled by the Linux. And the most important part is the Linux has the remote proc drivers to accomplish IPC communication with the other cores that I mentioned, the C6X and C7X DSP and the R5F. Linux also controls the graphics accelerator. So the graphics driver is a module in the Linux system. And the console, which gets all the application logs is also controlled by Linux. And most important, the Linux has the trigger to the Adas application. So the start of the application is triggered by Linux. So this is a quick run on the Linux role. So Codex M3 pretty much controls anything and everything related to the SOC on the TISOC with anything and everything related with the foundational security. So M3 being a relatively smaller core does not have the compute power to do complex cryptographic operations like AES, SHA-512 and whatnot. So we have a dedicated security accelerator IP, which is a hardware accelerator for doing all the cryptographic functions. So some of the key roles that M3 does is to do the foundational device security, secure boot. For any high secure samples, we'll use authentication to boot the boot binaries. And that authentication is done by the M3. And it also plays a pivotal role in anti-rollback protection, wherein we can make some of the bootloader versions obsolete and program the effuse with the latest version so that no security attack can be done from the previous versions of the boot binaries. And it also supports derived key generation for support of third-party stacks. So this on the TI foundational security. Now let's come to the actual vision application. So what is vision apps? So vision apps is actually a middle-level framework which sits in between the driver and the applications, and which can pretty much do any of the vision-related application. So I'll start off with the bottom-most row. Bottom-most row has all the cores that I spoke about before the control R5F, the C6X DSP, the C7X DSP and the A72, which runs Linux. So right above that is the operating systems that run on each of these cores. Since these are mostly related to the vision acceleration use case, all three of them are running the free autos version. A72 is like the master, so it is running the Linux. On top of the autos, all of them have flavor of IPC driver that runs so that each of the cores can talk to each other. On the R5F, we see that since it's a real-time core, a real-time control core, all the time-sensitive peripherals like the camera capture display drivers are being owned by the R5F. C6X and C7X are known for their parallel processing capability. So all the deep learning algorithms are running on the C7X and C6X. The TIDL layer is acronym for TI deep learning. There's all the deep learning algorithms that run on this. So all of this are sort of, we have a middle-level layer, OpenVX, which is common to all of this. And all of them have been wrapped up by OpenVX layer, which is the vision apps for TI. And on top of that, any customer can develop their own applications and the deep learning models can run using the vision apps. And on the Linux side, any applications and fusion algorithms could be run. Now a bit about what is this, why vision apps? So as I explained before, vision apps is a middle-layer framework, which is based on the TI OEX or the standard OpenVX framework for TI platform. This framework supports heterogeneous architecture. So it integrates different components in the form of a directed acyclic graph, a DAG. And it realizes the system use case of our vision processing. So the biggest advantage of vision apps is that it supports pipelining. Basically, multiple cores can run multiple functions in parallel, so that we have maximum utilization of the hardware. This also supports and manages real-time peripherals like the cameras and the displays. There's a very good support for this. And between different layers, we need buffer management. We start passing buffers from one layer to the other, one function to the other. So this gives a very simplistic provision for passing buffers from one layer to the other. And it allows running multiple dependent and independent components on the same core multiprocessing. I'll take a simple example of one directed acyclic graph. So here we can see that the camera is feeding in the RGB format frame. And then this is taken as input from the color conversion function. In the OpenVX terminology, any function that is done in a core is called kernel, not to be confused with Linux kernel. So I'll call it function. So the color conversion function takes input as RGB frame. This is running on C7X or a DSP. This is now passed on. And it gets converted to YUE frame, gets passed on to the channel extract functionality. And then gray scale frames are extracted, which is then fed to the image pyramid, which is run on a specific TI hardware accelerator called vision processor accelerator or the VPAC. And it then gets passed on to the optical flow. And then finally the Harris track algorithm is run to track the key points in an image. And the output is rendered on the display. So typically a vision application like a 360 surrounded view or a auto valet parking will be a composition of multiple such graphs running parallel and seamlessly on multiple cores. And the other advantages I've mentioned in the next slide. So we already have integrated basic surround view application, like another deep learning algorithms like auto valet parking, camera mirror systems, and early camera use cases. It already has a sample implementation of auto exposure and wide balancing for ISP processing. There's very good support for vision apps on free RTOS and safe RTOS for all the remote cores like C6X and C7X and R5 of course. It also supports HLOs like Linux and QNX as the host core for on A72 or A53. And some more highlights on the vision apps. It's very good at scheduling graphs. So the directed acyclic graphs. Based on the availability of the hardware accelerators and the cores, different cores, the vision apps can be used to split the various functions in the graph and execute them in parallel on different hardware cores, taking advantage of the parallel execution. As mentioned earlier, it gives very good abstraction of memory management functions. It abstracts out cache maintenance and other reuse of scratch memory. One thing that I spoke was it can actually split based on the availability of hardware accelerators multiple functions across multiple cores. Even the converse is true wherein if a graph is big and if it needs to be done in one core or one hardware accelerator, the entire operation can be clubbed merged into one kernel or one function. And this enhances memory locality and loss of any kernel launch overhead. And the other advantage is that it enables data dialing. Instead of acting upon one big image, we can actually take advantage of the data cache size and the local memory by splitting the image into small tiles and these functions get executed on small tiles, getting better use of cache. With that, yeah, so typically fresh out of the SDK, the boot flow is somewhat like this. As I mentioned, the ROM code brings the MCU domain R5F out of reset first. It loads the secondary program loader or the SPL of the Uboot world. It could be also any AtoS boot loader, which could be more lean and mean. This brings the MCU R5F is the boot core, which brings out all the other cores. For convenience sake, I have just mentioned the main domain R5F and A72 here. The R5 SPL first brings the A72 out of reset and loads the ARM trusted firmware and then loads the A72 SPL and then it loads its own firmware that it needs to run for vision applications onto itself. This is a device manager firmware. Then SPL loads Uboot onto itself. Then the Uboot, which is the standard boot loader for Linux loads the loads all the other core binaries for C6X and C7X and main domain R5F. All of these are loaded here. It typically takes about two seconds to get to Uboot and then load all the other cores. Then once the other cores are loaded, the Uboot loads Linux kernel and then we get a full-fledged file system in about 20 seconds. Just to remind you, all of this is being done on MMCSD, which is probably the most common storage media for both boot media and root FS. And then once the final file system is up and running, we run the surrounding application which actually starts initializing the camera and whatnot display and then it takes about another five seconds. So typically it takes about 20 to 25 seconds to get the use case. This is not done to like, it is not an inoptim, of course it is taking a long time, but it is done to showcase all the hardware. So here more or less all of the peripherals are getting initialized in series. And Uboot is given, taken as a choice of bootloader because it is very flexible and it has the command line interface. We can do a bunch of hacking around Uboot to do whatever use case that we want to accomplish. So this is typically the boot flow. It is of course not the best optimized time. So just giving the state of the art right now and then once we move. So once we start optimizing the boot time for any ADAS use case, the choice of boot media is very important. To begin with, we can start with the MMCSD card. This is cost effective and it is very flexible. You can remove any time and just replace the binaries. This is probably the slowest. So it is definitely not recommended for early use cases. The next thing is EMMC. This is pretty fast when it comes to boot time and it is typically a large memory. It is of the order of 16 GB, 8 GB, 32 GB. So if there is provision for one boot media and one that also being the root FS host, then this is a preferred choice. This is pretty fast in booting as well. And the final one that we have to be typically uses the OSPI or the XPI flash. Amongst the three, this is the fastest for booting any binary on the SOC. The memory footprint is pretty small. It is around 64 MB, sometimes 120 MB. So root FS is definitely cannot be hosted. So customer who can afford two flashes can choose OSPI as the boot media and EMMC as the root FS host. So yeah, so EMMC or OSPI will do good in case of boot time constraints. So there were multiple layers that were explained in the default boot flow. The R5, SPL, Ampere straight from where the SPL running on a 72 U boot and then Linux. So all of these take time and all of these come with some cost. So instead of going through the full stack, we can optimize wherever necessary. So instead of R5 SPL, loading the ATF and then the SPL, we can directly load Linux kernel from R5 SPL so that we bypass the U boot under the ASN to SPL phase. This saves us a lot of time. And instead of U boot loading the all the remote core binaries, we can make R5 SPL or the secondary bootloader directly load the remote core firmware. And once you load Linux, you can directly start your application. So this is something what we call as Falcon mode, wherein we bypass U boot and directly load Linux from the SPL stage. So when it comes to optimization bootloader, we should be very well aware of the UART speed. So typically if you have any console prints, the first thing we need to do is to disable all of the console prints because UART is a very slow peripheral and adds time to the boot. And going back, so here we are loading the Linux kernel and also the remote cores. So the first thing that we need to do from the R5 SPL is to load all the remote core firmware, all the DSPs and the R5s as soon as the DDR initialization is mandatory for Linux to come up. So once the DDR initialization is done, you can start all the remote core firmware. And typically if the bootloader is very optimized and small in size, we can start loading the DSPs and the R5s in as early as 100 millisecond ballpark. And that's basically using OSPI as the boot media to speed up the booting against MMCSD. So in the overall scheme of things, if you start with this bootloader optimizations, we save about 1.5 to 1.7 seconds in the boot phase, in the bootloader phase. I'll jump on to the optimizations that we did on Linux side. So in case there are some peripherals that are never being used in the use case in the system, we should go ahead and disable those drivers and configs so that we try to get a very lean and mean kernel which saves time in actually loading the kernel. And again, similar to bootloader, there is a command line parameter called log level, we can make that zero by default, so that none of the logs come on UART. So all of the UART logs are suppressed. This saves the time pretty significantly if it's a normal size, normal kernel which has about say 200 300 lines of prints before we get to the command prompt, we save about anywhere between 2 to 3 seconds. And yeah, as I commented on the boot media slide, we can host the entire file system in EMMC instead of MMCSD, EMMC being the faster peripheral. Now let's move to the device tree side of things. So there are a lot of devices again that are getting unused, that are not really used in this use case. So we can disable all of these device tree nodes that are not really used in this use case. And we should give a priority to loading of C6X, C7X or the RFI of course. So basically load the highest priority course first, which has the demand of boot time. And if the MMCSD is enabled, please disable that, that adds a lot of time. And yeah, finally switch the file system from SD to EMMC. The whole optimizations could save us about six to seven seconds to get to file system. And moving on to the file system related optimizations, typically Yocto based full flex file systems are off the order of two GBs, two to three GB. So loading the file system itself takes a lot of time. So we should optimize on the size of the file system. And if we actually come to a very bare minimal busy box kind of file system, we really don't have any library functionality that we can utilize to run something like as complex as vision apps use case. So we should take the best of both worlds. So we typically copy all the libraries from the full flash file system to the tiny rate of S and come up with something like a hybrid file system, which is bare minimum for executing that particular ADAS use case. Yeah, so yeah, so the hybrid file system should be around 200, 300 megabytes, which was about two to three GB in the original file system. And the other portion of file system optimization is to not go through the entire CSV init kind of default file system services initialization. Instead, we can write our own simple init script, which does exactly what is needed for this ADAS use case. Instead of booting all the way to init script command prompt, we can use a init script to do the minimal tasks just to mount the necessary file system folders. Load the modules that are necessary to run that particular use case, which is typically the RFIF remote proc and the C7xt remote proc and the graphics driver which are used by Linux for realizing that use case. So since there are a lot of file system services that run by default, it saves a lot of time. So typically all those take about eight to 10 seconds to get to Linux prompt. So this is one of the bigger piece that we saved in file system. As I explained earlier vision apps is the one that actually does all the compute on the C7x, C6x and the camera processing. So by default, the flow was to wait for the Linux to come up before initializing the camera. So that was getting delayed. And typically the camera initialization is a very time consuming operation as it is connected over the I2C. So what we did was to tweak the use case a bit. The default was in the RFIF control core which starts the camera. The UART was initialized first and then the IPC and then the I2C which is which is needed by the camera. So this was the default boot flow. So what we did was to prioritize the camera initialization back because it takes a lot of time. So we directly go ahead and initialize I2C and then we start initializing camera which is about 1000, 2500 I2C writes. So that's why this has to be prioritized first. And then we can go ahead and do the rest of the things. So it is just realizing what comes in the critical path and then prioritizing in the early in the boot. On the imaging front, since surround view needs about four cameras and four cameras use I2C to get the registrar's program, there is something called I2C broadcast which could be used wherein all the four cameras get the same commands and get initialized in parallel. And typically there is a service serializer sitting in between which also is connected by I2C. So all of them have some guard delays just to avoid some values that are not really correctly written. So we could optimize a lot of delay in register programming of all the I2C. And there is something called multi byte transfer on I2C. When the I2C address is contiguous, instead of sending every byte at a time, we could use the multi byte transfer wherein we can program 1000 registers at a time as the register address are very much contiguous. So these optimizations reduce the camera initialization time from anywhere between 3 to 4 seconds to about 600 to 700 milliseconds. So at pretty much every possible stage of executing the use case, we are trying to cut down on the time. So yeah, so this was the final time that we could achieve. So the bootloader comes up in like 300 milliseconds. After that about plus 400, so basically by 0.7, 0.8 seconds, we are able to get to the Linux starting. And then the initialization script to load the different modules and mount the file system. And yeah, camera initialization, as I told, it takes much more time. So it took about another 500 milliseconds. And then the surround view application running was about 550. So overall, we could get the entire 20 seconds of that use case reduced to 2 seconds. And yeah, so we were using the 5.10 LTS and we have also migrated to the 6.1 LTS now. So the time is anywhere between 2 to 3 seconds to realize this complex use case as surround view. So the optimized boot flow now, the R5F directly loads the ATF Foundational Security firmware. It also loads the ATF and the device manager binary onto itself and also Linux. So directly from ATF, we are jumping to Linux, which is bypassing the UBoot phase of the ASMT2. Linux is also very minimal with all the reduced drivers and reduced device tree blobs. So only the required drivers, remote core drivers and the GPU drivers probe. And then the busybox root FS is mounted, that's the hybrid root FS and then the surround view application starts running. So the time that was about 20 seconds got almost reduced to 2 seconds with the optimizations that we did. So I want to just conclude with generic guidelines on how to optimize ADAS use case. So before getting to the software optimizations, as I mentioned in the boot media slide, it's very important that we pick the right boot media, which is fast enough. So typical recommendation is OSPI, XPI or EMMC flash. And always use multi OS strategy instead of one core running everything in serial. So use as many cores as there, make use of them, run a real time OS or any flavor of OS on them to get the execution in parallel. Optimize bootloader to load the critical firmware's critical software's first before the others and others get loaded. In case your vision apps or something similar frameworks, extract or manipulate the ability to run pipelines as much as possible. And for vision use case, wherein camera is mandatory, please prioritize camera initiation. It takes a lot of time to initialize cameras because there's a lot of I2C or spy writes, which are more slow peripherals. And wherever possible, don't use software algorithms. There are multiple hardware actuators on each ADAS SOCs, offload the compute to these hardware actuators. And then that will also reduce the boot time. And yeah, any delays that are added as guard delays to write these I2C peripherals, please optimize those delays. And design your use case in such a way that always things are running in parallel and nobody is waiting for another. So that that avoids any delays. Okay, I think that I'm done with my slides. Thank you. Open for questions.