 So, good afternoon everybody, I'm glad to see you all here, the one observation that I was able to make, not a lot of young people here, probably that is a little bit too low level stuff for youngsters, but anyways, that's good, somebody join me on this presentation. Well, I'll be talking about Uboots on quite memory restricted platform, and I call this Uboots bootloader for IoT platform with the question mark, so we'll see if it is feasible. So, let me start them. First, I'll start from introduction, my name is Alexey Brodkin, I work for Synopsys as an engineering manager, lead in a group of, quite small group of engineers doing development of open source projects for architecture. And even if we are not talking about Synopsys, personally I use open source software a lot in my spare time and doing my day job, so whenever I face a problem I try to debug that, fix it and I then contribute my fixes upstream, so that's how I got involved in a couple of open source projects and some of them are listed here. Mostly notable, most notable of them, Linux kernel build routes, recently I started to contribute and open embeddits, but regarding that presentation what is important, I made a port of Uboot bootloader for architecture a couple of years ago and since then I've been maintaining that, enhancing and all that. So that's our agenda for today, I'll start from introduction of my target platform that I'm going to talk, then I'll explain what I had to do to strip down memory footprint so that my Uboot build sees usable on that platform and then I'll talk a little bit about how to execute Uboot from run because by default this is not what usually happens. And in the end I'll go through two issues I faced during first execution of my Uboot port on that board, so let's start then. Uboot, it is basically a bootloader and it's main reason to exist to load some real payload and start it on the board. And typically it would be some kind of OS, it might be R-TOS or what's at least here we may think of it will be Linux kernel plus RouterFest and all that. And initially I added support for AXS-1 release board which was quite interesting development board, basically that's it but with a different CPU. What was important it was 600 MHz CPU with a lot of memory you see these mention 2 GB of DDR, then I added support for that board with a little bit different CPU which runs a little bit slower but still I saw that it is more than enough for running Uboot bootloader. And so then last year we've gotten Uboots which is even more powerful it is quad core RKHS38 CPU running at 1 MHz and it has even more memory. And that is very typical situation when we use Uboots it is so-called single board computer we use Uboots there every day. But there is such a thing which is called low of an instrument or also known as low of a hammer which sounds like if all you have is a hammer everything looks like a nail. And that happens to me in that sense that whatever new boards I have on my desk I try to port Uboot on that and try to execute it and see if there is some use of that. So that's what happens I was given a new board with idea to use bootloader and package that into a real product that we are going to ship to our users. That's a tiny board actually it is quite small it is like you may see from connector it is like 3 on 5 cm or something like that. So it is called IoT development kit and it is meant to be used as a platform for software development and debugging in areas like sensor fusion, voice recognition, face detection and something else. And idea to use a bootloader was to allow users to run their own applications without additional debugging tools like you don't need to use JTEC whatsoever. You just build your application put on an SD card or on USB storage plug that into that board you may see that there is USB OTG connector and so there is SD card connector but SD card slot but probably you don't see it on the other side. And you may just run it without additional tools which at least for me looks quite nice because you don't need any extra. That basically meant we needed a bootloader which may support this peripherals, may support fat partition which is usually used by developers on Windows machines and so here I think Uboot is quite a good match. Now speaking about things which are important for Uboots like we see here that CPU runs at 150 MHz and from my experience with my previous board I know that 150 MHz is more than enough to start Uboot as fast that you don't even recognize that it will be quite fast. But what differs a lot from previous boards is amount of memory which is available on this board. So what we have here first is E-Flash. This is quite an interesting flash memory which we may use for direct code execution right from it. So that's good until it is relatively big compared to all the others but the problem is you cannot write to the random address in this flash. So right there first you need to raise a page which means you cannot use it as random access memory. So we may use it for code storage but we cannot use it for runtime variables and something like that. Then we have ICCM which stands for Instruction Closing Coupled Memory. This is interesting on chip memory which is mostly supposed to be used for code storage and we may execute code from there but nobody actually stops us from putting data there as well and so that's why for us it is pretty much normal random access memory which is again relatively large here. Then there comes SRAM. SRAM is SRAM. It is random access memory on chip but it is quite a bit smaller and so then here comes DCCM which is pretty much the same as ICCM but this is data closely coupled memory. And there is one but very important difference between them. DCCM is not connected to Instruction FHQ which means we cannot execute code from that memory. And which means this is random access memory but it is not very flexible and so what I'm going to do on that board I will use flash memory which is not usable for writing data as a storage for our code constants and initially even data before it gets copied and I will use DCCM for data which is used by UBoot. That gives us possibility to allow user to use ICCM for loading their applications and starting their application from there and if they want they may use SRAM as well. But anyway so once you load some application with UBoot you don't need UBoot any longer and you may reuse DCCM as well if you want. Also we have a spare flash it is good to store something but again this is not RAM so it is not usable. As for peripherals on this board we have a plan to them but in terms of UBoot what is important is SD card controller and this is designed where mobile storage controller USB OTG this is designed where all USB OTG controller and standard UART. So that's nice and so while using UBoot here again we have support of pretty much everything here. You may see a lot of architectures including ARC. We have support of different peripherals. We have support of the design where UART. We have support of design where USB OTG controller and we have support of SD card as well so nothing to do here. We have support of different file system and so FAT file system is here as well so again nothing to be done. What else it supports networking protocol but we don't have networking on these boards but anyway so that's an interesting benefit. And keeping all that in mind it allowed me to add support for these boards created UBoot for that board literally in a couple of minutes. And you may see that's a change lock of what I had to do and I tried to highlight a couple of things. So what is important? You add boards here, you add device tree description here and you add configuration here. The minimal stuff you need to do anyways for any boards and everything else here. It is only required if you need to implement some per boards which I had to do but anyways this was quite simple. Now I said so that it was a working port. Well it could have been working with the only important differences. We are not actually feeling in memories that we have. As you may see result in binary would be of size 400k and we cannot squeeze it into 256 kilobytes of ROM that we have. So that basically means we need to do something about it and we are going to work towards shrinking the memory footprint of that UBoot build. So what we may do, the simplest thing we may do we may just go and tweak our configuration. Thanks to the fact that for quite some time already UBoot uses Kconfig the same configuration utility which is used in Linux kernel. It is a matter of firing up that like make menu config and then go through the options and disable stuff that we don't need. So for example we don't have internet controller so we remove networking. We are not going to start any operational system so that's why we get rid of that as well. We are not going to load stuff through serial console because we have peripherals which are much more convenient. And we are not going to load L files because to load L file you need first to load L file in one memory and then extract sections from this L file to another memory. We don't have enough of memory to play with all that. We expect that user will load binary and just execute it. So we've got it removed and already you may see we have a little bit different number here. So resulting binary will be smaller but still we are not yet there. So what we may do else, okay, we may try to squeeze a little bit of size with help of a tool chain. I understood that something was missing in our port of view boot and what it was. It was that interesting feature when we asked compiler to put each and every variable in function in a separate section and then we asked linker to throw away those sections which are not referred in any other sections. It is very simple thing but for some reason, well I know why because previously I had way too much memory so I didn't think about that, so I didn't do that. And so now I added corresponding options in CPP flags and LD flags and if you want you may click on that committee if you download slides and you'll see how it was done. And so then I realized that actually most of the other architectures they already do that except ARC and for some reason another architecture where it is not supported is MicroBlase. I'm not sure probably tool chain doesn't support that because support is required or probably it was never required because they always have a lot of memory. So with that nice thing I was able to shrink another 5% of size and we are getting even closer but still we are not yet there. And the interesting thing for me was immediately like you see BSS which is more than 100 kilobytes and BSS it is basically not initialized statically allocated buffers like what's of that size. Okay so let's do some analysis and see what that could be. And with very simple NM utility I was able to figure out okay there are two buffers which is 64 kilobytes and so well that was not very good because I was very limited in memory so when I started to look at them I realized okay they are of size of this variable which is defined fortunately for me it was another configuration option so what I did I went back to configuration utility I just set another size and boom I was able to save 100 kilobytes and so already that allowed us to put that resulting image in our room and we don't need already a lot of memory for days actually because it is 7k here and 80k here. A little bit of background actually that initial value of 64k this is on a safety side because in theory FAT file system may have such a big cluster but in reality I don't think there are many cases when it is larger than 4k so 4k is actually quite safe for setting at least for me I was quite happy with that but still by default it is 64k and if you have gigabyte of memory you don't really care. So it looks like we were able to significantly reduce our memory footprint you see from 422 kilobytes to less than 200 kilobytes which is quite impressive and what helped us the most significant result we were able to get after analysis of our resulting binary and I used just size and an M here that's what I showed but also during my experiments I used quite a lot of blood on meter that's a very nice utility which allows you to compare two different health files and it will show you how different symbols became bigger or smaller and how much of a memory you were able to save like it will show you now you are plus 15% compared to what you were before another thing which helped I was practical I just removed all the crap that I didn't want to use and so with some tweaks of a toolchain I was able to save some space as well what could be also tried it might be LTO link time optimization I tried to play with it a little bit but for that I needed to make a bit more changes to you would build system because link time optimization requires to use GCC driver all the time even when we invoke linker or archaver or anybody else and so that requires more changes and I didn't expect to get a lot of achievements here and I was already achieving what I wanted to have so I just decided okay probably that will be the next exercise so like we are pretty much done but there is another thing we want to use we want to execute U-boot solely from ROM and this is not what usually happens usually U-boot gets self-relocated probably not all of you know about that so I will try to show you what happens there so that what happens first we load U-boots into some initial location it might be ROM it might be even start of the RAM or some other location and there we load entire image and then what happens U-boot gets copied in the very end of available memory and start to locate needed memory for heaps, tech, environment and something else this is required because on some platforms for example on x86 and on many others you don't have DDR prepared for usage and you have to set its clocks you have to train it and to do something else which means you need to execute from some safe location which is not DDR for example in x86 what they do they lock a couple of cache lines and execute from there it is not so something people may expect to happen but these hacks are really happening and then even if you start from the very beginning of the DDR already there was a pre-boot loader which we have for example on our board that I mentioned before you want to still move U-boot here because in the beginning of the memory you'd like to put real payloads which will occupy entire memory when U-boot execution is done so there are a couple of reasons for U-boots to relocate well and actually there is another reason because our data is still somewhere here and we cannot write to this area when we are in ROM which means we still need to relocate at least data and that will be still a requirement for us and anyways that kind of relocation works quite fine when you have a lot of memory but in our case it looks like that you see the difference so that's a U-boot bin and that's our available random access memory which means most probably we won't be able to copy ourselves together to that area and do something else so we need to keep everything that we can in ROM and only put required stuff in RAM area so what we need there? We need to have heap there essentially stack data environment and we will load payload but we are not going to squeeze payload here because we have a separate location for that but that's just for generic case and for that to happen what we need to do we need to skip relocation even though it sounds quite simple it requires quite a few architecture specific fixes and generic codes but it doesn't work right away because different code paths should be used and even though there is a generic flag which we may use to say that we want to skip but then we still need to have a couple of defines here and there again if you download these slides you may have that link where you may find all the changes I had to do actually not that many but anyways that had to be done so once we have that modification we signally in our platform code that we don't want to do that for allocation and we only will do some data copy later on and so then we may execute from real ROM and let's see what we need to do initially we have our entire U-boot bin where is the pointer? initially this is our ROM and so we have everything here we have interrupt vector table we have text section here we have raw data and even data here because this is our initial U-boot bin because data it is initialized data we cannot just get rid of that we need to still have it even before we start execution and then we need to copy data somewhere in our ROM it sounds simple thing but there are two questions we need to answer first question is ok so we need to copy data but where are we going to copy that because we have relatively large ROM at which location in the beginning in the end or where the second question ok so how can we have such a strange situation when linker creates U-boot bin with data here but then we move somewhere and from this code we still may refer to variables for example in that area in completely different memory region so let's try to answer that to answer the first question we need to understand what U-boot does during initialization in terms of memory allocation first thing it does it essentially allocates its space for itself if we are going to relocate so somewhere here it would be place for entire U-boot binary and in fact it was happening even if we signal that we wanted to relocate if we wanted to not relocate so I had to change that and not relocate that again for bigger platforms with more memory we don't care where we allocate here 200 kilobytes here 200 kilobytes and don't care but here I have to be very careful then once that allocation done or not done we allocate space for environment and we know the size of that allocation we specify it in our configuration header you see config underscore it is either kconfig variable or something set in our configuration header then we allocate space for malloc we sometimes call it heap and again we know its length already and also what is important we need to allocate somewhere space for a stack and this is done by simply pointing to initial stack pointer allocation and in my implementation I decided to put it here in the very beginning of the RAM and so there is a good reason for that and on the next slide somewhere I will explain what was the beauty of putting stack here so from that picture I think that's clear so if we use that value which we all anyways as I mentioned set in configuration header we may use it as a point where we load data so that problem is solved and answering the second question how we deal with data which is linked here but then copy it here it turns out that GCC, GNU-LD may already do that and they use concept of different virtual address and load address if you are interested again you may follow that link and get some more information but what do we need to do for that unfortunately that's not the very good position for me so first we define two different memory regions so this is a ROM and so we set its original size and this is RAM again we set original size and then for each and every section we tell linker where to put it so for example IVT, text and raw data we put in ROM and that's the interesting part we ask linker to put data in RAM but then put it in ROM and what happens we use addresses which match those addresses so compiler use some addresses to access stuff in that section but linker will put it in a different area in expectation that it will be copied to virtual addresses that we use during compilation later on during execution start up and so yeah then there is BSS section which we don't put somewhere in ROM because we don't need it there and it won't even appear in our UBIN because BSS it is nothing we will just allocate some memory region in runtime and we'll zero it so we don't need to mess with that what is also important here we put a couple of link time defined symbols for example this is RAM start, RAM end, BSS start, BSS end this will be required and this one is also important by the way ROM end this will be used for runtime tricks we are going to do early on start and here you may see how I calculated a couple of different values so it requires some time to understand what's going on there so I wanted to do that thing RAM data size, this is not entire size of the RAM but this is size well which will be used by exactly data, BSS, heap and environment so this is that area but not entire RAM and some other variables and that's what we have in memory layout if we use variables or constants we define by linker you see this is ROM base so this is quite clear it was a text base before this is ROM end and this is important because that will address where data exists before copy so that's the place where we use to start copying data and here you see RAM start is a location where we are going to copy so basically we copy in a cycle everything from here to here and so we know the size as well so we may calculate it even during compilation and so that is very simple and we know as well location of BSS so we may easily zero it later on and that's what we need to do basically first we set that flag, gd, flag, skip, log and so I'm doing that here and note to the name of the function which is executed very early on startup so we signal that we are not going to allocate and then we copy data with use of those symbols that we defined previously and so finally we zero BSS actually it is not entirely correct it will happen in a different function so I put it here as a matter of demonstration but this step happens anyway even if we do relocation so it looks like we are ready for the first execution of Uboot on our board and now we are going to execute it and see what's going on so in fact console works but not everything working as expected so we want to start using USB and first thing you need to do you say USB start and what happens? well it says error minus 12 like what? ok and error minus 12 essentially this is you know mem which means we don't have enough memory during allocation in our heap area and so what helps here if we take a look at our backtrace we may see a couple of interesting things so we start with USB initialization then we do device probe and then we do allocate P which at least for me hinted that each device has its own private structure which is used for a storage of device specific information and I thought ok so if I look at corresponding driver at its private structure probably there is something interesting and indeed I see there is some buffer of size DWC data buff size ok and if we look at its definition we see that indeed this is another 64k and so then after quite a short discussion with Marek he hinted me that actually this is not a strong requirement this is a buffer which is used for data exchange via USB and we actually make it we may make it much smaller and even if we want transfer a larger amount of data it will be just split into smaller buffers and that's fine and so what I did then then I added another K config option which allows you to fine tune that value to whatever you want so 64k was way too much for me so I set 16k and since then it was working perfectly fine what helped here that I was able to look at the backtrace and well actually some knowledge on how drivers work in modern U-boots and so this driver is DM so what is it? driver model implementation because if you use legacy implementation it will work a little bit differently so now when we get that thing fixed I faced another issue when I tried to access finally FAT file system so what I saw it was quite interesting because nothing happened fortunately I have access through JTAG to that board and I said ok I'm in memory exception so how come that memory exception happened and so then I took a look at registers and noticed that my stack pointer register which is a separate register at least on our CPUs pointing way outside of our RAM area and if you return to that slide you see so it was pointing somewhere here and that is the beauty of stack being here because there is no memory here and my memory subsystem just returned an exception if I have put that stack area somewhere here for example after data you may imagine what would happen I would silently start to corrupt my data and somewhere later down the line like during execution of something else I would face significantly more significantly worse situation when I need to debug something which might not even be reproducible easily so that's why you either put it here and so expect an exception here to get when you access in this area or if you have like memory protection units you may use it as well it will definitely help yes so also what may be used as a hint so yeah we understood what was the reason so then well not the reason what happened and what was the reason again if we look at a back trace we see that last function which was execution is part test does and in that function if we look at that we see in the very beginning allocation on stack indeed of something this is a couple of macros which are wrapped around each other and so in the end so that thing if simplified looks like that so what it does it allocates memory of size of that structure multiplied by amount provided here and here sector size is 512 512 bytes and so it is like what it is 78 and if we multiply that we get 40 kilobytes but if we look at that function in that function a little bit down we see that's actually what we want to do we just want to see file system magic number which occupies like a couple of bytes we don't need actually 40 kilobytes here and when I took a look at the Git history I understood that there was a previous fix that changed allocation function from something else and semantics of that function was different and actually what that guy wanted to do he wanted to put here one because we want to allocate only size of that structure and so with that fix which is mentioned here as well I've got this problem resolved as well and so from that exercise I concluded that you have to be very careful when you see something like that because if your memory can restrict it a lot you may allocate that here a little bit more than you expected so in the end I've got a working system and so those goals that were mentioned in the beginning they were met so you may now load image start it and play with that with anything and basically the main conclusion of all that exercise is you would is perfectly usable on a platform with not that much of a memory like for this particular case I was able to squeeze it into 200 kilobytes of ROM and 100 kilobytes of RAM and still with that I've got drivers for serial ports USB, MMC and support of a FAT file system not even for reading but for writing as well because we expect user to keep environment on SD card or a USB storage and so if you have tools that really helps a lot when you are trying to analyze where you may strip down memory a little bit more still sometimes you need to enhance existing generic U-Boots codes or sometimes even fix stuff because things are done wrong in some places but anyways that helps and what also is good to remember that when you are fighting with a limited memory situation allocation happens not only statically during compilation but in runtime as well and it happens not only in heap but in stack as well and it is always very complicated problem when you have stack over flow another thing that I want to mention in that particular situation you need to bypass U-Boot relocation and unfortunately that requires some work if you are not using architecture which most probably you don't do because there are not that many boards without architecture but hopefully at some point people start using it more you need to patch code and hopefully send patches in upstream as well and so yeah the last conclusion is most of runtime problems are due to problems with memory so I think that's pretty much it that I wanted to tell you about my excursion with U-Boots on that platform thank you all for being here and also leaving me here alone in the middle of the session any questions so far? at the very beginning you started removing some features unnecessary and you removed an ability to boot any operating system so what this bootloader will do after it will boot so what's the purpose? the world doesn't tend to on the operating system right? the point is you built your bare metal application which does I don't know sensor fusion whatever and it is 1200k in size I mean entire binary you may produce with a compiler not only L file or U image you may build binary image of your application then you load it into memory so we return even further so you load it it's what is my pointer so you load it say here your application and execute it and do whatever you want with U-Boots on U-Boot parlance what you do you say fat loads mmc0 michaelab.bin and provide address and then what you do you say go to the same address and what happens you would just jump to that location and start to execute your application from this point there is no U-Boots any further it might still exist somewhere but you pass control to a different application and so that's the reason for bootloader to exist so this option doesn't prevent from running a bare metal no roll binaries so this option doesn't prevent U-Boot to from running bare executables right? again what is the question? this option doesn't prevent U-Boot to run no that is very specific option which does quite a lot of things in preparation for starting Linux kernel yeah and then loading stuff changing like device triblop and then doing a lot of checks prepares registers caches and then passes control Go command is very simple it is implemented as jump okay thank you I have a really simple question how small can you actually make U-Boot be the binary? can you strip it down say to 64 kilos is that possible? is there some mechanics for that? well yeah you see what I had to do here I had those limitations and fortunately actually in the beginning I was not 100% sure but I said okay I will do that and fortunately I was able to squeeze what I have in what I wanted in what I have but can we go even further so if you take a look at yeah so if you like remove the USB from the file system and so on how small can you get? well I need to try again I had to implement something which supports both SD and USB and essentially it takes especially USB and file system it takes quite a lot but then the question is if we remove all the peripherals and leave only U-Boots core why do we need that? it's just a thought experiment really how small can you get? well yeah so we may well actually there are systems that have SRAM limitations and you need to load U-Boot into that and execute it but for that case we have SPL we have all those tricks which will strip down all the print case print apps and all that so yeah that will be the thing I would use in that situation can you speak a little bit about SPL? well it was not implemented here because it is not what we wanted because SPL is not meant to be used as a full-blown bootloader that we wanted it does one simple thing but I think you're not better than right here you're kidding I actually want you to say that to the audience about SPL the spoiler is we may have very even more stripped down U-Boots it does only one thing that we wanted to do it loads from only one media typically and it doesn't throw debug messages but it is very tiny so it could be fitting like one page of something and for example I think on iMix platform what they do they have SPL which fits into one page of memory like 4k or something no it's 32 32? well anyway quite quite small I do believe you can make a boot as small as one jump instruction no no and what Marek was saying about it is a practical thing because what happens you have a multi multi-stage boot first you have boot ROM which actually does pretty much nothing it might be a couple of lines of code which copy stuff for example from SPI word by word then you start execution from the first page for example as on x86 you have cache lines in which you log, you load and you have 32 bits k no more then you start execution from there and then with SPL you initialized your DDR and then you may load real full scale U-Boot from SD card from that DDR because you already have driver for SD card and file system which you are not able to have in a pre-boot loader because you only had a couple of lines of code there so you see on each stage we are becoming more and more powerful and in the end we may do whatever you want so that's beauty so what's your long term plan here for maintenance is it all going to be upstream it's already upstream and are you going to maintain these options which options to be able to strip off different features and strip off well you can strip features yourself that's configuration for your boards well that's my board that's my configuration if somebody wants me to add another feature ok I may add it if somebody says that we don't need USB there any longer I will just strip it down essentially I will support that because I'm a maintainer of architecture that's my board so it's my responsibility that's my child there are framework related stuff as well where you create each symbol goes to a different section and then you strip off at the linking time where you realize the symbols which are not being referenced anywhere and then you strip it off I would guess that will be at somewhat more framework level rather than your platform board yeah as I told it should be generic feature but probably back in the day it was not supported a couple of years ago and so then it became supported by I only added that feature now because again for previous solutions it was not needed I had enough memory and for for example microblades we may ask me Charles Simic I think I saw him if he just didn't think about that probably that's the reason and if it is possible to be done on microblades then it is no brainer we should get it into generic code and actually that's a good idea but first we need to figure out what's wrong with microblades so yeah that's the next step indeed ok thank you very much