 Můžeme mít s nejvědným Hanzonem na koukování koukování z internálních flashmemory. První z hanzonem je to, aby testovat performánci koukování z internálních flash v nejvědných scenáriách. Zrobíme klip na konfikutiu nekdo, by je to odstávali. Už se vám příjemná pláčka a nějaké úplněé vyskávty. Už se vám příjemná pláčka a než připomíná hánča. In the later hands on we will compare this with the execution from external flash and also this might help you to make benchmarks of your own, so to maybe compare some situations, scenarios and configurations that are related to your application. tady spolováka než měla. Takže když se společí, je to bezpočíně the same, ale je to v měloj či našliční mimory. A práce je, že jsme potrející a diskrísováka takže potrejí a diskrísováka a takže potrejí a diskrísováka a když jsme potrejí a diskrísováka takže potrejí a diskrísováka a když se potrejí 1,002 have the number related to the single function. So at the end, the row number we get means that the lower the value, the better the performance. And yeah, the main loop that calls these functions is placed in the ITCM RAM. So it is not placed in the cache. So there is a direct relation between the number of functions and the size of the cache we use. We also enable the optimization flags for the compiler. And as already mentioned, when we exceed the cache size, we will see some performance drop. But in this case, it's let's say the worst case scenario. So what we prepared is really the corner case. So in the real application, the performance drop might not be that significant. So that's some important note. So if we have a look at the internal flash, here we have some, let's say, logic interface and then we have the flash memory itself. And so you can see here that the flash memory itself is connected by 256 bit wide bus. So this is basically the size of the flash ward, the minimum size we can read in the hardware. And then we have some additional logic with some buffers. And then we have the XS bus, which is 64 bit wide. And at the maximum frequency, we have three weight states, so four cycles at total to access a single ward in the flash. And if we take all these numbers together, it basically means if we are reading some data sequentially from the flash, we are able to fully utilize the XS bus. So basically the flash itself shouldn't cause any bottleneck in the system. And with that, here is some theoretical bandwidth of two gigabytes per second. If we would really read all the data sequentially, of course in the real application we sometimes need to skip to different parts of memory and there is some delay. And by this, we can also fit one 32 bit instruction to the core each clock cycle because the core usually runs twice the frequency than the XS bus. So that means one 32 bit instructions or two 16 bit instructions being fetched by the core. So at the end, if you look at the core, it might still not be enough because on this slide, it's shown that Cortex M7 use so-called superscalar dual issue architecture and it can in most cases execute two instructions at the same time. So if we use 32 bit instructions, then we are able to provide only one instruction each clock cycle and then we lose some performance. But usually we have some mixture of the 16 and 32 bit instructions. OK, so even with the internal flash, the cache can provide some performance boost. OK, so if we have a look on the, let's say, memory architecture of the core, so here we have the connection to the bus matrix and we have the two caches here. So one is for the instructions and the second one is for the data. So on the 32 h72 and h73 we have 32 kilobytes of each and the size depends on the device. So on some other Cortex M7 device it might be more, it might be less. And you can enable or disable the cache globally or by using the memory protection unit, we can decide to enable it or disable it for some parts of the memories. So typically for the data cache we want to disable it for DMA buffers to make sure both DMA and the core see the same data. OK, we also have the so-called tightly coupled memories. So those are, let's say, part of the core and they are running at the same frequency and can be accessed fast by the core. So it can be useful for placing some critical code or data we want to access quickly. On the STM32 h7 these memories can be accessed only by the core or by the master DMA, but the regular DMAs cannot access these memories. The data tightly coupled memory has 128 kilobytes and for the ITCM it's 64 kilobytes plus there is some additional memory space we can split between the ITCM or AXE SRAM. For instance, if you decide to use the VALUELAN device which has limited internal flash and you execute some code from the external one, you might decide to increase the ITCM to load some functions to the ITCM and in some cases you maybe want to use bigger buffers on the AXE then you choose the different option. Here is some demonstration of the instruction cache and how it will behave in our example. In this first example we have the number of functions that can fit into the instruction cache. So if we start at the beginning when the cache is empty with the first function we need to read it from the flash, execute it, but at the same time it's being stored to the instruction cache. Then we go to the next one and on the last one it's still the same. But if we start from the function number one again so we can imagine this as some loop of instructions then if we start the loop again we just read the functions basically from the cache because they are already there and we don't need to access the flash or the flash it can be even the external one. In the example when we increase the number of functions even further, so the maybe also good to mention that in this is really the limiting number we will use in the practical example the number 1024. So if we increase it even further then if we go to the function 1025 the cache is already full so we need to remove some function to be able to store it there. So usually the core removes the function that was not used recently in this let's say example we remove function number one in the real life it might be something different so the core might decide to remove function 2 for instance because there are some alignments requirements and so on. But let's assume it's this way and then if we start the loop again the function number one is no longer there and we need to load it from flash and it might be similar for the next functions. So even if we exceed the flash size by a small amount the performance drop might be significant but again this is a corner case example in a real application you will probably have some smaller loops the performance drop will not be that big. As mentioned before we also want to add some additional bandwidth on the flash so one of the things we will do in the example is to show some image on the screen of the board and we will read this image from the internal flash and we will use the LTDC peripheral for that and since we need to refresh the display 60 times per second we will need to read the image 60 times directly from the flash each second. So if we take into account the resolution and the bytes per pixel we get some requirement width so this is 23 megabytes per second roughly and if you remember we calculated some theoretical bandwidth of 2 gigabytes per second for the internal flash so even this bandwidth is quite small compared to the theoretical maximum. Also during the refresh there are some blanking parts of the display basically it means that at some times the LTDC will not read any data from the flash so it will basically read line then there is some deline and then it reads another line. So this will also cause some spikes on the performance because sometimes you are measuring when the LTD is fetching data and sometimes not. We should see on the screen some picture so maybe don't be surprised by this fixed color area so it's just some part of the graphics we took for the display. In the example we will basically switch on and switch off this display so when it's switched off it doesn't read any data. Then we have the similar for the audio so we will read some short audio loop from the flash memory. We will fetch it through the DMA and through the serial audio interface we will send the data to the audio codec that is on the board and finally the audio is available on the jack connector on the board and again based on the parameters we calculate some required bandwidth which is even lower than for the picture so basically we will see that this audio playback has minimum effect on the performance. So here is the internal bus matrix showing the different busmasters competing for the access to the internal flash memory so we have the core and all TDC directly connected to the axi bus matrix and then we have the DMA for audio which is on different bus matrix but there is some bridge between the bus matrices and here it also shows the SRAM that is configurable between the axi and the ITCM for measuring the performance we will use the data watch point and trace unit so this unit contains 32-bit free running counter which is incremented each CPU clock cycle and by reading this counter we can measure the time it takes for some code to execute we just need to be careful that the time is below 7.8 seconds because otherwise the counter will overflow more than once and then we cannot measure this time and the good thing is that this unit is part of the Cortex-M core I think from the official documentation it's marked as optional part but I think it's quite widely available at least on all STM32s with the exception of the M0 and M0 plus cores where I think this cannot be included in the core before we measure the loop execution time we will disable the interrupt so we need to make sure we measure really only the loop and not some additional code and also we will do several measurements and make average because we expect the first run of the loop will take more time than the consecutive runs of the loop this is the code we will use so here you can see the disable of the interrupts then we yeah, then we have the fourth cycle that's the that we use for the averaging so we do several measurements then we read the counter so we have some timestamp at the beginning then we execute the functions then we get the timestamp at the end and finally we made the comparison so even if the since we are using unsigned 32-bit integer even if the counter let's say overflows in between we still get the difference between the two timestamps last slide from the theories so just some general recommendations for the performance it's a good idea to enable the compiler optimizations also good performance boost is to enable the caches so for the instruction cache there is basically no drawback for the data cache you need to be a bit careful with the DMAs because the DMA only sees the data in the RAM memory and not in the cache so you probably need to disable the cache for the DMA buffers and it requires some additional configuration and if you have some critical code or interrupts you can place them into the ITCM memory to run them faster since this code is in the ITCM it will not be in the instruction cache so then the instruction cache can be used for some other code so now we will continue with the more practical part still there might be some slides that explains in more detail what we what we did or what what are some other options so if there is this sort of brain you don't need to follow the steps on the slide it's just for the explanation and the steps we will do we will import some project template and then we want to place some code into the ITCM RAM so this needs to be done through several steps to do with this slide so to explain how code and variables are placed in the tool chain so by default all the functions and data are usually placed to the flash and RAM so data goes to RAM code goes to flash and usually that's the basic setup and you don't need to configure anything so if you add some function by default it goes to the to this so called text section so this terminology is related mostly to the GCC compiler and then you place the text sections into the flash so if we want to place it somewhere else the basically the first step is that we need to tell the compiler to place it in some different section and then when we link the files together we need to tell the linker to place this section to some specific part of memory OK, so we will import the project so here I have opened the cube IDE with the empty workspace and I can either click here import projects or go through the file import and here I select the general folder and select existing projects into workspace because we are importing project that already is configured for the cube IDE for example these options would be used for some projects that were prepared for some other tool and here I already have preselected the folder so in this case I go to the C drive then I have the folder I used for the installation or if you use the zip file you unzip it somewhere and then I select the hands on number 2 folder and here I see it finds the project in the folder so I just need to make sure the checkbox is ticked and then I click finished and it should import the project inside so now we have the project we can also open the file named to copy 0 2 so this is text file we will use to modify the project so they are the code snippets from the slides you can use because usually from the PDF or or the PowerPoint sometimes the copy paste doesn't work because the some characters are replaced by some different version or there are some special characters edit which the compiler doesn't recognize and the first step we will do is to tell the compiler to place the main loop function into the into the ITCM section so first we need to find it in the in the main C file and yeah you find it here and I just copy paste this attribute line and I place it before it and save the file so that's it and here just to explain what this means so as you probably noticed this isn't something that is standard in the C language so it's some language extension supported by GCC it's also supported by the kyle compiler and there are also some other options how to manipulate the placement of the functions and one of them is this compiler option the f function sections and this if this is selected then for for example main function will be placed in the dot text dot main section and then in the linker script we can we can use this section name to place the function somewhere so this can be some other option how to use this and if you are using the IR compiler then you can use this pragma location or ultimately you can define some universal macro that will work with the most of the compilers now we need to modify the linker script maybe I will first explain the code that is here so since we store the code in the ITCM RAM which is not which is volatile memory and after reset it's undefined we need to initialize it properly and to be able to initialize it we need to store somewhere the initialization data so we will store the initialization data in the flash memory then the linker script is slightly bit more complicated so on the first line we are yeah we have also these kind of symbols that define the addresses in the in the section so the first one is defining the load address that is in the basically the beginning of the initialization data then we have the definition of the section then we have another pointer let's say that defines the beginning of the data in the ITCM RAM then finally we decide what we want to place into this part of memory and then we have the pointer at the end of the data in the ITCM RAM and at the very end we specify memories where we want to place these sections so you will now do this modification just make sure you open the correct linker script so the one that ends with the underscore flash and not underscore RAM and we will place this code just after the dot esr underscore vector section so this this section basically needs to be first in the linker script because it's the interrupt vector which is at the beginning of the of the flash so here I will open the the flash linker script here I can see the interrupt vector definition and here I will just from the from the text file I will copy the required code that's it so we now change the compiler to place the function in the ITCM RAM we modify the linker included the initialization data but we still need one step and that is to at the beginning of the code to initialize this memory just to explain how the placement of the functions works also for data how this works in the GCC linker script so basically we select the file and the section we want to place there so basically in the in the pure format it should be file path and in the brackets the full name of the section but we can use the star character as a wild card so for for example if we have this expression the star and dot text star in the brackets we select from all files we select all sections starting with dot text so this one is for example used with the option I mentioned previously since each function is in its own section that is starting with dot text we place all these sections somewhere and other example that is already in the linker script is this star benchmark dot o and this basically selects all the code from the benchmark source files to be placed somewhere and the star at the beginning is let's say sort of a lazy trick to to replace the file path to the to the to the file because I think it depends how the compiler is being called then it assigns some some path to the to the file and if I don't want to figure this out I can just place the star before it and it will work. Okay, so that was to explain this and as mentioned before we need to now load the load the data into the ITCM RAM so we will define these pointers that are already present in the linker script so we will just use the extern to let's say refer to them and then linker will place the right value there and then we call the memcopy function at the beginning of the main so I will go back to the main yeah, first I need to put the the definition somewhere I think I can place it anywhere before the main but yeah, maybe we will use this this section as in the slides and place it there and then I go to the main function and here I called the memcopy here you see I use the addresses to compute the size of the section and and use the different start addresses and I think I can now compile it now we load the ITCM memory at the beginning of the main and in some cases this might be let's say too light so for example if you want to put place the main function itself in the ITCM then we need to initialize the memory earlier the other option might be to initialize it in the startup file and this is a code that is written in the assembly so it's let's say it can be a bit trickier but on the other hand there is already the similar code that takes care of the initialization of the data so you can basically copy paste it and replace the the addresses that it uses so this could be an option if you need to initialize the memory earlier and for the IR and kyle tool chains this is done automatically so when the the compiler itself has some initialization function that is being called at the beginning and it takes care of this case so now I think the project should be compiled and to check if we if we have done everything properly we can use this built analyzer that is part of the cube IDE if it doesn't show here you can go to window and find it in the menu but I already have it here in this part there is some basic overview of the different memories or parts of memories that are used and for example here we see the usage so for instance for the flash we can see that it's quite full so it's around 70% this is mainly due to the fact that we put the image for the screen which is quite big so it takes quite a lot of space from the flash memory so this can be used for some basic overview and then we switch to memory details we should be able to see basically the content of the memories so basically this is I think the analyzer takes the information from the map file so this is like a user friendly version of the map file and we can have a look and here we have the ITC MRAM and we see the main loop measure function is placed at the address 0 which is the start address of the ITC MRAM and here we see also the address where it is start in the flash so this can be used useful to check if the functions are really placed where we expect and also we can see these so-called veneer functions so these are let's say small helper functions created by the linker if you want to jump to between different parts of memories so basically the regular jump or branch instruction has some limited range where it can jump to so if the code need to jump more far the linker creates this function and it's also useful to see sort of dependencies so here I have the main loop measure function which calls the this invalidate instruction cache function I can see that the main loop measure function calls some function that is not placed in the ITC M so for instance I want to speed up some interrupt handler and I decide to put it into the ITC M run but at the end it might happen that I don't observe any performance boost because it calls some other subfunctions that are still placed in the flash and this so looking at these veneers it can be useful to detect this kind of situation and I think in the text section you might be able to see the ok there are too many things here but yeah at the end we see the veneer in the let's say other direction so when we jump from the flash to the ITC M now we will load the project to the board by running the debug configuration so first make sure you have the board connected so already have it and I will click right click on the project select the debug s and select the STM32 application now it should yeah it connects maybe in some cases there might be some pop-up message to update the stealing driver and now the it's programming the flash so you could see it takes some time because we used quite a lot of the internal flash so it takes some time to load it and now we jump here I can restroom the execution but we want to use the cube monitor so I will terminate the debug session and open the cube monitor k so I already have the dashboard prepared I think you should also have it prepared from the previous hands-on so if I go to the dashboard and start acquisition ah ok I need to maximize yeah ok I think this is correct so now we see some sort of performance percentage so this is this is computed as a related to the best case scenario so with the small code size and cache enabled and as you can see at the moment I don't have the cache enabled so that's why I don't have the full performance if I enable it I think we should get yeah to do a 100% then I can decide to enable the lcd display so now if you do it you should see the picture on the screen but since I have the cache enabled and here is the number of the function calls which still fits into the cache I basically don't access the the flash memory from the core for the code execution so enabling the display doesn't have any effect but if I disable the cache I think we should be able to see some difference even if it's small so now it shows 72% sometimes it jumps to 73% you can also see the let's see the toggling of the performance that's caused by the blanking periods of the display because the TDC is not reading the data all the time and if I turn it off you can see the the number of psychos required drops so yeah here we show the number of psychos so the lower the value the better the result yeah you can if I enable the audio stream I don't see any difference maybe together with the with the display there might be some slight change observable but yeah so let's now try to enable the cache disable the display and we here in the in this field the h7 demo and b funk calls we can change the number of functions being called so I can go to 1024 and still uh it remains the same I'm not sure why why they are why there are the spikes OK but if I go for example to 1050 now I I see the performance start to decrease can go to 10080 and the decreases again if I go to maximum value which is 4096 it basically get to the to the point where with the cache disabled so basically and there is the cache still helps a little bit but it's not significant so of course with the cache overloading we cannot get worse case than with the cache disabled yeah there are some values you should be able to observe just an example and yes some conclusion so as we can see the internal flash offers quite good performance for the code execution and even for some additional bandwidth and in the practical part V V we shown how to place some code in some different parts of the memory and the similar approach can be used also for placing data and you can use some similar approach with together with the measuring through the data and watch trace unit and through the cube monitor to observe the performance of your own application so for example you think maybe if I use this configuration the performance will get better but you need some way how to measure it so this this can be also a good example how to do that so I think that's all from my side