 Hello and welcome to this presentation of the iCache module, which is embedded in all products of the STM32U5 microcontroller family. The instruction cache or iCache is introduced on the CAHB code bus of the Cortex M33 processor to improve performance when fetching instructions and data from internal flash or SRAM's memories or from external memories through the Octo SPI1 and 2 or FMC interfaces. iCache allows a close to zero weight state performance on program fetches in most use cases due to intrinsic caching operation. This performance is achieved through the following two features, hit under miss support and critical word first refill policy. The internal flash is accessed by the dedicated 128-bit AHB fast bus. This is the only difference compared to the iCache implementation in the STM32U5 microcontrollers. SRAM 1, 2 and 3, Octo SPI1 and 2 and FMC are accessed through a 32-bit AHB slow interconnect. This two master architecture decouples the cache refill path from external memories from the high bandwidth path to the flash memory. The remapping logic allows for internal or external memory ranges to be cached by defining for them an alias address in the code section range 0 to 1FFFFFFF. The instruction cache reduces the consumption of the microcontroller by accessing instructions and data in the internal iCache rather than from the larger more power consuming main memories. Configuring iCache as direct mapped by software allows an even lower power consumption compared to the two-way set associative organization which is also supported. The multi bus interface minimizes potential conflicts between memory traffic. The 32-bit execution slave port receives memory requests from the Cortex M33 CAHB code bus. The 128-bit master one port performs cache line refills from the internal memories flash and SRAMs. The 32-bit master two port performs cache line refills from the external memories, external flash and SRAMs accessed through the Octo SPI1 and 2 and FMC interfaces. The second slave port is used for registers accesses. When an external memory access is marked as non-cacheable by the MPU, the iCache is bypassed. The request is forwarded to the external memory on the iCache master one or two port in the same clock cycle. Only the address may be modified due to the address remapping feature. The iCache offers close to zero weight states data read write access performance due to zero weight state on cache hit. Hit under miss capability that serves new processor requests while a line refill due to a previous cache miss is still going on. The critical word first refill policy which minimizes processor stalls on cache miss. The hit ratio is improved by the two way set associative architecture and the pseudo least recently used based on binary tree replacement policy. This algorithm is a good trade-off between hardware complexity and performance. Thanks to the wide 128-bit bus, a cache line refill from flash only requires a single data transfer because 128 bits represents exactly 116 byte cache line. Cache lines read from the external memories are transferred with the critical word first by implementing RAP4, AHB transaction ordering, in order to deliver the instruction requested by the processors fetcher first. The dual master port architecture decouples internal and external memory traffic. For example, SRAM fetches are not stalled by cache line refills from external memories. Interrupt latency is minimized when the interrupt service routines are located in the internal flash or SRAMs. The iCache implements performance counters, one 32-bit hit counter and one 16-bit miss counter. This performance monitoring analyzes and optimizes code placement in accordance with cache ability to achieve the most performant code traffic. The remapping logic is very convenient to extend the cacheable region beyond the 512 megabyte code memory address range, which starts at address 0. Up to four external regions can be defined, and for each of them, the refill port can be selected, either master 1 or master 2. Power consumption is reduced when iCache is used. Most instruction accesses are performed from internal cache memory, rather from main memories. Configuring the iCache is a direct mapped cache rather than the default two-way set associative mode. Also contributes reducing the consumption because only one cut of tag and data memory is accessed instead of two. However, the direct mapped organization may affect the performance when the distance between two programs needed at the same time is an integer multiple of the cache size. A dedicated secure bit in tag RAM of each cache line prevents non-secure requests from hitting secure iCache entries. An invalidated maintenance operation is supported to invalidate the entire contents of the instruction cache. Typically, when the main memory content is modified, this operation is controlled by software by accessing a memory mapped register. This is a fast command, non-interruptible, with an end of operation raising a specific flag and possibly an interrupt. An error flag and possibly an interrupt are raised whenever an unexpected cacheable write access is received on the execution port. The iCache does not manage AHB bus errors returned to Master 1 or Master 2 ports. It simply forwards the AHB response received on the master port back to the processor. This table summarizes the characteristics of the instruction cache. 16-byte cache line size transferred through a burst transaction of four words or a single data transaction of one quad word. Two-way set associative 8 kilobytes cache that can be configured as a direct mapped cache. A global invalidate maintenance operation is supported. iCache defines an alias address in the code region for up to four external memory regions. The address remapping is applied on the code alias address, transforming it into the external physical destination address. The minimum region size is two megabytes. The maximum size is 128 megabytes. In this chart, the performance in direct mapped and two-way set associative modes is the same. The reason is that the entire benchmark fits into the iCache. Once the code is within the iCache, the flash latency has no impact on the performance. When the iCache is disabled, the larger the flash latency, the lower the performance. The two sources of iCache global interrupt are error detection on cacheable write requests, which sets the ERRF bit in the iCache status register. End of the full invalidate operation, which sets the busy end F bit in the iCache status register. There is no iCache management of errors occurring on a Master 1 or Master 2 port request. The erroneous response is propagated through iCache back to the Cortex M33. iCache is clocked at the same frequency as the Cortex M33 core, because the iCache only caches instructions requested by the Cortex M33. Consequently, the iCache and the Cortex M33 have the same state in the various low power modes. When the microcontroller is in stop mode, the user can decide to power down the iCache. When the iCache is disabled, the iCache is bypassed, except the remapping mechanism that remains functional. CAHB bus requests, whether they are remapped or not, are just forwarded to the Master ports. So the iCache consumes less, because TAG and data memories are not accessed, but each instruction is fetched from the more power-consuming targeted main memory. To reduce power consumption, the performance monitor is disabled by default. In addition to this presentation, you can refer to the following presentations. DataCache Flash FMC Octo-SPI