 Hello and welcome to this presentation of the Dcache module which is embedded in all products of the STM32U5 microcontroller family. The STM32U5 series embed up to two data caches. Dcache 1 is a 4 or 32 kilobyte data cache depending on products. Dcache 2 is a 16 kilobyte data cache available on some U5 products. Dcache 1 is placed on Cortex-M33-SAHB bus and caches only the external RAM memory region. Octo-SPIs, HSPI, and FSMC. Indeed, by placing a bus matrix demultiplexing node in front of Dcache 1, SAHB bus memory requests addressing SRAM region or peripheral region are rooted directly to the main AHB bus matrix and the Dcache 1 is bypassed. The concurrence between Dcache 1 accesses to external memories and core accesses to internal SRAMs also improves the overall performance of the microcontroller. In the figure, the Dcache 1 master bus used to access external memories is completely independent of the bus used to access internal SRAMs. Dcache 2 is placed on the AHB bus driven by the port M0 of GPU 2D and caches all the memory regions accessed by it. The best performance is achieved by caching only external memories and by bypassing Dcache 2 for internal memories. Both Dcache's autonomously handle cache line reloads, cache line evictions, and write stores to external memories. Performance is achieved through the two following features, hit under miss support and critical word first refill policy. These data caches help to reduce the microcontroller's power consumption by accessing data in their internal memory rather than in the main memories which are larger and therefore consume more power. The multibus interface minimizes potential conflicts regarding memory traffic. The 32-bit data slave port receives instruction and data memory requests from the Cortex M33 SAHB system bus for Dcache 1 and from M0 port of GPU 2D for Dcache 2. The 32-bit master port performs refill of missing requests from memories, dirty data write back, or data write through to external memories. These memories are external flash-in rams accessed through OctoSPIs, HSPI1, and FMC controllers. The second slave port is used for registers access. When an external memory access is marked as non-cacheable by the MPU, Dcache 1 is bypassed. The request is forwarded and changed to the external memory on the Dcache 1 master port in the same clock cycle. Dcache 1.2 offers close to zero-weight states data read-write access performance due to zero-weight state on cache hit. Hit-under-miss capability that allows to serve new processor requests while a line refill due to a previous cache miss is still ongoing. There's critical word-first refill policy which minimizes processor stalls on cache miss. The hit ratio is improved by the two-way set associative architecture and the pseudo-least recently used based on binary tree or PLRUT replacement policy. This algorithm is a good trade-off between hardware complexity and performance. Cache lines are transferred with the critical word first by implementing the wrap for AHB transaction ordering in order to deliver the data requested by the processor's pipeline first. Right back and right through policies are supported, selections depends on the MPU setting for the address data region. Dcache 1.2 supports all data sizes, byte, half-word, and word. Dcache 1.2 implements performance counters to 32-bit hit counters, one for read and one for write transactions, and two 16-bit miss counters, one for read and one for write transactions. This performance monitoring allows to analyze and optimize data placement in accordance with cacheability and right back right through policy to achieve the most performant data traffic. Power consumption is reduced, most data accesses are performed to from internal cache memory rather to from bigger and external main memories. A dedicated secure bit in TAG RAM of each cache line prevents non-secure requests from hitting secure Dcache 1 entries. As the GPU 2D traffic is non-secure, Dcache 2 does not support TresZone. Software cache coherency is performed through maintenance operations controlled by memory mapped registers. These are the full cache invalidation operation, which is a fast command, the invalidate, clean and invalidate, and clean operations that are related to a programmable address range that Dcache is automatically invalidated after a reset. The address range maintenance operations are typically used to maintain the coherency of buffers shared by GMA channels and the processor core. These commands are not intriptable, with an end of operation raising a specific flag and possibly an interrupt. The narrow flag and possibly an interrupt are raised whenever a bus error is returned to the masterport of Dcache when the request is initiated by Dcache itself, either a line eviction or a clean operation. When the masterport forwards a request received on the slave port, Dcache simply forwards the AHB response received on the masterport back to the processor. This table summarizes the characteristics of the data cache, depending on the product. 16 or 32-byte cache line size transferred using a burst of 4 or 8 words, 4 or 16 kilobyte cache, 2-way or 4-way set associative. The data cache implements the following write-in allocate policies. Write through no write allocate, when a store miss occurs, Dcache 1.2 is bypassed, the data is directly written to memory. Write back write allocate, when a store miss occurs, the cache line is acquired from memory, updated with data received from the processor. The resulting cache line is then written to the data cache with the dirty bit set. The supported maintenance operations are invalidate global and per address range, clean and invalidate per address range, and clean per address range. This slide demonstrates the influence of data cache and instruction cache on performance expressed in CoreMark, when the processor core frequency is 160 MHz. Four scenarios are described, for which the location of data and instructions varies, as well as their caseability, in iCache and Dcache 1. In the first case, code is an external memory accessible through the FMC on SAHB bus, iCache is not involved. Lowest performance is obtained when code and data are accessed through the FMC, not cached by Dcache 1, and therefore code and data are multiplexed on the SAHB bus. This is slightly better when data is in the SRAM 3, but still transferred on the SAHB, which is also used to fetch code from the FMC. When Dcache 1 is enabled, the performance increases a lot, but is still limited by code and data sharing, both on the SAHB bus and in the Dcache 1. The best performance, which is 487 CoreMark, is achieved when data is in the SRAM 3, code and data still share SAHB bus, but the Dcache 1 is dedicated to code storage only. Code accessible through FMC could advantageously be cached by iCache through iCache address remapping, which would split code and data traffic between CAHB and SAHB buses. In the second case, code is an external memory accessible through the OctoSPI on the SAHB bus or on the CAHB bus. Performance is low when data is in the SRAM 3 and code is in the OctoSPI external flash accessible through the SAHB bus and not cached by Dcache 1. Performance is low also when code is in the OctoSPI external flash accessible through the CAHB bus and not cached by iCache. In both cases, the score in CoreMarks is only 39. The performance is drastically increased when the code is accessible through the SAHB bus is cached by Dcache 1, but the SAHB bus remains a bottleneck because it is used to transport both code and data. Almost optimal performance is achieved when the code accessible through the CAHB bus is cached by iCache. This requires the implementation of address remapping in the iCache. In this case, code is transferred over to the CAHB bus while data is transferred over to the SAHB bus. In the third case, code is stored in the internal flash not cached by iCache. When data is stored on OctoSPI PSRAM, the performance is low when data is not cached. When data is cached in the Dcache 1, the performance increases a lot and becomes even better than having the data in SRAM 3, 444 CoreMark instead of 430. In the last case, code is an internal flash cached by iCache configured in two-way set associative mode. Performance is low when data stored in OctoSPI PSRAM is non-cachable and better when data is stored in an SRAM accessible through the FMC but still non-cachable. In both cases, marking the address range containing this data as cachable leads to the same good performance of 622 CoreMark. The best performance is obtained by having data in SRAM 3 and instructions in cachable internal flash. I'll decache interrupt sources, raise the same and unique interrupt signal, decache it, and then use the same interrupt vector. The three sources of decache global interrupt are error detection on data request initiated by the decache itself, either for dirty cache line eviction or clean operations, which sets the ERRF bit in the decache status register, and a full invalidate operation which sets the BsyENDF bit in the decache status register, and if cache range maintenance operation invalidate, clean and invalidate, or clean, which sets the CMD and DF bit in the decache status register. Decache also propagates all AHB bus errors, such as security issues, address decoding issues, from master port black to the SAHB slave port. A typical case in a narrowness refill request initiated by an initial data request that misses in cache. At product level, using the decache12 reduces the power consumption by loading storing data from to the internal cache to the internal decache most of the time, rather than from the bigger and then more power consuming main memories. This reduction is even much higher since the cache main memories are external. As a result, the decache12 and their masters have the same state in the various low power modes. When the microcontroller is in stop mode, the user can decide to power down the decache which may require a complete clean and invalidate maintenance operation. When decache is disabled, it is bypassed. The system bus input requests are just forwarded to the master port. In addition to this presentation, you can refer to the following presentations Instruction Cache, Security, FMC, OctoSPI.