 Hello, and welcome to this presentation of the ARM Cortex-M7 Core, which is embedded in all products of the STM32H7 microcontrollers family. The Cortex-M7 Core is part of the ARM Cortex-M group of 32-bit RISC cores. It implements the ARM V7EM architecture. It features a six-stage pipeline and in-order dual-issue superscaler with single and double precision floating point unit and SIMD support. The performance of the Cortex-M7 Core is much closer to that of a digital signal processor than the Cortex-M4 Core. It can execute load and store operations in parallel with arithmetic operations with zero overhead on loops. The Cortex-M7 Core directly interfaces with tightly coupled memories or TCMs for very low interrupt latency, resulting in a more deterministic execution. STM32H7 microcontrollers integrate an ARM Cortex-M7 Core in order to benefit from the improved performance of the Cortex-M processors architecture and particularly of the high level of performance in low power modes. The Cortex-M7 Core delivers more performance than the Cortex-M4 Core, thanks to an enhanced architecture bringing increased processing capabilities. To enable this performance improvement, let's look at the three units which are responsible for executing instructions, starting with the pre-fetch unit or PFU and then the data processing unit or DPU in conjunction with the load store unit or LSU. The pre-fetch unit or PFU provides one 64-bit instruction per cycle to the data processing unit or DPU. It includes a buffer of four entries of 64 bits each to enable fetching ahead of the DPU. Add branch target address cache or BTAC for single cycle branch prediction. The Cortex-M7 Core has a six-stage dual-issue pipeline for efficient operation. It brings the ability to process two instructions in parallel if certain criteria are fulfilled. The data processing unit or DPU is split into several pipes. Two ALUs with one ALU capable of executing SIMD operations. A single MAC pipeline with one MAC per cycle capability and one floating point pipe supporting single and double precision operation. When an instruction reaches the issue stage, it is split into micro-operations based on the needed operation and registers used. It is then issued to the appropriate blocks further in the processing pipe. Forwarding of flags from the DPU to the PFU allows early resolution of direct branches in the decoder and first execution stages of the pipeline. The load store unit or LSU provides either dual 32-bit load channels or a single 64-bit store channel with store buffering to increase store throughput. The compiler hides the complexity of the core pipeline and optimizes code to take advantage of this architecture. Compared to the Cortex-M4 Core, the most important advantage is that code can read dual 32-bit values, double load with one instruction, and in parallel process the previous two data on the MAC pipe. The Cortex-M7 Core is more efficient with long sequences of computations. As a branch can also be dual-issued, it can be executed in parallel with computation. Branch target address cache or BTAC predicts whether the branch can be taken or not and reacts accordingly. It remembers the conditions and based on the processing, it predicts the next address to fetch. Tightly coupled memories or TCMs are dedicated memories directly connected to the processor, but not through a bus, so avoiding arbitration and latencies for frequently executed code. Tightly coupled memories for instructions, ITCM and data DTCM allow static mapping of important data and instructions to be accessed over these interfaces. This can be the case for the vector table, the interrupt service routines, and certain time-critical control loops that are executed often and require low latency and deterministic execution time. The AHBS provides a means for DMAs to access any of the tightly coupled RAMs. STM32H7 devices also implement ECC protection on TCM memories. The ITCM RAM has one 64-bit memory interface to satisfy core fetch bandwidth. STM32H7 microcontrollers enable access to 64 kilobyte SRAM over the ITCM interface. The DTCM RAM has two 32-bit memory interfaces to ensure more parallelism on request. Software can use up to 128 kilobytes of SRAM for critical data. ITCM enables 12 clock cycles interrupt latency, which is achieved when code is placed in ITCM and data in DTCM. The AHBS is a 32-bit AMBA3 AHB light slave interface. It provides system access, as for example, DMAs to the ITCM and DTCM. The AHBS supports simultaneous system and processor access requests. Here are some situations where you might want to use the ITCM RAM and DTCM RAM. Compared to the Cortex-M4 core, where several AHB buses are needed for parallel transactions with a memory system, the Cortex-M7 core integrates a single AXI master bus. The AXI master or AXIM interface is part of the bus interface unit, or BIU. It is a 64-bit wide AXI interface that connects the CPU to internal and external memories. It can be used for instruction fetches, data cache line fills and evictions, non-cashable, normal type memory data accesses, and device and strongly ordered type data accesses. STM32H7 microcontrollers integrate an AXI bus matrix to take advantage of the Cortex-M7 AXIM interface. The AXI decorrelates the access request from the data phase, so the request is independent from its corresponding data. If the memory has latencies, this separation makes the bus available to perform a new request if no functional relationship exists between the two requests. For example, instruction fetch and data fetch are performed in parallel. In addition to the AXI interface, the Cortex-M7 core integrates optional instruction and data caches for efficient memory access. If the cache is enabled, any access that is not for a TCM or the AHB interface is managed by the appropriate cache controller. In case of a cache hit, data is fetched or written to cache RAMs if the cache ability criteria is fulfilled. When the cache is disabled or non-cashable or shared memory attributes are set, the accesses are performed directly to the memory using the AXI interface. The STM32H7 microcontroller implements 16 kilobytes for instruction and data caches with ECC protection on cache RAMs. The L1 cache on STM32H7 offers fast access to frequently used code and data from the next level of lower speed memories such as external memories. Both caches use a line length of 256 bits or 32 bytes using a four-way set associative scheme for data cache and two-way set associative scheme for the instruction cache. A set is a group of contiguous lines aligned with an appropriate boundary, 64 bytes for two-way, 128 bytes for four-way. Sets allow for a faster search for an address to determine if it is cached or not. The cached instructions or data are fetched from external memory using the AXI interface. No hardware coherency is supported. Software needs to manage cache maintenance by invalidating and cleaning cache lines before use. Or an easier way to maintain data coherency is to mark regions as shared. It prevents these regions from being cached in decache, but this will result in lower performance since all accesses go to next level memory. On STM32H7 devices, the Cache RAM ECC protection uses the SEC-DED algorithm. Cache RAM protection is managed by the core and is enabled by default after a reset. Cache configurations must be changed only when the caches are turned off. A cache flush must be performed after changing the ECC protection settings. The cache RAMs implement seven ECC bits for instruction tag, data tag, and data. 8-bit ECC code is used for instruction cache RAM. The ECC protection allows the Cortex-M7 to recover from a RAM error detected in runtime. The recovery uses a cache clean and invalidate mechanism. For dirty data cache lines, if the data cannot be corrected, then the error is non-recoverable. The write-through policy can be used to avoid data loss. The L2 cache is always coherent with the L1 cache. This is a summary of the cache RAM ECC protection and recoverable errors. Three solutions exist to overcome cache coherency issues when different masters, for example, DMAs, share the same memory buffers as the core. The first solution is to mark regions as shared to prevent these regions from being cached in decache. The second solution is to clean or invalidate the cache when software passes or gets control over memory buffers. There are CMSIS functions to do all steps to clean and or invalidate caches. The third solution is to use a write-through policy for write-only memory buffers where the CPU is the data producer. In the Cortex-M7 core, the memory protection unit or MPU is used to configure the behavior of the cache controllers and the AXIM interface and force access rules and separate processes. The memory protection unit or MPU in the STM32H7 microcontroller offers support for 16 independent memory regions with independent configurable attributes for access permission, for read-write permissions in privileged, unprivileged modes, execute permission, executable region or region prohibited for instruction fetch, and cache policy. The STM32H7 microcontrollers design benefits from the new features of the Cortex-M7 core, such as memory interfaces, the cache system that compensates slow memories and superscalar architecture, to offer high processing bandwidth for every application while keeping good responsiveness. For more details, please refer to these application notes and the Cortex-M7 programming manual available at www.st.com. Also, visit the ARM website, in which you will find more information about the Cortex-M7 core.