 Hello, and welcome to this presentation of the ARM Cortex M7 Core, which is embedded in all products of the STM32F7 microcontroller family. The Cortex M7 Core is part of the ARM Cortex M group of 32-bit RISC cores. It implements the ARM V7EM architecture. It features a six-stage pipeline and in-order dual-issue superscaler with single and double precision floating point unit and SIMD support. The performance of the Cortex M7 Core is much closer to that of a digital signal processor than the Cortex M4 Core. It can execute load and store operations in parallel with arithmetic operations with zero overhead on loops. The Cortex M7 Core directly interfaces with tightly coupled memories or TCMs for very low interrupt latency, resulting in a more deterministic execution. STM32F7 microcontrollers integrate an ARM Cortex M7 Core in order to benefit from the improved performance of the Cortex M processors architecture and particularly of the high level of performance in low power modes. The Cortex M7 Core delivers more performance than the Cortex M4 Core, thanks to an enhanced architecture bringing increased processing capabilities. To enable this performance improvement, let's look at the three units which are responsible for executing instructions, starting with the prefetch unit or PFU and then the data processing unit or DPU in conjunction with the load store unit or LSU. The prefetch unit or PFU provides one 64-bit instruction per cycle to the data processing unit or DPU. It includes a buffer of four entries of 64-bits each to enable fetching ahead of the DPU. Add branch target address cache or BTAC for single cycle branch prediction. The Cortex M7 Core has a six-stage dual-issue pipeline for efficient operation. It brings the ability to process two instructions in parallel if certain criteria are fulfilled. The data processing unit or DPU is split into several pipes, two ALUs with one ALU capable of executing SIMD operations, a single MAC pipeline with one MAC per cycle capability, and one floating point pipe supporting single and double precision operation. When an instruction reaches the issue stage, it is split into micro-operations based on the needed operation and registers used. It is then issued to the appropriate blocks further in the processing pipe. Forwarding of flags from the DPU to the PFU allows early resolution of direct branches in the decoder and first execution stages of the pipeline. The load store unit or LSU provides either dual 32-bit load channels or a single 64-bit store channel with store buffering to increase store throughput. The compiler hides the complexity of the core pipeline and optimizes code to take advantage of this architecture. Compared to the Cortex M4 Core, the most important advantage is that code can read dual 32-bit values, double load with one instruction, and in parallel process the previous two data on the MAC pipe. The Cortex M7 Core is more efficient with long sequences of computations. As a branch can also be dual-issued, it can be executed in parallel with computation. Branch target address cache or BTAC predicts whether the branch can be taken or not and reacts accordingly. It remembers the conditions and based on the processing, it predicts the next address to fetch. Tightly coupled memories or TCMs are dedicated memories directly connected to the processor, but not through a bus, so avoiding arbitration and latencies for frequently executed code. Tightly coupled memories for instructions, ITCM and data DTCM allow static mapping of important data and instructions to be accessed over these interfaces. This can be the case for the vector table, the interrupt service routines, and certain time critical control loops that are executed often and require low latency and deterministic execution time. The AHBS provides a means for DMAs to access any of the tightly coupled RAMs. The ITCM RAM has one 64-bit memory interface to satisfy core fetch bandwidth. STM32F7 microcontrollers enable access to flash memory devices on the ITCM interface and include an integrated ART flash accelerator for best performance. A 16 kilobyte SRAM is also accessible over the ITCM bus. The DTCM RAM has two 32-bit memory interfaces to ensure more parallelism on request. Depending on the STM32F7 device, software can use up to 128 kilobytes of SRAM for critical data. ITCM enables 12 clock cycles interrupt latency, which is achieved when code is placed in ITCM and data in DTCM. The AHBS is a 32-bit AMBA3 AHB light slave interface. It provides system access, for example, DMAs to the ITCM and DTCM. The AHBS supports simultaneous system and processor access requests. Here are some situations where you might want to use the ITCM RAM and DTCM RAM. Compared to the Cortex-M4 core, where several AHB buses are needed for parallel transactions with a memory system, the Cortex-M7 core integrates a single AXI master bus. The AXI master or AXI-M interface is part of the bus interface unit or BIU. It is a 64-bit wide AXI interface that connects the CPU to internal and external memories. It can be used for instruction fetches, data cache line fills and evictions, non-cashable normal type memory data accesses, and device and strongly ordered type data accesses. STM32F7 microcontrollers integrate an AXI to multi-AHB bridge to take advantage of the Cortex-M7 AXI-M interface. The AXI decorrelates the access request from the data phase so the request is independent from its corresponding data. If the memory has latencies, this separation makes the bus available to perform a new request if no functional relationship exists between the two requests. For example, instruction fetch and data fetch are performed in parallel. The STM32F7 AXI master runs at the same frequency as the core. And with the AXI to multi-AHB bridge running at the same frequency, it optimizes memory system performance even in case of latencies on external memories. In addition to the AXI-M interface, the Cortex-M7 core integrates optional instruction and data caches for efficient memory access. If the cache is enabled, any access that is not for a TCM or the AHBP interface is managed by the appropriate cache controller. In case of a cache hit, data is fetched or written to cache RAMs if the cacheability criteria is fulfilled. When the cache is disabled or non-cacheable or shared memory attributes are set, the accesses are performed directly to the memory using the AXI-M interface. Depending on the STM32F7 microcontroller, the instruction and data cache sizes vary from 4 kilobytes to 16 kilobytes. The L1 cache on the STM32F7 offers fast access to frequently used code and data from the next level of lower speed memories such as external memories. Both caches use a line length of 256 bits or 32 bytes using a four-way set associative scheme for data cache and two-way set associative scheme for instruction cache. A set is a group of contiguous lines assigned with an appropriate boundary, 64 bytes for two-way, 128 bytes for four-way. Sets allow for a faster search for an address to determine if it is cached or not. The cached instructions or data are fetched from external memory using the AXI-M interface. No hardware coherency is supported. Software needs to manage cache maintenance by invalidating and cleaning cache lines before use. Or an easier way to maintain data coherency is to mark regions as shared. It prevents these regions from being cached in decache, but this will result in lower performance since all accesses go to next level memory. Three solutions exist to overcome cache coherency issues when different masters, for example ZMAs, share the same memory buffers as the core. The first solution is to mark regions as shared to prevent these regions from being cached in decache. The second solution is to clean or invalidate the cache when software passes or gets control over memory buffers. There are CMSys functions to do all steps to clean and or invalidate caches. The third solution is to use a write-through policy for write-only memory buffers, where the CPU is the data producer. In the Cortex-M7 core, the memory protection unit or MPU is used to configure the behavior of the cache controllers and the AXI-M interface and force access rules and separate processes. The MPU and the STM32F7 microcontroller offer support for eight independent memory regions with independent configurable attributes for access permission, allowed or not read write in privileged and or unprivileged mode, execution permission, executable region or region prohibited for instruction fetch, and cache policy. The STM32F7's design benefits from the new features of the Cortex-M7 core, such as memory interfaces, the cache system that compensates slow memories and superscaler architecture to offer high processing bandwidth for every application while keeping good responsiveness. For more details, please refer to these application notes and the Cortex-M7 programming manual available at www.st.com. Also visit the ARM website, in which you will find more information about the Cortex-M7 core.