 Hello and welcome to this presentation of the STM32G4 F-MAC block. It covers the main features of this block, which is used to perform background signal filtering tasks autonomously. The F-MAC unit is built around a fixed point multiplier and accumulator, or MAC. The MAC units receives two fixed point 16-bit operands from an internal 256 times 16-bit RAM and write the results back to this memory. The address of the input values in local memory is determined using a set of pointers. These pointers can be loaded, incremented, decremented, or reset by the internal hardware. Software does not access them directly. The unit allows frequent or lengthy filtering operations to be offloaded from the CPU, freeing up the processor for other tasks. Filter functions FIR and IIR can be realized by the F-MAC. Typical applications requiring these filters are motor control, audio, power supply, lighting, and analog sensing. The F-MAC offloads the CPU by executing background signal filtering tasks autonomously, thus freeing up the CPU MIPS for other tasks. The F-MAC unit enables the user to select the filter type, the filter order, and the coefficients, which are all programmable. This figure details the architecture of the MAC unit. X is the input sample buffer containing the raw samples to be filtered. B is the array of coefficients of the filter to be applied to X samples. X and B have the same size N plus 1 entries. Y is the output sample buffer containing the results of the filtering. A is the array of coefficients of the filter to be applied to Y samples. Y and A have the same size, M plus 1 entries. The number of MAC operations to obtain Y of N equals N plus 1 plus M. N plus 1 max to multiply accumulate vector X and vector B. N max to multiply accumulate vector Y of N minus 1 to N minus M and vector A. Inputs and outputs of F-MAC use the fixed point signed integer Q1.15 format. In Q1.15 format, the numeric range is 1 or 0x8000 to 1 minus 2 to the negative 15 power or 0x7FFF. 32-bit single precision floating point numbers can be converted to or from Q. 1.15 format by dedicated conversion instructions that are executed in the Cortex-M4 FPU. This figure details the various formats used internally by the F-MAC. The output of the multiplier in Q2.30 format is truncated to Q2.22 and added to the accumulator LSB aligned. The accumulator has 26 bits of which 22 are fractional and 4 are integer with sign. This format is Q4.22. The extra integer bits allow the accumulator to support partial accumulation sums in the range minus 8 or 0x4000000 to plus 8 or 0x3FFFFFF. This can occur if there are a large number of successive positive or negative coefficients. When the filter gain is less than unity for all frequencies, the accumulator value always returns to the range plus or minus 1. If the partial sum exceeds the accumulator numeric range and wraps, a sticky flag is set to help debugging. Nevertheless, provided subsequent additions undo the wrapping, a correct result is still obtained. A programmable gain can be applied at the output of the accumulator, from 0 dB to 42 dB in steps of 6 dB. This is necessary for IIR filter implementation. The F-MAC unit performs arithmetic operations on vectors, which are arrays of 16-bit fixed point scalar values. These vectors are allocated in the local SRAM. Software is in charge of configuring the X1 and X2 operand buffers and Y output buffer through X1 buff CFG, X2 buff CFG, and Y buff CFG registers. The base addresses can be chosen anywhere in internal memory provided that all buffers fit within the internal memory address range 0x0 to 0xFF. Buffer base address and size have to be programmed. Note that X1, X2, and Y buffers may overlap. These buffers are not visible in the CPU mapping. Before starting a filtering operation, the CPU or DMA controller initializes the contents of input buffers using the initialization functions and writing to the W data register. The contents of input buffers can be either data to be filtered or filter coefficients. The data is transferred to the location within the target buffer indicated by a write pointer. After each new write, the write pointer is incremented. When the write pointer reaches the end of the allocated buffer space, it wraps back to the base address. Regarding the X1 buffer, if the number of free spaces in the buffer is less than the watermark threshold programmed in the full WM field of the FMAC X1 buff CFG register, the buffer is flagged as full. As long as the full flag is not set, interrupts or DMA requests are generated if enabled to request more data for the buffer. The watermark allows several data to be transferred under one interrupt without danger of overflow. Nevertheless, if an overflow does occur, the OVFL error flag is set and the write data is ignored. The write pointer is not incremented in the event of an overflow. Regarding the Y buffer, if the number of unread data in the buffer is less than the watermark threshold programmed in the empty WM field of the FMAC Y buff CFG register, the buffer is flagged as empty. As long as the empty flag is not set, interrupts or DMA requests are generated if enabled to request reads from the buffer. The watermark allows several data to be transferred under one interrupt without danger of underflow. Nevertheless, if an underflow does occur, the UNFL error flag is set. In this case, the read pointer is not incremented and the read operation returns the content of the memory at the read pointer address. Each multiplication takes a value from the X1 buffer and a value from the X2 buffer and multiplies them together. A pointer in the control unit generates the read address offset relative to the buffer base address for each value. The pointers are managed by hardware according to the current function. This figure explains the X1 buffer operation. When the write pointer reaches the end of the buffer, it wraps back to the beginning. If available space in the buffer is less than the transfer size, the input buffer full flag is activated. If the top of the input set X of n equals the write pointer, i.e. no new sample available, the filter stalls until a new sample is available. The processor or DMA controller must ensure that the new sample X of n plus 1 is available in the buffer space when required. If not, the buffer is flagged as empty, which stalls the execution of the unit until a new sample is added. No underflow condition is signaled on the X1 buffer. The X1 buffer can be used as a circular buffer. New data are continually transferred into the input buffer whenever space is available. The write pointer automatically wraps around when it reaches the last 16-bit entry in the buffer as shown in the figure. Preloading this buffer is optional for digital filters since if no input samples have been written in the buffer when the operation is started, it is flagged as empty, which triggers the CPU or DMA to load new samples until there are enough to begin operation. Preloading is nevertheless useful in the case of a vector operation, that is, the input data is already available in system memory and circular operation is not required. The X2 buffer is used to store coefficients. It is usually loaded once during the initialization of the FMAC. Consequently, it does not support the circular addressing mode. This figure summarizes the operation of the input buffers. During step 1, the filter calculates Y of n from X of n minus 7 to X of n and loads the next four samples. During step 2, Y of n is now calculated. The sample X of n minus 7 is removed. Then n is incremented. The filter calculates Y of n from X of n minus 7 to X of n. No new sample is loaded. During step 3, Y of n is now calculated. The sample X of n minus 7 is removed. Then n is incremented. The filter calculates Y of n from X of n minus 7 to X of n. No new sample is loaded. During step 4, Y of n is now calculated. The sample X of n minus 7 is removed. Then n is incremented. The filter calculates Y of n from X of n minus 7 to X of n. No new samples are loaded. Since the upper address of the buffer has been reached, a wrap around to the beginning of the buffer occurs. This figure explains the Y buffer operation. When the right pointer reaches the end of the buffer, it wraps back to the beginning. A read pointer designates the oldest unread sample corresponding to the output data register. When a sample is read and is not part of the output set, then the space becomes free. If the right pointer equals the read pointer or the least recent sample in the output set, i.e. Y of n minus m, the filter stalls and the output buffer full flag is set. This figure summarizes the operation of the output buffer. During step 1, the filter calculates Y of n from Y of n minus 7 to Y of n minus 1. 11 samples are unread. During step 2, n is incremented. The filter calculates Y of n from Y of n minus 7 to Y of n minus 1. Software or DMA reads four samples, which shifts the read pointer to the oldest sample. During step 3, n is incremented. The filter calculates Y of n from Y of n minus 7 to Y of n minus 1. Software or DMA again reads four samples, which shifts the read pointer to the oldest sample. However, samples Y of n minus 7 to Y of n minus 5 are not deallocated because they are used in the current calculation. The FIR function performs a convolution of a vector B of length n plus 1 containing the filter coefficients and a vector X of indefinite length containing the sample data. To implement the FIR in the FMAC, the buffers are used as follows. X1 buffer contains the elements of vector X. It is a circular buffer of length n plus 1 plus d. X2 buffer contains the elements of vector B. It is a fixed buffer of length n plus 1. Y buffer contains the output values Y of n. It is a circular buffer of length d. Here are the parameters. The parameter P contains the length n plus 1 of the coefficient vector B in the range 2 to 127. The parameter R contains the gain to be applied to the accumulator output. The value output to the Y buffer is multiplied by 2 to the R power, where R is in the range of 0 to 7. The parameter Q is not used. The function completes when the start bit in the FMAC param register is reset by software. The FIR requires n coefficients and n input samples to calculate one output sample. To optimize throughput, the input buffer size should be larger than n in order to load the next samples while the filter is working on the current set. For example, when using 4-beat DMA transfers, epsilon should be set to 4. Also, the size of the output buffer should be set to epsilon to transfer resulting samples in a unique AHB burst transaction. The IIR filter output vector Y is the convolution of a coefficient vector B of length n plus 1 and a vector X of indefinite length plus the convolution of the delayed output vector Y prime with a second coefficient vector A of length m. To implement the IIR in the FMAC, the buffers are used as follows. X1 buffer contains the elements of vector X. It is a circular buffer of length n plus 1 plus D. X2 buffer contains the elements of coefficient vectors B and A, concatenated B0, B1, B2, 2, Bn, A1, A2, 2AM. It is a fixed buffer of length m plus n plus 1. Y buffer contains the output values of Y of n. It is a circular buffer of length m plus D. The parameter P contains the length n plus 1 of the coefficient vector B in the range 2 to 64. The parameter Q contains the length m of the coefficient vector A in the range 1 to 63. The parameter R contains the gain to be applied to the accumulator output. The value output to the Y buffer is multiplied by 2 to the R power, where R is in the range of 0 to 7. The function completes when the start bit in the FMAC param register is reset by software. The FIR requires n-feed forward coefficients and m-feed back coefficients, m being lower than n. The input buffer size should be n plus epsilon, epsilon being the number of data in a DMA burst. The output buffer size should be m plus epsilon. The clock reference of the FMAC module is the Cortex-M4 with FPU freerunning clock F-Clock. N-tap filter, such as FIR, requires n multiplications and additions per output sample, knowing that each MAC requires two memory reads, thus two clock cycles. As a consequence, the maximum sample rate is F-Clock divided by 2n. n must be lower than F-Clock frequency divided by two times the maximum sample rate frequency. Assuming the F-Clock frequency of the STM32G4 is 170 MHz, we obtain maximum filter size at sampling frequency Fs equals 2 Ms per second is n lower than 42 taps. Maximum sample rate for n equals 128 taps is Fs lower than 664 kHz. The table in this slide compares the performance of the FMAC hardware block with a filter software implementation. The CM-SIS DSP library provides optimized filter functions. Software is about 15% faster due to dual-MAC Cortex-M4. But for high sample rate large filters, CPU spends a lot of time doing filtering task. With FMAC plus DMA, CPU is free to perform other tasks. Flow control can be source, sync or filter driven. This slide describes the source-driven flow control sequence. The source of the samples ADC, I2C, etc. defines the sample data rate. The source requests the DMA or CPU to transfer data to the filter input buffer. The filter operates at a faster clock rate than 2n times the source sample rate. When the input buffer is empty, i.e., next sample not available, the filter stalls, waiting for new data. When the output buffer is not empty, i.e., one or more samples available, an output channel DMA request or interrupt is generated. The DMA or CPU transfers the output samples to memory or another peripheral such as DAC or PWM. This slide describes the filter-driven flow control sequence. The filter clock rate determines the throughput. An input channel DMA request or interrupt is generated whenever the input buffer is not full. The DMA or CPU transfers data into the input buffer from memory or another peripheral. As long as data is available in the input buffer, the filter generates new output samples. When the output buffer is not empty, an output channel DMA request or interrupt is generated. The DMA or CPU transfers data from the output buffer to memory or another peripheral. This slide describes the sink-driven flow control sequence. The destination of the samples, DAC, I2C, etc., defines the sample data rate. The destination requests the DMA or CPU to transfer data from the filter output. The filter operates at a faster clock rate than 2n times the destination sample rate. When the output buffer is full, the filter stalls. When the input buffer is not full, an input channel DMA request or interrupt is generated. The DMA or CPU transfers samples from memory or another peripheral to the input buffer. The FMAC executes the filter algorithm when the input buffer is not empty and the output buffer is not full. A flag called X1 full is set if the number of available spaces is X1 buffer is less than full WM threshold. A DMA request can be generated when this flag is not set in order to fill the X1 buffer. A flag called Y empty is set if the number of unread data is less than the empty YM threshold. A DMA request can be generated when this flag is not set in order to empty the Y buffer. The management of buffers can also be performed by software relying on interrupt requests that can be asserted when either flag is inactive. The filter clock frequency must be chosen according to the chosen flow control scheme. The FMAC unit is active in run, low power run, sleep and low power sleep modes. It is not available in the other low power modes. These peripherals may need to be specifically configured for correct use with the FMAC block. Please refer to the corresponding peripheral training modules for more information.