 So all the performance we are speaking about DSP performance, it can be measured how fast your system is able to multiply and accumulate all the time. So what the STM32 can offer to speed up such execution? If I come back, well for sure all these operations are pretty easy. So if you want to write this in C, you can do it very easily by one loop of four, by one for loop with one counter and then indexing to a race. Okay, no problem. But it will not be as performant as if you use the techniques which I would like to explain. So starting from Cortex M4 and M7, ARM has introduced the SIMD instruction set. So these SIMD instructions are on top of the Cortex M3 instruction set. They are added by the M4 and M7 course. And as it is in the name already, you use one instruction to manipulate more data. So I think here is a quite nice example that you use this instruction to work over two 32-bit registers. And it assumes that in these 32-bit registers you have two 16-bit numbers. In the low and high part you will have two numbers and the same for this register. So this instruction is able to operate over those two registers. It will split the registers into two parts, multiply in 16-bit way between these numbers in parallel and then sum or accumulate everything together to another register. And all this can be done in one cycle. So you can see the performance may be much better than if you just write a simple loop in the C. And then it would not never use this instruction. So usually the compilers are not able to use this instruction by themselves and you need to help them. So effectively what it does, it does this operation in one cycle just by calling one instruction. So the benefit is that it paralyzes the operation, depends two times or four times depending if you are using 16-bit operands or 8-bit operands. For sure there is no benefit for 32-bit operands because then you use the whole register for one number and you cannot parallelize in this way. So the instructions are not standard C compilers generated usually. So you are forced either to use completely assembly to make benefit of these instructions or use intrinsic functions or you use the high-level functions which are using these instructions inside. And that's what is inside the CMS's DSP library coming from ARM and we will see how we can operate with the library. That was for M4 and M7. It was common. What is new with M7? This is actually what we already talked about in morning. This is that we can do load and store in parallel with mathematics operation. This is the dual issue. So we can execute two instructions at the same time and we have much better loops, much faster loops. If we come back again to the mathematics, if I take the easiest filter, the fur filter, what I need to do? I need in every step, I need to load two numbers, I need to multiply them together and sum together. So to load them and multiply together, I can use our super scale architecture. So you remember dual issue, I can issue one instruction to load and one instruction to execute mathematics at the same time. So here I double the performance against the M4. Then I need to loop around a certain number of samples. So I use the benefit of one cycle overhead loop. Again, I'm getting much better performance than at the M4. And this is common to the both cores.