 Mi je Roberto Sandino. Proste, sem včasljena v audio in sensorih platforma Slab v stičnih in italičnih. Tukaj je tukaj počet, ki je tukaj občasljena in audio, ki je tukaj počet. Tukaj, mi je prav, da je tukaj počet. In s specijnem fokusem na doma, aplikacije, in občasljena. First of all, why audio? Well, audio has been one of the key drivers of innovation ever, because starting from the old landline phones, and then it brought us to the modern phones, cordless phones, and then the mobile phones, and ending up into what is today's smartphone, where several totally, once totally different applications have converged into a single device that is the modern version of that object that once was designed just to communicate voice. There's a reason for it, because voice is one of the basic needs and which needs to be helped by technology. If the evolution of voice communication has led us to modern smartphones and all the networking infrastructures behind them, voice has been also a key technology driver for automatic speech recognition system. The so-called ASR system, they started appearing to consumer people, to everybody, with the PCs in many years ago. As a professional application, there was probably also a Windows driver doing some speech-to-text application, and then one very important step has been when mobile phones started to include voice interface originally there was just a possibility to pronounce the name of one of the person in your contact list, and then the telephone would automatically call the person. Today we have Siri, we have Google API, we have a number of cloud speech recognition provider. Then new system are beginning to be modified so to include voice and audio as a user interface. So we have Smart TV with microphones that are mounted either on the remote control or on the device itself so that one can browse the internet by just giving voice controls, and then there's a new generation of devices coming to the market where basically one can talk to something that's in your house, maybe you don't even know who you're talking to, but you just talk to the internet and the internet replies, okay. So this is becoming something very big for consumers and for smart home kind of applications. So let's look at the technology behind it. First of all, let's say, let's consider an architecture, a genetic architecture. We have voice terminals in charge of audio capture, some signal processing, sometimes with low power requirements, sometimes not because they may be powered by the mains. And so then these device collects audio and there's another device which may then converge into a single object, but is the voice and data gateway, which is in charge of providing connectivity so that your voice can reach some remote server, which performs what we call today cloud processing and typically cloud processing allow us to have a natural speech recognition and the number of other features. And very important to note that speech recognition may be connected when it's implemented in a cloud server with services. So I can talk to something in my house and launch services like, you know, control the light, read the latest news, buy me a ticket for an airplane and so on and so forth. So this is gonna have an impact on everyone's life and it's important to understand how to optimize the electronic behind this kind of infrastructure. So we will start from the microphones because that's where the audio is captured first. And then we will look at all the system aspects. So microphones, I like to work, I'm with ST and I like to work with digital MEMS microphones because they offer a number of flexibility to the designers. Let's look at what they are. Actually, a MEMS microphone is built with two main components inside the packaging. One is the actual sensor, which is a typically a capacitive membrane and may be implemented with another technology, but it is an acoustic chamber connected to a very small sound inlet. The fact that this sound inlet is so small has a physical implication and it implies that the hearing pattern of a digital MEMS microphone is always omnidirectional because within the acoustic chamber there's an acoustic phenomenon called diffraction. Basically, the microphone has no way to understand from where the sound comes. But a good microphone can capture very well the sound pressure levels that impact that are present in the position of the sound inlet. Then we have a digital azic in charge of converting this analog signal into a digital one. And the output is called PDM, post density modulation, and it's a signal that typically has a very, very high temporal resolution and a very low sample resolution. And this is a typical interface adopted in this microphone system. Of course, microphone can be with plastic package or metallic package. They can be bottom part or top part, which means that the sound inlet may be between the pads of a microphone or maybe on the opposite side, on top of the package. So, how do we build systems with digital MEMS microphones? We just connected them to an STM32 or to an MCU in general. Digitally, so we can do this with serial interface, parallel interface or with dedicated interface. Now we have a number of STM32 processor that have a dedicated PDM interface and through a dedicated peripheral they can convert PDM to PCM. For all the other devices, we can have libraries. There's a PDM to PCM filter library that is part of the standard STM32 cube distribution and implements conversion from the PDM format to the most familiar and typical audio standard PCM 16-bit per sample format. Okay, let's note that one of the hardware interface mentioned here, the parallel interface using GPIO could enable us to build microphone array architecture, very effective and with a very simple build of material. You can plug directly a number of digital men's microphones in to an STM32. In my lab I have typically arrays of 8 or 16 microphones and we even managed to build a 32 microphone array with a single STM32. So, how can you explore audio? How can you implement, study your routines? You can do it by just going through the datasheet, looking at the technical specification of the interface or you can do it using some simple system that is basically like this one. This is called Bluecoin. It's a development platform. Today it's still an R&D device. We will distribute it beginning of next year and it's an example of what can be done in order to support audio experimentation. What is very important is that when you use one of these platforms such as for example the Bluecoin, you have available a number of software libraries to start with. So, you can experiment with a number of audio processing libraries which I will explain in the next part of this talk and you will also be able to put these audio processing in connection in cooperation with some other libraries working on other sensors and communication libraries exploiting for example the Bluetooth Low Energy Communication Channel. So, we talk about audio capture. So, the use case I mentioned is the home automation use case. So, let's try to understand what is the problem we want to solve here. So, we have one person indoor speaking into an automatic system, into an electronic system with microphones and with an STM32, with an MCO, okay? So, what's the deal? Okay, we just use one microphone, we connect it to the STM32, we record and we solve the problem. But, this may not be the optimal solution because if you are indoor, the voice of the speaker will reflect against the wall of the room. The bigger the room, the wider the reflections. And so, this reflection of the voice will also impact into the audio capture system creating some noise, some disturbing artifact. Then, most of this system will also have loud speakers. Maybe they are Bluetooth loud speakers capable of rendering an MP3 file. Maybe they are system capable of supporting a dialogue with a person. In any case, they will have a loud speaker producing audio. And this will cause an acoustic ACO that goes directly into your microphones. But also, the audio coming from the loud speaker will reflect against the walls of the room. And so, it will intensify the diffusion effect and the noise surrounding the audio capture system. Finally, there will be some background noise which could come from an open window, from a traffic jam outside, from a washing machine that's on and from a number of different factors. So, okay, so this is our problem. That's why we need to figure out how to improve the performance, the acoustic performance of the audio capture system. How do we do that? We do it implementing some signal processing. Now, signal processing, as was brilliantly highlighted by the previous speaker, can be done in a number of different way and they will always be different depending on the specific use case you are studying. You will need to implement different optimization, acoustic and signal processing optimization to your system. So, for the purpose of this talk, I wrote down a potential scheme that may be adopted in a home automation situation and where we have some, a number of basic functions, for example, being forming, source locale, being forming means listening into a specific direction. We wanna listen to the speaker, if we know where he is. Source localization means answering to the question, where is the speaker, okay? Acoustic echo cancellation, I will explain that later, it basically means if our product, our device, is reproducing audio, the device knows what is the audio, it's playing back, so it can do some signal processing to have that audio removed from the captured voice. Then we have some generic audio analytics functionality and there's a very, very long list here. They will improve fine tuning, adapting the algorithm to the different operating conditions and so we will be able to improve adaptively some noise reduction, some dereverberation function and the number of other functions which are not explicitly shown here. Finally, so the voice captured, enhanced by all this processing, is sent to a speech recognition system where typically there will be a trigger ASR, which means an activation keyword, like okay Google, like hello blue genie, as shown in the previous talk. Alexa is a trigger word. Basically when the system detects a trigger word, it opens up a channel for some further algorithm of speech recognition, which can be embedded within an STM32 or it can be operated from some remote server as a cloud service, okay. So this is the complexity we have to deal with. How do we face that complexity in our design? How do we simplify it? First of all, let's look at it from a high level. So we have certain audio processing that is spatial audio and requires multiple microphone in order to work. Some other audio processing routine are generic audio, they work on a single channel and then we have speech dedicated, speech processing functionalities that belong into a very specific, well-known and different category. All these kind of processing interact with each other in order to solve the problem of optimal voice recording, capturing and recording. So what do we do at ST? We have developed some optimized signal processing routine that can be used in that signal processing chain or in your own preferred signal processing chain and they belong in the so-called open dot audio set of libraries and they can be downloaded for free from the ST website with documentation, user manual. The important concept is that we develop those libraries to make them available to our customers for free. This software will be free for evaluation and free for using your products. The licensing terms will only ask you to please use our hardware, otherwise I don't have a way to pay for my salary because I write software and you have it for free. So we provide libraries under the open dot audio page and we also provide example projects such as the smart acoustics, the blue voice link that deals with the voice over Bluetooth low energy and the blue micro system which deals with mems processing and streaming over BLE. All of these is available as a starting point for your project, so you don't start from scratch. Okay. So what are you going to find within the open audio set of libraries? You're going to find examples of beamforming, acoustic source localization, acoustic echo cancellation routines which you may use in your design, you may use in your product, or if you are an audio expert, you may just use them as a benchmark, as a proof of concept of what can be done with these architecture based on an STM32 and digital mems microphones, okay? So each one of these function may well be replaced by your own processing functionalities if you are willing to develop them or if you have them developed, or you can use them if you are happy with their performance in your product without wasting time in further development, okay? Up to you to test them, we have published all the technical information and I'm here to guide you and help you understand a little bit better what we did. Then we, these routine will also interact with routine coming from third party. For example, we at ST don't have any speech recognition developed on our own, so we rely on external partners. You have seen probably some of our partners here in the big room showing their demos. And this partner typically have libraries optimized for STM32. We also do demonstration of voice recognition happening by cloud processing. And this can be done again, we can show you, we have prepared the reference design which is part of, for example, of our blue voice project or another one part of our sensor type project. And with this reference design you will see all the software that's needed within an Android or iOS app in order to have your speech recognition done by some cloud server. Then other functionalities are still R&D for us. I'm working in my lab in Italy on noise reduction, statistical dereverberation. We do some audio analytics. I'm sure audio companies are doing their own experiment, their own routines. We can discuss in a partnership how to implement at best this kind of functionalities. We will provide these other functionalities as part of our roadmap during the next year. So let's start looking a little bit at these functionalities. What do they mean? What do they are? How did we implement them? So binforming and source localization. Binforming means... Let's start from binforming. Binforming means using a bunch of microphones, typically two in general, more than one, and some signal processing in order to make sure that the system can listen selectively in one direction and not in another direction. For example, typically we are at the concert. We want to record the sound source and some telephone starts ringing, some cell phone. And we don't want to record the cell phone, which is not on the main stage. We know it's not going to be on the main stage. So we want to focus the listening capability on the main stage. We do that via binforming, for example. This is binforming. Let me show some simple example because we don't really need to push for very high end thing. A microphone array can be built with two microphones. And if they are MAM's microphones, they can be very close to each other and we can implement with MAM's microphone what is called a differential microphone array where the distance between the sound inlets can be as low as three or four millimeters. This is a number of theoretical advantages which I'm not covering here, but basically allow us to make a system whose performance is independent from the location of the sound source. I will cross the technical parenthesis here. Basically, if you connect the output of two microphones and you subtract the output, you get what is called a figure of eight polar pattern. Basically, if a speaker is in front of the two microphone at the same distance from each microphone and then you subtract, of course, you get zero output, okay? But if the speaker is closer to one of the microphone, then the output is not zero. So this is basically part of a signal processing routine which implements this polar pattern where all the sound coming from the axis of the two microphone in this direction is eliminated and all the sound coming from the other axis is captured. Now, using an additional delay, we can implement what is called a cardioid polar pattern which is very interesting because means these two microphones are capable of listening in a single direction, eliminating unwanted noise from other directions. Let me show, for example, we have a library that is scalable where the simplest, let's say, the simplest implementation of beamforming is really simple. It's what you find on page one of a beamforming book, so it's really a cardioid implementation by the book. Two microphones, one digital delay and a equalization stage. It's very low on MIPS, it's 11 MIPS, including PDM to PCM conversion because in this case, if we have microphone that are very, very close to each other, we can exploit the very high temporal resolution of PDM data format and exploit it within our signal processing. So, our library will perform beamforming and PDM to PCM conversion, giving you as output one nice PDM format audio corresponding to the beamforming output. So, in our library, we can also implement an end-fire configuration with what we call the strong beamforming and in this case, still it is, you know, end-fire means that we are listening in the direction identified by the two microphones along the axis of the microphone, this axis here. And with strong beamforming, we have an aperture of roughly 60 degrees. So, we have four different implementation of beamforming. Actually, this is the basic cardioid I was mentioning and this is the strong one I was mentioning in the previous slide. Then we have other degrees well-documented in the technical documentation. I don't want to bother you with all the technical details here, but if you have questions, feel free to ask, to contact me after this talk. So, I draw your attention to one specific kind of beamforming in this library, which is called ASR-READY. In this case, we optimize the performance for speech recognition application. One important consideration regarding microphone sensitivity matching. You remember I made an example of two microphones where it was subtracting the output and getting zero. This is true if the microphones are identical. So, every microphone has its own gain, which is managed by a parameter called sensitivity. So, we at ST do microphones, digital men's microphones, that are matched in sensitivity at plus or minus one dB. So, they can be very well exploited for beamforming that are optimal for beamforming. But me, being an R&D guy, I'm never just satisfied to take what comes as it is. So, in our software library we do some additional fine tuning of the gain. So, we do some additional sensitivity matching. Why do we do that? For one reason mostly, because it's very cheap to do that in software. It costs zero meeps, basically. Let's approximate it to zero. And if you are indoor, it's winter. Outside, there's snow. You are warm and close to some fire. Your system works at this temperature. Then you go outside. All of a sudden your system is facing a steep gradient in temperature. Your microphone sensitivity will change by the laws of physics. This is not about ST or our competitors, but in front of a very steep temperature gradient, sensitivity will change. So, this can be fixed, can be adjusted in software by this routine that is part of the open audio library. Now, let's talk about the quality and how did we test our routines. First of all, we built some environment to draw the polar pattern. And here you see polar pattern from the three different use cases. Single microphone, omnidirectional, means as you can see, there's a circular polar pattern, meaning that the gain is the same from any direction. The basic cardioid. The cardioid in ancient Latin means heart-shaped. Is this red one? You see this little bump here? This bump will not exist in the ideal case if the digital delay matches the acoustic delay between the microphone. This is something that we can go into the technical detail, but in a real case, there will always be some little bump here, which is due to that factor. So, the green one is what we call the strong beam forming. As you can see, in this case, we listen between plus or minus 30 degree, we listen to audio with no degradation of the volume, and with all the other direction of arrivals, we attenuate very strongly the level of the audio. So, in terms of application, what does it mean? So, we tested this system using a speech recognition environment as speech recognition is the target application, and so we can evaluate the quality and the confidence of speech recognition in a controlled environment. Basically, we put our microphone array in an echoic room. In front of it, we place loudspeaker with words spoken from a database, and at 90 degree, we place some noise generation. We record the output from the four different kind of, the four different outputs of our library, the omnidirectional output, and the other three, and then we feed the speech recognition engine with these outputs. So, why are we doing this test for one simple reason, because it's simple, and because you can repeat it, okay? We chose the Google speech recognition because it's free, it's accessible, it's part of our reference design, so I invite all of you who want to work on audio to repeat this test. These are the results. Okay, we have on the y-axis the ASR confidence, which is a number provided by the Google APIs, telling you how good was the speech recognition, how easy it was for the algorithm to detect the word, and on the x-axis, we have the signal to noise ratio. We tested the system at different levels, starting from 15 dB, which means voice was 15 dB louder than the noise, and then increasing the noise in steps of 5 dB up to minus 5 dB. This is the performance of an omnidirectional microphone. You see this one in red. So, basically, as the noise increased, the speech recognition falls dramatically. At 5 dB, the single omnidirectional microphone is useless, false speech recognition. And these are the various other algorithms. You may appreciate the fact that the ASR specific can still give useful result even at minus 5 dB of signal to noise ratio, with noise coming from 90 degree. Please note that this is not the most favorable to us. If we had placed the noise at 180 degree, we would have achieved even better results. That's why I invite you to repeat the test based on your specific use case. So, there's a number of systems you can use to test this informing algorithm, and also the other routines that are part of the open audio. You can use the open development environment hardware, nuclear board plus microphone expansion. And with the two microphone provided, you can immediately test the two microphone informing. You can also mount some microphone coupon on the microphone expansion board and test for microphone informing. Or you can use, you can build your own board similar to our blue coin that I mentioned earlier, okay? I wanna move quickly over the two next algorithm. I'll be very, very fast. Basically, the sound source localization is a routine that tells you where does, what is the angle, the direction of arrival of a sound source. We have implemented it with two different algorithms, and they offer different trade-off between MIPS and performance, but basically, so you will find on the technical information on the user manual of this routine. But I draw your attention to one important fact. If you wanna use these routines, first of all, with two microphones, you can detect a range of 180 degree, not more than that, and this is by geometry, okay? It's not the algorithm, it's a symmetry of the system. If you wanna detect a range of 360 degree, you may do so with four microphones, okay? So I invite you to use four microphone if your system is similar, for example, to the Amazon Alexa, which is in the middle of a room, and you wanna have detection everywhere around it. MIPS performance, if you need to detect the position of a human speaking to a machine, the human is not changing this position every 20 millisecond. So the MIPS, you will find on the MIPS indication in the documentation, but it's not very important for source localization, because you run these tasks as a low priority task, and maybe every one second, every two, whenever the system has time to do so. Let's move to the other routine in the library, which is acoustic echo cancellation. Basically, your system is capable of playing back music, has a microphone, but there's also loud speaker, like, for example, the Amazon Alexa, and the Amazon Echo, sorry, but like a number of other systems on the market, and so you know the system plays back an MP3 file through an STM32. The STM32 knows exactly what audio is being sent to the output loudspeaker. It can, through some processing, which is called acoustic echo cancellation, delete the captured audio from the voice track. Okay, this is the acoustic echo cancellation. We have posted on the ST website last week, Friday, this new example, sample project called Smart Acoustic One, which I suggest you to look at as a starting point for your design, because it includes the three routines I mentioned. Be informing, localization, and acoustic echo cancellation. Again, you will find on the website all the technical information. What matters here is that you have, at once, all the audio processing routine, and you can test them, you can experiment, make your own experiment with them. You can do experiment using the open development environment hardware, and you can do that with your own develop boards, or with a board like the blue coin, which I mentioned earlier on. Now, in the final part of this talk, I wanna draw your attention on a system level issue. That is, how do we use all of these, and why would we use those routines? So, let's consider a generic system, okay, that is an audio interface for a smart home. So, it has microphones, it has loudspeaker, and if it is supposed to launch some cloud-based processing, cloud service, it is also connected to the internet, okay. Let's keep this system as an example, and let's see what can this audio processing routine do in order to enhance the quality of the audio capture. Again, you have seen this before. Let's see how the problem can be stated in the specific case. We have voice, we have reflection and diffusion, because we are indoor, and we have the direct echo coming from the loudspeaker. You see that in this case, it's much worse, because they are very close. For this reason, I invite you to choose in the ST portfolio those microphones that have a very good acoustic overload point, because this will keep linearity, a linear behavior of the system, even when the audio is pushing strongly against the microphones. In this system, we are still going to have also the reflection and diffusion caused by the loudspeakers and also the background noise. So, what can we do? We can implement beamforming, we can implement acoustic cancellation. You see, now let's see here what is the effect of this routine. Beamforming means we are listening in this cone here, you see, there's a cone. In this cone, we capture the voice of the speaker and some of the noise, you see, because of all these reflections. So, within the cone of beamforming, we capture an audio, a voice that is enhanced because we have gotten rid of all these other disturbing factor, but it's not perfect. We still have disturbances, because they will be included within the cone. If we do acoustic echo cancellation, what are we going to solve? We are going to completely eliminate the direct acoustic echo and its reflections, okay? So, we still have some other factors, like background noise, reflection of the voice, but we totally get rid very well of the direct acoustic echo signal. So, are we solving the problem completely? No, in either case, no. So, we need to think of a different alternative implementation. Look, for example, at this alternative solution. We use, at the same time, beamforming and acoustic echo cancellation. We're trying to get the best out of each routine. So, we select, we teach the system how to select automatically one and the other. How do we do the selection? We use an embedded speech recognition library, which provides us with information on the speech recognition quality. So, doing that, we are going to perform one beamforming and a speech recognition after the beamforming. One acoustic echo cancellation and speech recognition after the echo cancellation. And then we select the one with the best ASR score. In this case, so we can switch between beamforming and echo cancellation automatically in a simple way, and you have already everything from the open audio plus our partner libraries, and you can already implement this, okay? Let's look at something at another use case, a bit more difficult. So, we are in a noisy environment, very noisy environment. And in this environment, the source localization, my source localization is not gonna work, I tell you, because in a discotheque, in a rock concert, or in a meeting room with people discussing, it's too noisy. And the source localization works every 20 millisecond, every 20 millisecond on a specific sound source. If it's too noisy, it will not work. We can do sound source localization, but we do it in a different way. We implement, if our STM32 is powerful enough, we are going to implement a number of concurrent beamforming routines. For example, four, in this example. After each beamforming, we perform ASR, so we are going to do on a single MCU, four microphone connections, four beamforming, and four speech recognition. And then we select the single beamforming channel, which is giving us, in the end, the best speech recognition score. In this way, we can, even in noisy environment, identify and use the right beamforming channel. Of course, in this case, we are going to use one of the high performance STM32, or even an STM32 F7, or even the next to come STM32 H7, which is even more powerful. Finally, let's deal with the most complex case, where the system works in a noisy environment, must identify the speaker and be informed towards the speaker, but must also get rid of audio that the system itself is reproducing of the loudspeakers. How do we do that? Okay, we follow the same approach. We are going to run four beamforming and then acoustic echo cancellation at the same time on an STM32 F7, which is capable to do that. Of course, we need to have very optimized routines, so our routine will fit. If you use your own routine, you need to be very careful in optimizing the software performance, the MIPS performance of your routines. Then we perform five ASR, and again, we select based on the ASR score. In this case, now, depending on the use case, we may also choose to not use the acoustic echo cancellation when the system is not playing back music. So, I mean, we can adapt then the system, the details of this design, but I'm passing you the basic concept that can be adopted for this design, okay? So, finally, I described to you this voice-based application that can be implemented in a voice terminal. There is software available that is modular, is based, built on STM32 Cube programming environment, and allow you to develop an integrated terminal through which you can stream your audio into some gateway, and from the gateway, you will reach cloud-based services, so that your system will be able to implement and launch cloud, let's say, internet services and cloud-based systems using voice and speech recognition. So, with this general architecture, I end my presentation.