 Okay, I think we are ready to start again. The second part of my presentation I will talk about signal enhancement algorithms, noise reduction, single channel, noise reduction, but also multi-channel and specifically also a dual channel approach that will be then important for the hands-on session later in the afternoon. Okay, so we have gone through some introductory slides in the morning and I pointed out that the performance in adverse acoustic conditions is very important and in fact users spend a lot of money on these devices and they expect more or less effortless communication, also in complex acoustic environments. That means many spatially distributed, possibly moving sources. The signals that we have to deal with are all non-stationary. They are also non-gaussian. They are composed of noise and reverberation. In the acoustic space you have very long impulse responses and time varying impulse responses. So in short it is really an estimation theoretic nightmare or it is not what you find in the books. In the books it is gaussian stationary and time invariant. In acoustics it is opposite in all these aspects. Effortless communication means that you need both high intelligibility but also high signal quality. Imagine the people, users are listening to your processed signals for 18 hours a day. And so if there are artifacts in the signal it becomes annoying very soon. And also we talked about restrictions, especially restrictions in terms of hardware. Very small size of devices. We need a very low power consumption so typically in the order of one milliwatt. And very low processing delay. I said typically below 10 milliseconds. So whenever you process your signal you read in say 10 milliseconds of your signal. You have to make all the decisions in that time. Whether it is speech or noise or to amplify or not amplify. That all needs to be done on 10 milliseconds. You cannot just read in one minute of the conversation, process it and stream it out again. That would not work in a conversation context. So the objectives for speech enhancement are manyfold. We of course want to get rid of the noise, want to have a low output noise level. But at the same time we want to have a high quality target signal, natural sounding target signal. We want to have also signal of high intelligibility. We want to have also low background noise distortions. By the same argument for the most part people are listening to the background and not to the target. Because they are not always in a conversation. So for extended periods of the daily use time they are just listening to background. And so also the background noise needs to have a very natural quality. And make sense to the users. And also if you think of binaural systems you want to have also high consistency of the spatial image, localization capabilities and so on. So clearly that's a difficult optimization problem and of course hearing abilities of the listeners must be taken into account. So I will start with single channel noise reduction and you may ask why single channel. Nowadays we have microphone arrays and a lot more fancy and powerful. Well if you look at the size of the devices again the devices that the users really like to have and to wear are the very small ones. Like in the ear canal almost invisible ones and there's room for only one microphone. You think it will be a bit relaxed when you have the binaural link so you can use the microphone from the left and the right side to combine it. But combine the auto signals but if you just look at one device for the very small ones you have just one microphone and this is it. So here we look at a typical block diagram of a single channel noise reduction algorithm where we have the incoming signal y of k here in the time domain. Then we do a segmentation with some sort of windows. We have looked at this already the DFT and then we have here the complex Fourier coefficients. Then we do some noise reduction processing on these complex coefficients. You get the estimated clean speech coefficients here at this point. Mu is the frequency bin and L is the frame index or time index here. Then we do an inverse DFT and we use the overlap add procedure to generate a continuous output signal. So in the DFT domain we have complex coefficients and if the noise is additive also in the DFT domain. In the complex DFT domain the target signal s and the noise signal n are additive. The task is given these noisy DFT coefficients estimate or find some function to estimate the clean speech coefficients. That's the task. Now let's zoom into this yellow box. This noise reduction box and typically you find three components in this box. So here again is the DFT. Now we look just at one DFT bin so here are the noisy DFT coefficients. This block is this estimator that takes the noisy DFT coefficients and computes clean speech coefficients. And in order to do so most of these methods require that you have some knowledge about the SNR, the local SNR in each time frequency point. So this is an SNR estimator that works in each frequency bin and estimates the SNR also adaptively over time. And for the SNR you need a noise power estimator that tells you of the noise level in each frequency bin and also time varying noise levels over time. So we have here already some important quantities that you find very often mentioned in papers about speech enhancement. The C of mu and L is called the apiuri SNR. So the SNR, I will have another slide on this, but the SNR that is more or less given or assumed. And gamma mu of L is the posteriori SNR. I have also slide on this to explain this. I want to continue with first an illustration of how more or less all these algorithms work. So here is the short time spectrum of a noisy speech sound. Again there is the vowel, you can clearly see the harmonics here. And the frequency range from 0 to 4 kilohertz. And we see the power, the short time power of this speech sound. And in red is now the estimated noise power shown here. And what all these algorithms do in one or another way is now to evaluate the signal to noise ratio locally in each frequency bin. So for example here at this first harmonic you have a fairly high signal to noise ratio because the noise level is here and the signal level is here. So essentially the algorithms should not modify these frequency bins because they have a high signal to noise ratio. While in between these harmonics the signal to noise ratio is relatively low, so 0 to B or below. And so these are clearly bins where not a lot of speech information can be recovered or there is even no speech information. And these bins should be attenuated. And that is shown in the next step, the enhanced spectrum. So by computing these estimators we indeed get a reproduction of the harmonics of the signal. They are not changed or only changed just a little bit while in between the harmonics we attenuate the signal. And now if you re-synthesize this frame or segment of speech into the time domain then you have the appearance that you have less noise in the signal than you had before. Also you can see that by applying just the gain as we do it here you cannot improve the local SNR in the frequency bin. That's not possible because you just apply a weight and the weight scales, the target and the noise alike. But these noise reduction effect comes into play when you have different bins, one with a lot of speech, one with a lot of noise and then you can attenuate those with the noise and leave those with the speech unmodified. Okay so there are now a lot of different methods and rules how to compute this weighting. The most well-known and prominent one is the Wiener filter. That's the linear estimator but there are also many popular non-linear estimators. Estimators for example that focus on the spectral amplitude instead of on the complex Fourier coefficients. Log spectral amplitude some have psychoacoustically motivated models to compute this weighting where masking, spectral or simultaneous masking in the ear is considered as well. And there are all kinds of variants of these methods and I will introduce a few of them. Firstly the Wiener filter which is a linear filter so the noisy input signal y of k composed of the clean speech and the noise is fed through this Wiener filter with impulse response h of k and the output is the estimated speech signal. And the Wiener filter now minimizes the mean square error between a desired signal and the output signal. The desired signal is in most cases the clean signal. So in a way for computing the filter we need to assume that we have the clean signal but only for the computation. So here is a mean square error and that's minimized and that's typically done then in the frequency domain and this leads to the following gain function. So this is the frequency response of the Wiener filter where here you have the power spectral density of the clean speech signal and in the denominator you have the sum of the power spectral densities of clean speech and noise. So that can be then rewritten by dividing the numerator and denominator by phi and n then you get signal to noise ratios in the numerator. So if you divide the numerator by phi and n and the denominator then you can rewrite this as xi over 1 plus xi and xi is now the signal to noise ratio. And now you can clearly see how this works. Whenever you have a high signal to noise ratio you can omit the 1 here and then this will converge towards 1 the gain function or the frequency response and whenever the signal to noise ratio is low then the numerator goes to 0 and the gain goes to 0 as well. And that's shown here in this plot. So here's this a priori SNR which is the xi here from this equation and the gain is the h here and so for high SNRs we do not attenuate a spectral component and for low SNRs we do attenuate it. That's exactly this function here. So that's I guess quite clear. And now the a priori and a posteriori SNR play a central role in computing all these different gain functions. So the definition of the a priori SNR is the ratio of the power spectral densities of the clean signal over the power spectral density of the noise. So there's a signal to noise ratio in each frequency bin or frequency. So this is like a continuous frequency definition or if you compute it based on the DFT and the periodogram for example then you would simply define it as the statistical expectation over the magnitude squared Fourier coefficients. So here again of this clean signal and here of the noise signal. So that's the a priori SNR and obviously we don't know phi SS if you would know phi SS then our job would be done already and we don't need to continue but we don't know it. So this a priori SNR is a quantity that's given or you have to estimate it and that's the next step. In order to estimate it we also introduce another quantity which is called the a posteriori SNR with this Rama of Mu and L and now here we plug in the magnitude squared Fourier coefficient of the noisy signal so this is clearly available because that's a microphone signal that is measurable. And we relate it to the power spectral density of the noise so this is a quantity that's a bit better to estimate than the a priori SNR because the numerator is clearly given or measurable and the denominator must be estimated. And there's a nice compact relation between the two so if you subtract one of the a posteriori SNR and you take the statistical expectation then you get the a priori SNR. So if we have the a posteriori SNR and do some averaging, statistical averaging we can compute or estimate the other SNR as well. Also this a posteriori SNR has some nice statistical properties like a periodogram it's exponentially distributed so here's the distribution is maybe not that important to go much into this detail but it's important once you want to set up for example a voice activity detector based on this quantity then you have to look at the distribution as well. So here's the exponential distribution which can be written with the variances of the speech and the noise signal or with the SNR as a XE value. Also we note that the maximum likelihood estimator of XE of the a priori SNR you can see here in the distribution of gamma XE is a parameter. And now you can maximize the likelihood of XE which is simply maximizing this density and then you find that the maximum likelihood estimator of XE is simply gamma minus one so it's this term here except that you don't have to compute an expectation. So gamma minus one is the maximum likelihood estimate under Gaussian assumption I should always add so if you assume that your signal Fourier coefficients are Gaussian distributed, complex Gaussian distributed. This version here of this estimator so that would give us the first estimator that we could then plug in to our Wiener filter and then we would have here the gain function for the Wiener filter and we would be done. So it turns out that this estimator is estimate is very noisy so if you plug this in as it is your gain function will fluctuate like crazy and will give you a lot of artifacts in the process signal. So a better way is used what is called the decision directed a priori estimator SNR estimator which uses in fact this maximum likelihood estimator weights it by one minus alpha limits it and combines it with the best signal estimate from the previous frame. So this frame L the new information the new maximum likelihood estimate and then you take your the output of your enhancement system so the S hat value from the previous frame L minus one you take the magnitude squared of it and you relate it to your noise estimate of the previous frame. So this is also a signal to noise ratio weighted by alpha so you combine these two estimates like a recursive estimate and the new one and this gives you the new and smooth a priori signal to noise ratio estimate. This is the celebrated decision directed a priori SNR estimation also developed by Efraim and Malar and here is the demonstration of the differences so here is a spectrogram of a clean signal here is the noisy signal with some so the clean signal with some babel noise and here is what you get when you use the maximum likelihood estimator. What you find is a fairly good reproduction of your target signal but lots of white spots here in your spectrogram and all these correspond to a short time artifact and now if you imagine that you have maybe one peak here somewhere in the time frequency plane and you re-synthesize your time domain signal essentially it means when you have here a lonely peak somewhere that you will in the inverse DFT you will generate a sinusoidal signal as the output because you excite or have a value in one frequency bin now if you compute the inverse DFT it will give you a sinusoidal output and that will sound not very natural so by having these random excitations here all over the place you will get a mixture of sinusoidal artifacts which is known as a musical noise and I will have a demonstration later on. Now if you plug in this decision directed estimator and now watch this rightmost spectrogram this picture looks quite different now also the speech information is mostly present but all these little artifacts from the maximum likelihood estimator are gone at least many of them are gone and only some are left and they are much less annoying than this one here. So this musical noise problem was a problem that was driving people crazy for many years because all the algorithms were essentially nice to publish a paper but they were not usable in practice because people would not like to listen to these artifacts all the time. Okay another algorithm I like to mention is the spectro subtraction dates back to Boll 1979 was one of the first attempts to produce an enhanced audio signal or speech signal so here the idea simply is that you take the power spectral density of the noisy signal and you subtract the power spectral density of the noise and that should give you in theory the power spectral density of the clean signal and that can be used then to create a filter as shown here in the second line you can simply extract here the power spectral density of the noisy signal and then you end up with this term and that is like a filter function or filter game that is multiplicatively applied to your power spectrum of the input signal. There are many variations of it it's also related to the Wiener filter it's simply the square root of the Wiener filter and it has to be generalized with different components here to make it a bit more flexible and to adapt it to certain noise types etc. Okay then there's a whole group of nonlinear estimators which use the basic assumption that speech is at least short time stationary even though it's never really stationary at least this assumption makes it manageable. The noise is additive and speech and noise are uncorrelated and now you can develop all kinds of estimators for either the complex Fourier coefficients so you have to estimate the real and imaginary part or the face or the magnitude and the face or you estimate the amplitude only the amplitude which has been also very common to do just amplitude estimation because it is believed I should say that there's not so much information in the face but currently other people working a lot on face estimation so things change from time to time. And then you can use different optimization criteria for example minimum mean square error optimization or maximum posterior optimization and then you end up with these optimization rules which tell you how to compute for example the complex coefficients given the noisy coefficients. The most famous one is probably the spectral amplitude estimator that was developed by Efrain Mala almost 30 years ago now where you would estimate or they develop minimum mean square estimator for the amplitude so here are the clean speech amplitudes, spectral amplitudes, here are the more estimated amplitudes and this is minimized under a Gaussian model which essentially means that you have to solve an integral like this and there's also a way to write it in a more friendly form using base theorem at this point and there's also a closed form solution which is shown here which uses some special functions but essentially it's computable also in MATLAB and if you want to implement it in a device you would tabulate these functions and then read out the values from functions so this is a non-linear approach and here's the gain function now plotted as a function of the a priori SNR and the a posteriori SNR because both play a role here in this computational rule and essentially what you find is when the SNR becomes lower then also the gain is reduced which you already saw for the Wiener filter which is a bit counter intuitive is the behavior for the posteriori SNR because here if the posteriori SNR goes up which signals in fact that there should be a target signal available the gain also goes down quite a bit and that's a bit counter intuitive but the result of this minimum mean square estimation and this contributes in fact quite a bit to the success of this estimator because this is a way that this feature suppresses the fluctuations in the output so whenever there's a fluctuation in the output it's a bit suppressed and that smooths out the signal and does not or leads to a less musical noise then there's a famous log spectral magnitude estimator where you would use a log weighting on the amplitude so to mimic or use a function that's more closely related to our loudness perceptions and that is also quite popular to use such a weighting a log weighting for example for example on the amplitudes and then do again the minimum mean square estimation there's also a closed form solution that can be computed also with MATLAB fairly easily easily. And is also very popular in certain applications. Okay, so here's an audio demonstration. I realize that the target signal isn't German, so I apologize for it, but you probably can get most of the effects. So here's the noisy input signal. Now, the output after spectral subtraction. So the target signal is fairly clear, but in the background you hear the typical musical noise of speech enhancement algorithms. And it's very clear that nobody would buy this. That is not a signal you like to listen to over a longer period. Now, the Minimo Mean Square Short-Time Amplitude Estimator from Ephraim Mahler, which clearly gives you a much better background or residual noise quality, more or less a white character, but still with some characteristics of the original noise. And the log spectral estimator achieves a bit more of noise suppression, but also introduces a bit more of target signal distortions. Okay, so again the original. Okay, so that was a demonstration of musical noise and what these established estimators can do. Some years ago, not so many years ago, we then generalized these methods and found a very general solution that is very flexible. Again, a Minimo Mean Square Error Solution where we minimize an error criterion and we made it more flexible in two ways. One, we abandoned the Gaussian assumption because in fact, four-year coefficients of speech are not Gaussian, even though you can read this in a lot of books or papers, but they're only Gaussian or the argument that is often cited is the central limit theorem. They say, well, when you compute a four-year transform, the DFT of the signal, you essentially sum over a lot of signal samples and whenever you sum up independent, statistically independent signal samples, then the resulting probability density will converge to a Gaussian. And that is a frequently cited argument but nobody cared to really measure a histogram or look at the true distribution. But what we found is indeed that it's not Gaussian because simply these signal samples are not uncorrelated or independent and the four-year transforms that we use here in these kind of applications are much shorter than the span of correlation in the signal. For example, if you consider a vowel, it can be easily 200, 300 milliseconds in duration and your four-year transform is maybe only 20 milliseconds. So essentially you see a highly correlated signal in your transform and that means that the distribution is more or less the distribution of your time domain signal and the time domain signal was already well known and that speech is not Gaussian but has a peaky distribution, super Gaussian distribution. So here for the amplitudes, we then use parametric distribution, the key distribution for the speech amplitudes which has a design parameter where you can control it a bit to be Gaussian, super Gaussian or sub Gaussian. So this is called the shape parameter and we have as a second parameter of control compression exponent beta, compression parameter here where what we use here in the exponent of the amplitudes and again here some illustration, the density allows us to modify it from Gaussian to super Gaussian, sub Gaussian depending on this parameter delta. So here is a Rayleigh density in black, that's the Gaussian case and then we can make it more peaky or less peaky and adapt it to the true distribution of speech and we can introduce some compression. So especially for beta smaller than one, this function will act as a compression like a square root compression or higher compression but we could also designed as an expansion but that's never used so we really leave it as use it for a compression. And again there is a closed form solution for all the integrals that you have to solve and you need again special functions that can be seen in the paper, especially this confluent hyper geometric function but then you have a very flexible gain rule that can be adapted to your problem and this gain rule in fact summarizes all or many many other older special cases like the short time amplitude estimator from Ephraim Mala is found here for a Gaussian signal so delta being one and beta being one, no compression and the log spectral amplitude estimators also here the limiting case and essentially you can move all around in this pink region and we found that a good working point is this one here where you have a square root compression and you have some super Gaussian assumption about the distribution of speech so delta being 0.5 so this has been on to be a good operating point where you get a good trade-off about noise reduction and distortions and so on. Yes, listening test and measures but in the end so you don't find in my presentation a lot of objective measurements of course we do that in terms of various measures but I clearly want to point out in the end listening is the one measure so there's no way around about no way around of listening experiments listening tests and that is really important also to listen whenever you process your results to listen to it and to be very critical about the target quality, the background quality and possible artifacts. Okay, I've pointed out earlier that one of the important ingredients to a noise reduction algorithm is noise power spectral density estimation because once you have a noise power spectral density estimation you can estimate the signal-to-noise ratio when you have a signal-to-noise ratio in each time frequency bin you can compute the Wiener filter or any of the other estimators and then you have solved the problem more or less. There are many different methods around for noise power spectral density estimation like using a voice activity detector to detect periods of speech pause then collect the noise statistics during that time and use it later on. There are some kind of soft decision methods where you don't make hard decisions on voice on non-voice activity but you use a probability measure. There is a method called tracking of spectral minima which I will explain and there are some more recent methods which use minimal mean square error estimation for estimating the noise power. The assumptions for these methods are always that speech and noise are statistically independent and speech should not always be present otherwise most of these approaches will go haywire somewhere. Noise is more stationary than speech is also one of the crucial assumptions so the worst case for all these algorithms is an interfering signal which is simply a second speaker that is always the worst case because then it's very hard to tell apart which is the target and which is the interfering speech. Just some ideas of how voice activity detectors work I also don't go much into detail here but I want to point out a certain aspect so typically they use some features use a subband power have some pitch detector and also use for example correlation of your signal segments over time then they form a decision based on these features they also most often use background noise estimate and also have a scheme to what's called a VAD hangover addition to make sure that at the end of words you don't clip your signal but you have given it sufficient time before you decide on non-speech. Here is the performance of one of these estimators voice activity detectors which is in fact a standardized voice activity detector from the mobile phones world GSM adaptive multirate codec and you see here at 30 dB it works perfect or this is like an initialization artifact here but this here is found very nicely so it works perfect, has some hangover built in so not to harm the trailing edges of speech sounds but then at 0 dB it more or less decides all the time on speech still works here quite well but for example here it decides on speech and for this type of application that is a good way to do it because this voice activity detector is used for what we call discontinuous transmission in mobile phones where you don't want to transmit the signal all the time especially if nobody is talking that's also a measure to improve the battery life of your mobile phone so you only transmit when somebody is speaking and otherwise not but you want to make sure that even under noisy conditions you don't lose speech because that would be very annoying for the conversation if your mobile phone would not transmit speech so here this is tuned such that you would transmit and would try to avoid missed hits or missed classification results this is another estimator which is more selective it's based on the aposteriori SNR by comparing this SNR with a threshold it can build a very powerful and simple voice activity detector which is a bit more selective as you can see but then also might miss some of the soft speech sounds when they are drowned in heavy noise so the message really is whenever you use a voice activity detector just don't take anyone because voice activity detectors need to be optimized for the application so this one could be for example used for noise estimation because it allows you also to pick out some noise in between the words but could also introduce some artifacts in other applications so that is the important message voice activity detection is always tuned for some application and you have to be careful when you pick out one algorithm when it comes to noise estimation itself you in fact don't need to do a voice activity detection because you don't want to detect noise or speech or non-speech activity what you want to have is a noise estimate and this is realized by this minimum statistics method where you again assume that the power or the variances of speech and noise are additive then you do this recursive smoothing of the periodogram we have seen this before then you search for the minimum the minimum of the smooth power was in a finite window of length D and then you assume that your minimum is somewhat representative of the true noise power and we see this in an animation here so this is a time here so the frame index, time of a noisy several noisy sentences in a time varying noise at one certain frequency at frequency bin 25 which is in a lower frequency range and now this method first smooths out this periodogram so we get the blue curve which is not so much fluctuating it has a lower variance and then we move across the lower edge of this blue curve and find the minimum and the minimum is then an initial estimate for the noise power you can clearly see that there's still something missing here the mean here for example during speech course is still a bit higher than the minimum and that needs to be compensated this is on one bin on bin 25 yes on every bin this is done on every bin and this obviously doesn't require any voice activity detection that's a continuous process you simply go for the lower edge but there's still a mismatch or you can say a bias between the red estimate and the true power which is obviously caused by taking the minimum the minimum is always smaller than the mean or in the best case, equal to the mean so this is shown here in terms of probability distributions and can be computed so here is probability distribution of the smoothed histogram and this is the distribution of the minimum which can be computed under some circumstances some assumptions and then you can compensate for the bias and once you do this and do some other small enhancements you get this result here shown in red again where you can clearly see now that you match the mean here of your noise power and you're even able to follow time varying fluctuations so here a decrease in the noise power here an increase even during speech activity you can follow the noise power to some extent and that gives a much better noise estimate for your enhancements and the noise estimate turns out to be a very crucial thing the noise estimate is not good your system will not deliver a good output performance and you don't need once more you don't need the voice activity detector for this continuous process so that was a very quick overview on estimators for single channel enhancement and now to conclude the single chapter on single channel stuff I have one more addition and that's again about musical noise because musical noise is the biggest problem that people were facing even if they had all these nice estimators and you have to do something about it you want to sell your product in the end and many people said well single channel enhancement will never work simply because of that musical noise problem but again now there are solutions available which more or less get rid of this problem okay so this illustration is just my graphical imagination of how it sounds okay so what's the trick here what we do here is a processing step in the capstrum domain and I like to briefly remind you what the capstrum is it's a an often used concept especially also in speech recognition in the famous male frequency capstrum coefficients so the capstrum is the inverse Fourier transform of the magnitude spectrum after taking the log function so you first take the magnitude of your Fourier spectrum or your DFT coefficients then you take the log spectrum so compression again then you take the inverse Fourier transform and that was dates back to the paper by Bogart Hylian Kyuki Kyuki was one of the co-inventors of the fast Fourier transform to 1963 and it's a very useful concept and I will explain why there's also some strange terminology attached to working in the capstrum domain capstrum is an artificial word as you realize created by taking the work spectrum and inversing the first half so that's the spectrum with the first half turned around and then we talk in the capstrum domain not about frequencies but currencies so that's frequency also with the all the the F should be in red in fact turned around and then you don't talk about harmonics but harmonics and so these guys here invented a set of new vocabulary and also their paper that they published has a very strange title but as I pointed out a very important paper and the capstrum is very important because it separates different components of speech signals or speech spectra I should say especially it enables you to extract the envelope information the harmonic structure of your signal and the fine structure of the spectrum and why is that how can this simple or not so simple equation do all that for you essentially you can look at it here as the spectrum of your signal in the db domain db is nothing else than the log used to looking at speech spectra on the db scale and now the inverse Fourier transform is not so much different from a Fourier transform so what essentially what it does for you is it takes this log spectrum and it computes a Fourier decomposition of this log spectrum so essentially what it does it looks for slowly varying components for fast varying components like in the signal when you have a time domain signal and you compute a Fourier spectrum you get an indication of which frequencies and what type of oscillations are included in your time domain signal and here we look at the spectrum and we get information which kind of fluctuations and harmonic oscillations are in your spectrum spectrum representation and here is an example so here is the time domain waveform of a vowel here is the log spectrum you can clearly see the harmonics and now if you think about a Fourier decomposition of this spectral shape here you could say ok I find a fundamental frequency here maybe I try to maybe you have to half close your eyes stare to the screen you find here a sinusoidal or at least a bit of a sinusoidal wave in here which I try to point out here and if you look at the harmonics that could be like a fast oscillation on top of this coarse shape and this is exactly what you now find in the capstrom so after doing this Fourier decomposition you find strong components here at low frequencies corresponding to the coarse structure of your spectrum then you find this fast oscillation corresponding to the harmonics here encoded in this peak here that is corresponding to the fundamental frequency so you can use this also as a fundamental frequency estimator and that has been proposed many years ago already here is the first harmonic so the first harmonic of the fundamental frequency also visible as a peak and in between you have all this fine structure that encodes the deviation of envelope of this true spectrum from the coarse envelope information and the harmonic but essentially it's a very versatile representation of your speech spectrum now if you do not just for a single segment of speech but for the whole spectrogram so here again a spectrogram of a clean speech signal then you get this what we call a capstrogram where here again you have time on the horizontal axis and on the vertical axis we have a Q-ference and you find here there's a lot of information in the lower Q-ference range that is the envelope information here and then there's one line not a full line but some pauses or interruptions here in this domain that is the fundamental frequency so that clearly indicates the evolution of the fundamental frequency and here you can even see the first harmonic and all this here is fine structure of your spectrum now this can be used now for a very specific smoothing process which is very specifically adapted to speech and the idea is to first separate your signal into cores and find spectral features which are done by the capstrogram and then do a relatively strong smoothing on the spectral fine structure because that encodes all the little fluctuations that you find in your signal and to know or just a little bit of smoothing on your cores spectral structure because that encodes the target signal, the speech signal so you can reduce the residual noise quite a bit with a negligible impact on the speech signal and especially it's nice because it preserves the harmonic spectral structure of your voice speech segments very well this can be applied this method to a lot of different enhancement approaches single channel multi-channel line source separation derevibration and we will see this again in the hands-on session okay here is oops okay one application is you can use in the SNR estimation process for example for Wiener filter again here's a standard SNR estimation using the decision directed approach there's also some outliers which would give rise to artifacts and here after the capstone smoothing where this is smoothed out quite a bit and would not be give rise to annoying fluctuations now also you can see that it has some interpolating capability so especially in the harmonic region here you find that the harmonics are bit restored again and that's because we specifically look at this fundamental frequency Q-francy bin and don't harm that or restore it so that intermediate harmonics are go out to some extent okay so here's a quick demonstration again the clean signal the noisy signal oops that did not work why not and again here the noisy signal and the enhanced signal what you find is the residual noise comes very natural so if you listen to this again and focus on the residual noise it's sounds mostly like the original just attenuated very little artifacts yeah yeah yeah yeah yes yes yeah so when it comes to hearing aids there are several criteria one is of course intelligibility and the most important goal is to restore the intelligibility but the second one is also listener fatigue or listening effort to reduce the listening effort and to present quality high quality output signal without artifacts and there's a kind of tradeoff and a balance and for these kind of single channel algorithms it is known that they don't improve the intelligibility if they are well designed they also don't lower it here you had some slight distortions of the target signal as well but here I also went for a fairly high amount of noise reduction so if it's well designed you would more or less keep the intelligibility of the noisy signal you would not improve the intelligibility but you would reduce the listening effort quite a bit and the purpose of these single channel methods in hearing aids as they are now maybe we have a better method in five years or so is to get away with the noise without harming intelligibility but it's not possible for these type of methods or at least we don't have one yet to improve the intelligibility significantly I mean sometimes we have very small improvements but it's a bit different for cochlear implants for cochlear implants similar like a low bit rate speech coder you have a very low bit rate interface to say to transmit your information and on this interface noise is very very harmful so for cochlear implants it's been shown that with single channel methods you get an improvement of intelligibility roughly by 2 dB not a lot but significant improvement of statistically significant improvement what is this 2 dB 2 dB when I mean 2 dB I mean signal to noise ratio in a speech reception test where you measure typically at the 50% point and you have this kind of psychometric relationship between speech reception and SNR threshold which looks often like the sigmoid it goes with the sigmoid function and the 50% point would then be here and here is the SNR so now so here would be 0 and here would be 100 and now an improvement of 2 dB at this point means quite a lot because could be a shift from 50 to 70% depending on how steep this function is here at this point so in this domain of hearing aids even small but consistent of course consistent improvements do contribute or can contribute to significant improvements in speech reception yes always in the end you always have to do a listening test but for lab evaluations or intermediate evaluations while you are tuning the algorithm people typically use well a PESC for example perceptual evaluation of speech quality but in most cases we look at specifically at the distortion of the target signal we look at the amount of noise reduction that we get and there usually is a trade-off between the two the more noise reduction the more distortion on the target signal and we look at the quality of the residual noise and or the total quality but in many cases even the simple SNR gives you some indication of what your algorithm does because most of these filters are linear phase filters so they don't introduce much of phase distortions and then the SNR already gives you a good but again I like to emphasize in the end you have to listen to the signals okay here is just a statistical analysis of in terms of histograms of spectral outliers and in fact you can show that many of the noise reduction algorithms produce heavy tails in the residual noise which corresponds to these outliers these random outliers which give rise to musical noise so here is a histogram of the noise power bins and you can see here the outliers which have large values which are relatively frequent and a deviation from the Rayleigh distribution which would be like a Gaussian noise case and after the temporal capstone smoothing this is much reduced amount of outliers is reduced and so musical noise is suppressed by this method okay so that was and I propose to have a quick break or some more questions if you like okay 63 okay that is just a floor flooring operation on the signal to noise ratio if you use it without this floor again you will have with this decision direct estimate a lot of outliers so it is better to keep a certain level so that you don't have isolated peaks in your SNR estimate yeah depends also on how much it also limits the amount of noise reduction that you can achieve so that is closely related and if you put in here like a minus 10 dB or 0.1 as a special value then you probably get fairly good output quality but also not a lot of noise reduction or 10 dB noise reduction but if you go to lower values so you might see too many outliers so I would say a value between 0.1 and 0.15 or so that is a value that could be used but that is something that you need to play with depending on the application yeah so the state of the art would be to use an estimator which allows you to introduce some compressive action and also to fit the a priori assumptions to the true histogram of your clean signal and then use some methods like the temporal capstone smoothing to get rid of musical noise what I have to say here is these are all methods which still can be implemented under the low latency constraint so the capstone smoothing essentially does not introduce any additional latency that is an important point to notice here and that is why it's nice ingredient to these algorithms because it's not introducing any latency do other smoothing processes over time then you have the time domain or somehow in the spectrum domain then you probably most probably introduce some latency there are some newer publications for example by the group of Diliang Wang Ohio State University where he uses a deep neural network to estimate time frequency segments and in his publication he reports relatively large improvements in intelligibility, single channel systems for normally hearing as well as hearing impaired people so the process has a fairly large look ahead and uses much longer segments of the signal in order to compute the output so at that point this would not be ready for inclusion in the hearing aid so the problem or also say the art is to boil it down to this low latency type system and then experience we had is that then we would not be able to improve intelligibility anymore while when you have all the time to process your signal in some applications possible like for speech recognition you can read in larger chunks of your speech signal and do a lot of processing on it then when latency is not a problem then you would be able to improve the intelligibility as well single channel approach one exception I mentioned is low bit rate systems like low bit rate coders where we have also shown to be able to improve the intelligibility with systems like this or cochlear implants as well where we also had experiments in our lab which showed improvements in the intelligibility okay so I think we continue so far we have so far we have treated single channel and now I move on to malty and specifically also dual channel processing methods and obviously single channel algorithms are somewhat limited because they do not allow to exploit spatial information and also I said this before single channel methods at least are designed for the low latency constraints that I mentioned do not improve intelligibility but reduce the listening effort and improve the quality of the signal malty channel systems allow you to exploit spatial information and sound field statistics and also here we have many many different approaches and I can also pick out two or three and explain them there is the most famous maybe delay in some beamforming and filter in some beamforming their adaptive beamformers there is blind source separation in various versions there is the malty channel Wiener filter so the generalization of the Wiener filter to multiple input signals and there is also what we call model based adaptive beamforming approach which I will explain in more detail but now since it is already late in the afternoon maybe tired let's do some interactive session and ask the question now suppose you have two microphones what would you do with these two microphones or the two microphone signals any ideas direction of arrival as beamforming one keyword was already there for example we could add the two microphones signals simply add them and then we would have what would we gain well probably depends a lot on the target and the noise signals and the spatial properties of these but in a situation where the noise that we receive at the microphones would be completely uncorrelated and the target the speaker would be speaking here from the broad side of these two microphones we would gain a maximum of 3 dB so for delay in some beamforming if the microphone signals if the noise components at the microphone signals are uncorrelated you will gain log n and n is the number of the microphones dB so that is a useful simple approach but we also notice that in our application like hearing aids the microphones are often very close together so if you look at one device as I showed earlier today the microphones are only 1 cm apart and that will lead to highly correlated noise signals so a delay in some beamformer would not be very useful because it picks up more or less the same signal the two microphones pick up more or less the same signal and then it doesn't help you a lot so what else could we do technically speaking yeah, perfect so next step would be to subtract the two signals so what happens now well if the target if the target comes from this direction here it would probably be bad because then we would subtract the target signal from this point and we get no target at the output that would not work but if the target comes from this direction here this could still work reasonably well there would be some subtraction some differentiating subtractive action but any noise component that comes from this direction here perpendicular to the axis of the two microphones would be cancelled out so that could be in fact a good approach and that is in fact used in hearing aids and it's called differential microphones where under the assumption that your target signal is here aligned with the axis of the two microphones and the noise source comes from this direction you get indeed an improvement of the signal to noise ratio you need an equalization filter here at the output because of the subtraction here which acts more or less like a high pass filter so like a differentiating system high pass filter so you have to equalize for this with a low pass filter but then you get fairly decent output and this also works if the microphones are closely spaced they are just one centimeter apart you would still be able to eliminate the noise source that is here at this point and as long as it is a point source okay so you can get a very high gain if the noise source originates from a single acoustic point in space and you have no reverberation so if you do this experiment in an unecoic chamber you get fantastic results with a simple system but under more realistic conditions you might get 3 to 6 dB of gain but much better than for the summation or the delay in some beam form I want to go a bit more into detail in this system so here again we have the two microphones they are closely spaced the distance is D and we have now the speech source aligned with the axis of the microphone away so I've turned it around now and this slide here and so the mechanism is as follows we have here an acoustic delay or we can call it sometimes the external delay that is simply the delay of the wave fronts traveling from one microphone to the next and we also insert an electric delay so in the signal path here delay T sometimes called the internal delay and then we do the subtraction here delay T at this point to change the characteristics of the overall system and then again we have the equalization filter here which compensates for the subtractive operation at this point so it works with closely spaced microphones the equalizer I mentioned already simple implementation but also some noise amplification at low frequencies so you need high quality microphones to implement this not cheap microphones it doesn't work with cheap microphones and adaptive control of the directivity pattern is possible and you get gain as I mentioned before between 3 and 6 dB and the good thing is that now you can control with this internal delay T so the electric delay that is controllable the directivity pattern of this two microphone system so if T is zero so no delay then we have exactly the effect that I explained already you can receive the signal from the front without attenuation so this is a polar pattern where this gives you the attenuation of the signal as a function of the direction of arrival and anything perpendicular sources perpendicular will be cancelled out because of the subtraction of the two microphones signal so from here or from here but now you can change this internal delay and for example achieve such a cardioid pattern or super cardioid or hyper cardioid the different directivity patterns and for that reason most hearing aids come with two microphones which are omnidirectional and then are combined in the hearing aid in an adaptive fashion so that different directivity patterns can be realized simply by software means or hardware means but you don't need to have an extra microphone for the beam forming or the directivity here again is the relation of the distance between the microphones the speed of sound better is a parameter that can be used instead of the internal delay T so here is a way to compute the better from T or T from better and here are typical parameters that are used to achieve certain types of directivities and you can see here the directivity index that gives you the improvement ranges between 4.7 and 6 dB and this all can be also made adaptive by taking the two microphones and forming the dipole characteristic and the cardioid characteristic and then have like a mix mixer an adaptive mixer that mixes these two output signals so you can shift this zero this nulling action where you can null out one of the noise sources along the back hemisphere and it can be made adaptive so if there is somewhere noise source behind you you can have it find this noise source automatically and eliminate it it's very nice and very simple approach that's also often used in hearing aids goes back to Elko and Pong and they also have a patent on it so what else can you do with two microphones for example after we did the summation and the subtraction we could filter one of the signals and subtract it that could be beneficial for various reasons and one typical application is what we call noise cancellation so we would try to pick up with one microphone the noise source only and with the other one the speech source and then we could try to estimate the noise in the primary microphone here and subtract this estimated noise at this point and hopefully just have the speech here at this output this is called noise cancellation that's similar for example to echo cancellation someone asked me about echo cancellation before lunch and so what's the problem here in principle that works fine if you are able to find to extract a reference noise signal but since also again in our application the microphones are closely spaced on the best case maybe one on the left ear and one on the right ear the speech signal which is meant just to be picked up by the primary microphone will also leak into the reference microphone and then it will also be cancelled here at this point so in general in acoustic signal processing even though this is a very famous method and it has been used in biomedical engineering for compensation of EEG signal compensation all kind of different applications also echo cancellation but in general for these noise reduction applications it's not a good idea because it's difficult to control the target signal leakage into the reference microphone and that reduces but there's now a structure where we could combine all the different methods that we have seen so far so here we have the summation of the two microphones so we sum the two microphones to build first signal here we do the subtraction to generate a signal where we cancel out a source that is in front of the two microphones the target signal in fact so here we don't cancel out the noise source but we cancel out the target which would be sitting here in front of the microphones and then we use a filter to estimate the noise which we see here in the upper path so the idea is to hear some beam forming or some delay in some beam forming to generate the first estimate of your target signal we call the target supposed to sit here in front of the microphones here we generate a noise reference by canceling out the target and then we have a filter here that matches the noise that we see at this point with the noise that we have still in this signal and cancel it out at this point and this is the output signal this is also a very famous approach known as the generalized side loop canceller by Griffith and uses all the different components we have seen so far this generalized side loop canceller and also other approaches I've shown can be then generalized to many microphones they are not limited to two I've just explained it using two microphones but it can be all generalized to many microphones so here for example the delay in some beamformer where you just insert here some delays to cope also for other directions of arrival and here is the summation point of the beamformer can be done with arbitrarily many microphones but again for hearing aid applications there is not a lot of space or not a lot of places where you can put these microphones limited and there is also something called the filter in some beamformer where you insert your filters and also then can optimize these filters for maximum attenuation of noise and this can also be generalized to arbitrary many microphones ok so this is just a quick overview of what can be done in principle with the material that we have and now let's have again a look at binaural processing methods so methods that combine the left and the right here and before we had this wireless link and this connectivity the two hearing aids would operate completely independent and sometimes you could even select different programs independently which would be not a good idea so in one ear you would have a program for say speech and noise and on the other music listening program and that would be of course very confusing for users so and I would like to present one typical algorithm that is based on two microphones which we will also explore a bit in the hands-on session and which would then of course require the streaming capability the streaming of audio between the left and the right side there was also earlier this question of what the wireless link is good for it's exactly these kind of algorithms to implement them but then you need real-time streaming of audio between the left and the right side to do that so the algorithm I would like to present is also very classic one which was initially developed for the reverberation so for getting rid of reverberance but also removes some of the noise that is frequently encountered in closed spaces where you have lots of reverb so here you find a reverberated speech signal and you can clearly see that the signal and the spectrum you can clearly see that the signal smeared out a bit it's not that clear or crisp as the unreverberated the dry original would look like and it also sounds a bit reverberated so there's clearly reverberation and the idea now is that the direct sound components from the speaker to the microphones especially if the speaker is close to the microphones will be highly correlated because that is more or less the same signal but the reverberation which is composed of many reflections from the walls will come from all kind of different directions at least delayed reverberation and will be more or less uncorrelated so again you have here a way to distinguish between the desired and the undesired signal components in terms of statistical properties one is correlated, one is uncorrelated and that is a way to get rid of this and there was a very early proposal in fact also one of the authors is John Allen well known Berkeley and Blauert Blauert in Bochum Professor Emeritus from 1977 where they used the so called magnitude squared coherence function for the suppression of uncorrelated signal components first let's listen again to the noisy signal and then the signal that is even some or the reverberated signal and the signal with some noise maybe I go to the second one directly this is like a typical signal like in a train station where you have announcements and if you ask a hearing impaired person whether she or he understands announcement train stations the typical answer is no simply because of reverb and all the noise but even for normally hearing people sometimes very difficult to understand what they present in the PA system ok here's a block diagram and I've copied it directly from this original publication so here are the two signals they called X of t and Y of t here that signifies a spectral decomposition using a filter bank or we could use our overlap at DFT based analysis synthesis system then there's some phase compensation here at this point then you add the two signals to get some gain maximum of 3 dB by adding the two microphone signals and then there's a gain applied to each of these signals and this gain for example could be based on the or is based on the correlation between the two microphone signals and as I said before the direct components would be highly correlated so the gain should open up and the uncorrelated the late reverb and all the ambient noise is should be suppressed and the gain should be designed such that these components are then suppressed in these frequency bins and then there's again a synthesis filter bank or an overlap at system ok one more few more words about the coherence function because the coherence function is a convenient way to compute this gain the complex coherence is nothing else than a correlation coefficient in the spectrum domain so you go to the DFT domain we have done this now several times and now you compute the cross power spectral density between the two microphone signals so that's phi y1, y2 the cross power spectral density and you normalize it on the auto power spectral densities of the first signal and the second signal so that's like a correlation coefficient in the spectrum domain so you do this in each frequency bin and then you get frequency dependent information that's called the coherence function and then also very popular is the magnitude and for this you simply take the magnitude squared value which then means that you take it in the numerator and denominator and the nice thing about this magnitude squared coherence is of course it's a real valued function it's not complex anymore and it ranges between 0 and 1 so you have really a gain function that can be 0 whenever the two signals are completely uncorrelated when they are completely correlated and now you have really this type of gain function that you like to have to get away with those frequency bands which have only uncorrelated so reverb and noise information and keep those frequency bands which have highly correlated signal components which most likely then stem from the target speaker here is just a plot of the magnitude squared coherence for diffuse noise and late reverb so it can be shown that for a diffuse noise which is a noise field where you receive the same signal energy from all spatial directions the magnitude squared coherence looks like a sync function or a magnitude squared sync function right here in the solid line is the sync function, the model and the dashed line is a measurement and you can clearly see that it works fairly well so you can assume that when the microphones have a certain distance then you have mostly uncorrelated signal components at high frequencies and some correlation at low frequencies when the microphones apart the less correlation you see so here it's only confined to very low frequencies the distance between the microphones ok and here is what it sounds like so the reverberated and noisy original the process signal and clearly here that the reverb runs, the reverberance has been significantly reduced also some of the noise has been reduced but as we see from this figures here the low frequency noise cannot be reduced because if it's like a diffuse noise field as in this simulation or in the signal then it will be also highly correlated at low frequencies and will be passed by the system ok then the final approach I would like to explain a bit is based on the generalized side loop canceller which was presented earlier we derived in fact or developed it from basic components and here is now the multi-microphone version of it where we have a delay and some beamformer to generate a first improved signal then we have here what we call the blocking matrix that is in fact the generalization of the simple subtraction that we have seen before the simple subtraction blocks the target because it has a null or steers a null towards the target and that can be now generalized to many microphones and then we talk about a blocking matrix which has a task to block out the target signal so that we have only noise signals at this point then we have a multi-channel adaptive noise canceller which estimates the noise in this signal and subtracts it, cancels it out at this point and this is a typical scenario with five microphones now where you have one speaker and a second speaker and the task is to extract some noise extract these speakers here at the output and how can we do this what I forgot to mention is that this is not a fixed system anymore but this is fully adaptive so this estimates the position of the speakers estimates this blocking matrix and the adaptive noise canceller all based on posterior distributions on the speech activity of the two speakers so that's fully online and fully adaptive and then we can extract two different the two different sources so now here's a quick demonstration of it speaker one, speaker two in the form of spectrograms so one is a male and the other is a female and here is the superposition of both so as speaking at the same time and also some additive noise or some noise added to the two signals and the separation now works in two steps one is to localize the sources and the second is then to extract the two different target signals and the localization uses an approach that is based on beam forming so we form a beam of high sensitivity and with this beam we scan the room or scan a certain range and then try to find the two sources and now if I step through the different directions so here now the beam is indicated by this red bar here and now if I move this around and once I pointed into the direction of speaker one then I see the spectrogram here of speaker one and if I continue scanning the room then I at some point hit speaker two and then here we see the spectral information of speaker two and then it disappears again and you also notice that at low frequencies we always receive some signal at low frequencies this there's no directivity for this array the directivity depends on the aperture or the distance of the microphones in relation to the wavelengths and so at this is not anechoic no, no, no this is slightly reverberant I have a demonstration audio signal in a minute and we hear some not a lot, not as much as in the previous example but you hear still some reverberance okay, that is the SRP FAT approach maybe known but essentially it's acting like a beam former and the idea is to compute a measure that allows you to distinguish between to measure the power that you receive in various directions I don't go into the interest of time not into detail here but explain it using this figure here where we now look at this example the overlapping the two spectrograms plus additive noise here over time and over frequency in the vertical axis and now we use that SRP FAT measure and find in each time frequency point an angle that maximizes this measure and what you see here now is that there are lots of blueish points I guess they are also blueish on your screen and yellow points and if you look for the angles, the corresponding angles these are exactly the angles where the two speakers sit so here this blueish color is somewhere around 60 degrees and this yellow is maybe 110 degrees and these are the winning angles in each time frequency point so we indeed find the information for the two speakers here again in the pattern that we receive from the beamformer and now we can use a lot of different strategies to extract the two speakers we could use build a mask for example for one speaker or another speaker a binary mask would not sound that good but still be capable to extract this or we can control this model-based generalized sight loop cancel with this information and that is the approach that we are doing in fact so here we have a histogram of a single frame so at a single time we take out one slice here and we get a program of angles of the distribution so we get the two peaks here for the yellow and blueish bins then we fit a GMM to this which gives you a stable estimate of the position of the two speakers and yeah that's it and we can show then that we get some improvement in terms of single to noise in comparison to a delay in some beamformer which is the blue lines for two different measures segmental intelligibility weighted signal to noise ratio interference ratio or segmental intelligibility weighted signal to interference plus noise ratio improvement so compared to the delay in some beamformer you find some nice gain here in the system and it degrades a bit with increasing noise levels because then it's more difficult to find the two target sources obviously now here's an audio demonstration with the two speakers listen to them the mix signal the output signals and the other one background you still hear the other speaker that is due to the reverberant part of the signal ok so that brings me to my summary I've shown that modern hearing systems are highly complex signal processing devices which also have to fulfill a lot of constraints especially latency and power the signal enhancement task is at the core of these devices because that is still the most challenging situation adverse acoustic conditions and to cope with these is quite difficult there are single and multi-channel approaches microphone area processing approaches and source separation approaches and there's also acoustic scene interpretation which I have not talked about here but that's also part of the processing in order to control the different options that you have available for processing your signal well the challenge continues there's still a lot of work to be done especially to deal or find better methods to deal with noise and reverberation and to find low complexity low power low latency implementations and new solutions probably rise from better models about speech signals and hearing then the availability of sensor networks and the wireless connectivity that we have now at hand and probably better models of how humans process acoustic signals in general so top-down cognitive processes that can be included some of these problems are dealt with in a project that we are currently working and that I'm coordinating and we have also research fellows from this project here so the ICANN here Marie-Cuny Curry Initial Training Network where we look indeed at modeling the processing in the normal hearing and impaired auditory system we evaluate the algorithms that are developed and we apply new strategies for improved communication and a very central part of the project is that we have a common development platform for implementing these algorithms so they will be implemented in real-time and so that they can be compared between labs and we have also common evaluation schemes which I also not talked about here mostly for time constraints and these are the network partners that are in this network okay a lot of these algorithms or at least the older ones I've presented are also described in these two books so it will add advertisement at the end and then I've also some references compiled for your for you and five pages of references where you find the references which I've mentioned in between on my slides in full details that's it I've made up a bit of time now we have to go to the hands-on session in like 15 minutes so maybe we can make a shorter start at 5 and finish at 7 you have to do how long you depends on how fast you can program yeah on type but don't be scared it will be step by step and in fact you don't have to program a lot it's just a few lines of code a few optimizations but maybe I can give you before we leave a look now on the hands-on exercise while I still the projector here so essentially we will implement this dual channel reverberation system and when I say we implement it I should add it's mostly implemented already and you still have to add a few things so you would optimize or implement parts of this low delay spectral analysis synthesis system the delay in some summation beamforming then the gain function and if there's still time but that's already optional we can also include the temporal smoothing of the spectral gains using the technique and you need to fill in a bit of code and look at the spectrograms and most importantly listen to the signals that you get