 I would like to thank PCCA organizers for their invitation to speak here. And then we share screen now. The title of my presentation is Neural Spectral Spatial Filter. I'm from Ohio State University directing the perception and neurodynamics laboratory there. So this is all of my presentation. So you chose the background of this study, in particular of masking based beamforming. And I present complex spectrum at your approach to speech deregulation. And finally I'll be presenting the main theme of this talk, which is Neural Spectral Spatial Filter. So beamforming is widely used decades of multi-channel processing methodology or spatial filtering. And for spatial filtering, the beamforming term would suggest that we didn't know which direction the target source comes from. It's for steering purposes. The target direction is also known as the steering vector. To figure out steering vector, we need to rely on so-called direction of arrival estimation for the target source. And that's typically conducted through some localization. Our source localization in the river and multi-source environments itself is very difficult. For human addition, it's not even clear localization should precede source operation. Actually, evidence, perceptual evidence would suggest that localization seems to depend instead on source operation. So mask apensive beamforming is a relatively recent idea in this long history of beamforming development. The idea here is that we use a timbre-based mask to guide beamforming. And surplus masking helps in two ways. First, to specify what the target source is. This is done through a surplus learning, where you can tell the system what the target source is, whether it's a speech signal or it's a music source, for example. We're asking for the help to suppress the appearance of sources. That's where the target-based masking was originally proposed for. So let me explain this idea a little bit further because this is going to be utilized later. Let's take the perhaps most widely used beamforming as an example, which is NVIDIA beamforming. NVIDIA stands for minimum variance distortionist response. And the purpose of this beamform is to minimize the noise energy coming from non-target directions or coming from directions that are not corresponding to the steering vector while maintaining the signal coming from a target direction. If I write down the signal model as follows. So we have the array recording that corresponds to y, y-nize. It's a vector indicated by a ball face. This is a vector of sensors or a vector of microphones. And we represent the type frequency domain. So the basic unit is a TF type frequency unit. And it has two components. Why is the target source? The target source is indicated by S as a TF. And this, if you do speech digital vibration, the target signal would be the direct pass signal or the dry source signal. And let's modify with the vector C. C is the steering vector of the array. And that gives you a received speech signal by the array itself. And of course, we also have the background noise, but this is the N itself is the spatial vector of the background noise. Because the NVIDIA beamforming corresponds to quadratic optimization with a constraint and this can be solved through quadratic optimization technique. And actually we need to figure out the way vector that minimizes this term. And the middle of this term is piece of N that is the spatial covariance measures of the noise. And we want to minimize this corresponding to minimizing the noise, total noise energy. But the subject what I'm showing now in the steering vector direction, C, we want no processing, no distortion. So that's corresponding to the inner product, the wave vector. And this is a wave vector. It needs to be trans, it needs to be transposed in a complex domain that correspond to conjugate transpose. And that multiplication corresponds to what starts the constraint when it's satisfied. Now the minimization of the output power is the same as the minimization of the noise power because the distortions constraint that we can solve this. And this term has essentially two things in it, why is the noise covariance metrics, which is piece of N and the other one is the steering vector. So once the beamforming vector is solved, then beamforming corresponds to again the inner product of the wave vector N, they are very important, which is Y, Y is a key. And that gives us the estimate of the target signal, which is S hat of T. So it boils down to essentially accurate estimation of the steering vector, as we said already, it's about globalization and the noise covariance metrics estimation. Furthermore, we know the C corresponds to the principle eigenvector of the target covariance metrics. And if speech and noise are uncorrelated as they typically are, then we can, we have this relation which is that the target covariance metrics is the mixer covariance metrics subtracted by the noise covariance metrics. So furthermore, we can basically say that it really is a single estimate, which the noise estimate is crucial for beamforming performance. It's just like traditional speech enhancement or single channel speech suppression, whether the background noise estimation is key. So when you have a time frequency mask, that indicates the target signal in the time frequency representation, and it provides a way of more accurate estimating the covariance metrics of noise. And this is done through two seminal studies was published in 2016. And final analysis uses recurrent neural network to estimate the monorail ideal binary mask by IBM, he could shadow confuse a racial mask using special special costly. So the ideas that students read here that you have a rare recording, and first from single channel signal enhancement through a type of this mask computer type of this mask in each of the microphone recordings. So we combine the masks. And the mask, either model was for a single combined mask will provide information for computing noise covariance metrics in the media. Even though it's suppression is not done completely. One car apply to do a technical masking one more time. And you use the same time with mask as a post processing step or the post and that gives us the output. And this beam forming is responsible for large modeling improvements into open evaluation challenges called time three and time four. And this results in much better than traditional people. And I think it's fair to say that mask base beam for me represents a major advance in beam forming development, particularly for the informing based on the channel, automatic speech recognition. So let me move to the second part of my talk, which is complex spectrum approach to speech derogation. So, three years will publish a study, which is to use complex spectrum mapping to perform multi channel speech derogation. And this is based on a strategy of target consolation. And one nice property of this approach is a single channel give over issues not treated as a special case of multi channel deregulation. People in this audience knows that reverberation corresponds to obviously reflections in the enclosed space. It's illustrated here you have a speaker others speech signal, and you have a microphone way here that there's a direct pass signal. That's why we like to obtain, but there are infinite number of reflections that also get recorded. And what do you have a branch and we also need to consider diffuse background noise which is illustrated here. A single model of the noise is speech enhancement. Can be formula in a similar way as we just introduced. Now, the difference that we actually wanted that's getting one of the microphones as a reference microphone, and this is indicated by Q. This is the first term. This is a total recording of the array which is why, and we represent them in the SKF key domain or show time for your transport domain. And the first time corresponds to the essentially recorded signals of the recorded microphone ray signal of the target speech. And each, each correspond to the reverberated versions of the target signal. Since we're doing the reverberation want to remove them. So that's in the years as well. The third term, which is and and corresponds to the reverberant background noise. So we have HD reverberation and denoising that we like to remove both of them. So we combine the H and and together to form a term which is a big essentially a rare recording has two terms in it. Why is the, the speech, the target signal, second why is the interference. So this complex approach is based on realisation of the face relations of a science was between multiple microphones, obviously in cross spatial characteristics. This is extremely important obviously for spatial processing. And we think the complex country web is a natural approach to represent signal base in addition to magnitude. So we're doing spectral mapping but doing spectral mapping in the complex domain. And this boils down to employing a, say, deep neural network to predict a real and imagine a pause of the direct sound from the corresponding noisy and reverberant mixtures. The earlier work, but formulas come as my process in using a real value to give me a level. We perform deep noising idea reverberation on the real spectral spectral graph and imaginary spectral graph together. I think we handle the imaginary unit properly. It's not actually the same image even though you have a real value to your network to perform complex format process. So the project has a strategy is I would first perform species reverberation in individual channels. So you have P channels here. And each of the channels goes through the recordings goes through a single channel speech to reverberation. And even though it's done multiple times we use actually same, same, you know, to do the same thing to do original process, but then individual. Then we have the initial aid to reverberate signals. And then we have the S hat one, so as high P, then we send them to a media people. And media people would then give us an estimate of a target signal, which is a single channel, but instead of taking this as the final output. The difference. The result we're doing actually target consolation we actually take this as the estimate of target signal, but then we find the difference between the mixture signal which is wise of Q, subtract by bf q here, and that gives us an estimate of non target signals. And then we have to get with the mixture signal within perform you do reverberation second time, you see a multi channel do reverberation network, and that gives us the output of s q to chat. So if we have a single channel, the task will come expect from that and can be formulated as follows we want to predict. This q is the reference microphone. Based on the microphone recording is like you. And most functions are going to formulate in the real and imaginary domain we are all I domain is simply the difference between the target signals real component and estimated target signal real component. And this is the error one long past the imaginary components of the target signal and estimate. Compared to earlier work that's performing. So what happens in the, in the imaginary domain, we have added a magnitude term. So this is the first is the most function, which are out of our eye. We also had a second term which is nothing but the magnitude domain, which is the absolute value of s q, and this is this estimate. And this is actually the inclusion of a magnitude loss leads leads to much better speech quality. And essentially reflecting the relative importance of magnitude on her face. So for MVP and media being from now we now have the estimated targets for signal. We actually, we can directly estimate the target. So we can calculate the variance metrics, special query specialist is just perform the other product of the estimated factors. And that gives us piece of s. With this pass. We know about the, the interference signal is nothing but the mixture signal subtracted by the target signal estimate toxic. So with this V estimated we can then calculate the interference covariance metrics. As this formulas chess. And model this is actually very different from masquerades before, because masquerades people implies that we will value vast to mixture signals. Here, we direct the estimate type of signals and calculate target source covariance metrics. So with the covariance metrics is calculated. Then the figuring out steering vector is straightforward as already described. This is nothing but the taking the estimated target signal covariance metrics, find the principal vector and restaurant is done in the same fashion. Now let's do the year regression. So the duration duration. We feed the r i plus of the y q minus b f q, as well as I possibly like you to the second year to estimate the real and imaginary plus of the target signal. So this term corresponds to as we said filter version for all non pocket signals. So if you have an accurate beam former the term actually is very close to V two, because a combination of reverberant versions of the target plus backgrounds. And why do we do this why don't, why don't we take the info the output as the final results because without this time actually game actually confuse a direct pass signal. And it's reverberated versions. This is the most likely to be a closer look. So can view the second team as a possible. So this algorithm was evaluated on the new book challenge, which is often can open challenge evaluations. Systems are trained and simulated new post response functions. And the speech culprits is WSJ. Can zero, essentially, was rich and authorizes. I think it's a British accent with British accent. Without evaluating the train models without we trained key here without returning and we both challenge data cells, and he has also a simulated version as well as the recorded version. Even similar diverse actually used to measure. Only possible response functions. Arrays are standard. A microphone circular array, and T 60 values range from point to five to point seven seconds. And speaker to our resistance range ranges from point five to 2.5 meters. And the background noise is air conditioning noise. And we chose to pay science for comparison. Also based on WP method. WP stands for weighted prediction error method. And we also anybody or compare with WP combined with for the pinpoint. This idea of operation results on the new book challenge data set. And multiple metrics are simulated data, including capture distance and log like a ratio of our for this to match to metrics to metric to measures that the smaller numbers represent better points. So it was rated segmental as small, which is commonly used for measuring digital version performance, as well as past or speech quality evaluations. As you can see here that are all the evaluation metrics. All the value metrics, including even the real recorded data where the metrics. So I'm all so they could not invest in metric stands for speech to reverberation modulation and ratio. And if you have a single channel, then we don't need to have. There's no point of performing single channel hospital things we take the first stage output as the final result. We have more than one microphones, we have a way of processing them to take to the second stage or the postfield output as a result. And in all cases, compared to WP and WP combined with traditional beam point or BN based pinpoint that improvements are substantial. In all cases, for example, if you look at the past numbers with a microphone. The algorithm achieves 3.7 compared to unprocessed 3.2 point three seven and the best result from WP versions is the last one WP combined with a team and base been for that gives 3.2. The improvements is point five. In terms of past is a very large amount of input. It's further evaluated on ASR results, and the value of both single channel to channel and a channel results. And again, this reflects a large amount of improvements over unprocessed look at a channel. Airways is such a 6% compared to unprocessed which 12% the best WP version which gives 8.5%. So what I rate results also much better than the baseline is actually this results represented. The results are the best results on the revoked challenge test sets. Let me play a sound and we'll illustrate the results of speech due to the relation. Let's see here. So this is a noisy reverberant mixture. The plan will give shareholders one right for each share held on May 17. This is the result from single channel WP algorithm. The plan will give shareholders one right for each share held on May 17. I think that the amount of reverberation is somewhat reduced but created not by a large extent. Listen to a single channel complex spectrum mapping result. The plan will give shareholders one right for each share held on May 17. This comes through a lot is a very clear improvement. I have a channels that results will be better. The plan will give shareholders one right for each share held on May 17. Now let's come to the last part of my talk which is neural spectral spatial filter. So with a fixed array with stable spatial relations. So keep as you do we actually need a B-former. Even though the previous result was achieved with a B-former in the middle. And is it necessary? So in other words, come on channel complex spatial mapping itself fully utilize both spectral and spatial cues. The answer we believe is yes. Because all distributed features are already contained in the multi-channel complex spectrum and substance mixed signal. So the input contains all the information that's needed. And the complex representation in multi-channel complex spatial mapping. Naturally in course in the channel phase relations. And a very important property is the single channel processing is by definition special case. And it boils down to letting deep learning to figure out the most discriminant to spectral and spatial features will perform a particular task. This style processing recorded neural spectral spatial filtering. And that was the topic of the long paper published last year. This can be illustrated by the diagram here. We have multi-channel recording as input. We first perform STF analysis. It gives us complex special grams. We have P channels. Now it's going to have P special grams. And the simplest way of utilizing multi-channel input is to concatenate them together. Not doing anything else. Just combine them together in a straightforward manner. That becomes an input to a DN. And at the end, then it's trained to produce whatever output you want to. In this case, you want to produce the direct pass, direct path. And speech signal as the output, then it's going to be as Q estimate. That's the result of the DN processing. And with this, because the results are in a complex domain, we can then utilize numerous STF team to produce the waveform output as the final outcome of multi-channel complex spatial mapping. And the overall approach is conceptually very simple. Multiple channels are simply providing multiple inputs. And it's also computationally efficient because we're not doing single channel processing at all. We're just single channel, just produces single channel, just produces a feature. If there are multiple channels and they have multiple features combined, single channel, the feature is already there. And there's no need for post-process. And call this such a simple technique or simple approach for form compared to more sophisticated combinations of beamforming and deep learning. We have systematically evaluated several array configurations with the simplest ones, two-channel linear array, with then eight-channel linear array and seven-channel circular array, which resembles the array layout of Amazon Echo. In this extensive evaluations, we compare with normally traditional beamforming techniques, but also mask-based beamforming, as well as complex spatial mapping-based beamforming. Well, traditional beamforming will provide target direction to the delay sound beamformers, fixed beamformers, or premixed targeting interference signals or covariance measures, calculations in that depth-by-beamformer of MVDO. And we also evaluate whether the cost filter should be added. And the DNA architecture we employ with performing complex spatial mapping is called DCCR or densely connected convolutional recurring network. This layer has an encoder component and a decoder component to get forms of U-shaped architecture. In between, there's a bottleneck layer, not to say a recurring network. And from a corresponding encoder layer to a decoder layer, there's a skip connections. This is also called convolutional densely connected block. So you have multi-channel inputs come in, concatenate the features together. At the end of the decoding pathway, we produce a real component, an imaginary component, combined together through inverse STFT produces, as I said, the waveform output. There's speech-tier reverberation. There's a lot of results. Let me just focus on one part of it, one array, which is a linear array. And the evaluation metrics are PASC, or speech-porting, and scary variant SNR. And compared to unprocessed, PASC number is 2.24 with the best. The best beam-forming combined with deep learning approach, which is the CSM-based timey variant MEDR. And CSM is performed by densely connected CR and neural network. And there's also post-filtered processing added to it. So all together, it's actually a very large amount of improvement. It's going from 2.24 to 3.57. And this result is very similar, so it's a lot better than the approach we are advocating, which is just performing multi-channel complex spectrum using DCCR and neural network. So it is 3.57 compared to 3.55. In terms of SCAR, SIS, SNR, the results are also very comparable, except the multi-channel CSM produces slightly better results. And for speech enhancement, we use, of course, our diffuse noise. And we look at a linear H&R array and very similar story, which is about the best combination of deep learning with beam-forming is the results are very comparable with simply multi-channel complex. And if you compare single-channel and multi-channel processing, the results are also very consistent. For single-channel, the results are very consistent, so the story improvements of CSM compared to unprocessed, sorry, 22.4 percentage points. And if you have two channels, the results are going to increase from 22% to 27% or 20% if an H&R is further improved to 72%. So, in other words, single-channel processing produces the expected model enhancement results on the same array layout, which is the most consistent performance as you would expect. Because now, input features are more distributed. So for interim summary, post-filtering substantially elevates beam-forming results based on time-frequency masking and CSM. In terms of neural network architectures, GCCR clearly outperforms bi-directing or LSTM. So, multi-channel is not compatible at all. And special filtering using multi-channel CSM is a competitive result. Compatibility is the best combination between post-filtering and deep learning. So here is another sound demo before I conclude. This is an A-channel speech enhancement which caused ICP's noise added at minus 5 dB. This is the unprocessed signal here. Probably hard to make out the speech signal itself. And let's see. If you just look at mask-based beam-forming. And using GCCR as the underlying neural network architecture for time-frequency masking. This is the result. Among Santos spokesmen said there is very little we can say. Reduction of background noise as well as reverberation. And if you look, listen now to the output of multi-channel complex spectrum algorithm. Among Santos spokesmen said there is very little we can say. That is actually very hard to, it's hard to distinguish between the output and the clean signal itself. Among Santos spokesmen said there is very little we can say. All right. To summarize, I have presented a special filtering approach for multi-channel speech enhancement. As well as speech reverberation and speaker separation, which is the topic. I did not have time to cover what actually was studied in the papers I just mentioned. And multi-channel complex spectrum algorithm is conceptually simple, computationally efficient and effective approach. And one characteristic of particular like is that multi-channel CSM reduces single channel, model CSM if the input is only single microphone. So in other words, this approach treats multi-channel and single channel processing truly in the exact same way in this resembles the characteristics of human audition. And at a high level, the approach reflects the strategies that we want deep learning to discover multi-discriminate and multi-discriminate spectrum spatial field features to perform speech suppression tasks. We're not encoding anything, we're not including beamforming or other spatial filtering techniques in the middle. Just let Sufa is learning to discover one of the features that are most effective to the task. And with that, thank you for your attention.