 So, welcome everybody. So I'll be talking about neural net speech enhancement today, as you probably know. So just to give you a little bit of an outline of what we're going to do, here's the next three hours of your life. So we're going to start by talking a little bit about some background, how we do some denoising, how we can get started with all those things. We'll move on to talking about how we get from traditional DSP methods to neural nets and how there's a continuity between those two things. We'll get into talking a little bit about cost functions once we set up the rudimentary version of this process. Then we'll talk more about architectures. There's a whole bunch of ways that you can do enhancement with neural nets. So we're going to go through a bunch of the popular ones and see why certain things happen the way they happen. We'll talk about efficiency. How can we make those things efficient so they can run on small machines, so they can run on a large scale and all those things. We'll talk about data efficiency. How do we deal with the case where we don't have a lot of data or constrained types of data? And then finally, if we have time for that, we'll also talk about non-negative autoencoders, which are sort of a different family of models that can allow us to do much more sophisticated processing than we would otherwise. And then briefly, we'll try to wrap up everything at the end. A first request is that I would love it if you guys ask a lot of questions. There's no such thing as a bad question or a stupid question. So please don't be shy. Feel free to interrupt me. If something doesn't make sense, do let me know. I've been doing those things for a while, so I take a lot of things for granted. So I think it would be great if you jump in and start a discussion. We have plenty of time, so that's not going to be much of an issue. So let's not make this be a sort of a one-way lecture. The more interactive it is, the more information exchange we'll have. So let's start with part one, background. So what we're going to do at this part is we're going to talk just a little bit about the traditional signal processing ways of doing things. We'll talk about basic frequency domain methods for denoising. We'll talk a little bit about how we could perform this type of denoising in the frequency domain, and then we'll sort of slowly try to transform that into a neural net. So we can see how we have sort of a one-to-one correspondence between things that we used to do forever and things which are more modern now. So a quick sort of recap of regular denoising. There's a very long history of denoising in signal processing, of course. Historically the one way to approach this problem was by doing things like a simple filtering. For example, if you have a lot of hiss in your recording, you apply a low-pass filter, if there's a lot of rumble, you apply a high-pass filter, if there's like a humming sound, you can design a band reject filter that will focus on that particular band. But of course when you have much more complicated noise patterns, you can't really make filters by hand and we have to resort to slightly different methods. Of course none of that processing usually happens in the time domain just because it's very complicated to see what's going on there. So instead we're going to move to the frequency domain. So we're going to be seeing a lot of spectrogram plots like this one. The color map is the darker something is, the more energy you have, and then of course we have frequency on the vertical axis and time on the horizontal axis. So this is an example of a particularly noisy speech recording. Let me play this sound so that you hear it. And I just want to verify, you guys get my computer audio, right? Yes. Yes. All right, so what's nice with this representation, of course, is that you can see the noise in this time frequency domain. We can see that there's a lot of high hiss at the top end. We can see that we have this sort of very prominent hum throughout the signal. And then we see all these wavy lines, which are of course the speech signal, which are really deeply buried inside the noise. Is that a question or an unmuted mic? Okay, so, but removing this type of noise is not something that's trivial, of course. It's going to require a lot of manual work. We can definitely make a whole bunch of filters that can address all of those frequency bands. Or if you're really patient, you can go in the time frequency domain and turn off all the time frequency bands, which the noise dominates. But these are, of course, I don't think anybody wants to do by hand. So the standard way of trying to remove noise in those situations would be spectral subtraction. So it's a very, very well-known algorithm, very old algorithm, basically every device that you're using has some version of it running for speech communications. So the basic idea behind it is that we want to find what's kind of like the average spectrum of the noise that we have into our signal. And then what we want to do is effectively subtract the magnitudes of that spectrum from every spectrum in our spectrogram. That's the basic idea. So in this particular case, what we would do is we would say, for example, well, we can see how at the beginning of the sound here there's no speech. So if I were to take the average magnitude spectrum of that section of the sound, what I would get is effectively the spectrum of the noise, because there's no speech in there. And then the idea is that since the spectrum of that noise doesn't really change throughout time, what we can do is that we can take that sort of average spectrum from that section and then subtract it from every column that we have in the spectrogram, and that should suppress the noise. So that's the basic idea. It's a very simple concept. So formally, here's what it looks like. Your input is going to be X. So now we're looking at a spectrum at time t at a frequency omega. So we take the magnitude of that point. We subtract from it some amount of the noise spectrum that we've estimated. So n of omega is the spectrum of the noise. And then what we do is we add in the phase of the original signal to construct a complex number, and then that complex number is going to be the output time frequency representation for time t and frequency omega. So the only thing that we're doing is we're basically removing from the magnitude of our input the magnitude of the spectrum of the noise. How much we remove the noise depends on the parameter alpha. We can remove it a little bit or we can remove it a lot. And of course, there's one extra complication in that if we remove too much of the noise, then we're going to end up getting a negative value here. And as far as magnitude goes, if you have a negative value, you basically flip the sign of the signal and it still ends up being loud. So if any number is less than zero, we effectively clip it to zero so we don't get any energy there. So by having that alpha parameter and tuning it, we can really remove a lot of frequencies potentially hurting our original signal or we can just have something that's going to be very subtle. So how does this look like? Well, here's what happens when we do this to the signal we have. Let me play again the noisy version to remind you. And then here's what happens when you do this method on apply this algorithm on this. So we can clearly hear some of the artifacts of all the filtering that happens. But as you can see, all those horizontal lines that were part of the humming noise are gone. All the high frequency components are mostly gone. And we're sort of left with this wavy patterns which are effectively what the speech is. So it can be a fairly powerful method if used correctly. Now I've introduced it in sort of a very hacky way. This is not necessarily the way that you want to think about it in DSP terms because it involves a lot of operations which are a little strange. You can also express that as a linear filter. And the formulation behind that is that the spectrum that you're observing at time t and at frequency omega is going to be approximately equal to the spectrum of the speech sound that you have plus the noise spectrum that you're observing. So you're going to basically see those two things mixed. And then we want to design a filter. Let's call that gt of omega such that when I apply that filter on the magnitude spectrum of the sound that's coming in, I should be able to get an output which cleans up the sound as much as possible. Now the way that I set it up that way is because historically in the signal processing that's how we want to think about filters. We're applying just an element of multiplication in the frequency domain which is the convolution in the time domain. Now that filter g of t is going to be a function of the noise spectrum that we've extracted and the input signal that's coming in. So we're expanding it here like that. And we can use different types of gain functions and that's going to give you different types of denoisers. The most basic one which is the magnitude subtraction one is this one where we define the gt of omega to be equal to one minus the spectrum of the noise divided by the spectrum of the sound at that particular time. So if I were to take my input and multiply that with this particular filter as I'm doing in this part of the equation, what would happen is that this x would go over here. It will cancel out with the denominator at this point. And what we're going to be left with inside is one minus n of omega which is effectively the subtraction. So that's a way of seeing this filter as this process as a plain filter. So basically the only thing we did is apply a linear filter. We had a little bit of a different process by doing this clipping to zero, of course, but even if you don't, it can work well. If you're careful with your alpha parameter and like I said before, this is the backbone of all speech denoising. There is no phone or device that doesn't use some version of this process today. But the problem with that method is that it has a lot of limitations. The filter itself is a constant filter. It doesn't necessarily react to the input very well. If you have a lot of changing characteristics in the noise, it wouldn't be able to deal with that. And it's not necessarily a filter that could be smart and jump in and out or adjust its behavior the way that we want to. Just to give you one example of something that could go wrong in a situation like this, here's an example where we have non-stationary noise. So as you can see on the left recording, we have this sound that keeps going up and down. That's a siren sound. And then we have a couple of sentences happening on top of that. Now, what's going to happen is that as I'm learning the average spectrum of the noise as shown on the left part of the left figure, what's going to happen is that that's going to tell me that you have a lot of energy in the noise between, what is it, 500 hertz and 1500 hertz. At any point in time, the noise only has just a very narrow frequency band that's active. But because we're averaging trying to get a better estimate, we're going to get something that tells us that this entire set of bands is going to be full of noise. So when I try to get that spectrum of the noise, the average noise and then subtract it from my data, what's going to happen is that this entire frequency range will seemingly be dominated by noise. If that gets removed by the signal, we're going to end up sort of removing a lot of speech in that section, which is not going to sound very good. So just to play the sounds, here's the inputs. The shelves will bear of both jam or crackers. A joy to every child is the swan boat. And then here's what happens when we do the spectral subtraction. The shelves will bear of both jam or crackers. A joy to every child is the swan boat. So that's really good at suppressing the noise, but if you're paying carefully and hopefully the audio compression wasn't too bad, you'll notice that we're missing an entire frequency band from this recording, which kind of gives it this weird sort of lack of resonance in the signal. So that's going to be our starting points to get into neural networks. What we want to do is come up with a denoising process that we'll be able to deal with situations like this, where you do have a very dynamically changing types of noises, where we have to make a lot of sort of decisions in the process. So, and the way that we're going to do that is by using a fairly common story in the last few years, we're going to take this linear process that we defined, or almost linear process that we defined, and we're going to try to sort of slowly turn it into a neural network by making it nonlinear, by adding a depth to it, and by adding a lot of elements that will make it look more like a neural network and a little less than a regular filter. Historically, yep. Can you hear me? Yes, I can hear you. Okay, so I raised my hand, but then now the slide is over. So regarding the previous slide. Yep. So just a technical question. When you remove the band, would you just take all of the frequencies and remove all of them, or somehow detect the pattern and remove them? No, so what's going to happen is that in those frames that we have at the beginning, you will effectively get the power spectrum of that region, right? And that's going to look something like this. So it's not going to have a lot of energy at the top, maybe a little bit here, but then you get a big bump in this region, right? And that's because you have all this energy moving up and down. So that thing averaged will effectively give you a spectrum that looks like the vertical line that the vertical graphs that I put here. Now, if I were to take a spectrum like this and subtract it from every column that I have in my spectrum, what will happen is that this frequency band will have something subtracted, which is a bigger because it's going to be getting this section subtracted from it. All of the other frequency bands up here will have a smaller spectrum being subtracted from them so they don't get influenced as much. So what you're seeing happening on the right plot is that all this section over here got this part of the this part of the input subtracted. So it's mostly obliterated except for the very loudest parts that happen to be above the threshold of however much I decided to remove it. And that's going to depend on that alpha parameter that I mentioned. So we're not necessarily removing all those frequencies, we're subtracting the spectrum of the noise, which has the effect of removing most of them. Does that make sense? Yeah, so we don't just take everything as one big block, we kind of try to get the right values in this part. Yeah. Okay, great. Thank you. Any other questions? Right. But yes, please keep asking questions and sometimes hard to see the raised hands. So just, you know, feel free to jump in and say something. All right. So that's of course a very common thing that happens a lot, taking a linear process and making a thing neural network. And historically, it's always giving us pretty good results. So it's a very, if you go to a lot of conferences nowadays, you know, I can speak to speech that deal with audio signal processing, you will see a lot of papers that kind of work like that. Sometimes people don't give you that train of thought, but it's something that will help you understand things better. So even if you don't see it in the paper, it's a good thing to try to figure it out. So the reason why we want to make those processes in neural net is because now we're going to get all the benefits of it. For example, neural nets are nonlinear, they'll be able to give us things that a simple linear filter wouldn't be able to do. Because we have depth, that means you can have much, much bigger parameter size, which means you can learn to do things which are more complicated. They're much more flexible because you can come up with architectures that reuse weights or have certain structures that correspond to your data. So it makes things a little easier to do things like that. So in order to get there, what we're going to have to do is change our notation a little bit and sort of move to a system that's going to be a bit more amenable to thinking of things as a neural network. So the original model that we had was that we have our input in this frequency domain, we're applying a filter to it, and then that's going to give us our output. So nothing special here, that's a straight forward DSP. We're going to rewrite that using matrix notation, just to sort of harmonize it with what people do in the neural net world. And the way we're going to do that is we're going to say that, well, this particular frame is going to be the spectrum at time t. We're going to take all of the frequencies at the same time. This is going to form a vector. So this is going to be our vector x of t, but we have all of the frequencies omega spread out as a column vector. And then to apply this particular operation with matrix notation, what we effectively do is we're going to take that vector and multiply it with a diagonal matrix where every diagonal element is going to contain every omega element of the gain filter that we talked about. And if you multiply these two things together, so it's a diagonal matrix times a vector, we're effectively going to be doing this element wise multiplication, and that will give us a vector that will have all of the frequency elements of the spectrum of the outputs. And again, even more compactly, we could just note it as the last equation here, that vector x of t, which contains all of the frequencies at time t, multiplied by the matrix G, diagonal matrix G, will give us the vector y, which again contains a spectrum of the output sound. Now, the reason why I want to write it that way, because that we can use to generalize and make it a neural network. And here's how we're going to do that. The original model is at the top. Input spectrum comes in one frame at a time, we're multiplying it with this diagonal matrix, we get the output spectrum. First thing we're going to do is we're going to change that gain function. We're not going to have it change over time, it can be a constant now. And we're going to make it a more complete linear transformation. Instead of saying that G has to be a diagonal matrix, we're going to make it a full matrix, we'll call it w. And we're also going to add translation to it by having a b vector that we add to the end. So now we have a transformer that can do all kind of transformations, not just scale all of the elements, and it can also sort of translate our data by using this bias term. Now, so now we're jumping from having sort of a number of parameters, which is as many as the input to having a squared number of parameters or squared plus the size. To move more towards a neural network, sort of architecture, what we can also do is add a nonlinearity. So when you look at a neural network node, it's effectively doing a linear transformation. And then we apply some kind of a saturating function, it could be a hyperbolic tangent, a logistic, a regular whatever. So we're going to do that as well. So now our filtering process moves to being this linear transform that we defined in the previous equation, but on top of it, there's an element wise nonlinear function that will give us the nonlinearity that we want. And then finally, what we're going to do is stack multiple of those layers on top of each other, and that will allow us to have sort of a deep network or a deep system, where now we can say, well, at the first level, level zero, our representation we're going to call that h of t is going to be equal to the input. And then we're going to take that and then figure out the next level of transformation. So we do a linear transformation, we put it through a nonlinearity, that's going to give us the next level of latent representation h of i plus one. And then we're going to put that in again in yet another layer and yet another layer. So we stack a bunch of those operations together. And whenever we had enough, we're going to take that output and say, okay, the final output h that we have is going to be what we define to be our output. So this should be a fairly clean translation of the linear model that we started with going all the way down to becoming basically a pretty standard, very simple dense layer neural network. So any questions so far on how we made this translation. And by the way, that's something you can do with any linear system you want. It's very easy to do. All right. So the model we just defined is called the denoising auto encoder. Basically, it's a network to which you give a noisy input. And its job is to produce some kind of a clean signal coming out of it. So conceptually, the way it's going to look is that you're going to get something like a noisy spectrogram. You're going to put that into your denoising auto encoder, which would be some kind of a neural network. And its job is to produce a clean output out of it. Now, the exact structure that we're going to use is informed by the way that we did the denoising before. So again, to show the same thing in more detail, we have our input waveform coming in. We do a short time Fourier transform that's going to give us a spectrogram. And then we're going to split that into two parts. We're going to have the magnitude part. So we're extracting the magnitude of that representation. And that's the plot you're seeing here on the top left. And then the magnitude is going to go through a sequence of nonlinear transformations, which are going to be different layers of our over network. And then at the end, they're going to give us something that we're going to call the new magnitude. And hopefully that's going to be the magnitude of the denoised sound. At the same time, we're also extracting the phase of that spectrogram representation. And that one just propagate all the way to the end. So we take the magnitude, we process it, we get something that should be the magnitude of the denoised sound, which is what you're seeing on the right figure. And then we have to combine that new magnitude that we came up with with a phase of the original signal. Now, what that does is that it tells us that if the output of the last layer did a good job at giving us a clean magnitude spectrum, all the parts where the noise was dominant would be suppressed. So we're not going to get a lot of noise. What the phase is at those parts, we don't really care. So by multiplying this magnitude with the original phase, we're basically just suppressing the parts that we don't want. And we put that into an inverse shorting for the transform. And that's going to give us output, which is going to be our denoised sound. So it's a fairly simple pipeline. If instead of having this sequence of non-linear transformations, we just did the subtraction with the noise spectrum, we would effectively have the spectral subtraction algorithm. So that's all there is to it. It's not much more complicated than that. Now, there's of course a difference in how we learn the parameters for all of those networks that we have here. What we did before is that we had a section that we said, okay, this is only noise. And we use that to figure out the noise spectrum. Here, we can't really do that because we're not subtracting a spectrum. We just have this very abstract notion of parameters that don't necessarily correspond to something physical in the signal. So the way we're going to do that is by a slightly different process. And of course, it's going to be a typical neural network training process. So we're going to do the following. We're going to have to get matching clean and noisy data. Now, this is impossible to get in real life. There's no way you can get data like that in real life because it's very difficult to get the exact signal that was part in a noisy, that was inside a noisy recording. So instead, what we're going to do is we're going to take a whole bunch of clean speech recordings, big dataset, your Libra speech or Wall Street Journal or whatever, and then you can get a big connection of noise sounds as well. And what you can do is you can make artificial mixtures of those. You can take one random speech recording and then one random noise recording, add them together at some specified signal to noise ratio, and then you can say, this is going to be a noise recording. This is what's going to become the input to my network. And you can make as many of those as you want, as long as we have enough data. Now, once you do that, you know that the ground truth for that particular mixture for the denoist sounds would be just the original speech recording that you used, right? So you already know what the clean sound sounds like, and you have a mixture that corresponds to that. So that means that now, you also know what the target is. So you're going to use the speech to make a mixture, but you're also using that speech to have it be the clean target that you have. So now we have pairs of sort of noisy inputs and clean outputs. What you have to do is kind of use them as a data set, and then you train your neural network to be able to figure out how to map a magnitude spectrum of a noisy recording to the corresponding magnitude spectrum of the clean part of that recording. How we do the training is pretty generic at this point. We just use any of those millions of gradient descent variants. Do your usual batch training, and then after a while, it takes a while to train those things. You actually get something that could be reasonable. Now, how well does this thing work? Well, it can denoise fairly well in a lot of challenging cases. So here's an example where we have a noisy input and then what the network gave us as an output. This is just a single hidden layer network with 1024 nodes. So it's not a big network. It's actually a very small and simple one, but just so you get a sense of how this sounds, here it is. So here's the input sound. She had your dark suit in greasy wash water all year. And then this is what comes as an output. She had your dark suit in greasy wash water all year. Right, so it's not perfect. We still hear a few artifacts, and we'll be working on those later on, but it's even for such a simple thing. It actually does a pretty good job. And you can actually see that this is not a very good example, but you don't really have stationary noise here because the engine keeps revving. It has no trouble dealing with that, which is nice. All right. Okay, may I ask one doubt? Of course. When you feed the network, do you do any kind of normalization or have you found it is useful to normalize the spectrogram before feeding? At this point, we're not doing anything. We're going to get to that later on. But for now, we're just getting a sort of spectrum as it comes out from your, you know, for your favorite STFT function. Okay. I have a question. Yes. So the question is, how much do you know is the output of speech comes independent? And how good it works in one speech, noisy environment? We will get into that later on right now. The model that I have is actually a very crude. It's only doing a very, very simple processing on the spectrogram space. This model can surpass sort of the simple denodging technologies you would find today, but it's not that much better. And mind you, at this point, we're not doing any post-processing. We're not doing any pre-processing. So this particular approach is not going to do much, but we're going to get to more complicated models later on. And these will be able to address a lot of the sort of multi-talker situations and reverberant rooms and all those things. Can I just confirm something? So once we denoise the spectrum, do we use the noisy face or do we also do something about the face in this case? We use the noisy face. Okay. And the idea behind it is that if you have, for example, I don't know, this frequency bin here, which is just noise, as soon as you set the magnitude to zero, the phase of that particular bin doesn't really matter. Now you will have situations where you have some of the noise coming in in a frequency bin that also has voice, and that's going to give you some phase sort of artifacts. But for the most part, they're not that noticeable or not that extreme, unless you have a very strong correlation between your noise and the voice signal. So when you do the reverberation, for example, that becomes a little bit more of an issue. But again, that's because we're still using this representation. Later on, we're going to end up with a representation that doesn't really use phase. So things are going to be a little different. But for now, we are using the noisy face. Okay. Thank you. All right. So she had to hear that again. So on paper, here, the advantages that we get. We can deal with non stationary noise fairly well. We're going to see more examples of that later on. The filter should be able to react to the input. So it's not just a passive filter that does the same thing over and over. And that's because we have more parameters and we can do more sophisticated stuff. We can specialize on specific types of data. So for example, we can train our network to work best with our voice in a specific environment, as opposed to with, you know, everybody's voice. And that's going to make it perform better in this case. And of course, the other big advantage is that there's no estimation that has to happen during deployment. With a standard noise attraction denoiser, you have to periodically sample the noise and try to figure out what the noise looks like as it evolves over time. In this case, all of the heavy computation and sort of estimation happens in the training process. And then during inference time, you just have to do a four pass over the network. And there's no estimation that happens at any point in time. And that of course makes the implementations a lot simpler. So to finish off with part one, we talked about how we can generalize denoiser to becoming neural nets. Again, we get all the advantages of neural nets by doing something like that. But what I'd like you to do is always sort of look back to that model, try to figure out what it is that we're doing, and how would it simplify in sort of traditional DSP. And those things are important because that can help us understand what the trade-offs are and a lot of things like that. So before we move on to the second part, do we have any questions? All right. So moving on, part two. So now we're going to sort of keep doing that process and sort of take this idea a little further and go beyond just using a replacing a filter with a neural network. We're going to do a lot more than that. So in general, what you'll notice in this talk and what you probably figured out by the previous lectures as well, if you have any kind of a DSP process, there's always some equivalent in the neural network world and those two tend to be interchangeable. In the DSP world, they end up being linear and sort of well understood processes. In the neural network world, they're sort of nonlinear, but they still have the same sort of underlying functionality. So we're going to do now the same thing, but now we're going to look at the entire processing pipeline. And the way we're going to do that is by looking at the sort of a standard speech enhancement setup and then sort of see what it contains and then see how we can reimagine that as just a simple neural net. So the usual processing pipeline that we have with audio enhancement is that you have a signal that you don't like. Most of the time, you put it through some kind of a filter bank, for example, the short-time Fourier transform or the MDCT. That includes a sub-sampling process. Then you do some kind of processing into that space. In the case of a speckle subtraction, we're subtracting the noise spectrum. And then you sort of invert that frequency, time frequency representation and you go back in the time domain and that gives you the signal that you want. So this is pretty standard, whether you're trying to do the reverberation or denoising or source separation or a whole bunch of other tests. Now what we're going to do is we're going to take that and change it to a sort of a neural network. And the way we're going to do that is by reinterpreting all of those parts. And what we're going to say is that, well, you know what? What we call a filter bank in the signal processing world, it's actually a convolutional neural network. Again, the filter bank, you're convolving with a bunch of filters. In a convolutional network, you're convolving again with a bunch of filters. The sub-sampling process that happens with a lot of those filter banks, that's something we could do with something like pooling or striding in the neural network world. Again, same thing, different terminology. The processing that we do in the middle, we could usually replace by doing some kind of a neural network. For example, the subtraction we did in the speckle subtraction, we just replaced by a bunch of simple layers. And then again, the up-sampling we can do by undoing their max pooling or striding. And then we can have an inverse filter bank in the end. So each one of those steps effectively translates into something that we know as a neural net term. And so we can take an entire signal processing pipeline and translate all of those steps one by one. And we're going to go from having this DSP sort of system into having just one big neural network that effectively does the same thing. So in the case we just examined, here's what happens. We have the full denoising process, which consists of the following. There's going to be a front end. And the front end is going to be using the short-time Fourier transform to get to the time frequency domain. And from that, we extract the magnitude and the phase. So this is what we do at the beginning. It's kind of like a fixed process. Then we have what we're going to call the processing step, which is the part where we just apply the filter on the magnitudes. This is the denoising itself. And then once we do that, we're going to have to re-synthesize, go back to the time domain. So we take the new denoised magnitude. We take the original phase of the signal. And then we use an inverse short-time Fourier transform to get the output sound. So what happened in this case is that the only thing that we changed is the processing. That's the only thing that became a neural network so far. Now what we're going to do is actually look at the front end and the re-synthesis and try to make those be a neural network as well. And we'll see how this will give us a few more advantages. So here's what happens. First, we're going to start with the front end. So what's happening in the front end is that we're going to take just a frame of sound from the input that's coming in. So we're going to take L samples coming in from the input waveform. We're going to multiply those with a tapering window, like a hand window or whatever you want to use. So we're just standard for spectral analysis. So that's going to give us a windowed frame of a signal. And then what we're going to do is we're going to take the discrete Fourier transform of that. And that's going to give us the spectrum of our signal at a particular time. So that's a single column in the spectrogram that we used so far. So we're going to do that for multiple offsets of our input. So that's going to give us our transformation. So we're going to get a whole collection of Cs that will effectively be all the columns of the spectrogram. Now, again, we're going to try to rewrite this using linear algebra notation because that's going to allow us to think of it as a neural net. So we can rewrite this thing as what's called the sliding transform. And the way that it's going to look is as follows. We're going to construct a matrix T. And what that matrix will have in every column is sort of a small snippet of the input sound. So for example, the first column of matrix T is going to be our input samples from the 0th point to L minus 1. So we have L samples in that particular column. Then we're going to move by H samples. That's going to be our hop size. And we're going to take our next snippet of L samples from the input sound. So that might be sort of the next part of the sound. And then we're going to do that again and again and again. And we're going to take all those little snippets and organize them into one big matrix that's going to have sort of L-sized parts of the original input in time order. Now, if I were to take that matrix and I multiplied with a diagonal matrix whose diagonal elements contain the window that we have here, what that will effectively do is multiply every element of each vector of our matrix with its corresponding element from the window. So by doing the multiplication of the diagonal window function with the matrix that contains all of our chopped up signal in smaller frames, we're effectively applying a window. And then if we multiply the result of that with the Fourier matrix that's going to be a matrix that contains all the Fourier bases in its rows, that will effectively implement a discrete Fourier transform. So what you're seeing on the bottom equation is this big matrix C will be a matrix whose columns will be effectively those CFT columns that we have here. So we're doing exactly the same kind of operations, but on the bottom we're using linear algebra. On the top we're using more standard DSP notation. So the only thing we're doing is packing our sound into a matrix and then we're using just a linear transformation that will contain a window and the Fourier transform together and that will give us the spectrogram of our signal. Now, pictorially how this looks like is as follows. This is what a segmented input would look like. Every column that you see of this matrix contains the snippet of a waveform and neighboring columns basically have the same waveform shifted by a few samples. And that's why you see how it's got this sort of a smear structure when you look at it. That's because every column is similar to the column next to it, but there's a little bit of an offset. So this is basically a time waveform cut into individual frames. This would be sort of what the real part of a Fourier matrix would look like. So if you were to multiply this matrix with that matrix, you will effectively take every basis that you have here and you can see there are all signs of different frequencies. So you would take every basis that you have on that matrix and apply it on every column that you have, which is every snippet of the sound. If you do all of those things together, here I'm plotting the magnitude of the output, you will effectively get a spectrogram. And here you can see that we have something like three different nodes and then a chord. So it's just a different way of thinking about applying DSP by using linear algebra formulations. A different way we can think of that is to think of sliding transforms as sub-sampled convolutions. So this is the formulation we used so far. We said that we have our input, which is cut into small windows. We're multiplying that with a matrix that's going to apply the Fourier transform in the window and that's going to give us a matrix that contains the time frequency representation. If you look at one column at a time, we're just saying we're going to take one column of the big T matrix at a particular time. We're going to multiply that with that transformation matrix that's going to again apply the window into the Fourier transform. And that's going to give us the spectrum of our sound at that particular time. And then if we expand this expression from vector notation to scalar notation, it's going to look like this. And what we're effectively doing here is we're saying that let's take a snippet of our input and convolve it with one of the bases that we have in this W matrix. That would be one of the Fourier bases. And then we're going to assign that to the output of a signal. So effectively what we're doing is we're doing a convolution where the filters are going to be our Fourier filters. And that's going to give us our output. Right? That's the very definition of a filter bank. There's one small complication here. You might notice that I have a plus instead of a minus. That's because for some reason that I don't understand in the neural network world, when people do convolutions in convolutional neural networks, they actually do correlations. That's because they started doing a lot of work with detection. So whenever you see a minus sign that should be a plus sign, there should be a minus, bearing in mind that I'm trying to keep the notation from the neural network world. So this linear product that we started with, it's effectively a set of convolutions. Now, if you also want to do the sub sampling, we can use that by doing what we call strided convolutions where we only take every other output. You can explicitly have a decimation or a sampling step after that, or you can do some kind of a pooling operation, which is very common in the neural network world. But the basic idea that I want you to to remember is that we could either think of the front end as being just a simple linear transformation, which we can implement with a simple linear dense layer as a neural network, or we can think of it as being a set of convolutions, which we can implement by using a convolutional layer. So these two things are effectively doing the same thing, and it all depends on how you repackage your data. In this case, we're using the raw waveform as is. In this case, we had to repackage the raw waveform in a format that would allow us to do all the convolutions by using a matrix multiplication. So what that tells us is that what we have as a front end, the short-term Fourier transform, could be a neural network layer, either a convolutional layer or a simple dense layer if we arrange our input like this. Now, let's look at the real synthesis as well. That's going to be a little different. So the inverse transformation is a filter bank as well, but now we're using the overlap ad functionality. If we have a hop size, which is less than our window size, that means that we're going to have windows that overlap with each other. When we have to do the inverse short-term Fourier transform, we're going to have to overlap those frames to make sure we get a proper reconstruction. You can't necessarily do that easily with a regular convolution, but it turns out that something called the transposed convolution in the neural network world, which also some people are honestly called deconvolution, which is not deconvolution, but the basic idea behind it is that it's trying to invert or being sort of the transposed operation of a regular convolution. And just to give you a sense of what it does to see how it relates with the short-term Fourier transform, here's an example in the 2D space. So we have an input that you can see in the blue sort of matrix on the left. We have a filter and then we have an output. Now, when you apply the convolution, what's going to happen is this 2 by 2 filter will first get overlapped with this sort of 2 by 2 subset in our data. We're going to multiply all of the corresponding elements and take the dot product of all of those things, and that's going to give you the first weight. Then you're going to take this filter again, and now you're going to apply it into those four elements. You're going to get the dot product that's going to give you this element. Take it again, shift it by one sample to the right. You're going to multiply it with those four. That gives you this element, and then do the same thing with the four squares that are sort of in the bottom two rows. And again, I get these three elements as well. So that's how the standard convolution works in a neural network. The corresponding transposed convolution of that effectively changes the sizes of the input and the output. And what's going to happen when you do a transposed convolution is you're going to say, I'm still going to have the same filter. I could start by taking something which is the same size as the output in the convolution operation. And now I'm going to take the first element of my input. I'm going to scale my filter by that one element, and I'm going to add the output of that to this section in my output data. And what that does is basically overlays the filter in this location. Then I'm going to scale my filter by the second element and move it to the appropriate location, which is going to be this one. So now what happens is that these two samples get to be overlapped. And then we do the same thing for all the other parts. So if this was a one-dimensional case, the inputs and the output here would only have a single row. And what would happen is that you would do your regular convolutions on the left side. But when it's time to do the transposed convolution, you would take whatever coefficient you have at that point, filtering it will give you a time sequence which is much longer. And then you would have basically overlap all of those sequences in the outputs. And this is exactly what the inverse short time Fourier transform does. It takes your data and then it moves your subsampled time frequency data, and then it sort of translates them into time waveforms that get overlapped with each other. So what are we doing? Yeah. I have a question. One slide before, please. Yes. The C is a complex. It's giving you the magnitude and phase or just magnitude. Right now it's a complex number. It's a complex number. Okay. Thanks. So now what do we do all this sort of math acrobatics? Well, both of the formulations that we used, whether it's a matrix multiplication or the convolutional layer, are neural net blocks. And what that means is that we can rewrite the entire denoising process as a big neural network. So again, what we had before was that we have an input. We do a short time Fourier transform and we get a magnitude and a phase. And then we have a sequence of neural network layers. And then that gives us a new magnitude. We combine that with the original phase. We do an inverse short time Fourier transform and we get an output. But now we can take that process and translate it. And you can say we're going to take our input that the short time Fourier transform, I'm just going to replace with a bank of convolutional filters because that's effectively what it's doing. We're going to do one small difference. We're not going to take the magnitude in the face, but what we're going to do is we're going to take the output of that and put it through a series of linear transformation, of nonlinear transformations, which is similar to what we did before. And then whatever comes out of it, we're going to multiply it with what the original representation was. We get a transverse convolution that would be the equivalent of the short of the inverse short time Fourier transform. And that will give us the waveform. And by doing all of this processing, what's happening is that we don't have any DSP element or any traditional DSP element left into this. We basically have a sequence of, we have one big neural network that receives raw audio coming in and gives us raw audio coming out, which makes things very, very convenient. Now, there's a couple of things that I slipped under the rug here. So let's look at them. There is a difference in the front end. We don't use complex values anymore. So what we did with the short time Fourier transform is that we had all these complex numbers coming out of it. We would take the magnitude because we know it makes sense to operate in that. And then we take the phase because we don't care about its effects in denoising as much, and then process things the way they were. We're not going to do that now. Now we're just going to have a convolutional layer. The whole point is to train the basis functions to optimally do denoising. So we don't necessarily have to preload it with Fourier functions. So what we're going to do is we're not going to extract magnitude and phase anymore. We're just going to pass the signal as is directly to a sequence of layers for processing, but also keep that representation and pass it over toward the end. Now, is that okay? Well, it is okay. It actually worked well. The reason why we do it is that when you operate with neural networks in the complex domain, the math is a little different. You have to be careful about a whole bunch of a sort of minima that don't exist in the real space. And also in a practical note, it's very difficult to do complex valued logic with a lot of the deep learning frameworks which are out there. But the other reason is that we don't want to be constraining things with a representation that we like, right? The short time Fourier transform is great, but it's not necessarily optimal for what we want to do. So we want to let the network figure things out on its own. Now, like I said before, the polar representation that we use with magnitude and phase is actually very intuitive and very easy to work with. But if you really wanted to get something like that, or if something like that was necessary, you can imagine that all of these non-linear layers that we have here would be able to actually extract magnitude out of those operations, or something that could be combined over here to give us the same effect. So we're not necessarily losing that functionality. It's something that gets hidden inside the non-linear processing that happens, in other words. We could also say that, well, we're not doing an STFT. We're doing something like an MDCT, in which case it's kind of mood talking about magnitudes and phases. But either way, we don't have to use complex values. They are convenient when it comes to making plots and interpreting results. But again, they're not something that we need. And what that means on their synthesis side is that we're not modulating phases anymore. We're not taking a new denoised magnitude and then using the original phases and combining things together. Instead, we use what we call a skip connection in the neural network, and we do what we call masking. So we're going to basically take the representation that was the original representation of our data, and we're going to do some kind of masking here that suppresses some elements or appropriately scales others, and then that's what's going to be given to our transverse convolution for the resynthesis step. Now, these are both in some sense similar. Here, we're multiplying the magnitudes with just the phases of the signal. What we're saying here is that we're going to have some kind of a latent representation that will act as a gate to the original latent representation. So there is an equivalence between those two, but it's not a one to one correspondence. So now you might be asking, well, yeah, but is masking something that's okay? Is that something that's appropriate? Well, masking is actually a pretty powerful tool. So just to give you a little sense of what you can do with masking, and we're going to do that in the short time for a transform space, but it doesn't have to be. Here's an example of two people speaking at the same time. Now, if for some reason that was the representation that our neural network was using, what masking would effectively do is we'd have to look at every pixel in that spectrogram and say, does this belong in source one or does that belong in source two? And we could make a binary mask such that when you do this operation here, you'd effectively mute a particular pixel and let it go through or not to their synthesis. So again, we're not going to do this by hand. Of course, we're hoping that the neural network would do something like this. But for now, let's consider the oracle case and the oracle case is the case where we know the answer and we're just seeing if this process works. So what I can do is I can artificially say, well, let me make a binary mask for which if the spectrum of my first sound is louder than the spectrum of my second sound, it's going to be a one. If not, it's going to be a zero, right? So this is the case where one speaker is louder than the other speaker. In this case, it's an oracle case. I know exactly what the solution is. I'm just making a mask because I know that. And here's a mask where the second source is larger than the first source. Now, if I were to take these two functions and apply them on the spectrum, I would effectively be doing this kind of thing, assuming that the bases were spectrum bases. And if you do that, here's what comes out. So here's the mixture again to people speaking. I assume moisture will damage this ship's hull. Here's what happens when we multiply the first mask with the representation that we have. She had your dark suit in greasy wash water all year. And the same thing for the second mask. I assume moisture will damage this ship's hull. Now, we do get some phasing artifacts and stuff like that. But the basic idea behind it is that just by turning on and off, simple coefficients in your late introduction, we'll be able to give you a pretty good reconstruction of what you want. Now, there's going to be an extra element added here in that we're not going to be using Fourier bases anymore, right? We're going to be using bases that are being learned. So that means that the network has the flexibility to learn the optimal domain to do this kind of masking. So what you're seeing here is kind of like a sort of, in some sense, kind of like a lower bound to the performance we should expect to get. So there's nothing wrong with masking. And it's actually very closely related to the phase modulation we did before. So now the question is, are all of these changes a good idea? Yeah. Well, the short answer is yes. One element, one sort of positive element here is that this really frees us from a lot of user-selected transformations. A lot of the times when my job is to figure out what kind of algorithm to use to do my processing, I would have to make a choice. Do I want to use a short time Fourier transform? How big do I want my windows to be? What's the hop size? Do I need to use wavelets? Do I need to constant Q transform and empty CT? In this case, it's not going to matter. We're just going to say there's a bunch of filters and the neural network will have to figure out what's the best way of dealing with it. So we don't have to sort of impose any kind of bias on what the latent space will be like. It also helps us minimize a lot of the parameter selection. We only have to decide how many bases we have. We don't have to worry about what kind of window to use or hop size and stuff like that. That all gets automatically sorted out in the process. And of course when you do that, yes, the results do tend to be better. It's of course going to be a lot more computationally expensive. We won't be able to use efficient fast Fourier transforms and all these wonderful tools that make things fast. But we'll be able to learn transformations which are specifically tailored for the data and the tasks that we're working on. So even though it takes a lot more processing, it's probably not that big of a deal. There's also a data processing advantage from sort of an architecture standpoint which is that when you think of things as a neural network, as a neural network, you're effectively making the processing pipeline very homogeneous. That means that all of the operations that we do boil down to being a matrix product with a nonlinearity on top of it. And that makes things a lot simpler. If any of you had the the joy or lack of implementing things on a DSP chip, you always had to worry about, okay, what's the right way to implement the filter? What's the right way to implement the Fourier transform? And there's all this specialized hardware element that will help you do those things. Now you don't have to worry about that anymore because a lot of the neural network pipeline is pretty much the same operation. It's just matrix multiplications over and over and over and over. So that means that if you want to design hardware, you basically have to worry about just one thing, matrix products and nothing else. So when it comes to deployment and we'll be talking about that in a minute, it's going to make hardware design much, much easier. So we only need to fine tune one type of computation. So to wrap up this section, the entire DSP sort of pipeline that we have, we can reimagine as a neural network layer. So now we have what we call an end-to-end audio system. That means that we have raw audio coming in and raw audio is coming out. There's no notion of pre-processing or post-processing. These things have become adaptive layers. And what's the huge advantage of that is that we have a fully differentiable system. That means that we can say, well, here's the input that I have. Here's the output that I have. Just optimize the entire processing pipeline. We don't have to do anything that we have to design ourselves. Everything can be adapted for this task. And of course, it simplifies the processing a lot in both software and hardware. Right. So that's the end of the second part. And I'd like to know if we have any questions before we move on. Hello. Yeah. Hi. I just have one question. So when we use the spectral subtract DSP-based enhancement method, so we got some artifacts, right? But when we use the denoising autoencoder, do we still get the artifacts? Or is it like it's lesser in neural network-based enhancement or it's more in signal processing-based enhancement? So the reason why we get artifacts in the sort of the DSP approach is because we have an effect called the musical noise. Let me see if I can get to my slides at that point. Let's see. Maybe we can see a little bit of here. Actually, there's another example for the rep. There you go. So this is the first example that I've shown where we're doing denoising. And what you'll notice is that there are certain parts of the noise that actually went through because we didn't subtract from them enough of the noise spectrum. But that means at this point in time, you're basically going to get a single sinusoid. It's going to turn off. It's going to turn off. And that creates what's called the musical noise. I don't know if we can hear it very well in this example. Let's see. So there's a little bit of twinkling sound that you can hear, not so much in this case. But if you're not careful, that's a very typical artifact. Now, the reason why we get that artifact is because we have an operation that does something very specific and that's a side effect of that operation. Now, when we go back to the neural network methods, what happens in this case is we have a pipeline that looks like this. And now what we're doing is we're optimizing from beginning to end to get something that sounds good. And what that means is that the neural network has to figure out what's the right domain and what's the right operation in that domain such that we get exactly what my user wanted. So we're not going to get the same type of artifacts because we're not introducing a process that creates those artifacts. What we're saying is that to the neural network is, look, here's the data. I want you to give me something that sounds as good as you can. And you have to figure out all the post-processing yourself, such that it sounds like that. So the only artifacts you're going to get are going to be because you don't have proper training data or because your architecture might not be powerful enough, but they're not going to be processing artifacts, not in the same way that you get in the DSP world. Okay. So following this, I have another question. So for example, if you do not have the enough data to train the neural network base enhancement method, and if I train my model on the one dataset and test it on another dataset, so is that likely to get the more artifacts than or signal processing base enhancement? It could potentially happen. One thing I'd like to note is that if you're dealing with, you know, ordinary situations with stationary noise, like, you know, right now in my office, yeah, there's some fan noise or AC noise or whatever, I would not use a neural network to to remove noise in this environment. It would be overkill. If you have simple stationary noise, just, you know, use the DSP stuff. It works. It's been fine tuned. It's not bad. So for more complicated situations, yes, if there's a mismatch between your training data and your deployment data, obviously you're going to get not as good performance as if you had better matching, but also there is also getting at that point there are so much better than what you would get from a spectral subtraction that it's kind of an uneven comparison. So we're going to talk a little bit about data efficiency later on. But again, depending on the type of sounds that you have on the type of interference that you have on the type of process, maybe you want to do the reverberation, right? So, you know, I'm sort of showing all those things in the context of denoising, but I might as well say I have reverberance sound coming in and clean sound coming out, right? It would be the same process. So again, depending on what you're trying to do, how your data looks like, how big your network is, and all those things or how you regularize, you will have different types of artifacts, but they're not going to be as predictable as with standard DSP algorithms. And that is the price of playing with neural networks. You can't predict what's going to happen, right? You just try it and if it works, it works and you publish a paper, but you still don't know why it works. Whereas with DSP, that's not the case. Yeah, that's true. Thank you so much for answering this. But we'll be touching on some of those things later on as well. So, it's good questions. Do we have any other questions? I have one question. In this application, the initialization of weights, if they are close to zero, it can make, it can work or you have to apply some different technique to convert faster or make it more efficient? I've almost never had to worry about weight initialization in those kinds of models. If you use the standard ways of initializing, they work just fine. So it's not a big deal. Later on, if we have time, we're going to talk about a different family of models where weight initialization actually matters a little bit, but for these models, it's not that big of a deal. The other thing I would like to note is that a lot of cases, you can start with something that you know works well. So for example, what we did so far is we said, look, the SDFT is effectively a bank of convolutional filters, right? When I say that, that means that I could probably take the Fourier basis and put them as the initial conditions for my network, and it's going to start me from a pretty good representation. So a lot of people early on were doing things like that, where they would just take something like a DCT coefficients, preload them as the initial weights for the transformation, do the same thing for the inverse DCT on the other side, and that would speed things up a lot if you had the same sort of sizes for your network. So there's ways to get smarter initialization, but either way, you're going to have to wait a few hours for this thing to train anyway. So if you wait for 10 more minutes for the weights to convert to find the way, it's not going to make a huge difference. But you're not going to get very bad sort of local optima that you have to worry about those things. Okay, yeah. All right. So let's do the third section. Maybe then we could take a break or should we take a break now? Any opinions? I think you can decide whether we have to take break now or later. Okay, well, let's let's finish the next section, part three, and then we can now we can take a break. Okay, great. All right. Okay. So, so part three. Now we have the question of what do we have to optimize? And you know, this is one of those questions which is crucial. And probably the most important question that you have to worry about, even more so than architectures and things like that. So on the top, you have my one of my favorite quotes that I was being told constantly in grad school that everything is optimal given the right criteria. So whenever you make an algorithm and you use your, I don't know, min-squared error cost function or something, and then you don't like the results, the answer is not that the results are bad. Your results are optimal. It's just that the cost function that you use was the wrong one. So especially with neural networks, that is something that's very, very important, because we have a choice of optimizing a wide range of loss functions. And we have to know which one we have to optimize in order to get the best results that we want. So this section is about discussing all of those things. So here's what we did so far. The original model did a very straightforward optimization. When we operated without the convolutional layers in the front and the back, what we said is, well, let's take the output of the network ZFT or whatever the target, and then subtract, take the sort of the min-squared difference between the output of the network and the intended target, and that's going to be a loss function, right? So we're doing a min-squared error in the frequency domain. Or if you have the convolutional layers, you could do a min-squared error in the time domain. And that's kind of like the first cost function everybody uses for a lot of problems or regression type problems. And then the question becomes, are those okay? Well, min-squared error is very common in DSP, but the reason why it's common is because it makes all the math very, very simple. It has a lot of properties that make life easy. But that's not necessarily what we need in our case. So for example, it would be very easy for us to say instead of doing the L2 distance, we could do an L1 without making a difference. And yes, it does make a difference, but then the question becomes, why is that better or worse? I mean, we can see that it might give you better results, but then again, we need to have a way of justifying why this is the case. So when it comes to evaluating separation or denoising or, you know, any other task given enhancement task, the reverberation or echo removal or whatever, it is always very important to know what matters. And I can't imagine a single case where doing something like a minimum MSC error as being the right thing. It's the easy thing. It's the obvious thing. It's just not the right thing. So the other thing to note is that what really matters is going to relate to your goals. Some people try to do denoising or source separation because you want it to sound good. That means that you ultimately have to satisfy somebody's ear. Some people want that because they want the output speech to sound intelligible, because somebody has to understand what's being said. Maybe it doesn't sound good, maybe there's still some noise, but if you improve the intelligibility, that could be what you care about. And that might sound a lot different from something that's designed to completely suppress the noise. Or maybe you want to facilitate further processing. Maybe that's a pre-processing step to put into your automatic speech recognition system, in which case you have to worry about that. And you have to make a lot of choices. Do I want to completely remove the noise and have some artifacts? Or do I want to keep some of the noise and maybe not have artifacts? These are all choices that there's no right answer to. It's going to depend on your application. It's going to depend on your taste. So it's not something that I can tell you, you know, this is the way to do it. But these are valid concerns and how you optimize your neural network is going to reflect that. And you want to be able to do this. So first thing we want to do is we're going to go, let's say we go for maximal separation. If we really care about source separation, sort of separating the noise from speech, our goal is to suppress the noise as much as we can. And we also want to minimize any artifacts that we get from reconstruction. And there's a very famous toolkit called BSS eval that gives us a set of metrics that are very good for doing this kind of processing. We have the three metrics defined in this case, the SIR, the SIR and the SDR, that's source to interference ratio, source to artifacts ratio and source to distortion ratio. And these basically measure a lot of the things that we want. So to give you a sense of what they look like, the source to interference ratio or SIR is kind of equivalent to what we would call the signal to noise ratio. In some sense, it measures the how much of the interference is left in the output that we have. What it effectively does is a whole bunch of inner products. And what we're trying to figure out is how much of the original noise, so we have X going to be our output of the network, Y is the ground truth of the target and Z is the ground truth of the interference. So Z is the noise that we already know what it's like. Y is the signal when I get as an output, the target signal and X is the output of the network. Having those three values and arranging them this way will give us a number that will get maximized the more you suppress Z inside signal X. We always show that in the decibels, the larger the number it is, the better. If you get anything more than 10, 15 decibels, it actually sounds pretty good. And if we want to make that a neural net loss, we want something that's minimizable. So we can take that maximization that we have here and simplify it a little bit. So we're basically flipping the ratio and then removing some of the constant terms. And it ends up like a loss function that looks like this. We want to minimize the correlation with noise, which is what the numerator does. And we want to maximize the correlation with the target, which is what the denominator does. If we optimize this ratio, we're effectively optimizing the source to interference ratio. So that's going to minimize the target, the interference, as much as it can while keeping the target intact. There's the source to artifacts ratio, a little bit of a more complicated expression. But what that does is it measures how many artifacts have we introduced into our signal. When you get an output, you will have some amount of your signal. You will have some amount of remaining noise. And then you will have some amount of artifacts that the processing introduces. What SAR does is measure the amount of artifacts. What the SIR did in the previous slide, measure the amount of remaining noise. Again, this is measured in dB, larger values are better. And again, we can sort of stare at it for a while and do some math to simplify it. And if you want to make it a neural net loss, where you have to minimize it, it's going to look like this. I'm not going to try to explain it because it's not that easy. But basically, it's something that we can easily optimize. It's just a bunch of dot products. And then finally, there's the source to distortion ratio, which in some sense combines the signal to interference and the signal and the source to distort to, that should be SAR up here. So it combines the SIR and the SDR into one thing. So we can use that as a single measure that basically takes the two other measures and tells us how well we do. Again, just a bunch of dot products. And if we simplify them as a neural net loss, it looks very simple. And what it tells us is that we want to minimize the output that comes from the network while maximizing the correlation with the target. So it's a very simple expression. It says, make the signal as similar as you can to the target. At the same time, make sure it's not just a zero or a very faint signal. So this is a cost that you can use, fairly simple, very easy cost to differentiate and put into your own network. And that can work really well. Now, another avenue you can take is to say, well, you know what, we really care about intelligibility. I'm doing this for speech. People are going to listen to that. I want to make sure that they know what they can understand what's being said. And that might mean that you still let some noise or maybe you let some artifacts go through as long as that intelligibility is not gone. So BSS Evolve is not going to help you do something like this. Its job is to reduce noise and artifacts. But that doesn't mean that it's going to give you something that's going to be as intelligible as possible. So an alternative measure to that is the Stoi measure that stands for short time objective intelligibility. So it's a measure that basically spans from zero to one. Zero means that you have no idea what's being said. One means that the sound that you have is actually quite intelligible. If we're sort of sketching it a little bit as an algorithm, the way that it's going to look like is that, again, it's got this short time Fourier transform. It's got some octave band representation of the data up to 10 kilohertz. And then you're computing some local correlations and sort of average them over the entire signal. And those give you an indication of how much intelligibility exists into the signal. Again, this is something that people came up with by sort of correlating it with human perception. So don't try to make too much sense about why it works. But you can show that actually correlates well with how people feel about intelligibility. So you can use it as a computational proxy to that. Now, what's a nice property of that is that if you look at what it entails, there's an STFT, there's some octave band, there's some correlation. As we've seen before, all of those things can be seen as neural network elements, right? We can combine them and make them a neural network, which means the entire thing is differentiable. So you can actually sit down and implement a story as a neural network with fixed weights. And then that means you can compute its gradients, which means now it can be a cost function that you can use for your neural network. So one of the advantages of doing this end-to-end network that we did in the previous section is that because we're using waveform level processing, we have an input waveform coming in and an output waveform coming out. Any kind of cost function you can apply on waveforms, whether it's your SNR or your story or whatever else, now you can implement, you can optimize as a cost function. So the reason why we did all this bothering of figuring out how to replace the Fourier transforms is not only to get a better type of representation, but also because that allows us to operate metrics directly on the waveform, which is what most of the standard metrics are on. Now, which loss matters the most? Well, again, like I said before, there's no right answer here. A lot of papers report one of those measures, most usually the SDR, a lot of the times maybe they care about STOI. You can also use other ones like Pescorp peak or whatever measure works well for your space. But one thing we wanted to figure out, we did a little experiment in my group at some point, is we made a listening experiment and tried to figure out how do people feel about different questions and how they relate to different cost functions. So what we did in this case is we had a set of four questions and then we trained simple neural networks with different cost functions and we wanted to figure out where do you get more correlation. So here's the answer that we got. So again, that's sort of more informative to show you that different tasks have required different things. So one question is how much was the interfering source suppressed? So we put this on Mechanical Turk. We asked hundreds of people what they thought about a whole bunch of separation results. And here what you're seeing is the rating that people gave to different cost functions to sounds that we got through different cost functions. So for example, we trained a neural network using mean squared error and then we see that the mean opinion score was about, I don't know, 50%. And then we trained another network where we said the cost function would be 50% the SDR and another 50% SDR. And we see that people felt that that resulted in interfering the sources in sort of resulted in suppressing noise much better than the mean squared error network or the SDR network. So we see for that particular question, for that particular goal, the cost function we want to use would be the average of the SIR and the SIR. Now if we change the question and say how intelligible was the separated speech, things change. What we used before that gives us the most noise suppression doesn't necessarily give you the best intelligible speech. And that we actually got by balancing, what is it about 75% the SDR and 25% of storage. So now we're introducing a cost function that actually cares about intelligibility. So we can see again, that's different. And again, the MSc does not give you the best results, even though it's kind of like the standard thing to do. Listening test number three, we asked people how well was the target source preserved. And that means, you know, are we introducing a lot of artifacts or not? Turns out again, a combination of SDR and STOI tends to do better. And then the last question was, how inaudible were the artifacts? So again, in this case, it turns out that using just the SDR is much better, but overall, there wasn't as much variation as we had before. So what those tests tell us is that there's no right cost function for what it is you're trying to do. That's going to depend heavily on what your goal is. And if you don't know what your goal is before you start doing this, then don't do it. Figure out what you want to do. And once you figure it out, then you know what kind of cost function you need to use and how you need to optimize things. Another question you might want to ask is, okay, so what's going on with the ASR? Well, if you want to do automatic speech recognition, a lot of separation, it's well known that a lot of separation algorithms do not help ASR. There's a lot of inaudible changes, a lot of changes in the statistics of the speech signals that come out of those that will actually confuse models. ASR systems do not necessarily work the same way that our ears work. They latch on on features which we might think are not necessarily important. So if you apply a separation algorithm before an ASR system, most of the time it actually makes things worse. Unless you train that ASR system with denoised data, but that's usually very expensive. So what you can do is you can use a lot of the ASR metrics as a loss for the denoiser. So that means that you can train your automatic speech recognition system in tandem with a denoiser and then optimize the denoiser to facilitate recognition. Of course, that requires training from a scratch. Or the other thing you can do is you can optimize a denoiser to minimize world recognition metrics. So you can take your noise signals, put them through a denoiser, then put them through an ASR system that only spits out your, let's say, your world error rate, and then you can use that as a cost function for your denoiser and try to optimize it so that it optimizes the performance of the ASR system, not necessarily how it sounds or any of the other metrics. And again, that's fairly easy to do because now a lot of the ASR systems are just a big neural net. You can also use a lot of perceived quality measures. Sometimes people care about how good something sounds or what the fidelity looks like. That's irrespective of denoising and other things. And again, you can use a lot of standard tools for that. You can use something like PESC, which is again a sort of computational function that basically approximates how people would perceptually evaluate speech quality, the speech, same thing for general audio. There's a whole bunch of metrics that people come up with in quality assessment. Most of them use very standard signal processing elements. And it's very straightforward to turn them into a neural network and then use those as a cost function by itself. One final thing about cost functions, there's also a very strong distinction between denoising versus separation. When we talk about denoising, you have a mixture of sounds. And then there's one sound that you want and everything else is considered to be the sound that you don't want. So it's kind of like a binary problem. That's kind of a fairly simple problem. And that we could easily address with a lot of the metrics we had in the previous slides. But then there's interesting situations where you want to do, for example, separation of two speakers speaking at the same time. So if you have two people speaking at the same time, then it's not clear and you want to separate both of those audio streams. It's not clear what is the noise and what is the target, right? Because both of these signals are a target and you need to be able to extract both of them just as well. Now, the problem with those situations is because you don't have a clear definition of what constitutes a target and what constitutes an interference, we cannot use the same losses as we did before. And here's the reason why. Imagine that you're training your own network and you're randomly giving it mixtures of speakers and then you try to extract, let's say, the first speaker out of those mixtures and train that way. Well, if you were to do something like this, at some point you might get a sentence s1 and then that gets mixed with sentence s2 and you ask your neural network to optimize itself to give you as an output just s1. Sometime later on you might also get a mixture of s2 plus s1, which is going to be exactly the same input, but now the first source becomes s2 and now you're asking it to give you s2. What would happen in those cases, you would get one gradient that tells, oh, you want to give me s1. The other gradient says, oh, you want to get s2 and those two would basically cancel each other out. So you would be asking your network to do something impossible, which to give you two different outputs for effectively the same input. And what happens in those cases is that you run into trouble. This is a lot more prominent in situations where you want to do things like music separation, where you can have multiple music sounds. And again, the order could be permuted in all sorts of ways. It's not easy to do something like that. So in order to deal with this, there's something called a PIT permutation invariant training. And what that does is a way of creating a meta loss function that we can use. So we're going to use the same loss function that we had in the previous slide, but we're going to add one extra element to help us deal with that ambiguity. And the basic idea behind it is actually quite simple. So what we're going to do is, let's say we have this case where we're separating speech from speech and we have s1 and s2 coming in, but then later on we might have s2 plus s1. So what we're going to do is we're going to calculate the loss of either of the permutation of sources. So our input is going to be s1 plus s2. And then we're going to try to figure out how to separate s1, but also try to separate s2 as well. And then we're going to look at those outputs and see which of the two gives us the best loss function or the best str or story or whatever it is you're evaluating. And let's say that in this particular case, it was s1. If that's the case, we're going to propagate gradients that do that. And then likewise, we can also get s2 out of it as well once we have s1. And then later on, when we get that signal again, even if the order of the speakers is premuted, we still see that the network does better when we extract s1 first and then s2. So we can use, we can sort of again say that you know what, even though I'm looking for the first source, which in this case is going to be s2, stick to separating s1 because you already know how to do that. And then I'm just going to flip the order of the outputs. So what that means is that when you design your own network, you're looking at all of the outputs that you're getting, figuring out which permutation gives you the best results, optimize based on that permutation, and then you can undo that so you can get the order that you want. And if you do that, it actually works really well. And that can help you do a very good separation when you have multiple speakers at the same time. So the pros is that we encourage the network to do whatever it knows that we do best by not confusing it by asking you to do multiple things. One of the problems is that now, especially in cases where you say I want to have three speakers and separate all of the voices, you don't have a lot of control during inference time on what the order of the outputs is going to be. But that's fine because when you have three people speaking, there's no implicit order in that, in them, right? You can train a network to specifically focus on specific speakers, and then you don't have to worry about this kind of thing. But if you say that I have a bunch of sources and they're all the same type, and I want to get them separated, that statement does not imply an order. So that random permutation is going to be there anyway. Finally, last thing to mention about evaluation, there's a lot of data sets that you can use to train your systems and also to evaluate them. You need to have a lot of data to train those systems. So whenever you try to evaluate any of those algorithms, yes, you have to report some kind of a loss function, but you also have to use some kind of a standard data set if you want people to be able to compare what you do with other papers. So there's a lot of options for speech. People have been using a mixture of data sets based on Wall Street Journal and Libre Speech. Wall Street Journal, of course, you have to pay to get the data, but Libre Speech is free. So there's a bunch of data sets that are applied on those. Some of them also add some reverberation on it to make things more challenging. So that's for speech and speech separation. For music, again, there's a whole bunch of data sets that have mixtures of musical instruments, and then you also get the separated tracks. Again, these are all standardized. You can easily Google and find them. So most papers are sort of linked to them anyway. And again, those are very good for evaluating and some of them for training as well. So to close this section, it's important to know what you're optimizing for. What sounds good is not necessarily going to work well on your application. And then if you have end-to-end networks, you can get this indifferentiable losses. You can basically optimize the whole process. And you can do very complex things like say, well, here's my input waveform. Here's my output waveform. You know, optimize this, you know, very standard industry metric like, I don't know, PESC or something. And actually, it's very easy to do things like that. All right. So I think it's time to take a little break right now. Before that, though, do we have any questions on this section? All right. So I have a question. So the question is, what speaker recognition task in noisy environment, which kind of VSS evolved will be more important to improve like SDR, SIR or SDR? So don't think of them as being competing. SDR is kind of like a combination of SDR and SIR. So you can usually use that and it's almost works just as well. So if you care about identifying speakers, I'm going to conjecture that what matters the most is that you have the least amount of artifacts. So I would say optimizing SIR is probably one of the most important things, but you also need to reduce noise as well. So I haven't done it. So again, I'm just theorizing here, but I would say a healthy dose of SIR to make sure that you don't alter the voice of the speaker is important. And then you also need to have some component of SIR because you want to suppress any competing speakers to be able to do that. But again, I haven't tried it. So again, just my theory at the moment, but I wouldn't focus about on individuality and things like that, of course. Thank you. Any other questions? Just one more question. Like in a modest channel setting, how should I handle a real-time non-stationary noise? Because I'm trying to work on it and it's really hard. I tried with masking approach. It's working well with stationary, but not like the real-time non-stationary noise. Okay. We'll be getting to multi-microphone stuff and some non-stationarity in a minute. So maybe you can re-ask that question on the next part. Okay. Thank you. Now we're going to move on to what's a very typical thing when it comes to doing a to working with neural networks, which is trying to find an architecture that actually works best. So this probably where most of the action is happening right now, we already have a good sense of what architectures work well, what sort of setups work well, the cost function, all that. But now there's been a series of papers in the last three, four years where people are trying to figure out what would be the best architecture that can learn most efficiently with the data and sort of give us the best possible performance. So we're going to start from one sort of observation that we need to have a model that's a bit more elaborate than what we started with. So remember what we had so far was a very simple model. It didn't really look at any temporal correlations. So now we're going to use more modern sort of building elements of neural networks to get something better. So the general structure that's used today looks as this. It's very similar to what we had before. You're going to have the input. There's going to be some kind of a convolutional layer that's the equivalent of a short-time Fourier transform or some kind of a basis decomposition that brings us to some latent space. Then what we had so far is a series of linear layers. We're going to call those the separation modules. And then whatever comes out from that sequence of layers gets multiplied with the original representation that gets put to our transpose convolution that ends up being our surface synthesis step and then we get away from as an output. So most of the innovation you're going to see is going to happen on the separation modules. And the question here is how do we get more performance out of them? How do we structure them to be more efficient and how do we get the best gradient flows to facilitate training? So the first thing to note, of course, is that linear layers are okay, but that's what people did back in the 90s. So what we want to do is incorporate a little bit of time. So we're going to take our model, which so far was TAM agnostic, and it didn't really look anything around it. And now we're going to build some time a sense to it. So what I mean by time agnostic is that you would get one frequency frame coming in as an input. So that would be just one function that would have been your C of T for a specific T. And then what would come out would be the equivalent Y of T. But the order that the data would come in, it really wouldn't matter. We didn't look at neighboring context, we didn't look at anything like that. So now what we want to do is to fix the problem and put in structures in here that actually take a look at what's happening around you. And that's, of course, because we all know the speech is not a sort of signal that's independent across all the time frames. There's a lot of context that we need to take into account. So the usual tools to do that would be to effectively replace those linear layers by something like a convolutional network or a recurrent network, or even some kind of an attention layer. And that will give us the ability to look at what's going on around us. So let's start by using CNNs because that's the most obvious and probably the most powerful way to do this. So now we're going to redefine the separation modules as follows. We're going to have a sequence of convolutional layers, whereas before we had simple dense layers. And what those layers will do is that they will take into account some of the future and some of the past frames on a signal to come up with a better estimate. If our convolutions had a filter size of one, then they would only look at this at the current time point. But what we're going to do is grow them in time so they can see a lot of the context. So the length of those filters is going to effectively define what we call the receptive field size. And the receptive field size is basically a fancy way of saying of how big of a time window is my network or my layer looking at. Before we had a receptive field size of just one frame. Now because we're going to be doing convolutions, that's going to be a bigger receptive field. So if somebody were to give you a bunch of convolutional layers and each of them had a filter that had k-taps or a filter of k coefficients long, then you can use this expression to effectively figure out how big your overall receptive field would be after n modules. Now the thing to remember is that these convolutions happen in terms of the front end frames. Front end frames are going to be spaced from anywhere from one millisecond to 20 or maybe sometimes 40 milliseconds. So the convolutions we're doing are in that time scale. They're not on a sample level. So now one of the problems with CNNs is that it's going to be very hard to get very long receptive fields. For a lot of speech applications, you're going to need about a quarter to one second of context. So that means that using the formula in the previous slide, you would probably need to have something like 300 layers of convolutions to get that. And that's a lot of convolutions. That's a very complicated network. That's not something that you really want to train on. So we get a whole bunch of problems for doing that. Well, first of all, is that the number of parameters is going to be very significant because every CNN we have there is going to be probably in the order of a million parameters. But we also start seeing vanishing gradient problems. If you have a lot of sort of convolutional networks one after the other, eventually the gradient gets diminished because it is very hard to propagate all the information from so many transformations. So that's a well-known problem in neural networks. So we can use standard ways of dealing with it. So to deal with vanishing gradients, we're going to use two tricks. We're going to use skip connections. And we're going to use normalization. These are textbook approaches to dealing with this problem. So there's nothing very specific to this problem here. So the skip connections are going to help us propagate gradients. And what we do with those cases is that right before you enter a convolutional layer, what you do is you take that signal and you sort of skip this processing that happens of the layer and the saturation that you have and effectively add the output of that section with its input. Now what that does is that it makes sure that even if you can't propagate the gradient through the convolutions, because information might be lost or whatever, you still have this path and that ends up helping a lot. The second trick we want to use is normalization. And what that's going to do is going to help you avoid a lot of extreme saturations. So the way that you do it is that every time you have a convolution, either right before it or right after it, what you can do is what we call normalize the data, which means you want to shift it to have zero mean and optionally scale it so that it has unit variants. There's multiple ways of doing that. You can do that over the batches or over all the time samples or with different groups. So there's lots of different norms in neural nets. Most of them work just fine. So there's not a huge difference between them. But the effect of doing this is that you condition the data so that the next convolution doesn't lose a lot of information, because you might have like a shift that keeps going or a saturation that keeps becoming more extreme. So adding normalization layers actually helps a lot. There's still active debate on whether you want to normalize before the convolution, after the convolution, or after the saturation. I'm not going to get into that. You can get this sort of slightly different results every time, but I don't think it's worth getting into that debate. So these two tricks by themselves can help deal with the vanishing gradients, but we also have to worry about the fact that we have a huge amount of layers and how we want to be able to reduce that. And of course, the standard way to do that is to use dilated convolutions that will allow us to get much larger receptive fields from much fewer layers. So the goal here is to not have 300 convolutional layers to get something like a second or half a second of context. We want to have a much, much smaller amount. And the way we're going to do that is as follows. We're going to be using stacks of convolutional layers, and then we're going to use dilation in each successive stack. Now with dilated convolutions, you've probably already seen them in the previous lectures, but the basic idea is that you take your input and then you apply the convolution on every d-samples where these are dilation factors. So for example, on the top case here, we have a filter, we apply on the input. So we're going to take neighboring points, multiply them with those coefficients, and that's going to be the output of the filter. If we have a dilation of two, we're going to take every two points from the input and apply the filter on these two points. So we're going to be skipping one. A dilation of four, we take every four samples and so on and so forth. And what that allows us to do is to have a filter that spans a bigger amount of time while still only having a small number of coefficients. So that allows us to use fewer CNNs because we can quickly have CNNs that have a much, much bigger context window without having to wait until all the convolutions expand all the way to give us the big receptive field. So what we're going to do is we're going to stack dilated convolutions and the standard way to do them in this field is that we're going to take a sort of a sequence of CNNs. The first one is going to operate with a dilation of one, so just a regular filter. Second one, dilation of two, dilation of four, age 16 and so forth. So we're going over the powers of two to basically quickly expand the receptive field. In the graph here, I'm omitting of course all the activations and normalizations to make things a little simpler. So what's nice with this setup is that we're getting an exponential growth on the size of the receptive field as we stack more convolutional layers. So if you wanted to get a one second receptive field at eight kilohertz sampling rate and I was using convolutions of three coefficients, I would need about 4,000 regular convalesce to get that much of a receptive field. Whereas if I use a stacked dilated convolutions, I can get the same thing with only 12 of them. So that means that I can have many, many fewer parameters and still have a receptive field that sees a lot more. And that of course is a big win in not having humongous networks that will take a lot of resources to drain and would be prone to overfitting. Now there's one more thing we're going to do to reduce the number of parameters even more. And that is to look at how we do the convolutions. Now if you look at the regular convolutional layer, let's say we have something that goes from n dimensions to n dimensions and has L filter coefficients. The amount of weights that you have are going to be n by n by L. And that can get pretty big, right? So for example, in our cases, we often have something like 512 dimensions. So that's going to be n. And our filters are going to be about three. If you compute the value for that, that's going to be about 785,000 parameters. So that's a lot of parameters for something that doesn't do that much. So instead what we can do is a trick where we factorize convolutions. And the basic idea is to take this single convolution here that has a lot of parameters and approximate it with the following scheme. We're going to take one convolution that goes from n dimensions to k dimensions where k is going to be some latent space. And that's going to be what we call a one by one convolution. Effectively what that means is that our filter size L is going to be one. And what this effectively does is it only does a linear transformation on the dimensions of our data. It's just like the same thing as applying a matrix transformation, a full matrix transformation on our n dimensions to go down to k dimensions. So there's no convolution that happens in that case. The only operations happen across the dimensions. Then what we can do is apply a convolutional filter in this lower dimensional space. So now the values we're going to have are going to be k by k by l. And then use another one by one convolution to boost our dimensions from k dimensions to n dimensions and go up to the original dimensionality. So functionally this does the same thing as the original. The only difference is that now we're doing it by going down to lower dimensions, separating the effect of mixing the dimensions and the effect of filtering over time. And that ends up giving us a set of layers that have fewer dimensions. They also have additional nonlinearity, which is a good thing. But we can do roughly the same job with many, many fewer parameters. There's one more thing to add here in that when we get into the temporal convolution part in the middle, we're also using a depth-wise convolution, which means that instead of using a full set of filters for every output, in this case k filters for every output, we can have one filter for every dimension. So that means we can go down to having k times l parameters as opposed to k squared times l parameters. So if we do that, we can get an equivalent convolutional block that mixes both all of our latent dimensions, but also does some filtering over time with much, much fewer parameters. If you do it right, you can go from something like the 700,000 or almost 800,000 parameters that we have up here down to about 200,000 down here. And this is used a lot because, again, it simplifies things by a huge amount. So putting all of those ideas together gave us one of the first architectures that worked really, really well. And that was that's called the Conf Dasnet. So the basic idea is the same as before, there's a front end, there's a re-synthesis, nothing changed, we're still doing this masking operation. But now the layers that you have here are what we call the TCN modules. And the way that those work is that you have, for each of them, you get an input coming in, you have a sequence of convolutions that each have a different dilation that grows exponentially, you sum the outputs of all of those things in the end. So you have a bunch of skip connections in there. And then you get the output, and then that goes to the next block, to the next block, to the next block. And additionally, you have skip connections between all of those blocks, all the way to the output of the final block. So there's a lot of skip connections going on. But then there's a lot of simplified convolutions that we have in here. And then if you look at the convolution blocks that we have, that's where we start making use of those separable convolutions, because the input is coming in, we go down to a lower dimensionality using a one by one convolutional network, just a linear transform. We have our activations and our normalization. Now we do the temporal filtering with a separable convolution down here. And then we boost it up again into end dimensions. There's a little bit of a difference here in that we have one path that computes the output by adding a skip connection. And then another convolution that gives us the skip connection that comes out from here and then gets added to the end. Again, these are things that are not necessary to put in there. You could actually do the same thing without most of the skip connections and it would work okay. These are things that give us a little bit of extra performance. It has an adaptive front end. It's going to be a little different from what we have in terms of size. Usually you use just about 40 samples or 20 samples of base length. So it's much shorter than the STFT that tends to be in the hundreds or thousands of samples. The hop size usually ends up being 10 or 20 samples or so. But then if you look at the bases that are learned, you can see that it kind of looked like sinusoids. Here we have lots of low frequency sinusoids. We can see that how the frequency keeps going up. And then all the way up here, we have very fast changing bases, which correspond to high frequencies. They're not exactly going to be sinusoids. They're going to be whatever basis is optimal for this particular problem that we're trying to solve now. The T-send modules are usually about 512 dimensions. And then in the middle, we go down to 128 dimensions for the separable convolutions. Filters are always three points. And we use stacks of four to eight dilated convolutions. And most of the good versions of that model have about a one and a half second receptive field. This is a pretty big receptive field. But it tends up having a relatively small size. It's about 5 million parameters. And if you do that, you get about 15 decibels SNR when you have people speaking at the same time. So that's speaker-speaker separation, so which is pretty good. Another thing we can do is to add recurrent layers. And now, instead of using convolutions to have a receptive field, we can use a recurrence to help us see what's going on around our network. So most often, you're going to see either LSTMs or BLSTMs used for that. But of course, other RNN variants are fine as well. And the two best-known architectures that do that are the MOOC system and the DPRNN. Those systems do tend to be slower and or bigger, but they actually can perform better, slightly better than Conf TASnet, just because LSTMs can pay a little bit more attention than convolutions. So here's the base case. Is there any question, Felipe, or you forgot? Not once before, like in the previous blog, I made the question. Okay. So which slide? No, no, I already did the question. Oh, okay, okay, okay. Then put, I think if you click and again raise hand, I think it will lower the hand. Okay, thanks. Yeah, I can only see a subset of the people on my window. So if somebody raises hand, I usually don't see it. So again, feel free to just jump in and speak. All right. So the MOOCs architecture, it was originally designed for separating musical instruments from music mixes. So it's got a slightly different scope, but you can also use it for speech enhancement as well. The architecture is a little different from what we've seen so far, in that it's got what they call a unit-like architecture. So the basic idea is that you have a input coming in, and then you have a sequence of encoders. And those encoders have skip connections to the corresponding decoders on the other side. What those do is they basically reshape your data to have always fewer and fewer samples coming in because they're doing some kind of a resampling or striding. At the end, you end up with a representation that goes through a BLSTM that scans the network using both time directions. So you get a context on both sides, and then that gets fed into the coders, which then use the skip connections from the encoders as well. And all of those things come together to produce an output. The structure of the encoder is going to have two convolutions. What happens, your data comes in, you project to some other dimensionality. There's a stride of four. So you're sort of decimating your samples as they come in, and then you split it into output and a skip connection, the decoders, they get their input and their input also from a skip connection from the corresponding encoder that has the same size. So those get added together. So they get fed to each other. And again, they sort of undo the convolutions that were done in the encoding stage. And all of those things together basically start from an audio input. These encoders kind of incorporate a lot of the front end and the separation as well. You get a lot of the sort of temporal magic happening in the BLSTM, but also in the convolutions, and then you undo all of that processing to reconstruct the waveform. It's a very large network. It could be slow because it's got a lot of parameters, but the mean-opinion scores on it are better than the Comptasnet. Another yet another version is a DPRNM. So the only difference that it has from Comptasnet is that it's using RNN instead of convolutions. And it's building up on the idea of factorizing those convolutions in that with Comptasnet, what we do is we factorize convolutions by saying there's going to be a convolution that doesn't have any time window. It's only operating across all of our dimensions. And then we're going to go to a separable convolution in a sort of depth-wise convolution that only operates on the timer index, but doesn't really matter, bother with mixing our dimensions, and then we mix them again after that with a one-by-one convolution. That way we factorize things. DPRNM does the same thing, but it does it with RNNs. So effectively at every block you have one RNN that scans over your dimensions, and that's all the appropriate mixing that needs to happen there, and that's followed by another RNN that operates over time, and that deals with the temporal aspects, and then these two together effectively cover both dimensions, the latent dimension aspect, and then also the time aspect. What's nice with that is that it ends up being a smaller network because you're using the RNNs that tend to be fairly compact, and that for things speech and speech you can get up to 19 decibels of SDR, which is a pretty amazing number. But it's really, really slow because RNNs are not going to be as efficient as convolutions. So just to give you a taste of what those things sound like, here's a couple of examples. These are all going to be denoising examples. So here's an input, that's going to be the top left plot. Comma, a couple of years ago, comma, an impossible situation. And here's the separation out of it. Comma, a couple of years ago, comma, an impossible situation. Here's another input. July 31st to stock of record July 2nd. So that's a pretty bad mixture, and here's what comes as an output. Rears on July 31st to stock of record July 2nd. Another input. We have no basis to believe it was done to an institutional period. And one last one. So as you can hear, they do pretty well in removing a lot of the noise. They are creating some artifacts because the overlap that they're doing is not as sophisticated as the inverse or temporary transform. We have the tapering windows and everything is kind of smooth. So you get a little bit of roughness there. But for the most part, they actually sound pretty good for a lot of cases. And again, if you Google online, you're going to find things that sound even better than that. So there's a lot of activity happening in that space. Quick note that you can also have multi-channel versions of those systems. It's very easy to take all of the models that I've shown before and extend them to deal with multi-channel inputs. The only difference is that you're from dense, now we'll have to combine multiple channels coming in and then getting some joined sort of a signal coming out of it. What I'm going to show now is a modification that's more appropriate for things like swarms of microphones or when you have microphones where their respective position changes all the time or their number of microphones keep changing over all the time. So that could be a situation where you have a bunch of laptops in a meeting room and then people come and go and new microphones come and go as well with them or you can have all of the cell phones in a concert where there's people recording at random times on and off. So these are things which are kind of like an array but not quite. So again, we can still use those models in that situation, but it requires a slightly different approach. So here's what we're going to do. The idea is that the number of channels that you're getting in is not going to be fixed and whenever you're dealing with uncertainty in any number within your network, your best bet is to use something like a recurrent network. So the current models that we talked about would get a sort of a noise input and let's say it's going to be a n number of channels and you can fix your front ends to take n channels, get some transformation and then give you let's say a single channel that has the output that you want. But if I were to change the number of channels dynamically, I wouldn't be able to deal with this. I would have to train a new network to deal with a new number of channels or if I move my microphones it would have to adjust its front ends. So that kind of a model cannot facilitate a lot of dynamic setups. So it cannot generalize. So what we're going to do is do what we call a multi-view network and the idea here is that you're doing sequential processing of the input channels. So instead of taking all of the channels at the same time, processing them as a single block and then doing the further processing, what we're going to do is we're going to take the channels as a sequence of events and we can use an RNN to process them. So what we're going to do is we're going to take the first channel, we're going to put it through our one channel denoiser. It's going to give us some kind of an output. It's not going to be optimal because it's only going to take one of the channels. But then we're going to feed that output back to that network as we give it as an input the second channel. And now it's going to give us a refined output that's going to combine the previous output and the new input. And then we're going to do that over and over and over and we're going to scan over all the channels. Now the advantage of doing that is that we only need one channel at a time. So if at any point in time I know I have five channels available, I'm just going to run this thing five times. If on the next time step a six channel appears then I can run it on six channels. And they're all using the same sort of computational core here and the only thing that changes is how many times you run this thing because you have a different number of inputs. So it's effectively a recurrent neural network that operates over all of the channels. So the way that the overall system would look like is that you would get an input frame coming in that could be a Euro signal or it could be Euro Fourier slices doesn't matter what it is. Each channel gets fed to that network. The output of that network gets fed to itself again as it sees the next channel and so on and so forth. And then once we're done with all the channels we take that output and say this is what came out at that frame. And then we take that and feed it on to the next time frame so we can do another and then expand it over time as well. And then we go through the same process again and again and again. And what's nice again is that every frame we can have a different number of channels and things work out really nice. Just to show a couple of interesting plots that tell us why this is a good idea. So this is the SDR depending on number of channels. This is a system that was trained on five channels and then we see how it performs. And we see that if you give it fewer channels obviously it doesn't do as well. The blue line is the baseline for the five channels. But then as you give it more and more channels you can see that the performance is slowly becoming higher. So it's learning to generalize. The other thing that's kind of nice is that the channel order doesn't matter too much. Here we have two experiments. One of them where we give to that multi-view network channels that have decreasing SNR as time goes by. This is the blue line. And you see as time goes by the results don't get biased too much about from the worst inputs coming in. So it's actually maintaining a decent output. Likewise if you give it channels in order of worst to best you can see that it learns how to improve the performance. So at the end you get very similar performance in both cases even though the order of the channels was completely different. So again what that tells us is that the system has learned to be channel order invariant as well. So to sort of close the architecture section getting some temporal awareness was very important. We could do that with convolutional or recurrent layers. There's also some papers that came out recently that using attention to replace the RNNs. So that's an option as well. We can also have some awareness on the number of channels by using the multi-view networks. And the whole idea is that we want to be smart with all these architectures to try to avoid as much bloat as we can. We want to have those networks to be small. We don't want them to be humongous. And the only thing we do with convolutions and recurrent models is we basically reuse a lot of the parameters in a way to sort of simplify things. There's a ton of variations. Start googling those things if you want. You can find source code for most of those things on GitHub and you can start playing with them immediately. So any questions on the architectures before we move on? I don't have a direct question on architecture but it's related to the multi-channel and the oracle mass. So should I ask now or I can ask later? Yep, go for it. Okay so the question is, so I have the noise recorded in like a real life industry noises which have a non-stationary noises. So the issue is when I do the oracle binary mass like when I quote my results on oracle binary mass, it performs bad compared to the noisy speech in terms of SIR, SIR and SDR. So when I train my DNN to static the mask obviously it will never work because oracle is not working so that's my question like what direction I should take forward. So in this case your oracle is on the short time Fourier transform space right? Yes. Okay so the beautiful thing with the Comptasnet paper, in fact it was part of the title of the paper, is that the performance that we're getting was actually, so if you use the same networks with a short time Fourier transform front end they are going to saturate it about 11 dBSDR or so. That's kind of the limit of the performance and an oracle binary mass maybe gets you a couple more but this is where the limit is. What they've shown in that paper is that by using an adaptive front end you can get you know seven or eight db more in performance from the oracle case just because you're using a different representation. So I would say that the representation that you're using, the latent space you're in which is that short time Fourier transform space is probably not the optimal space and the oracle mass that you have is not necessarily going to give you what the real upper bound is. It's going to give you the upper bound of using that representation but there's probably other representations that a network should be able to learn that will get you a much better performance. There are also other issues of course right you need to have enough parameters enough dimensionalities enough big enough receptive fields so there's a million things that could go wrong which is one of the unfortunate things with neural nets but I would say don't look at the oracle results from a binary mask on an stft as any guidance at this point you should be able to do much better than that by an adaptive front end. Just to just to add one more point here I also tried with the Wiener mask and Reso mask and yeah but it was the same so thank you for the feedback. Yeah definitely look at a different type of front ends there is a actually there was a student of mine had a paper called a two-step source separation I think and the basic idea there was to try to find the best possible space to do the separation so there's one step where you're effectively doing oracle binary masking but you're doing it on a space that you're learning the best space for it and that ends up getting you the best possible space for the task or for your training data and then you can get that you can fix that and use it with a network like com task net that would be the equivalent the only equivalent I would think to what you're doing in that you will be simultaneously optimizing the best space and seeing what's the what's the optimal what the optimal binary mask can get you and in those cases you're getting you can get numbers up to like 40 you know mid 40s db in str which is really pushing the numerical precision of of those systems so if you were to implement something like this and if you know send me a message on slack I can point you to the paper you would see what the true upper bound is given a neural network framework okay thank you all right so now a few words about efficiency and deploying those things and and all that stuff again it shouldn't be surprising at this point these things are huge there's a lot of convolutions a lot of matrices a lot of parameters the finding the mux model to be kind of amusing because it's more than 400 million parameters which is you know a remarkable amount of stuff to pack into a model which means that a lot of those networks are not particularly practical right so especially if you work in the industry and you you know care about deploying those things in real life a lot of those models that I've shown so far are not really going to cut it so we want to see what are the things that we need to worry about to do that um first thing you need to worry about is causality and that's a big deal if you wanted to have this thing running on say a cell phone or you know some kind of a smart speaker device you need to have a causal system so and what I'm doing in the plots in the bottom here what I've done is I've plotted the equivalent of an impulse response of some of those networks so what you see on the left is the impulse response of a com task net what I've done is I just put it gave it an impulse function just a delta function and then I saw what came out so this is the signal that comes out when you have give it a just a delta function and then all of the all of the weights are set to one and what you're seeing here is basically the extent of the receptive field right there's a lot more stuff that's happening in the middle because a lot more filters operate there but the entire process spans from minus 1.2 seconds to plus 1.2 seconds so you get almost the two and a half seconds worth of of our receptive field but that that model looks at future values and it also looks at past values and because we're doing this operation offline that's perfectly fine right we're just centering our convolutions so if somebody gives us a sound we can look in the future we can look in the past and that's perfectly fine likewise with the recurrent models because we most of them use a blstm that scan the sequence forward and then backwards as well and again that would be a case where we're sort of cheating because we're looking at the future in real life you won't be able to look at the future so you have to change things accordingly so there's a couple of things you can do the blstm models are not going to work in those case you would have to move to an lstm model where you don't do the backward scanning and if you do that that's okay but of course it it ruins performance a bit with convolutions is a bit more interesting here's the impulse response they call it impulse response of a comf.net but this time i'm only using causal filters so instead of having three tap filters where the center tap corresponds at time zero and then you have one looking in the future and one looking in the past now i'm shifting this so i'm looking at two time steps in the past and the current time step so that makes it a causal filter and if i do that it looks like this so it's got roughly the same extent as well but now you see that if you don't have to look in the future you're only looking at the past if you do that of course performance is going to drop by you know three or four decibels in str but that's of course the price of not being able to look ahead a compromise would be to use asymmetric filters which is what i'm doing here so again in the regular comf.net what we have is three filters three tap filters you have one in the center and then symmetrically one looking in the future one looking in the past as you use dilations those two those spread apart and they can sort of come further apart while still looking at the center or come closer now what we do is we take the current time the next future frame and then we only dilate the past so what that helps us do is that we're getting a wide receptive field going towards the past and we're getting a very short expansion of the receptive field looking in the future so now we're looking 40 milliseconds in the future which is an acceptable delay and 1200 milliseconds to the past so again that helps us look a little bit further out ahead and that can help us get better results due to that we also have to worry about things that like real-time potential processing one second of sound should happen faster than one second so a lot of those algorithms have a lot of demands computationally a lot of them can only run on a gpu in a reasonable manner or some of them are very easy to pipeline well for example the lstms because of the recurrent computations are very difficult to to to sort of optimize them computationally so being able to minimize the processing time is something that's very very important most of the papers you're going to see of course are not focusing on things like that and also a lot of the deep learning toolboxes are not suited for for doing any kind of sort of streaming inference on sequential data so for example you can't do convolutions where you keep a buffer sort of filled so again this is something you have to worry about as well finally memory requirements is a big thing you have two types of memory consumption one of it is the number of parameters in your model itself so for example the MOOCs has a huge parameter space more than 400 million parameters but you also have to worry about intermediate representations every time you take a signal and you put it through a convolutional layer what comes out of it is many filtered versions of that signal so if your signal took a certain amount of space once it comes out from the convolution it's still going to occupy some space in your on your gpu or your cpu depending on when you run things and it turns out that once you have a lot of convolutional layers that introduces a lot of intermediate representations or intermediate variables that take a lot of memory so most of the memory consumption that happens with those models is not going to be the model and the input or and the output is going to be the stuff that happens inside and it's not easy to a lot of times optimize for them so so that's a big problem in training and inference as well so one model that was proposed to alleviate those things is the wombsickly named the sudo rmrf model so what this model does is it's basically trying to find the trade-off between separation performance but also computational constraints and it's got a couple of key elements in it one of it is that it's using a lot of resampling operations to minimize memory use so instead of using for example the related convolutions what you can do is resample the input and use the same convolution as before and that helps you not waste as much memory it's using a lot of faster operations to to replace previous operations with much simpler processes and you'll see that in a minute and it's using a form that is amenable to have making a large model if you need to but also compressing it to a much smaller size so it's more of a family of models than a specific thing and the other thing that it's designed to do is to try to learn from data as fast as possible so the basic architecture looks like looks like this again gonna have the same framework as before we have the front end we have the synthesis we're still doing some kind of a of masking taking place here but now we have this uconv separation models and the way that they look is that for each of those layers you're going to get the input coming in and then you're going to convolve that with with some kind of a set of filters and decimate by a factor of two or use a stride of two so that's going to give you half the size in in in time samples and then you're going to get to the second level and then you're going to do that again you go to the third level fourth level fifth level you do this for as many levels as you want and with every level you're getting a smaller and smaller output coming out and that means that you're saving a lot of the a lot of memory by doing this if instead of doing this striding I was doing say dilated convolutions to make up for it all of those intermediate outputs would be the same size and I would get much more memory usage and then once we get to the very bottom the last level that we want we invert that process and the only thing that we do is we just use a simple up sampling operation just using nearest neighbors and then at every level we also have a skip connection where we sum what what the original encoder did so we do that and we propagate all the way up to the same size as we started with and that comes out and then we move on to the next one so by doing this kind of structure we're saving a lot of memory and then we're saving a lot of computation here because we're just doing simple you know replicating our data every other time step as opposed to doing bigger convolutions that have to fill in the gaps so what's nice with that model is that it allows us to do things without with with a lot of savings computationally so this is a table that kind of shows you the models that we examined the sudo rmrf model with one x is kind of like the big model this is a smaller version that's using a smaller number of those uh uh uh reassembling steps uh and modules so uh again the results are for uh speech versus speech so it's the bolster journal to speaker speaking at the same time trying to separate them um so as you can see um you know conf dasa is about 15 16 decibels uh the mux is a little lower it actually works better for music uh dprnn it gets almost 19 decibels which is really close to the state of the art right now this is about 20 21 decibels but you can see that sudo rmrf is kind of close to it the difference is that once we look at the number of parameters we see that there are a lot of differences for example conf dasa is about five million parameters the mux is more than 400 400 million dprnn is smaller because it's using the lstms but sudo rmrf is a 2.6 uh parameters for the big model or you can go down to under a million parameters for the smaller model if you look at the number of gigaflops uh conf dasa that is reasonable the mux is actually a little more efficient in that sense dprnn because it has a lot of lstms it's actually extremely slow uh to run so it's you know nowhere near real time so uh and again if we compare what we get with the sudo rmrf we're seeing that we're getting much much a smaller number of gigaflops required per second of of input and the performance again is comparable likewise if you look at the memory this is how much memory takes to process one second of of audio at eight kilohertz we see that the sudo rmrf gives us the smallest numbers and then if we look at the cpu processing time again it ends up being one of the most efficient ones so if i were to make something like a real-time system that works on the phone i would probably look at something like the the small sudo rmrf model uh just because it doesn't require a lot of memory it requires the smallest amount of of of gigaflops uh for audio coming in if i wanted to get the super duper state-of-the-art numbers i would probably use something like dprnn which does give me great numbers but at the same time um it ends up being 50 times slower uh well sorry uh 50 times lower than the uh than the other models um likewise i'm not going to show this here because we're going to run out of time uh you can show that training can also become more efficient when you do those things right because uh you can do a lot more epochs of training uh with uh with a gigaflop uh uh uh uh with allocated gigaflops then you would do with uh with different models one other thing we can do is we can also talk about hardware efficiency um so even if we use the smaller models we still have to worry about doing a lot of gigaflops of processing so if you want to do this thing on an embedded device like a smart hearing bud or a smartwatch or any of those things um we need to do find a way of getting things smaller um and a good way to do that is to look at binary neural networks and that's one way of of minimizing the amount of actual processing you have to do in hardware so we're going to do a little detour and talk about that as well so the idea is to basically quantize as much as we can a standard trick in neural networks nowadays is to take your input take your neural network rather and uh quantize it down to 16-bit floating point or maybe 8-bit floating point and that helps you save a lot of processing power and sometimes makes the math a little faster as well um but still that's a lot of complicated hardware to do all the floating point uh or even fixed point math uh what we're going to do instead is what we call a bannerization so what we're going to do is we're going to take every input and model parameter and sort of represent them by a single bit so they're going to be zero or one so we're not going to get any real numbers we're not going to get any integers everything is going to be a zero or a one all of the weights of our network all of our inputs all of our outputs so that what that means that we don't have to use integer arithmetic and that means that we can simplify the amount of processing uh happening on hardware significantly so how do we do that so what we have here in this equation is a typical formulation of what a neural network unit looks like we have x being the input w being the model weights effectively what happens is that your input comes in you compute a dot product with your weights you sum everything and that goes through a saturating function now let's see what happens more carefully um if my x's here are positive numbers and they're corresponding w user positive numbers the product is going to end up being uh a positive number if if this happens a lot the summation is going to become much bigger and that means we're going to saturate the hyperbolic tangent towards plus one likewise if the x's are negative numbers and the w's are negative numbers i'm going to get the same effect but if the signs between the x and the w's are different what's going to happen is that i'm going to get a negative number from all those multiplications if you sum a lot of them together you're going to saturate the hyperbolic tangent towards negative values so if you have a lot of agreement in the signs of your weights and your inputs your output will tend to be plus one if not your output will tend to be minus one right so that's a rough level what's happening in a typical neural network now if you want to map this to binary what we can say well let's assume that our uh sort of uh uh negative numbers are minus ones the positive numbers are plus ones we can only use a binary representation so we're going to map these to zeros and ones so minus one becomes zero and plus one becomes one i can take the operation that i had in the previous slide and reimagine it by using bit operations and the equivalent would be the following if i were to take the x nor of x and y in the w that would be the equivalent of the multiplication table so here what we're saying is that if the sign of of x and w is the same you're going to get a a positive number coming out if they're different you're going to get a negative number what the exclusive nor does is that when you have the same bits you get a one when you have different bits you get a zero so if my w and my x is disagreeing sign this saturates to plus one or minus one here if my bits disagree on whether the zeros are ones now this will likely give give me a lot of ones a lot of zeros now if i see that i get a lot of ones uh on average so i say if say more than half the bits are sets to one then i'm going to say my output is going to be a one if less than my bits are set at one than half of my bits are set to one then i'm going to output at zero so effectively what we're doing with those operations is a very crude quantized version of what's happening here now what are we going to do that well that means that i can take a real valued network which would have floating point values so we have a say an input vector coming in and we have a matrix here that's going to have all the floating point values we have to do all this multiply and accumulate so we have to compute all those non linearities which again take some time um and that's going to give me a real valued output i could replace that with a binary valued network where now instead of doing multiply and accumulate i'm doing an x nor and a bit count which are very fast operations to do in hardware everything is represented by a bit if more than half my bits are set then i set them to zero then to one or to zero accordingly and then the output ends up being a binary output as well so what we've done is we've done a very very rough quantization of of the same operation but now there is no need for for any floating point logic on our hardware and our data takes only one bit per parameter as opposed to you know 10 to 64 which would be the real value version um if you want to train that um we're still going to train such model with real values it's not easy to train with binary values and it's going to be fine and the one trick we're going to do is as we train the network we're going to apply hyperbolic tangent on our weights and what that does is it saturates it helps us saturate those weights we're going to regularize it so we can say make the weights be as big as possible in magnitude so most of them will end up being either a plus one or a minus one after they go through the hyperbolic tangent so by the end of training we're going to have a network where most of the parameters will be plus one or minus one once we have that we map them to uh to zero and one and we get the binary form that we can deploy um likewise the inputs have to be a binary value as well uh because we can't explore real data we can do that in many ways we can take our data and sort of quantize it or we can use some kind of hash representation there's different ways of doing that uh but again once we do that we end up with a binary representation that we can feed to that network so how well does that work uh this is an example doing uh just a amnesty digital recognition just as a simple benchmark um what you see with the uh dark blue is what the real value the neural network does light blue is what the binary neural network does all of those are different configurations and what you're seeing is the error rate so you see that for equivalent sizes our error rate is about you know one one and a half maybe two percent worse than the real valued network but at the same time the amount of computation or necessary hardware and the amount of storage in the binary network is vastly smaller than the real value network so it ends up being a win now why we care about that you can see in the hardware comparison if we have one uh one connection of a binary neural network with the real value the thing that would be a real a 32-bit real multiplication which is a fairly sophisticated hardware it's going to be a simple gate for binary network and then as you move up into layers and networks you can see that things become much much simpler so at the extreme case if we just wanted the comparison so for example the hardware area required for having a 32-bit float network of about that size it's about 6 000 micro squared meters if you use a binary network it goes down to 100 so we're saving about an order of magnitude and if you look at the power we're saving about two orders of magnitude so these are all the simulations because we have to design special hardware for that as well so going back to audio we can take an input spectrogram we can represent all the pixels using a say a four-bit quantization so now we're going to get a four a dimensional vector for every time frequency bit and then we can train a binary network whose job is to find a binary mask to apply on that bit pattern itself and we can train it and if you do that here's some examples that we're getting for for some denoising I think that was some stationary noise you can see that the again the dark bars are the real valued networks the blue bars are the binary value networks for the SDRSIRSAR we see that we're getting roughly the same amount of performance we're about one or two dBs lower but we're still doing respectively well even though we are using a much much much smaller networks so again this follows to that you can do things with hybrid models Minja Kim at Indiana has also figured out how to do that with LSTMs and other models and you can even try talking about processing PDM audio streams directly so PDMs are just binary audio streams so again those things are nice if you care about hardware if you're a software character then this is completely useless because implementing binary networks that way is not something that's efficient with the current software and hardware that we have all right so to recap on that on this part if we care about deployment we have to worry about a whole bunch of different things we need to make sure that we don't do as much processing that we're sensitive about memory at the algorithmic level we have to make sure that you know we have some sense of causality there's a lot of variations to all of the stuff that I said you don't see a lot of those things published because they're usually systems that get deployed commercially so these are things you kind of have to discover on your own as you're as you're doing those things but again these are the things you want to keep in mind so any questions on on this part um I have a question so you said that you train the system with float points and then in inference you change it to binary yep you cannot like it do like a fine tuning with the binary network then so you can further improve the quality yeah there's a lot of stuff hidden under the rack there so I just you know breeze through it yes the proper way to train a binary value network is that you train it with real values and then there's a step where you're fine tuning it because you're fixing the weights to because you're you're you're explicitly using quantization in the process and that forces the model to to compensate for some of the quantization that happens so there's there's you know four or five different papers that show different ways of doing that but yes there's usually a fine tuning step if you don't do it it's still going to work it's not going to be as good okay okay thanks sure all right um so I'm going to pick up the speed a bit so the next thing you want to worry about is data considerations um uh neural nets uh need a lot of careful collection uh when it comes to data so we want to uh to deal with that as well so what we're going to do now is going to talk about how do you build your training data set and what's a good way of making it not be extremely time consuming um and and sort of get around some of the problems there so um here's a typical training uh uh we with uh uh that we're using so far um what you do is you get a database that has a lot of speech recordings you get a database that has a lot of noise recordings if you want to do speech denoising and what you do is that you create artificial mixes between the speech and the noise that's going to be your input and then since you know what the original speech was that can act as a target now there's a couple of problems with this first of all we need to have an exact target for every input right um so we have to use synthetic inputs and every time you create synthetic inputs that means that you're introducing some kind of a bias so right now maybe i'm using uh uh the waltz through journal data set for my uh for my speech and maybe i'm using the bbc sound effects for my uh uh for for my ambient noise well waltz through journal is mostly people speaking english and then the bbc sound effects is constrained to certain types of noises um if i want to deploy that system with people that speak mandarin it's not going to work as well because the speech examples it's seen are not going to be representative of what happens in a different language that has all the tonal elements and all that likewise whatever noises are the people from bbc thought were were were adequate were were not necessarily represented noises you would find uh in in in some street nature right so it's very difficult to uh to get around this idea of bias of course a very complicated problem uh but it would be a good to find a way of not of of not having us deal with it and being able to learn from real data so so what i do is try to make our system perform some new tricks and do it without seeing uh uh sort of this one-to-one correspondence between inputs and outputs and the way we're going to do it is by using sort of a very human-like approach what we're going to do is we're going to learn from inputs and outputs that do not match uh so what we're going to do is we're going to train the system in a way where we're going to give it a bunch of bad recordings and this would be recordings that have noise or recordings that have reverb or things that we don't want and also give it some completely different good recordings and the idea is that we're going to sell tell to our network that well look there are some recordings which we consider to be bad and this is what they sound like and here's what we consider to be a good recording so now if i give you a bad recording can you find a way of processing it and making it sound like a good recording so we're not explicitly making mappings between inputs and outputs we're giving it examples of what good and bad sounds are like and that of course makes our data much much easier because now i can go into the real world and collect real noisy samples where i don't have the ground truth i can use those as being my targets and then i don't have to worry about having the ground truth train the system at the same time i can also collect a bunch of clean sounds and i can use that as my as my inputs so how do we do that so the way to do that is by using a sort of a style transfer kind of approach so we're going to have uh uh basically uh uh something equivalent to what people do in computer vision so in computer vision what happens we say that you know what there's an input that we have let's say it's this picture of a spectrogram and then here's the style i wanted to be the great wave of kanagawa which is famous painting and what those style transfer networks do is they basically take elements from here and try to re-render what the input looks like with those elements so in this particular case if i were to do it in computer vision i would get a spectrum that looks like this right so it's clearly using the same style at the same time it has the elements that we find in the original input so we want to do the same thing but we want to do this thing in audio we want to say you know what here's a noise recording but here's another recording completely different that sounds really good because it doesn't have noise can you render my noise recording in the style of the clean recording so that's what we're going to do so to do that we're going to have to design a sort of a style transfer network so we're going to use the following logic we're going to use two sets of training data we're going to need one set of training data that has a lot of clean speech and one set of data that has a lot of noisy speech again we don't have to use the same speakers or the same sentences or or the same recordings for the clean and then in the noisy recordings they can be completely disjoint data sets the idea is that both of those things are going to be representative of what good speech or clean speech sounds like and what noisy speech sounds like the representative of those two styles and then we're going to train an audio auto encoder for each style and we're going to see how in a minute and then we're going to train those auto encoders what we call cycle passes that will help us figure out how to translate from one style to the other so let's look at it in a graphic form to make better sense of it so the first step is we're going to make auto encoders for every style so I'm going to take all of my say let's say clean audio speech recordings and I'm going to train them through a system that has some kind of an encoder a latent representation so this can be very similar to what we did before right it'd be uh something like a comf task net or a bunch of lstms or whatever you want to use it doesn't make a huge difference but that's going to give us some kind of a latent representation and then we're going to have a step where we take that representation and then re-render it back as audio and get the original input right so this what this network does is try to find a latent representation for the input such that it can reconstruct it as an output and of course that has to be a lower dimensional likewise we can do the same thing with the undesirable type of audio for example that could take noisy or reverberant reverberant speech and do the same thing so now we have uh two encoders and two decoders uh each of them best suited for uh you know good desired type of audio and then uh one of them suited for the uh type of bad audio that we have now the second step is we have to train what we call cycle passes and that will help us come up with the uh what we want so the way it's going to do it's going to happen is we're going to take good audio and we're going to translate that to what we call a sort of a bad audio and then back to good audio again and the way we're going to do that is by a path that looks like this we're going to take clean speech encode it get the latent representation these things are already learned right um we actually sorry we're learning those on the fly uh so we're going to get clean speech gets in the encoder we get a latent representation but instead of putting it through the decoder that gives us that's going to give us that clean speech again we're going to put it through the decoder of the noisy speech um we're going to get some kind of a noisy reconstruction out of it we're going to take that put it back into the encoder of noisy speech get a latent representation put it back in the decoder of the clean speech and then we're going to get something that should sound like clean speech and what we want is as we're going through this loop we want to optimize all of those elements to give us exactly the same output as we started as an input and then we're going to do the same thing again the other way we're going to take noisy speech put it through some encoder but decode it through the clean speech and decoder we're going to get some rendering of the uh uh uh uh noisy speech in a clean manner put it back into the clean speech encoder get the latent representation decoded through the uh uh uh noisy speech decoder and then we should be able to get exactly the same signal we started with and we're going to add whoops one more thing which is that we want for each of those cycles we want to make sure that the latent representation we use is the same so effectively what's going to happen is that we're going to learn how to map speech into a space such that when we get a noisy version of it it still maps into the same space if we put it in through here I'm going to skip the math explanation of that it's basically the same thing as we had in the slide so the end result is that we're going to get encoders and decoders that will use the same uh a latent presentation for the same kind of input so if I were to give it a clean recording of me saying you know hello um and I can put it through encoder I will get a representation that will give me the ability to either use that in a clean decoder and get a clean version of that recording or put it in the noisy decoder and get a noisy version of it likewise if I started with a noisy one I'm going to end up end up with the same representation here so I can render it as a clean sound or or or as a noisy sound so the advantage of that is that now we don't need to have matching pairs of inputs and outputs it can work just as well using matched data and we can do all sorts of other mappings right doesn't have to be denoising it could be the reverberation or you know a room acoustics transfer so we can do a lot of cool stuff so just to give you a couple of examples here's an example of a sort of uh of an input we stopped under the willows by kempton park and lunched it is a pretty little spot there a pleasant grass plateau running along by the water's edge and overhung by willows we had just commenced so I can hear that it's got a lot of background noise this is what happens when we put it through this network and rather it in the style of clean recordings we stopped under the willows by kempton park and lunched it is a pretty little spot there a pleasant grass plateau running along by the water's edge and so so I did a good job with removing the noise and again that's a speaker and type of noise that the system had not seen before is just trying to re-render that particular recording in the style of all the clean recordings that we had given it and we can do that with different things so for example here's a case where we changed the uh the acoustics of a signal so here's the input signal although the breeze had now utterly ceased we had made a great deal of weight and here's the target style pretty little spot there a pleasant grass plateau running along the water's edge now what's happening here is that the target style has a lot more reverberation the speaker is more distant from the microphone than from the input so now what we can do is render the input in the style of the transfer and what we're going to see is that we're going to get the same reverberation and the same sort of noise characteristics that we had in the target recording even though it's the content spoken in the input recording so we play this again so you get the context so input although the breeze and now utterly ceased we had made a great deal of weight target pretty little spot there a pleasant grass plateau running along the water's edge and stilt around although the breeze and now utterly ceased we had made a great deal of weight so it's the same speaker although the breeze and now utterly ceased but now he has the same type of muffling and reverberation that we had in the target style okay so since we're running out of time let me skip ahead a little bit in fact let me skip ahead a lot bit and okay let me go there oops so so what I wanted to sort of pass in the section is that we can use a lot of non-traditional training with those neural networks there's a lot of interesting stuff that happens nowadays with alternative architectures so we don't necessarily have to have ground truth data and again in a real life environment where you know you only get some noise recordings and maybe have access to some clean recordings this thing can actually help you so what I'm going to do is I'm going to skip the the last part which again it was probably not state of the art and not necessarily something you need to know just to give you a sense of what it was about what I talked about so far where a discriminative mod discriminative models where they're just trained to do regression from noisy to clean you can actually use generative model as well not necessarily GANs and if you use a sort of neural models like that they will allow you to use reuse one network for multiple tasks the problem with the models that I've shown so far is that you train them to do one thing and then they kind of stuck perhaps maybe not the style transfer one but the previous one's definitely and you always would have to retrain them if you have new types of noise, new types of speakers or you want to do a different task with generative models that doesn't have to be the case their performance is a little lower than with discriminative models because they are not trained to do that's one specific thing but they end up being a lot more powerful when it comes to deploying them to do different stuff so you sort of ran out of time we don't have to talk about them but I can send you links to to read up on them so closing remarks so everything that we do is gonna in this space usually goes back to basic DSP so whenever you use any kind of neural network try to figure out if I remove the non-linearities and all the fancy stuff that we do with neural nets what is the DSP operation that I'm doing that's going to give you a lot of insight it's going to help you connect those things to techniques that are well known and it's going to help you use those tricks and those techniques in the world of neural nets and understand what the limitations are so by using a lot of deep learning layers as replacements to DSP elements we effectively do with charging existing algorithms it's very important to know what you're optimized you have to know what your application is otherwise you might be optimizing the wrong thing so we talked about all of those things and then efficiently is something that's very very important right there's a lot of things that happen in those in those models we want to be careful that we want to be efficient with use of data but we also want to be efficient with with computation especially if you care for deploying those things that's a big deal if you just write papers then yeah go nuts and have things that don't compute so and finally you have to keep up to date with those things every month is a new paper coming out and it's you know 0.5 to be better than than last month so whatever I said now is probably going to be obsolete in the you know six months so these are only models that came out in the last couple of years or so so again you have to keep a abreast of all the developments so what I'm going to do is basically end this now and thanks for hanging out with me for three hours I'll be available on the Slack channel today and tomorrow and then I guess in a few hours I'm going to send in the hands-on which you guys are going to do tomorrow from what I hear so if you have any questions now's a good time otherwise we can take it offline could I ask you a question on multi-channel the reverberation so you mentioned the reverberation a number of times then we spoke about how neural processing is kind of catching up with multi-channel DSP approaches so has it been able to catch up with the likes of multi-channel linear prediction for the reverberation the answer is not entirely clear so there are some networks that do the reverberation it's been pretty active I think until recently we didn't have a good evaluation data set so people were a little apprehensive I think the main problem in general is that and we saw that with speech enhancement that was the case you know seven years ago or so a lot of those neural network algorithms are very bare algorithms right we don't do anything special when you compare sort of a bare algorithm like that with you know modern DSP algorithms where people have spent decades fine-tuning little bits and pieces and everything works well it's obviously it's not going to sound as good you know the artifacts that we get for example in the neural network algorithms that we have now are things that are solved problems with in DSP it's just that you know it's not easy to to implement those in the neural net so I don't think that the state of the art in derivation right now with neural nets is as good as the state of the art in DSP but it's a matter of time until people get better data sets more powerful models and find enough tricks to get there so I have no doubt it's going to end up being better but I think we're still at the point where we're trying to figure out what's a good architecture you know what kind of elements need to be there or what kind of data do you need to train it on so you know if you just want to deploy this right now into a product I would say stick with what's you know what people do in DSP it's going to be more efficient you're going to know what its limits are but I'm sure in the next few years especially since now people are starting to think of denoising and source operation as an almost solved problem people are going to shift attention into other enhancement problems so I think derivation is the next one to come but I think it's still early steps and people are still coming up with things which are not necessarily efficient because they're not causal they will never work in real time they have all these limitations that I've talked about so we're still figuring it out basically all right thank you yep I have a quick question so are these are any of these models prone to adversarial examples there's a little bit of work on adversarial examples I think the main problem here is that there's not a clear adversarial goal in to say denoising I mean you can try to generate data that you know speech that somehow defies denoising but it's a weird operation to try to do so I think there's a lot more adversarial work being done with we say speech recognition or speaker identification or stuff where you're trying to make a decision even though it's possible I haven't seen anybody trying to figure out adversarial ways to confuse a denoiser I'm sure you can do it I just don't see quite why anybody would do that yeah I was maybe talking of like types of noise that are not really you know perceptible but then you fit it into that a noise and the signal is completely broken yeah so so that would get classified as a sort of out of training data kind of case right so suppose that I train my system on on on speech and then and I'm just using no regular ambience noises if if I wanted to sabotage that what I could do is give it a noisy speech signal and then add a very high frequency that's very loud right this would be something that would completely freak out in your network because it's not something that it sees and it would be something that would be otherwise inaudible by by human listener so you can do little things like that I wouldn't call them directly adversarial because they're not particularly sophisticated and it's very easy to trick a network that way and certainly you can do that but I haven't seen anybody actively trying to come up with adversarial types of noise but again this is really going fall back to how good your denoising is and it's also going to depend on the type of the network for example the style transfer network that I talked about wouldn't fall into that trap as easy just because they're trying to get something that sounds clean not something that removes the noise necessarily so again depending on your architecture you could have you know a stronger effect or a weaker effect but but it's relatively unexplored I wouldn't say that people have done that a lot great thanks um I have a question yeah so I have several questions but I will just ask one or two here so so for predicting them like usually uh and speech announcement approach we do uh mass prediction using dnn but will it make sense to predict the filter instead of the mask using say like unit I think this is open to debate so there's a lot of different ways that people think about it so the simplest ways to think in terms of binary masking of course if you look closer a lot of the networks that I've shown don't do binary masking they do some kind of a soft masking I've heard people say that it's actually at this point a better idea to just try to reconstruct the output directly and not worry about doing doing any masking so honestly I don't think there's going to be a clear answer it's kind of a typical neural network thing where you just have to try everything and see what works it's also going to highly depend on your data you know one thing that I've noticed is that those networks for example can behave very differently when you have simultaneous male speakers and when you have simultaneous female speakers just because there's a lot of you know lower frequencies with male speakers things get mixed more it's a much more difficult problem than if you have only female speakers so there's little things like that that will change how you're from that looks like what your the best reconstruction method is like so you know I don't think there's one piece of advice I can give you but I would say there's there's probably more ways to do reconstruction beyond masking and you might want to take a look at them because they probably might make a difference in your problem and my last question so do you think that if I use a non-autoregressive approach is such as wave blow for predicting muscle it makes sense yeah so yeah so I didn't talk about them and we had a discussion with with Yanis about that as well there there are a couple of papers where people use a wave net like architectures to do the same thing right you could train a wave net to receive noisy speech and then try to produce clean speech I think what's been sort of what people have been what's been deterring people for doing more of that is the fact that the wave net is computationally a lot more expensive than all of these models so from a practical standpoint it doesn't make a lot of sense but it does give you very good reconstruction quality I haven't seen any numbers which are competitive with the state of the art in sort of non-autoregressive models but but it might just be that it's a matter of time I would like to note that it's very easy to take a lot of the recursive models that I've shown and kind of make them be kind of autoregressive by feeding back their own hidden states so in some sense some of them are autoregressive models that just not explicitly autoregressive like the wave net where you take the output and feed it back in but yes you can do that people have done it I just haven't seen it being compared in the same way to see how much potential it has but but it does sound okay it's not bad no no my question was not to reconstruct the speech but to predict the mask using the wave oh yeah I mean in my mind they're equivalent problems right it's it's the same computational complexity to predict the speech or predict the mask that I haven't seen doesn't mean that somebody hasn't tried it but you could definitely use that yes I think the problem with that is that you also need to have a space in which to apply the mask whereas constructing speech directly would would be much more natural in that setting