 So thank you very much for joining us, and I hope and wish that next time we will be increased, giving the same or even an updated slightly version of this talk. Okay, that's up to you now. Thank you very much, Yanis. Thank you so much, Yanis. Yeah, it is, yes. First of all, I'm deeply appreciative too for giving this talk because, as Yanis said, we worked together over 20 years ago, and I also never thought that I would work on low bitrate speech coding again, it was dead, and suddenly it just all came back. In this talk now, I mean, unfortunately, there are like two fields, the computer science, neural network approach, and the single processing people, like myself, different approaches to these problems, and they rarely spoke in the past, but now we're all talking, which is good, but what I wanted to do with this talk is to try and connect that whatever we do now today is basically just a continuation of what people did since early 40s, 50s, when it comes to electrical, and even before that with mechanical. So I'm going to try to connect the old world of speech coding with the new world of generative synthesis, and hopefully in the end of the talk you will figure out, like I think I have, that's basically just another way of generating the speech waveform, but it follows the same principle as the old ones. With that being said, let's dig in. So what I hope to cover today, I think I will have plenty of time, so I will cover these six, maybe the six topics. First of all, I'm going to go back to the roots and try to figure out the synthesis way speech synthesis with the source filter model. I mean, I haven't seen previous talks, but I think you probably heard of the concept and also in the same way that you, the earlier talks probably has gone deeper into the neural networks than I will do today because I'm going to focus on speech compression, that kind of perspective of this thing. So speech coding will be kind of the main topic, and especially linear predictive coding, which is a very nice mathematical and kind of anatomical modeling that kind of merges very, very conveniently and is kind of the dominant way of doing speech coding up to today. And now I then move over to the modern way, like we created this or went into this venture a couple of years ago to encoding with the generative neural synthesis. I will talk about a couple of different approaches to that and especially one that is generative synthesis, but will also cooperate some old school digital signal processing, the LPC net. I don't know if you've heard of it, but it basically ticks the signal processing, take care of some parts and let the neural network take care of the hard part. And in the end, I will also hopefully cover practical aspects of speech coding in text to speech, and in most speech synthesis, we are not concerned about anything else than just pure clean speech, which these methods are fantastic and very well suited for that. Unfortunately, in the real world, we need to first of all, if we're going to do communication, we need to do it in real time, so we need to be fast, so process the audio, but also everyone, no one is, most people are not sitting in the studio both and talking on a phone. So we must handle background noise and stuff. And unfortunately, most of these methods don't really handle it. I'll hopefully show you some place on audio samples, what happens in background noise. And I'll talk a little bit on approach that we've taken to reduce the sensitivity to background noise. All right, so the first thing is to go back and talk a little bit about the source filter modeling for synthesis. And we'll go back to how voice is produced in our bodies. I think most of the pictures I'm using here are taken from the public domain. And I found this one on Wikipedia, which is an atomic model of the voice speech production. There's a lot of small details here, but I highlighted the most important ones. So speech sounds are made from air coming from the lungs, going up through the larynx, where we have a vocal folds that looks like a little bit like a lip that goes up and down like this to make a periodic sound. And that little opening there is called glottis. And I think you had someone mentioning the glottal decoding the other day. So that's the source, the real natural source for periodic sounds. And if for the unvoiced sound, the fricatives and the plosives, you still have the air flow from the lungs, but then your articulators will, in the rest of the mouth, will form the sound. This little periodic sound, if you just look at the periodic sounds, then it goes from the glottis up through the throat there into two main cavities. There is the oral cavity where in the mouth and the nasal cavity up to the nose. And finally, the sound is radiated through the nostrils and the lips. And all these things are shaping the sound. And the cavities are, it's very basic. There are these sound waves go in there and the shape of the oral cavity, all of this called the vocal tract, all the stuff, this space in here in the head that forms the sound is called the vocal tract. And our articulators, the tongue, the lips and the teeth and so on, move around and shape this tube of air and all these resonances. So this kind of is the origin of the source filter concept. So this is the basic concept that everyone, when we talk today, like the Zoom meeting, this concept is in the speech processing that we listen to. So there is a source generating a sound, a basic sound, it could be a voiced or an unvoiced. And then the rest is formed by a filter, just a plain simple passive filter, acoustic filter, that is the vocal tract and the lip and nostril radiation. That's the thing, that's the source filter modeling. And this has been around for so many years. And I don't think most people know how long it's been. It's been since the 18th century or even earlier. So the first mechanical synthesis using this principle was von Kemplen. He worked on this machine for many years. And it's a very simple model, but yet I think it's amazing. It's mostly voice synthesis, although you can mechanically try to get some consonants like stops. But not so much, yes, of course. Actually you can get, you can just, it was an artifact of the model that through the leakage and you could actually produce some unvoiced sound with the air leaking out through the system. But there is just the bellows that you use for the kitchen oven, for the fire. That is the lungs. So you put your hand on the lung and you pressed it. And then there was a little glottis, a little reed from a bagpipe in the end of the bellows. And then there was a little rubber in the first model he had like a clarinet of a hard little clarinet horn, but then he changed that to a rubber one so that you can form it. But this little thing coming out from this thing is the basic excitation, the source. Then you'd filter it with your hands in this box. You put your hands over the little mouth and you form vowels by that. It's not very complicated but with, it sounds, you can actually get some vowel sounds. I hope that I can share a video. Let's see now, if I do this, stop share and share another screen. Oh, I'm not a host anymore. Okay, sorry. Can you make me a host? Yep, let's do it again. Let's see. Can everyone see this one? Yes. Okay, let's hope that you can hear it too. We can hear it. Okay. So let's see how do I stop sharing. This was good before. Stop share, then share again, let's share screen. Are we back? Yes, we are back. Okay. All right. Okay, so that's an old source filter model. Then it's passed forward to electronic time. Homer Dudley worked at Bell Labs in the 30s and the 40s, actually, into the 50s. He started what we now call a channel vocoder. He started this project called the vocoder, voice coder and decoder. As a part of this system or this project, they also started a, this is mostly an exhibit project called the voter. The voter was a full system with speech, unvoiced and voiced speech. There was an operator sitting in the middle and it was a she because there were 20 women trained for this. She had a keyboard where she could control band pass filters and it's like a filter bank that she could control and also she had a switch for voice and unvoiced and she could control the pitch of the pulse train with a pedal. So it was completely manually operating. However, I mean, you have to do a lot of different sounds and you have to train how to generate these. So you needed a highly trained operator and they trained for a year and they say that they started with like a couple of hundred girls because there was girls at that time because these were all, at this time, like there were former secretaries so they were skilled in typewriting and you needed to make like 10 different sounds per second because that's the typical speaking rate and the different phonemes per second that we do. So this exhibit thing is made completely manually operated. As I said, this was a part of the vocoder project and in the vocoder project it was both analysis and synthesis. So the synthesis part was driven by analysis which is taking the opposite of this. So you analyze the band pass, the energies in different band pass, the voicing and pitch so to control this parameter. As I said, this is control manually and it's very, I think, is amazing. So now we have to do this again, video and I have to stop share, go to this, no, share screen. You got a sentence. For example, Helen, when you have the voters say she saw me, that sounded obviously flat. How about a little expression? Say the sentence in answer to these questions. Who saw you? She saw me. Whom did she say? She saw me. Well, did she see you or hear you? She saw me. Now so far you have only heard the voters speak in one voice but the voter has other voices which he can use when Ms. Hopper makes a simple adjustment in his mechanism. Helen, will you have the voters say greetings everybody? Greetings everybody. Now will you have him repeat that in a high voice? Greetings everybody. And now in his best face. Greetings everybody. When a boy's voice changes, he's never quite sure whether it's going to be a tenor or a base and the voter, being still a comparatively young man, also has his moments of uncertainty. Let's hear him recite Mary Haddle Little Lamb. Mary Haddle Little Lamb. Mary Haddle Little Lamb. Happy reading, happy reading, sir. Happy reading, sir. God, boy, you've had it awfully dismal. Okay, if we look at this in a single processing where it's like a single way, the source could be either a pulse train, a periodic source, or a non-periodic random noise source. So we can see that is an excitation signal, U of Z, or U-N in the time domain, and you're having a switch, if it's voiced or unvoiced. And of course you can have a mixed voice too, for the Z and the V. These kind of sounds too. I mean, it's optional how to control this switch here. But in the basic form is the switch, voiced, unvoiced. And then all of the resonances, all of the spectral shaping is done by a simple filter. And we, the lip radiation and the nostrils and all of that, we just disregard, disregard, we put them all together because they're all a form of a passive filter that we can control with parameters and outcomes the speech. All right, that was the background of the source filtering. Now we're going to use this for something different than text-to-speech, and we're going to use it for speech compression. And what is speech coding and compression? Well, it's came as the need for, to store, especially in digital, for code coding is compression of the speech and like digital compression. This is what I'm talking about here. So you do an analysis of the speech and you extract a bunch of parameters depending on your type of analysis. These parameters need to be either transmitted to the other side or stored for later retrieval. To do that, you take all these parameters and make, quantize them first so that there are discrete values. And then you represent these discrete values with bits for a digital way. And the channel, you send them over a channel and that channel could either be like a storage channel or you store it on magnetic recording media or you send it over the air or to the other side. It could be air or wire. And the channel could modify it, but it's very likely that these bits will be corrupted. That's in the decoding part there, when you convert these bits back to parameters, you can take care of like error corrections and stuff and control of that, the loss of channel if there is a loss of channel. But that's the decoding part. And then these parameters that are decoded, then they are of course not exactly the same thing as the ones that we cut from the analysis because there is a degradation step or maybe two, the quantization errors and then maybe the channel errors. In the end, the synthetic speech is the decoded speech comes out from the speech synthesis. So this entire chain is kind of called a coding chain. If we then look at, okay, if we're just going to extract parameters and send them over and use these parameters to generate speech, how low can we go then? What is the most efficient way to do that? And that's kind of the coding. Coding is trying to make this efficient as possible, as few bits as possible. And then we can think about what's the limit then, how low can we go? Take for instance, a system that just take the input speech like we have today and just do the best speech recognition we have today. I mean, this is possible today. You can take speech recognition output, all the texts, the phonemes, the text. If you wanted to put it into a more compact form, you make text from the phonemes. But you can also have to transmit intonation, prosody and who is speaking and whether he or she is angry or sad or so the emotions can be transmitted. And then you take a regular text to speech. And use these input parameters to produce it. So this takes the speech synthesis as, yes, most of you are familiar with. You can do that. That will be very, very efficient. Today we can do this. In the 50s, when Shannon started thinking about entropy and information rate, he couldn't do this. So he did some estimates and not only him, he provided a system with a simple system to do it, like looking at the entropy of characters in English and two, three, four characters and then forming more phonemes and stuff, longer words. And some other people did more, like they took 8,000 to 10,000 English words and tried to get the entropy of that. And then someone else tried to figure out how quickly you talk. And I think some, I don't remember who, that's why I didn't say whatever. In the end, they came up with approximately 50 bits per second is the lower limit of sending data for speech. And that should capture who's talking and all the essential information. That's if you go purely by lexical characters in English and linguistic information you have. Another information theoretic approach was taken by Fano. He was looking at, I don't know if this is not as well known as the other one, but he was looking at it as from the Shannon channel theory about noisy channels and how to decode symbols on the noisy channels. So he was looking at like acoustical noise and the transmission of the acoustical signal. And he got a more pessimistic estimate of how to get the, especially, okay, so he had about three kilobits per second for wideband speech before narrowband, like regular telephony, he got down to about 1600 bits per second needed to transmit the acoustic wave of English. This has recently been kind of up to date again because people are not, it never went away actually. During the 80s and 90s and when I was in the 90s when I was active in speech coding there, people did a lot of, I did too, tried to figure out the lower limit of speech coding. But the recent estimate, this is from ICAST 2017, one, Steven van Kijk and his friends, they took some similar phono approach, acoustic approach, but made it, added some more recent advances and got down to approximately 100 bits per second. So this is the current estimate. So that's kind of the goal with speech coding, with 100 bits per second you can transmit anything in a clean speech. It's not an, there's clean speech, this experiment was clean speech. But all of these theoretical limits, these estimates, even if we want to do it practical, I haven't done this, but I was thinking about that yesterday, that actually I want to do this, take these ASR and text-to-speech and see how low we can go with current tools today. Let's assume a very long delay because a TTS, TTS is not so crucial, but for ASR you need to see maybe a half a second or something, or less, slightly less, but that's completely too long delay for interactive speech communication. So if you want to go two ways, this is a very cumbersome mechanism. So for practical speech coding, or this is what has been done, this is, since maybe the, this is the state of practical speech coding since 80s, I think, there are two major groups of speech compression algorithms. We have these parametric codex, this vocoder, this robotic sounding ones that are, you can produce a very low bit rate, even lower than 1,000, like 300 bits per second, and it could of course go up to, because it's taking, have a model and you extract parameters, and then of course, if you quantize them with infinite bits, so you can of course go up to how high you want, but it doesn't really matter because it will never sound better, because it's, the quality is completely limited to the model, doesn't matter how many bits you give to it, because it's a model restricting. And all of these parametric coders, most of them, I haven't seen them when we started doing these experiments a couple years ago, there are no, not many at least, I may be not, not, I haven't found one, so more than a wide band parametric coder. So it's limited to narrow bands in echolared sampling. The other class are the classical waveform coders, that are waveforms, so for instance just taking PCM, you take the sample and you quantize each sample, and if you're, and the idea is to match the coders to get the waveform, try to match the actual samples, incoming samples as good as possible. So in that sense, if you're, the more bits you get, the closer you get to the actual incoming wave, so there is no limiting quality if you define the quality as fidelity, because if you get infinite rate, you get perfect reconstruction of the incoming wave. And for waveform coders, they go all the way from narrow bands to telephony to a wide band which is typical of VoIP, like we have today here, and full band too if you wanted for high quality audio and stereo. And as you see in this little schematic plot there, the parametric coders, there is a limit of, there is the model that limits how well you can sound, but the waveform coders. But the problem with waveform coders is that at low rates, if you try to mimic the incoming waveform, it gets just a really poor job. That's why the parametric coders are popular or successful at lower rates. As an example of a source filter for coder, it operates in the frequency domain. I'm taking this open source codec called codec2. It's the good thing with most of the codecs I talked about today, I mean you can get them and play around with them. And David Rowe has built, it's based on an old compression scheme called MBE, which stands for multiband excitation, which is still a source filter based one. It's operating in the frequency domain. So on the analysis side, you do pitch estimation, you take the FFT to get and try to estimate how much periodicity there is in the bands, so you divide the speech spectrum into different bands and you decide how much voicing it is. So the voicing is not a switch because it could be different voicing, unvoiced and voiced in different bands. That's one of the success of these, the low bit rate coders started sounding much better if you had mixed excitation with different bands. And then it does an LBC analysis to get the spectral shape and quantized parameters. But the interesting thing is then if you look at the synthesis part, you can see that you take the voicing and to figure out an excitation and you do the inverse FFT to that. Actually this becomes, I see now that this is actually a time domain one too because you, no you don't. You add, you put the post filters in the FFT domain. Yeah, so you form the source filter in the frequency domain. So you generate an excitation and then you form, you multiply within the spectral envelope from these, from the LBC resonances. All right, let's switch to the most, predominantly most used coding scheme called Linear Predictive Coding. This is based on a mathematical theory and a tool called Linear Prediction Analysis. This modeling of a signal is that it's an all pole filter. So the signal is consisting of a source filter. So the excitation, a flat excitation is driving an all pole filter. And why it's called Linear Predictive is that it makes a prediction of next sample by taking linear combination of past sample. That's the prediction and you, so, and the error is the driving excitations. So you get, you get the new samples by an incoming sample plus a prediction of all samples. And these polynomial coefficients are easily computed. It's a linear equation system. It's the normal equation, the call typically, which includes correlations of the speech. So it's a very powerful and simple tool. There's not much complexity in estimating these polynomials and not very complex to generate the speech. So they've been popular since the fifties for modeling speech. Okay. How do we do, how do we use this in coding? Yeah. This is a rudimentary LPC vocoder and call it like linear predictive vocoder. Like for instance, this principle was used in the federal standard U.S. until 1996. There is a very common, very well-known vocoder parametric codec called LPC-10 that was standardized by Department of Defense for a long time ago, but it basically looks like this. So you take, you do a linear prediction analysis and from, and then you do a pitch and voicing analysis from the residual after the inverse filtering. Basically you take the speech and you filter it with, multiply with A of Z and then you get the flat residual and from that you do an pitch analysis. And also making a decision whether it's voiced or not. So these coefficients are quantized, sent over the channel and used at the decoder and we have an LPC generator here. So whether it's voiced or not, you use a pulse generator, impulses, very simple pulses and or a noise generator. And the pitch period, the periodicity is controlled by the pitch. And then you have a gain and finally take the LPC coefficient. And in LPC-10, it's named 10 because you take 10 LPC coefficients, the polynomial coefficients. For narrowband speech, it's sufficient to have about 10 coefficients. That gives you four formats, that's eight, four formats, four resonances and then you get a spectral tilt over the other two, a spectral tilt and shape, overall shape. So 10 has been sufficient and it's been used even today and when every time you make a cell phone call, you're using 10 LPC coefficients in your cell phone. That's a long, long historical legacy. All right, here's another one. This is the most advanced one or actually the ones that still use it's the replacement of the LPC-10 in 96, 95, 94 they started. The American defense found out that it was really hard to use LPC-10 because it sounded so bad and robotic. So they need something sounding much better at the same low bit rate. So that was a big competition between all the major providers and research labs in 95 and in the end TI won that competition with their codec called MELP, which has a mixed excitation linear predictor. It has some refinements compared to the LPC-10. One of the major thing is, as in the multiband expectation, that's the mixed excitation. You can have different voicing in different bands. So if you look at here, here's the synthesized part, the synthesizer. So you have a noise generator for the unvoiced, but you just activate noise in different bands with different gains. That's the shaping filter and you have a parameter called the bandpass voicing strength that makes the balance between voiced and unvoiced and you transmit actually transmit a little bit of the shape of the periodic source with a few Fourier magnitudes of the periodic excitation. And another thing that's different, yeah, they have some adaptive spectral enhancement, but that's basically also just an LPC synthesis filter. And in the end, they added a pulse dispersion filter, which is an all some kind of phase filter to making the periodic sound not sounding so robotic. But this is a major upgrade from LPC-10. It's nowadays, I don't know if it's still used, but it is, I think it is still used for ham radio and for some very low bit rate secure voice communication. And it operates 6, 1200 or 2400 bits per second. Hopefully you can listen to it later. Even though it's a major upgrade from the LPC-10, it's still pretty robotic to my ears. So what happened a little bit later after the LPC-10 era is that a concept called linear analysis by synthesis came into scope. That is, we actually go back to making the output looking more or less like the input. So you have a synthesizer, but when you look at the actual waveform and try to match that. So it's actually a waveform matching, sometimes this is called a hybrid, because it comes from the LPC type of coding, but it is a waveform matching. So you have an error minimization. You try to match the waveform. And as done in almost everyone nowadays is that you do this matching in a weighted residual domain. So you take the LPC analysis and you inverse filter the speech and then you have whatever you're using in your RPC system as an excitation, you have with parameters, you generate that. And then you shape the error with a weighting filter, which is a combination with a special weighting filter and the LPC synthesis. You do it in this weighted speech domain. So this, this EWN here is an awaited speech domain that's perceptually good for important to minimize. So when this concept came out like in the 80s, 82 or something for the first time, this improved the quality a lot. However, since it's a waveform matching, you actually need to go up in bitrate too. The major breakthrough came 85 when Atal and Schroeder introduced what's called as code-excited linear prediction, the cell or the kelp depending on your flavor of pronouncing it. At that time when it came out, it produced, took like a week on a cray to produce a second of speech or so. Because at this time, in the 80s or the 80s, this was really, really computational expensive. And the idea is that you have what they call an LTP filter there. That's a long-term predictor to do the periodic aspect of it. And then you have an LPC filter to get the short-term prediction or the spectral shape. So you have two filters. You can see that LTP could be as a codebook too, but it doesn't really matter here now. So you have two types of prediction to take care of periodicity and the spectral shape. And then you have a long, long, long table of different excitation sequences, a codebook of excitation sequences. And you just try them all and figure out which one is the lowest error energy. And since you have to do an exhaustive search and do a lot of computation of signals here, that was expensive. And it is quite expensive generally, but I think nowadays we have the power to do that. This is done today in all cell phones. But then you could go down, you can get from the waveform coder that didn't sound good. Like the traditional waveform coder didn't sound good and lower than like say 16 kilobits per second. But with the cell coder you can come down to four barely, but like six, seven at least, to sounding really good. So until a couple of years ago, that was this thing. So we had four internet calls like we have now, the voiceover IP and Skype calls and so on. We have wide band speech. We figured out that that's the most pleasant, the best quality. But in order to have it sounding good, you need quite high bit rate compared to the bit rates we talked about before. So typically, and like here now in this call, we're using Zoom, right? And Zoom is using Opus. And Opus is probably running, I would say 32, 24 kilobits per second right now. My voice is coming two years transmitted as maybe 32 kilobits per second. And these codecs are all waveform matching. And they go very, they sound really crappy at low rates. They go sounding harsh and noisy. However, now in the emergence of internet everywhere, people want to use the phones to not to speak as a cell phone. They use the phone to do voice calls like Skype calls or Zoom calls, even when there are poor networks. So there is a need to do internet calls using a low bit rate. So reducing the bit rate has suddenly becoming important again. And that's why I feel when in the beginning, because speech coding hasn't been, it's kind of been dead. Like especially low bit rate coding has been dead research-wise since the 90s. So, but now it comes back again. Because, John, is it also true that for not only for speech, but also for music, for audio and general coding, people restarted to look at the issue. Yeah, but for them, yeah. And for audio, audio can't really rely on, there are no good audio codecs relying on models, I think. And now that all audio coders in the end of the day are kind of waveform matching ones. So even for high, for, they clearly don't sound good until like 30, even for stereo coding, like 30 or 40, maybe. But yeah, there is an interest now to even look for audio coders, like waveform matching coders to go down in bit rate. That's a topic for another talk, which I maybe come back and talk about. But here I'm talking about speech codecs and very low rates. So, for instance, I can take an example of Duo, which is just an app that I've been involved with. People are using Duo calls even in really extreme conditions. Even though you have a good phone, the networks are problematic. So people have good phones and phones get cheaper and more and more advanced, but the infrastructure is not. So in some emerging markets, I don't know if it was India or Indonesia or one of those two, they saw that people were using really old 2G networks, like GSM networks, to try to make VoIP calls. And in those networks, the bandwidth you have available to do coding is around five kilobits per second, and you can't do it, but there is a need. So that's why people started looking into this again. And at lower than five, just like a round number, the only thing available that sounds okay is parametric coders. But they sound really bad. They sound better than VoIP from coders, but it's still pretty bad. So what can we do now, knowing what we can't know? Yeah, but then we realized, or people realized, wait a minute, the parametric coder, yeah, what are they? Yeah, they have a speech synthesizer at the decoder. They're driven by parameters, but so you just need a generative model. Are there better generative models today than for 20 years ago? Yeah, that's what I think we have. That's a segue to the next session, next section. What are people still up for continuing or should we take a break? You can, it's up to you. I think it's probably a good time not to take a break and ask questions if people have questions. So then there was a nice, you know, we saw quickly so many years of speech coding. And so I don't know, probably people have. Do you have any questions? Yeah, it's up to you, Jan. Yeah, I know, I gladly take it. As I said, I think we're halfway there, but in only one hour. Yeah, yeah, yeah. So it's okay. If there are no other questions, we can take a break. And then we can pause this recording, shift us, please, pause it, and then come back. You can turn off your video, listen to your audio, relax a bit, and then come back. Yeah, say like take 10 or something? Yeah, yeah, that's fine. Okay, all right. Yeah, okay. I hope I still have some people, I see people are connected. I'm not really sure. Yeah, yeah, they are connected. Can you check also, since you're in the host, can you check if there are people that you need to admit? Again, it's too late for East, from where we have many, many participants. I know, I know. So yeah, I'm just happy that some people are still online. It's time zones. I think there are nuisance, I realize that too. I try to be kind of agnostic, but my body doesn't really comply with that, because it's sometimes really hard to stay up after when traveling, wherever you are. And it's getting worse. I think it's getting worse, by age, unfortunately. Come on. I have a friend though, he has a superpower. He says, for me, time zones are irrelevant. You can call me anytime, anywhere. That's okay, but I guess he doesn't have such nice colored hair as you do. So you cannot be perfect, you see? No one is. Okay, so let's go venture into actually the topic of the entire talk, but I felt like you needed a proper introduction to get perspective of what we're doing. So let's go to the state of the art in the sense that, I don't know, there are no systems deployed yet, but there are active research going on. But in universities and companies using this new technology for low bit rate speech compression, all right, to start off, I think many of you have seen this kind of taxonomy before, and then I, unfortunately, didn't see any of the other talks. There are many, many ways of doing generative synthesis, and these methods can be classified differently. And I think this was Ian Godfellas who did this the first time. I have these parametric models that are based on maximum likelihood as the goodness metric when you're doing this generative synthesis, the generative synthesis in the sense that you're trying to directly or indirectly figure out a probability density function and draw samples from that. Those samples are your waveform, or in this case. And he kind of classified it as you do a maximum likelihood, and for any kind of a maximum likelihood, you need a PDF and a distribution, and this distribution can be explicit, so there is a real distribution, or it could be implicit in the sense that you never actually use the distribution. It's in there somewhere, and the GANs are in that class, and I'm not going to talk about that, hopefully someone else has. It's yet to be, to my knowledge, to be used for compression. But if we have an explicit distribution, you can have a real distribution that's tractable, that there is a real tangible distribution, that you can extractable so that you can operate on it and derive something from it, or you can approximate it because you can't really treat it, so you do it with approximation. These are all the variational methods, the variational auto encoders. And to my knowledge, there are no variational auto encoders for speech coding. There are auto encoders for speech coding. They are more waveform matching type, and so far, they've been doing higher bit rates, and the current state is that most of these systems are really cool, but they are quality-wise, they are in par with traditional methods, and we hope that research will make these waveform encoders, or the auto encoding brand or type of codecs. Minji Jim in Indiana, his team, they're working on this intensively, and I'm rooting for them to make good waveform audio coding in that direction. But that's not what I'm doing, and we're talking about here, we're talking about tractable distributions that we can actually get an expression for the PDF and try to optimize its parameters. And all the audio regressive ones can be in here, and again, I know that you talked about WaveNet and WaveRNN and SampleRNN, and I'll be talking a little bit more about that, and I don't think it's okay to do it again because repetition is always good, and I'm going to look at it from my speech coding background. And this is something, I come from the speech coding TSP, really trying this audio signal processing, where we have mathematical models, we estimate, we have models for all the signal spawns, and just recently, like the last five, 10 years, you came into this notions about neural networks and black box and modeling. So I'm still going to use, and that's, I think is that this could be a strength, could be an impediment, I guess, sometimes that you're still thinking in this signal processing vein. That's why I surround myself with a lot of people that are good in machine learning. So we give and take. And again, maybe this is a repetition, but we have, we try to model a distribution of data. That's a little bit different, actually, from the standard signal processing way. We sometimes model distributions for features and so on. But we rarely sample this distribution to get the audio samples. But this is kind of in the novel thing for us, from this whole school of thinking. So we have the signal samples. And this is the samples of the distribution. So it's two types of samples. Samples from PDF, it's audio samples. It's kind of tricky, I don't even mean, but the, I sometimes mean the samples of audio samples, sometimes mean PDF samples. But we have at least this set of samples and they all belong to this P data as the probability of density function. We want to fit, we want to find a model so that this data model is, data is resembling the model. The PDF of the true data is resembling that PDF of the model that we just devised or trying to fit. And that's also not new. We, there are these, these parameters for the models have been for speech samples. For instance, Laplacian distribution has been, if you just look at speech samples as a memory-less source, we know, of course, that the, if you look at two samples, speech samples, they're highly correlated. And, but if you just look at them as, if you just randomize all the samples and look at these, the correlation and so on, and look at the distribution, they all, they look very similar to Laplacian distribution. And similarly, if you look at the DFT of the, the FFT, if you look at the DFT coefficients, the complex or in the imaginary or, or the real values, just look at them memory-less and plot the histograms. They are very, very often similar to a Gaussian. So, so that's been modeled before in noise and speech enhancement and noise suppression and so on for the actual samples. Let's see what, sorry. No, no, no, no. Let's see. How can I go a gap? I missed the cursor. The cursor, when I do, when I do full screen, my cursor is gone. But it's fine. That is for keynote, probably. Yeah, it's keynote. Anyways, I don't need it. Okay. But these, all these man-made ones, I mean, we're limited in the sense that, yeah, this, this, it has a few parameters and they can't mix, they can't model anything. Of course, there are some of them, like, like mixture of logistics, mixture of Gaussians and so on, that has a little bit more freedom in their express, expressivity, but expressivity. But it's still kind of parametric. And there's limits. You want to be completely free. And that's what these deep model, they remove this, this constraint. You can basically model anything with them. And, and you don't really need the parameters as long as you have like a network that can just do it implicitly in the box. So that's kind of the strength. So if we look at the ones that we're using mostly here is the autoregressive models. And it all, all is based on that we can, we can just factorize this big multidimensional for a, for a frame or for a long sequence of samples that we, of speech data, we can model this as just a, just a factor of conditional probabilities. And if we look at this individual ones that you can see that this Q or XK, so for each individual sample, it's, it's a function of old samples. So we can predict the next sample just given all the previous samples. So we, it's very simple. It's one dimensional probability density function. That's kind of the key for using this. Some people will, with, with, with, with good cause, you say, yeah, that's also, also a drawback because you need to sample, it's sampled by sample, which can make it a little bit complex, but it's, it's makes it so simple in its, in its raw form. So you just look at, you have complete, we have a tractable explicit likelihood here. So the probability function. And you can, you can easily draw a sample from this because you can just have, you have, you can, you can model it and do a random number from it. But it's also gives an explicit probability per sample given the old samples. So that's, this is what I feel is kind of the key and the elegant thing of looking at it's why, why we, why we can do what we're doing in order to aggressive models. And again, if you have questions, I know it's late, but you can just jump in with questions whenever because this is, this is not a real lecture. This is a late, late back session, I hope. So, and I guess again, WaveNet, you talked about that earlier. And there is no talk with WaveNet there whatsoever where you can't, you must use this moving image. So this animation is in all my presentation of WaveNet and every presentation, I know people from DeepMind when they use talking about WaveNet. It's very beautiful. So it says something about dilated convolution. It's, it's also a little hard to, because this is, these are not nodes that typically when you draw a neural network, these are more how the samples are, are related insides, not the real samples, but the sample probabilities inside the network. And so it's order aggressive. You generate the sample and then you put, put that back. And it's also, and one thing that I feel is interesting with this picture, Chromix single processing background is that it very much looks like an FIR filter. It's a convolutional filter network that takes all, unless, I mean, you can see it's recursive in the sense that you put it back, but produce one sample. You, you, from below, you just took, take everything you have, or for a certain, certain window in, in back in time, you use that data to, to nonlinearly form nonlinear filter, you can call it like so, so that you generate the sample. And here again, I think you've seen this picture inside WaveNet architecture. The, the actual computations are done inside. It's, it's a convolutional network with skip connections and where you form the original WaveNet, you have a softmax for 256 probabilities of each, of a sample of eight bit, 256 levels, like the eight bit probabilities of value of, of a audio, audio value. And the thing, the way I print, the way it's displayed here and it's nice animation is that it's all completely order aggressive. It's no control of it. So if you run the system like this, it will just produce random speech sound, basically babble, which is, yeah, as an art form, it sounds pretty cool, but it's not useful in order to drive it to control the network, you need conditioning. And that's very often omitted in the WaveNet presentations, but any kind of usefulness you need in with the probability density function, the Q here and the blue one here, you need another conditioning here. You could have to condition on some parameter conditioning features to control what, what this next sound would be. So I'm going to talk a little bit about a parametric speech codec that's using WaveNet for its synthesis part. It was presented at ICAST a couple of years ago. And to my knowledge, it was the first one to kind of use order aggressive network for low bit rate coding. And again, this is typically how you train WaveNet or how WaveNet is used in reality when you have a conditioning. Here, the generative model has its condition on some features. Besides, it's not here in this PX here. Of course, it's, it's, it's order regressive. So it's not only, so this X here can be described as an a scalar one. So it's just one dimensional with the conditioning and all the past samples. So when you train it, you do feature extraction, some kind of features. I guess for TTS is text or phonems or pitch and stuff. And then you put these conditioning features into the WaveNet. And then you compare it during training with the clean speech. That's the targets. And when you, and then you so-called teacher force it to during training to to because you put the clean speech into the pipeline during training and minimize the parameters to to make the outputs as soon as as close as possible. Not that you don't actually sample it, but you make the the probability of the true sample be as possible as close as possible. But you all know that you've been training WaveNet. So in the coding thing, when we're going to use the same, we're going to use exactly the same thing. So it's, we're going to drive WaveNet with not TTS parameters, but we're going to, we're going to drive it with speech coding parameters. And what we, in this first project we did here, we took a, it's the codec2, which said before was, is an open source coder, open source traditional for coder, parametric parameter. I think it's very nice because it's in the sense that it's open source and it's very easy to understand straightforward parameters. It's a narrowband codec. So it's, it's, so narrowband meaning that a sample at eight kilohertz, meaning the signal contents is only up to four kilohertz. And the parameters it uses, we can maybe go back. Actually, we can do that. How do you go back? Let's go, let's go back, look at the codec2. Yeah. So the parameters for codec2 is the LPC coefficient. They are, they are coded as line spectral pair, line spectral frequency, which is another representation of the polynomial coefficient. Very, very good for quantization because they can guarantee that the polynomial is stable, which is good in reality. So because you, in, in a, here we don't really care for, for WaveNet, but for, for a coder where you actually put the, the filter one over A of Z, it's good if it's stable, otherwise it will maybe make your ear bleed. Anyways, so the LSP and on the energy, the, the line spectral frequencies, energy and voicing, just one parameter for voicing. So that's what we use as conditioning features. So, and these, we just fed into WaveNet. These are quantized at, and at 24 kilobits per second. So, so if you're going to use this after training, we use this as a, as a speech coder. We take this, we take the clean wide band speech. The codec itself is narrow bands. We have to, they have to down sample and limits its bandwidth to four kilohertz. Transmit these quantized parameters at 2.4 kilobits per second. We, then we use this to drive WaveNet as, as, as the conditioning and the output speech is actually back because we trained it to, we trained it to generate 16 kilohertz wideband samples. And that's the output of, of the codec. And this is just some data for what we, this is as a codec then we just repeat what I said. Let's take the parameters from codec two. They are all narrow band, but the output is wideband. And for, for the Wildcast paper, we trained this on Wall Street Journal. We said, so we had 123 speakers and we had it with disjunct tests with eight speakers. So I think that's also kind of important to notice for prior to this work, the WaveNet for TTS, they are all trained for one voice or single, single voice, although there are like multiple voices. So the, so when you generate speech, you typically in TTS use, you generate with the same speaker that you, that you train for. And when Aaron talked about using his WaveNet, the WaveNet a long time ago for, for, when he also did with the DeepMind people always also said, of course, in one of the applications would be compression. They always said that, yeah, but then you need to send speaker ID feature as a, as a conditioning feature. That doesn't really fly in a general telephony system that you can't, you, you need to have an unknown speaker sounding good too. Or maybe, of course, if you put a, if you make a register with all the speakers in the world, and of course, then you just have, you need a few bits to send, but you need to store these and you need to train for all of these. So it's of course, not, not very visible. So that was the one thing that we were kind of worried about, like, oh, does it generalize? Can, can we actually code? Can we train for many speakers and then code on other speakers? So that was, was about to be seen, whether it's generalizes. And this, this is also based on the first WaveNet that used 8-bit Mula. So it's the, so it has limited resolution in the audio, audio values in the audio samples. It's 8-bit Mula, which is the standard for telephony in landline. And we look back, we send at 100 Hertz, what is that? It's 10 milliseconds update rate and look back 300 milliseconds to do that. And one thing that is cool with this then is that we, we can get automatic bandwidth extension. We don't transmit anything in the information above four kilohertz because the codec does not have that. But the cool thing is that the WaveNet has learned to use these features to regenerate the upper frequencies. So here you can see what happens with a wideband original and the codec to generate, this is the codec to synthesis of it. So the codec to parameters in themselves using their codec to synthesis will not generate anything higher than four. But the WaveNet generator can recreate these fricatives there. You see it's like 1.3 seconds there. That information is completely gone. Actually, there is one at 0.7 there. Yeah, what happens there? There's an error there. Yeah, actually, yeah, that's also what happens. So it doesn't really know, it can do, and it's all probabilistic, so it actually can guess wrong sometimes because it has to fill up, you have to fill out the in-paint the upper band and then of course you can't always get correct. So sometimes it adds data, like at 0.7 seconds there. That was not there in the original. And that could actually be heard sometimes that there is some phoneme errors that it adds, it kind of recreates the wrong sound. Hello, Jan? Yes? Yes, so there's someone in the waiting room. Can you please let her in? How do I do that? Let's see. I don't see it. If I do this. No, you? Oh, yeah. In the participants. Let's see. Oh, I see. Yeah, I'll say I said sorry. I admit. Oh, I can do that. Yeah, yeah. Thank you. Thanks. Sorry for the interruption. No, no, that's good. I learn new things every day. This is the first time I'm using this program. Okay. Okay, so how does this sound? I'll soon let you listen to it, but typically in speech enhancement and speech coding you report numbers of objective metrics and these objective metrics are supposedly verified to be correlated with a subjective impression. And for speech, they used to be PESC that has been replaced with POCA, which is apparently better. And POCA is really good in way for matching coders in the sense that it correlates with the quality. But for generative synthesis, even low bit rates, even vocoders, it's not really suited because the waveform is so completely different that the models in these objective metrics fail. So the only thing, okay, we know, because we have listened to it, we know that the WaveNet codecs sound pretty good. And you'll see in the objective scores. But using POCA, you see here that, okay, so the rate here is the bit rate. We're comparing with some others too. So we have Melp that talked about before. It's also at 2.4. And then we have Mutilated Speaks. Speaks is an open source kelp coder. So it's a waveform matching coder. But you can actually drive it down to 2.4. It's not recommended to go below 4 or 5 because it sounds really bad. But we needed just to make an example to show how it sounds. We also have like state-of-the-art wideband coding, which is AMR wideband. It's also a standardized rate. But that is driven at 23 kilobits per second. And this WW is if we take just, if you do complete waveform matching of our coder and just use the probabilities, not for generating signals, just for generating an entropy coder. So you can use the entropy to entropy code the original or the MULA. And that will be around 40 kilobits per second. That's perfect. Now that's perfect. That's MULA reconstruction. And the MOS scores from this polka, which is a goodness measure from 1 to 5, or actually MOS doesn't go more than 4.9 in polka. I don't know. Maximum in polka, I think it's like 4.9 or something. So according to polka, our codec is as good as MELP, which I hope that you will agree when you listen to it. It's not really true. So what I want to say is that these objective metrics for generative synthesis, you can just turn them out of the window because they don't, we need a good better metric, objective metric for these kind of systems because this does not reflect reality. On the other hand, listening test is the only way to do it. And I think that's in general, even from the speech enhancement and all that, it's good to have some objective metrics like SNRs and PESC and so on. But in the end of the day, these are used, our ears and brains are the final judge. So you need, when you do, I think that's for anyone here that does TTS2. In the end of the day, you need to do a subjective test. You have to figure out how it sounds because that's what matters in the end, not objective metrics. So we did a musher test. It's not really a musher because we didn't have, formally in musher, you need an anchor, which is a low-pass signal. But since we have low-pass signal and narrow-band codecs, just a low-pass signal would sound pretty good because it's basically what a perfect melt will sound like. Because of that, we introduced the speak series, kind of the worst. When you do this, you need an anchor to kind of spread out the curve. So that's why I call it musher-esque because it doesn't have a proper anchor, but we have a more relevant anchor that speaks here because that's really terrible. So you can see here the listening results, listening test. And the original, because people couldn't really, sometimes people can't really hear the difference. That's why it goes down there a little bit. But the MULA is called, MULA is WWR. So that's 8-bit, it's 128, the MULA itself will be 128 kilobits per second using entropy coding, which is a lossless coder. On top of that, we get to 42 kilobits per second. That's kind of what we strive for. AMR Wideband is a good codec. It comes up there pretty high, but that's at the bit rate at 23 kilobits per second. So if you look at the low bit rate codecs here, the really low bit tricks, we have a melp and codec 2 for comparisons and speaks for like an anchor. And here you can also see that speaks is a way for matching one. It's a kelp coder. At 2.4, it's really bad compared to parametric coders. So both melp and codec 2 are better than speaks. And melp is slightly better. We use codec 2 because it was publicly available and open source and other people will use it. So the interesting thing here is that our WaveNet codec WP here is up there. It's not as good as AMR Wideband, but it's pretty high up there. And if you compare to codec 2, again, we have the same encoder. It's just that we use the WaveNet as a synthesizer and codec 2 has its own degenerative synthesis. And this is basically, we were pretty happy with this result. Now let's see what happens. Let's do it up share, demo, share, share screen. What? Oops. Sorry, sorry, sorry. Yeah, here it is. Found it. Sorry. You have to bear with me. Here it is. No, here it is. Okay. Do you see this one with the demos? Yes. Okay. All right. Let's first hear the original of one mail. Separately, New York State sold about $77.1 million of certificates of participation. And so that's 16-bits PCM. And here is 8-bit MuLaw from that. That's where we base everything on. This is kind of our round truth. Separately, New York State sold about $77.1 million of certificates of participation. Yeah. All right. Here is, let's go. How bad you can go? Let's go. AMR Wideband. Separately, New York State sold about $77.1 million of certificates of participation. Okay. Let's go down from the other side. So speaks. Separately, New York State sold about $77.1 million of certificates of participation. Yep. That's 2.4 kilobits per second. Here's Malp. Separately, New York State sold about $77.1 million of certificates of participation. And Codec 2. Separately, New York State sold about $77.1 million of certificates of participation. Okay. Now we take the same bit stream over the channel and drive WaveNet. Listen to this again. Separately, New York State sold about $77.1 million of certificates of participation. Separately, New York State sold about $77.1 million of certificates of participation. So the other one. In the efforts to restore market confidence, administration officials have emphasized that the economy's fundamentals remain sound. In the efforts to restore market confidence, administration officials have emphasized that the economy's fundamentals remain sound. And AMR Wideband compared to. In the efforts to restore market confidence, administration officials have emphasized that the economy's fundamentals remain sound. Okay. Any questions about that? Hello Yan, one more thing, I think there are people still in the waiting room, can you please let them in? Participants, now I know how to do it so it's easy. Thank you very much. All right, so I had a question about that, can you hear me? Yes. All right, so these speakers are unseen speakers, right? Yeah, okay, thank you, it sounds pretty good. Yeah, they are taken from the same database though, okay, yeah, yeah, it's just a caveat because you have to be a little bit critical always, but yeah, so it's from Wall Street Journal, so they're different speakers, completely different speakers, different recordings, but it's recorded in the same way, so it's the same type of microphones and studio recordings and the text they read is similar to, so kind of prosody and all that could be the same, but that being said, yeah, given this very good circumstances, I think it's pretty good. This is kind of what drove us to continue in this field and others to continue in this field. So did you try it with other recording conditions? Let me talk about that a bit later. Okay, thank you. All right, any more questions? We have, we have a long day, all right, then I'll go back to the presentation. Okay, I think the question I got here was a really good one. It was perfectly synced with my next slide here that, so okay, so we were really happy with these results and other were too, so let's build a practical speech coder and put it in the phone. Unfortunately, WaveNet, original one that we used, this is not, the WaveNet we used for this was not the open source, one, this was DeepMinds and of course DeepMinds own proprietary one is the best one I've ever heard and there are some secret sauce in there that I don't know about, but it sounds much better than any other implementation I've heard. So we used theirs and drove it with our own conditioning and we had to hack the code so on, but anyways, it's the best implementation of WaveNet I know. It's extremely complex, so you can't really, and even in the ones that are implemented independently, you can't really get complexity low enough to run it in real time on any, I don't even with GPUs, I'm not sure, but I'm pretty sure actually and that's can't do it, so WaveNet is a good way to kind of see how good you can get with the system and hopefully one day we have computing power or someone is clever enough in making something as good as WaveNet that's less complex and people have done that, tried to reduce the complexity of WaveNet, so DeepMind guys came up with a parallel WaveNet where you just still like a smaller, train a big one and you just still a smaller one that you can use. You might have talked about that, I saw that in the chat room that maybe someone talked about earlier. Another one is WaveRNN also from DeepMind and SampleRNN, these are replacing the convolutional big FIR filter with an IAR filter in my mind of thinking it recursive net instead of a convoluted net. So the complexity is one major thing, it's if one of the practical one you have to run it in real time, interactively on both sides. Another thing that you need to have in a practical situation is that you never have, very very rarely you have clean studio quality speech, so you may be able to, you have to cope with background noise. That's one thing and on the recording chain because like especially in a phone you have the microphone and non-linearities in the clipping and stuff, so that comes from the hardware and different processing things. So recording chain is a factor and then of course as we did here we have multiple talkers and languages. We show that the multiple talkers are not as big issue. Languages, I've seen languages could be a problem but mostly yeah it's you need to you need to figure out how to cope with different talkers, different pitch, especially the prosody and pitch structures is problematic and sometimes it can't really cover that. So and I'm from from from now on because this was like the first pilot experiments we did like how well can you do like this is this is kind of like what we strive for. If you want to make something usable we have to address others other problems. So and and as since I have it here we can listen to what happened to if you try to code something that's not perfect clean speech. So stop share again. I'm going to do go here, so U.S. and then share screen. I'm becoming a really good on this now. All right do you see wave net conditioning features from noisy audio and can you see this? Yes. Okay all right let's do the first example. We have a regular okay this is an extreme example but this can happen right when you have a phone call and you're sitting home in in in COVID-19 and you're having a you're having a video chat like that today this this can sound you can sound like this if you're sitting in a living room. This is an artificial example because typically you don't sit and read Wall Street Journal like that but this is an artificial example. Nevertheless it's pretty hard to let's let's listen to it again and see what happens when when our wave net codec. Since wave net codec can only do clean speech it's try to match this input to with clean speech. Try again. So that didn't go so well. All right and now we have an extreme example because there is no clean speech here whatsoever it's pure music. You get the drift and this is how wave net coding tries to mimic this. Yeah did that answer your questions how it sounds with other kind of recordings? Yeah I think it's amazing. Yeah yeah but it's it is what it is I mean you get what you get what you wish for. It's a fantastic speech synthesizer for clean speech and it can it can extract pitch and voicing for speech sounds. Try to get the speech pitch from a multi-pitched orchestra will fail and it will and same with background noise. So that is the big problem. So there is something remaining in worded week before we can put it in the phone. Hi so I have a question. Yeah so first of all this maybe you've got a converter between classical and contemporary music right? This is what it's doing. But I want to ask so this so have you tried changing the database like adding noise or? Yeah I'm gonna talk a little bit about that in a few slides. Okay and then just another quick one. So you said low complexity of course is also a specific right? And you cited some other lower complexity let's say models but aren't those still a couple of order of magnitudes away from? Yeah okay yeah they are. With the sample RNN is still yeah I'll talk. Yeah you're right. Okay thanks. Okay so I'm gonna first talk a little bit about like trying to get rid of the complexity. You got lower the complexity and then in the end talk about how can we try to get it more robust to background noise. So like the question excellent question asked some of these are lower than wave net but still pretty high. And one of them is for instance is sample RNN. So that's been used for coding the guys that told me did this last year. It's lower complexity than wave net but they did not try to make it real time they also they just they just examined another type of architecture that we did they did so they had a sample RNN and what they did more that so in there our sample RNN is like a layered hierarchical so it's kind of a learned down sampling or up sampling in this case or depending on you see it different timescales and different layers and to generate the so it's one it's one RNN in each chair so it's like three grooves here and there could be big ones. And the interesting here compared to what we use they used traditional LPC vocoder parameters and not the codec too but they showed I was similar you can use you can use any kind of parameters to drive a generative net so I think this is a pretty neat work. With this model since they have this good thing with this codec I think is that since they have this timescale or this hierarchical one they can scale it too in different bitrates so they can actually make it sound better at higher bitrates so they can come up to like three or four eight kilo bits per second sounding a little bit better than what we could what WaveNet thing can do so that's kind of one of the advantages of using the sample RNN because you can do this different you can scale it a little bit. Anyway it's still too complex even and kudos to the authors they did not address the complexity here it's more than making you more versatile codec so it's still too complex. So Wave RNN was I think you talked about that too but the Wave RNN does lend itself to lower complexity though you can drive it pretty low and they show the original Wave RNN paper that it can go really low but without losing too much quality you will lose quality it's unfortunate I tried the best Wave RNN and the best WaveNet you can unfortunately never come up to exactly the same quality but close and so in Wave RNN you replace the entire convolutional net with one single on single group which is complexity reduction compared to sample RNN and another thing to get the complexity down is to have like sparse weight matrices you can many of these multiplications will be zero so you can omit them and also you don't have to store them but the quality improvements here is that they went from 8 bits to 16 bit and the way the representation with the softmax that was in WaveNet we have like softmax for all the 256 levels if you want to do 16 bit resolution it's infeasible to have unfortunately was like 65 Instagram with 65,000 bins it's a little bit too complex so they split it up in the cars and fine parts so they do softmax for fine quantization and for coarse quantization so I don't know if we need to go into the update equations but if I look just look at the if we just disregard and also unfortunately in their paper in the original Wave RNN paper they in the equations there is no mentioning of conditioning which is of course in reality a need so in my in my equation here I actually see the input vector is the old data but also a feature it's also a feature conditioning in there and so that that gets you pretty pretty far actually in its general form here this can get you very low in complexity but the next so now we're finally the next next topic will be about one of the things that I was involved with called LPC net which takes the complexity down to a real-time operation on an Android phone and an iPhone and but you need so you need but you need to do more a little bit more than WaveRNN so in LPC net we go back to LPC predictive coding thinking again we know from all past experience since the 80s 50s that you can easily with 10 coefficients in LPC 10 maybe more when you do Wave wideband but you can easily model the spectral envelope of speech by linear prediction just taking linear combination of the old samples and you you so you don't need to you need a unit to spend expensive neurons on trying to to to model the spectral envelope so in LPC net we do that ourselves let's let signal processing take care of that simple and let the network focus on generating the excitation in LPC based thinking in the source filter just just to wave RNN on the residual or the excitation or linear prediction residual basically and that was presented I cast a couple of years ago but I don't know if anyone earlier this week talked about it but this has had some resin resonance in the literature and people have used it and I've heard people using it for TTS and it sounds pretty good so it works we're going to use it for coding in later on here but for even for TTS it works fine so besides lifting out the spectral modeling spectral envelope modeling we put in some other things to make it sound better and another thing to make it easier to for a little so okay so we put preemphasis preemphasis has also been done in speech analysis to flatten the spectrum tilt which is good for the LPC to actually track the formants and not the tilt itself because that we can easily just flip it up and you can do that you can do it adaptively but it's you can also just do it fixed hard so because in general all vowels have a slope that's and you can just take an average slope out of everything what that what happens then is that you emphasize the higher frequencies so when you that has the advantage of making a better fit on a preemphasized speech so when you go back to the speech domain you low pass filters a little bit which means that the the noise that you add due to the 8-bit mule or a representation we noise shape that a little bit so that it's it's reduced in the higher frequencies and that's actually audible it's it's it's a bit I felt that that was a good improvement then another thing that is in here that's improvement that's a little bit of computational ones it's instead of so we do an input embedding instead of trying to predict the actual mule or values or represent them the input input is embedded so that you can pre-compute the matrix products it's just a simplification to make it faster and actually the bonus is that you actually let the since you have this embedding the the network will learn the mule in your function a little bit and here is the system it's to it's based on two two two networks that the way we're an end standard way we're an end but it's been it's been simplified so it's conditioned first of all like and that we have a sample rate for network that generates a sample the output audio and then we have a frame rate network for the conditioning so operates one at a frame it's a frame basis so for each frame each speech segment it's generating feature vector a feature vector constant feature values the features is extracted and quantized and yeah if it's non quantized it's we can do that for TTS you don't need quantized features you compute the linear prediction yeah just the LPC coefficients and use that to form prediction of the past samples you as input to the system you use the up sample feature vector it's up sample through convolutional limits and two layers of fully connected ones so the feature vectors is concatenated with the prediction from the form by up the prediction is basically the linear combination of past samples you see s10 and the order LPC order here is 16 we also use the prediction is in and also the last sample is also used this is just to make it make it easier to have it have that the last sample as explicitly so and also the residual error coming down that's fed into two grooves one big one which is the main one and then a smaller one and then a two what you call the dual combination of fully connected layers and just a softmax as in softmax as in the original wave net or the wave on end with only the cores you can see at this and that will give us the network will give us as always just the predict the probability of probability density function the distribution which we sample and get the excitation that excitation is is also is recursively put into ss conditioning but is also added to the prediction of the past output samples and that gets us the sample all right this is the T that was the the speech synthesis part okay for coding again we need features right then we need code so the origin after the original LPC net paper we actually made build the build the codec to that into speech last year so we could we got it down to 1.6 kilobits per second anyways features used in this codec version of LPC net are the features are extracted every 10 milliseconds but in order to reduce the bit rate we pack pack them all into 40 millisecond packets so the transmission frame or the packages every 40 milliseconds you pack all four frames together the features we use is a capstro representation capstro representation of the spectral envelope pitch and the pitch period oh there is an error now let's see we have a pitch I think basically is it because the capstro also gets to get the gain and yeah is and the pitch okay what's missing here is the pitch the the voicing when there is voice around voice but the voicing is it's it's it's often it's a correlation it's a pitch gain basically or how much how periodic it is all right and how do we get the pitch yeah we do correlation simple cross-correlation in the residual domain this person well yeah I have this reduced perceived noise in every slide I don't know that's that's not it's it's on five milliseconds separate anyways a conversation is important partner one so we quantize the pitch the average pitch over entire packets which is 40 millisecond that's a long time for only a single pitch value so we we need to be able to modulate or change the pitch during these 40 milliseconds so we have we spent three bits on ha on a modulation or how to pitch can vary during those 40 milliseconds then two bits for the the gain the periodicity the voicing value the spectral envelope is coded with through this capstro and the capstro coefficients are obtained from nonlinear non-uniform bands and bark bands the bark scale is which is resembles a little bit the spec the perceptual way of representing the spectrum spectrum so we take 18 non-uniform bands spectra bands and get cook and get the Kepstra coefficients through inverse DFTs there from that and those Kepstra coefficients are quantized a little bit kind of come a little bit sophisticated in the sense that we have we quantized every the quantize the last frame in each packet with a fixed quantizer then we try to predict or interpolate the other ones and in back back and forth a little bit with different tech and all of these are with vector quantization I don't need to go into details there but in the end of the day we transmit the the bits bit allocation for this coding scheme is total 64 bits for 40 milliseconds which in the end calculates to six 1.6 kilobits per second during training yeah this is kind of practical details so some training mechanism that we did here to make it sound better was to add noise to the input to kind of reduce the effect that we're in training you're not using the since we're this is based on on the linear prediction so we should have had the previous previously coded versions because in the but we're not be using teacher forcing from a true one that will kind of deviate after a while unless so we add some noise to kind of compensate a little bit for that and another thing that we noticed was when you do for when you do it for coding you have to start with not quantizing anything so do unquantized features and then freeze freeze the the actual wave RNN part and just retrain the the conditioning stack with quantized features that's in the end what sounds best just just a tip if you ever gonna do this so the more interesting thing here now because this LPC net quantize one we tried it on different different systems on regular CPUs it it works very well on the 386 Linux we tried on Linux but doesn't matter but for for CPUs and 386 type we get like like 14 to 20 percent percent real time of the core so that that's that's good on on a phone a neon phone snapdragon from two different generations the pixel 3 and the galaxy s10 actually they should be okay we have also an iPhone and it works on iPhone too I don't remember the number but it's clearly real time and for fun we also tried on the Raspberry Pi but that was not real time but it's pretty fast anyway on Raspberry Pi so this is kind of the motivation for doing it so so it really runs fast the CPU any questions okay how does it sound compared to others then yet to the left one so first we can to just take the LPC net compared to this unquantized this is just the synthesis part of or featured driven synthesis we had we typically nowadays do a mush or I like because we have we're looking at we at Google we have a kind of crowdsource listening tests we have like a mechanical Turk style so you can easily get 100 or more listeners and they rate all these only samples of course we have to be careful when you do crowdsourcing you have to filter out you have to do a lot of post-processing of the results but in the end of the day you can see here that the confidence intervals are pretty small so I've strongly believe that this is kind of accurate or this is not kind of this is accurate this is reliable results and in the left we have comparison with the reference and mula of course and that that will sound pretty good and on the horizontal axis we have this is depending on how big this and the network is and you basically how much you prune it or how big you have dense or so to compare them all that we call it dense equivalent units so either it's big or pruned or small and unpruned they give the same dense equivalent units and we compare LPC net to what we call wave on and plus which is people could say wave on and plus minus it's it's minus because we're not using we're not using this this two-stage fine course resolution that means that the mula is the kind of how good it can sound but we all we used all the other stuff that it's basically up to see that without prediction so it's our version of way more and then but the reason why we call it plus is because we had this noise shaping that I think sounds really good which that might not be needed in regular way about an end and also we still can do pre we can do sparse sparseness thing where we can since we have a low 8 bit one we can pre pre compute a lot of the matrices and make make the computational very very efficient so whether or not is called plus or minus and it's it's not it's it's our version of wave on it and we can see here that given the same equivalent units using using using an LPC using in your prediction to take care of the spectral envelope improves so that's the motivation for using LPC net to get you either get the quality up for the same complexity or get the same but lower complexity for the same quality so and the coding part when we when we pompadize this we compare to we compare to the reference which is reference I don't know yeah I think it's mula I don't know yeah I think it's mula actually certain the reference is you know it's not I'm sorry so that's the original there we have two we have two sets we have the we trained on entity data database from entity and it has a it has a designated test set to is it disjuncts and different speakers and all that but again same type of recording conditions so that's that's the first set which also took it another set which is from the opus codec the IETF opus has some test vectors for compliance test and we take took those test vectors funnily it sounded better on the disjunct test set from a different recording so apparently the opus set is easier to code but more importantly here is that we compare to the LPC net and quantize from the left and nope so it's unquantized LPC net and opus it is a waveform matching coder run at nine kilobits per second and our LPC net at 1.6 kilobit per second and melp like we used in the previous slides at 2.4 and here we have speaks wideband at four kilobits per second and speaks speaks wideband at four is also almost as bad as speaks narrowband at 2.4 we had in the previous one but it's also serving as an anchor so it sounds good enough it's much better than melp and it's closer to close to opus at nine but not quite yet there so I'm also gonna talk about another way of using LPC net it's we talked about an interspeech last year is that even if you have a codec you can replace like you if you have an opus decoder or like a waveform matching hybrid coding so the new thing here is we're actually going to try waveform matching coding not only parametric encoder we're going to use waveform matching code you can see if you can do the same thing there take its feature quantize features and instead of driving this its own synthesis use it to drive a generative synthesis we are using the LPC net and we also comparing it to using a wave net as we did before that's just to get us kind of an upper limit of how we how well we can do with opus features and you remember we did you had you can do really well with codec two features because we also do we also do well with with the waveform matching codec features the reason why it might not go so well is that in wave as you remember from the analysis by synthesis it's all kind of feedback back in the analysis is done through synthesis so if you remove that if you if if you don't use the synthesizer as it was analyzed with it might be a mismatch so let's cross our fingers and so we take the conditioning feature from from opus bit street and we take not we take a few of them only we only need some pitch and some spectral envelope and some power parameters same thing here and we do mushroom test with crowdsourced listening and if we compare here if we do if we take opus so we code opus at six kilobits per second so that's the one next almost further right that's if you use so so the three ones in the middle they all use the same features all use the same bit street if we take opus is itself and generate speech we at six kilobits per second it's pretty bad that's basically where it breaks down for for wideband if we use it to put if we use it to drive a wave net which is of course not implementable in the the human world you get high quality you get the same quality actually a better than opus run at nine kilobits per second a good trade-off is to use LPC net which is actually implementable and runs in real time and so if you use LPC net with these features you get in somewhere in the middle so this is just an example of you can actually use so you can use LPC net as a post-processor for any for so we can take a kelp-coder or whatever if you have the processing power you can choose to make it sound better if you then it's its own synthesis all right I'm gonna spend the last 10 minutes or 15 minutes about before we actually talk we can get some questions later to talk about an approach we've done to because we need to make it we want to I want to build a real codec and then you have to address the how to get rid of noise or cope with noise and one way we approached it was to to do so we we have one approach that I'll talk about here we're focusing just to get robustness again we'll use WaveNet because we want to know if this approach is a good one or not and we know that this will not become it cannot be implementable but just to see if the system because to my ears the best synthesizer that I have ever set my put my hands on is deep minds implementation of WaveNet so I'm going to continue using it and so we disregard complexity for this study for sure that's was who is this interspec 1990 2019 yeah so it's an error the last years okay so the thing is like yeah we we know that they're perform best with the synthesizing from what single source I haven't the people at WaveNet I think have done you can do use WaveNet not only for speech you can use a piano you can use it for single instruments but for a single source like speech is really good and these generative models perform best so we want to synthesize clean speech because we know it's good at it but how can you do that in a noisy scenario yeah we try to get features that will look the same regardless of its clean speech or noisy speech and that's the idea to extract extract features that are noise robust that's what we do in this codec so for instance if we put an input set of all these recordings with a clean speech the same clean speech with different types of noise so it's the same same clean so we have a set of perceptually equivalent we can call it like especially for for for intelligibility and then we want to get the get features that disregard the background noise and just get the essential features for speech that's the eye that's it that's the goal at least so in that sense we have we have a set of identical networks it's it's we call it clones because they are identical in the sense of the weights but all of these it's we might have heard of Siamese networks this is Siamese but several Siamese with several twins it's clones so it's not triplets quadruplets or whatever so all of these identical system here here we can they get oh different inputs but different inputs representing is essentially the same so we hope that the the network can learn to extract features that will be the same for all of these so to do that we need different we need some losses for this is just a feature extraction network so instead of having man-made spectrum envelope pitch or whatever we tried we hope that the system can learn to get that themselves so we have training losses one loss is to hope that try to force them to get similar so these these features from each network should be similar so that that's one of the distortions the first one then we force them to have we want them to have a force distribution we want them to be Laplacian so factorized because that encourages disentanglement so you can control different things with the same simple features for that we use a maximum mean discrepancy loss so that you can learn about elsewhere then we also have the output of the decoder so this looks very much like an auto encoder right so you want to take the input to get a latent feature and get decoded and you want the decoded representation be as similar as possible to the input so after after after feature extraction and decoding you want you want the to be similar so that's I would call that an auto encoder loss we also add some noise here to the latent leather layer training that will hopefully get us kind of a smooth mapping and if you look so here's some details of we don't take the speech itself we extract features and then we get to another set of features we take the log mail spectra coefficients and with a network of it's yeah it's a one convolutional it's a convolution dilated convolutional convolutional it's a dilated convolutional but also a few fully connected ones from these let's say 160 or 360 depending on lots of mail spectra we try to learn 12 latent features and then we have a decoder that looks same but the other reverse looks it's very similar to an auto encoder structure but remember we have like clones of these auto encoders and quantization of this is done very crudely or not crudely but it's a standard way for high-rate codecs to do you use and you use the uniform scalar quantizers and you just after you quantize these with a scalar quantizer you do entropy coding for each latent variable scale everything is scalar and and for that we have an upper bound we can just we know what the entropy coder will give us we haven't really implemented the entropy coding but this you get very we get point 25 less or our maximum so it's the same feature again like this is like you've seen this big figure now many times take the quantized speech feature drive wave net and get some output speech the only difference from like the original way that they're this that we we want in let the scenic we use at eight bits but in this in this in this codec we use 16 bits to get even better quality and we're not using two course and fine structure with a softmax we use a simple discretize we just make mixtures like the deep mind guys did in the pixel CNN so it's you just have a simple PDF and discretize it and here's some I don't know this is so interesting but we use eight clones that's interest that's important to know maybe and what happens what the bit rate becomes about two kilobits per second it's it's variable bit rate will so that's approximately 1.8 to compare this we go back the first paper we did codec 2 now we did you melt just just for the hell of it take another one you just took a vocoder extractive features and feed the same wave net in order to get comparable we had changed the frame size a little bit so the melt went up to 2.2 2.7 kilobits instead of the standard 2.4 and I think I don't know if it's some interesting thing here to see it's like these learned features you can they are pretty pretty independent from each other from the coalition and you can see some kind of linguistic or features here that one feature it seems to be a general power like feature gain of the feature one and feature two here it seems to be the power in the upper band versus the lower band so it's kind of an spectral high band low band thing yeah you see it's very low and with the fricatives that's when a high frequency and then it's higher at the voiced segments you know I don't know if that's interesting again so the listening tests are the most interesting thing so again crowdsourced radars the reference so let's see it's the best thing to compare here is yes we have clean conditions tend to be a signal to noise ratio and zero to be connected to noise ratio for the the street and cafe noise and stuff that we put from free sound so let's start at the clean one yeah the clones even for clean speech the lured features are better than the man-made melt pots well that kind of goes all the way here it's always better than the melt ones that's good even for clean and another thing that's interesting is that yeah also it helps like it's a zero to be an attend to be oh you can zero it's attend that zero to be I think that's the one where we have the actual the noisy the noisy itself as a one of the anchor the clones are better than the noisy original so then the clones network the features managed to do a little since it learns the clean features and then synthesize the clean speech it does a little bit of a speech enhancement on it so somewhat okay results I feel yeah just as I said all right yep I think that concludes my talk and I'm open for questions okay thank you Jan thank you for all this effort after a long trip back from Europe with jet lag we appreciate the effort yeah questions oh actually that's not the title let's go back again I changed the title because I forgot what I gave Jan is go back to the first one this yes yeah okay so yeah questions of course we use also because many people will not be available right now yeah it is too late for them so we use slack and if you are if you are already in the slack channel for SPCC so you will see the dates per days then people push there for questions okay you might you might get questions also through the slack channel if there are no questions but okay let's wait yeah and yeah questions let's see if someone raise any hand how do you raise the handle sir where do you say participants here okay just you can speak there is a raise hand I think is in okay can I can I have a good question without raising a hand I'm sorry about this nice color listen I miss half an hour from the park after wave net so I don't think you've touched on this but I just want to ask you more of a conceptual question so my impression was that you didn't touch a lot on rate distortion theory and I think I understand why in a sense I guess given your experience my question for you is do you think that rate distortion theory has helped us build better coders or has it always been sort of a method of comparison it's been really good in for theoretical trying to get limits like trying to figure out how much we can do and what's reasonable in the end of the day you always have to build something and the rate distortion and high-rate thing it it's it's mostly a tool and it helps for inside but for actual design not so much all right so it basically starts with an engineering idea then you you call it out then you test it and then you check let's say what way how close you are to the limits based on the yeah okay dependent for the high rate like high-rate theory like the way we did the it's in well known and actually in in codex like opus for instance and and silk the concept of uniform quantization followed by entropy coding that works and that's very that's that's very straightforward in opus taking because sticking scalar quantization is always in theoretically in worse than vector quantization so you can get a little bit better so do you using lattice code or lattice quantization and that's that's what's done in in in opus but it's at the entropy coding is is essential and it it's very simple so so yeah that part I think it's been useful even for design but like rate distortion theory to get limits because that's what it does right rate distortion is kind of if you have infinite delay how much can you do yeah thanks other questions hi yeah I would also have a question so if I understand well the the design or the choice to do to how to deal with the noise is to kind of get rid of it is this right yep okay is this fine is this because it would be too hard to also model it or I mean yes okay the thing is that speech model speech sounds are as we've seen here in the high-dimensional signal space it is on a very narrow manifold so when you use when you use this learning you can you can learn this manifold and project it down to the speech speech single domain or the single domain generals so however the background noise will kind of make the sound trim the manifold bigger which is harder to for so far in my experience is it's harder to to capture it in when you try to learn learn the representation and what I think what what happens is that when you train if you you mean like for the for the training when you have a teacher forcing the target if you take the target be noises you try to try to predict try to generate noises speech you in general you get better slightly better in noises speech but the cost is also because now you can generate noises speech you're not so good at generating clean speech anymore so if the input is very clean you will tend to generate kind of noises noises speech and that's that's the cost you can't get both sure so as is in some sense the model kind of saturating already force clean speech like it's good at it but yes if you ask it to do more than it has to get worse yeah it's it's saturate in the sense that it learns very well to do it and especially especially wave net because wave net seems since you can wave net is a little bit more robust in the sense that it learns a little bit more but it also that means that it's not robust in the sense but it's that's why yes a high quality it can really it can really learn the speech manifold lpc net and wave RNA and you so so to get so noises speech with wave net is babble noises speech with wave RNA and lpc net is babble but less severe so since they are they are not so they trade off a little bit of modeling this clean speech because if you compare wave net and lpc net i might i might send some samples with on the the hands on because lpc net is good but wave net is better so and dogs background is going to be really hard to come so what was the question yeah so so so the trade off is that you you lose you lose the clean speech if you want to capture the noises speech and so far there's there might be some there might be some i mean this is something that we're we're looking into and other people who want to build i don't know if they want to build a real codec you have to look into this and so far the best approach is to get rid of the noise thanks okay there is a question delales yeah so what about not noisy speech but poorly recorded for example with a phone microphone or something like that did you try that with wave net uh yeah uh that's actually the it's also uh you know in lpc net that is uh it's kind of in the training so in the lpc net we we randomize a spectral shape so there is there is a small filter a random filter in front of all speech when we augment the data by filtering it so lpc net is more robust to that and even even wave net it's not it's not so crucial i don't i don't think the any kind of linear filtering is is problematic okay thanks