 Hi, hi all. I am Sivanand. I am with the CDTTS group at Apple and thanks Yanis for such a kind introduction. And I am really humbled for the invitation to SPCC and very grateful to Yanis for inviting me. And also thanks to Vasili for sharing his first session with me so that I can go back to sleep as early as possible. So thank you. And what I would like to do in this talk is give a very gentle, basic introduction to neural vocoders. And the brief outline of the talk is as follows. We will look at the context of neural vocoders through neural TTS. And let us look at what is the general pipeline of neural text-to-speech systems because a large part of neural vocoders or their study has been through the text-to-speech application. So it benefits us to look at through the neural TTS pipeline. And then we will see vocoders as an independent area and how they have evolved from signal processing vocoders from way back in 1960s, 70s and 80s. And come through some of the latest developments. So I will talk mainly about one vocoder, VEVRNN, from the family of auto-aggressive neural networks and another vocoder from the non-auto-aggressive family, which we refer to as VEVGAN. So this will give us some background as to how the progress in the neural vocoders has been in the recent years. Now let us go to how neural text-to-speech pipeline looks off late. For the folks who are probably familiar with unit selection framework or the old concatenative speech synthesis framework, you are aware that there are lots of modules, for example text processing, and then there are modules associated with tokenization, phrase break prediction, and pronunciation, and so on and so forth. There are a huge number of modules associated with what we refer to as text analysis. And then there are similarly other pipelines associated with the backend, which essentially deal with waveform concatenation and other acoustic features like processing features, duration features and so on and so forth. But what happened with neural text-to-speech is as you are able to see on the slide, essentially the whole problem was boiled down to taking text and then converting it into a high-level spectral features like spectrum, for instance, or to be more specific, male spectrum, male frequency spectral features, and then take the high-level spectral features and convert it to waveform. So as you can see at the bottom figure, there is a neural front-end, which for those folks who are conversant with the text-to-speech application know that it is typically referred to as tacotron. And the other is the neural backend, which is our current interest, which is also called as neural vocoder, which takes the spectrum information and converts it into a waveform. Now, ideally, at this point of time, I would have actually loved to show something that is of the latest development in neural vocoders. However, because of one of the latest developments, I am going to show you an antagonist rather than a protagonist. So you will understand what it is and why it is in a moment. So this is a new paper. I am sure some of you folks have seen this from the DeepMind folks, which appeared a month ago. It's called end-to-end adversarial text-to-speech system, or in short, they refer to it as EAPS. Now, the reason why I am bringing this in this particular context is in this paper, the authors have taken a sequence of characters, or sequence of phonemes, by passing the characters through the graphing to phoneme conversion module, and then directly predict a waveform at the output. Now, that is the reason why I have added this particular sideline to the title EATS, Eating Away Vocoders in TTS. So the reason why I want to highlight this is in text-to-speech application, we see that more and more there are publications, like if folks familiar with WaveNet and WaveRMN literature also might be thinking that the input features are actually linguistic features in the original WaveNet and WaveRMN papers, and the output is a waveform. So essentially, most recent papers are going into the direction where we don't think of an explicit vocoder at all. There is an input text features and there is an output speech waveform. We are predicting the waveform directly from the text features, and hence we are eating away or consuming the vocoder part of it directly in the pipeline of the text-to-speech in a single neural network. So now you can see here, there is text at the input, there is neural net, which is a sequence-to-sequence neural net, and then there is a speech at the output. That's what it has been consumed into in a single network. But as has been beautifully pointed out earlier by Yanes, the study of vocoders, for those people, for those who are still pursuing neural vocoders as their PhD thesis or master's thesis, they need not necessarily panic, because as you have seen in the introduction from Yanes, that vocoders have applications not just in TTS, although the off-late research of neural vocoders have been heavily driven by text-to-speech application. They find applications in speech coding, they find applications in enhancement, they find applications in voice conversion, and for that matter, music generation are anything to do with some tweaking of parameters of speech waveform. And whenever you want to manipulate or have some control over speech generation, we see that neural vocoders come into picture. However, if you are thinking of using neural vocoders for text-to-speech application in particular, then I think we have to keep this particular trend of research at the back of the mind. So I wanted to highlight that a bit before we delve into the actual vocoders part and hence this slide. So now we will go into getting to the actual vocoder part of the problem. Let us try to first understand, to my knowledge, I guess most of the people who have joined the call today are very convergent with the problem of vocoding, but still for folks who are not well aware, I probably this background might benefit. So in case there are folks who have not had breakfast or coffee, I would suggest since this elaborate introduction is actually mostly known to all of you, you can take a break and sit back and relax. But in case you are not aware, probably this will get you started to see what is the problem of vocoding. Now on this slide, I have two pictures here. On the top is the magnitude spectrum and on the bottom is the phase spectrum and these spectra are computed using typically using short time Fourier analysis tool. Now, there are many ways given a speech signal to compute the spectra. You can use several types of filter band, but most prominently used tool is the short time Fourier analysis tool. And when you use a Fourier analysis to take a waveform and then convert it into a spectra, what we typically see is because the transformation is a complex transformation, we have the magnitude part and also the corresponding phase part. Now, when I showed you something in the earlier slide in the text to speech, let me go back to that slide. What we are actually seeing here in Tacotron in the neural text to speech pipeline is that the phoneme features or the text features are converted into some form of magnitude spectrum by the neural filter in Tacotron. But what is ignored is the phase spectrum. So what is the problem of vocoding in general is that given this partial information that I only have access to the magnitude spectra or the magnitude component of the spectrum, how do I reconstruct the waveform without having access to the corresponding phase spectrum? That is the problem of vocoding that we are dealing with. Now, I want you to note the phase spectra and the magnitude spectra pictures from the pictures that are drawn or shown here. There are a couple of things that I would like to highlight. If you look at magnitude spectrum and if you have a slightly little background in speech, you will see that speech is essentially composed of the pitch, which are courtesy the activity from the vocal folds here and this characteristic of the vocal folds is to vibrate at a certain pitch period. So it is characterized by pitch. And then there is the vocal tract function here, which modulates the airflow and produces the corresponding phoneme. Now, what the magnitude spectrum if you look at the plot on top will tell you is that it is actually giving you an insight into the identity through resonances. You can clearly see certain resonances in the spectrum and the resonances correspond to the vocal tract. And then you can also see harmonics, the horizontal lines almost spaced at equal intervals, almost at equal intervals, that corresponds to the harmonics or the pitch part of the vocal folds. That means essentially magnitude spectrum is capturing all the information of the speech, the linguistic identity, the speaker identity, a little bit of channel and background noise, etc. Now, if that is the case, if magnitude spectrum is modeling everything, then what is the use of corresponding phase spectrum? Does phase spectrum have any information at all? And if it has any information, how do we model it? So these are the two things. Now, it so happens that essentially if you just take magnitude spectrum and then estimate the speech signal by using a random phase component, iteratively, which is essentially the Griffin Lim algorithm, or if you take magnitude spectrum and estimate what is called as a minimum phase component of the phase spectrum, then which is essentially what is used in publications like Mellogg's spectral approximation filter is published in 1993 and or straight, which is much more recent, 1997. These two coders are based on minimum phase reconstruction of speech signal. So if you take some of those reconstructions and here, unfortunately, I don't have speech samples for you to play, but I am sure some of you would have already heard synthesis from straight or Mellogg's spectral approximation, MELLSA filter, or Griffin Lemby synthesis. You will hear to some artifacts, essentially pointing out to the fact that if you want a very natural timber of speech, you need some information, which is embedded in the phase spectrum. So it is important, that we actually model face spectrum. But then why do we ignore phase spectrum, that is the question. question so that you will understand immediately if you look at the picture below you see the picture above has a certain pattern and it looks like it is modelable by a statistical model but the moment you see the picture below it looks like an absolutely random image that is the reason why most often than not in speech we have not tried to model phase at all because you can see it is not really statistically modelable at least the face spectrum as it is there are several ways to compute or estimate the face spectrum in a way it is slightly modelable but then the typical short time Fourier transform face spectra is what I am showing you and you will see that it is not really modelable by a statistical model it looks like really random but at this point of time I would like to point out one particular work from Yanis Taylor himself published in 2009 which has made some attempt you will see very rarely that there is an attempt made in speech community for modeling phase and this is one of those rare attempts and for those people who are very interested in mathematical models of modeling such circular data because phase is really circular it wraps around itself I think this is one good paper to read now so much so for the background of signal processing based vocoders now I have already highlighted that most of these phase reconstructions like minimum phase reconstruction and starting from a random phase and iteratively reconstructing it using Griffin Lim algorithm don't actually give a very high quality of speech that we would like to have a synthesizer a speech synthesizer to possess now if that is the case where do we go that is where I think neural networks have made a lot of progress and the rest of this talk is essentially pointing out two examples of recent neural vocoders which have really achieved a very high quality speech synthesis now what I am showing you on the slide is the probably what we can refer to as first generation neural vocoders and they're mostly autoregressive in nature that is their quality and we can think that wave network order which is the first one wave Rnn and then in the same line of thought there was sample Rnn and finally the LPC net these are four different vocoders all of them are autoregressive now if you look at what I term as second generation of neural vocoders we will see that most of them belong to the non-autoregressive part in a moment I'll come to the point of what is autoregressivity in all of these models although I guess that most of you by this time are well aware of that so here are the list of family of non-autoregressive vocoders of which I think it's good that we have Jin Wang and Professor Jinichi Yamagishi who are the who proposed the neural source filter so but today I will be talking of a Milgan example as an example of non-autoregressive neural vocoder at this point of time I would like to pause and reflect for a moment if not for anything else what can be a motivation to learn neural vocoders to want to learn neural vocoders and I think personally for me this is the biggest motivation that I am showing you on the slide what neural vocoders or study of neural vocoders enables us is it gives us a doorway to actually learn variety of generative models that have been proposed off late for example if you keep studying from 2014 the variety of neural vocod generative models are variation lot to encoders yeah can you since you are the host can you check in the participants probably you will see have people in the waiting room you can okay yeah yeah I can admit them just admit them all yeah thank you yeah yeah yeah so no problem from time to time check that if there are okay sure definitely yeah yeah so auto regret there is variation lot to encoders auto regressive models then there are flow based models normalizing flow based models and then there are generative and with serial neural networks now the beauty of neural vocoders as a research area is that accepting for the fact of that variation lot to encoders have not been used much we see that all other variants of generative model architectures have been heavily explored in neural vocoders whereas that's not the case with the neural front end by and large the neural front end research has itself revolved around tacotron or a variant of tacotron or transformer there is a quite a bit of research in that direction also in using some of these models but not as heavily as the neural vocoders so if you are wondering why you should study neural vocoders I think that's one of the beautiful gateways which gives you an insight into a several set of generative models that have come off late so yeah with that let me go to first generation of neural vocoders I think Vasily will be handing a lot of design issues of wave net and the details of wave net but I would like to just highlight one aspect of wave net and then move on to Vivaan if we look at wave net the first auto regressive model what we mean by auto regression is a regressing on to the current sample using a set of previous samples that have been predicted so you are basically regressing on to the self accepting that there is a time step difference that is all so in wave net what folks have done is they have taken the input sequence of samples past samples and then there is a stack of convolution dilated convolution layers to predict the next sample now if you see and analyze why we need to have such a depth of architecture 60 layers or 30 layers depth of architecture essentially the answer will come out to be that if you have to model the context or see as much past signal as possible in the wave form to predict the next sample then you will have to have so many dilations in the convolution network because the receptive field size is dependent upon the dilation factor and the kernel size and hence also the depth of the neural network now the question is is there an alternative way of modeling the context without having to use such a deep architecture now as an answer to this question we see the wave R&N architecture what wave R&N proposes is quintessentially the opposite by taking a recurrent neural network with one single layer of R&N because R&N is equipped with the history mechanism the mechanism of the memory with just one single layer you can actually memorize as much past context as you would like accepting for the optimization difficulties that we generally see with the R&N we see that it has the ability to cope up with the past learning so they have tried to compress this whole wave net into just a single layer shallow neural network architecture now if you look at how the wave R&N generation itself looks like it's like the following as shown here in this slide you have the mel spectrum at the input and then you have the sample probability distribution in the form of a softmax function at the output what we do is generate one sample after the other auto regressively as we see here now the moment we see such an auto regressive generation we can think that the inference when we are trying to generate sample by sample for the given mel spectrogram is going to be extremely slow because it's sample by sample and not everything are done at once the if you if you in the contrast if you just take inverse FFT or the inverse DFT you see it's always completely parallel so how do we increase the speed of the generation such that the system becomes practical that is the other aspect that we will slightly dive into and but most of the details are at high level so we will not go into that detail which will bore you down but hopefully give you some insight into how to make this process a little faster so this technique is called as persistence R&N and the reason I wanted to highlight this is that this particular technique of persistence is not just unique to neural book orders or wave R&N for that matter this technique of persistence is actually widely used even in R&N training so I thought it is worth spending some time in understanding the persistence logic a bit and here you go what we see by default is that if you look at the R&N algorithm there is a set of parameters that are shared across time which are essentially the recurrent parameters now in case of sample by sample generation during inference from wave R&N what happens is the recurrent weight matrix gets kicked out from the CUDA cache memory for every sample so essentially the weight matrix need to transfer from RAM and into the cache memory to compute the matrix vector multiplication which takes a lot of time and the logic of persistence makes sure that if you have a recurrent neural network you take the recurrent matrix and make sure that you can store all the weights or the parameters of the model in the cache of the GPU without having to kick it out for every time step now you can see that the parameters don't have to go back and forth from cache to GPU RAM and they always stay in the cache and it's extremely fast to do the matrix vector multiplication because they're very easily accessible so that's essentially what the logic of persistence is and it is because of persistence that we see that wave R&N can actually be made extremely faster and we can have a real-time synthesis neural synthesis otherwise all neural decoders until now until this point whatever I have introduced whether it is wave net or the wave R&N it's actually extremely slow to synthesize although the quality is high and making their use very limited now because of this persistence logic on the GPUs you can actually make it extremely faster and make it use for a real-time application so if you're wondering what exactly is the memory transfer or how slow or faster the memory transfer is here is a hierarchy of an example GPU this is a very typical old GPU probably 2016 or 2017 nowadays GPUs might have much different parameters here that I am showing but what you can definitely see is in the memory hierarchy if the parameters are stored in the RAM then you are going to need a huge lot of time to access them but if they're in the thread or the cache register then accessing them is very quick so that is essentially the difference between having parameters in GPU RAM versus cache and why we would like to have the parameters in the cache now this is one of the reasons why as I told you earlier why we were and then is fastest because of putting the caching the R&N weights in the registers now having said all of this let us look at what are some of the limitations of this wave R&N approach that means essentially the limitations that we observe in the neural encoders of the generation one the auto regressive models will give us a pathway to see what we need to address in neural encoders generation two which are essentially the parallel encoders so let us analyze some of the limitations of the wave R&N we see that because of this persistence logic we have to make sure our model fits into the cache resistance of the GPU that means it has to be really small we can't have as big a model as we like because it may not fit into the cache of the GPU so because of this logic we are limited with the model now when we reduce the model size what happens is essentially the model capacity gets limited now for example if you would like to have a universal vocoder which means if you want to combine several different speakers data from several different speakers and then create one vocoder for all the speakers then you need to have a slightly larger capacity but then if you increase the capacity it may not fit into the GPA so the trade-off is actually a bottleneck here for most of the advancements of the vocoder now we can also see that if we want to let us say generate speech at a higher sampling rate then also because there is recurrence the moment you increase the sampling rate for the generation it's going to be slower if it were a very parallel synthesizer whether it is a lower sampling rate synthesis or a higher sampling rate synthesis won't make such a difference but because it is these models are recurrent then it will have such an impact so based on the compute trade-off that we see during generation we see that it does indeed affect most of the advancements in the vocoders this gives rise to thinking about alternative architectures to neural vocoders which can essentially address some of the problems that we've seen with generation one neural vocoders that we have seen earlier now that gets us to neural vocoders part 2 which are essentially the non-autoregressive neural vocoders now I am sure some of you are pretty familiar with parallel wave net or clarinet vocoder that have been published quite some time ago based on the distillation mechanism so there is an autoregressive neural network trained as a teacher and then the teacher the essentials of the teacher are distilled into the student which is entirely parallel in both parallel wave net vocoder and clarinet I think Vasili will be able to share some details of the parallel wave net also in his talk so I will be focusing more on a recent architecture which is called as melgan or sometimes it is also referred to as wavegan there are actually two separate publications but essential architecture is the same so I am going to summarize or highlight some of the common aspects of the gan architecture that have been proposed for neural vocoding so that's what we will do in the subsequent slide so if we look at wavegan it's basically again we have the spectrum as input the mel spectra so it's a conditional model with spectrum as input and wave form as output but the major difference is it's going to predict non-autoregressively as opposed to the autoregressive prediction of wave net and wave argument so what is at the heart of melgan or wavegan is that there is a generator which takes melts and produces a wave form and then there is a discriminator which takes both the real wave form and the synthesized wave form or the fake wave form produced by the generator and the discriminator tries to distinguish between both of these two while the aim of the generator during learning is to be able to fool the discriminator this is the setup of a gan and how the gan is used for a wave form generation but however they are very important design choices which I will highlight in the subsequent slides which have actually given are responsible to give such a high quality synthesis using a non-autoregressive network what are the design choices and how do they impact the learning of the gan we will see in the subsequent slides so for folks who have not read quite a bit of gan literature it will be worth going through some of these papers because I will highlight two different architectures of gans but in the process what I want to essentially highlight is one of the problem with gans is the stable learning how do we learn them in a stable manner their optimization generally is not so stable so there are various aspects for example what are the various regularizes we employ what are the last functions we employ these all become very important in making the learner very stable so I will give this to highlight that it is important these design choices play crucial role I will give two different architectures examples one is the TTS GAN and the other is the MEL GAN but it is worth going through a plethora of GAN literature sitting outside there which tell you about different mechanisms of these regularizes and last functions and their importance in having a stable training for gans so first let me go through the MEL GAN architecture and then I'll go through very briefly through the TTS GAN architecture that gives you a two different flavors of GAN training and then hopefully I can pass on the baton to Vasili and he can take on from there so yeah we will look at what is the key design of the generator of MEL GAN this is the generator design as you can see so we have the MEL spectrum on top and then it goes through a couple of upsampling layers and you can see the upsampling factor at the bottom is actually way too high so if the upsampling is done eight times at a single go and then after that you see that the upsampling factor is reduced to two times so initially there is an aggressive upsampling and then there is a slightly lesser aggressive upsampling towards the end to finally produce the waveform and the essential building block of the architecture is a dilated convolution block essentially like a what we have in a pavement so it's a stack of dilated convolution blocks but along with the upsampling layers and that is the design of the generator but what is more important in MEL GAN is the design of the discriminator and if we look at the discriminator we will see that it takes the waveform and it also takes the natural waveform and then it tries to distinguish between these two however in doing so there are a couple of ways of doing it for example you can take the entire predicted waveform of the generator and also the entire predicted match in their natural waveform and then feed it to the discriminator and show that and ask you to discriminate between these two but if you do so and then train your generator or MEL GAN it produces an absolute robotic speech and sometimes you are not intelligible and then the authors in this paper highlight that it is important for that reason to actually take a short windows like how we do short and Fourier transform take short analysis windows with a certain frame shift and train the discriminator on those windows rather than training it on the entire waveform and you will see as I will show you in the tedious GAN also that this is one of the key important design considerations which made GANs work for neural decoders so window based objective where the discriminator actually works on short time windows so with this as the key designs of generator and discriminator now if we go to how it performs I have given here a tentative numbers of wave Rn and how many parameters does a typical wave Rn have and the corresponding wave GAN and what is important to note here is on CPU and GPU that means across varying hardware platforms wave GAN is able to do better than real time whereas wave Rn unless you apply certain other optimizations like give me a second I think there is one more participant waiting so you have to employ something like sparsity or subscale as has been proposed in the original wave Rn and paper to make sure that it is actually the wave Rn then he works across platforms on the CPU or the GPU or in the other you know mobile devices and so on and so forth but for the wave GAN you do not have to do any of those you don't have to go through any of those performance optimizations without any such performance optimizations the parallel architecture allows you for a real-time factor across the devices and because it has such a huge number of parameters generally it is very much amenable for dealing with universal work order so you don't have problems of capacity and you generally don't have problems of achieving real time across hardware platforms so these are some of the advantages of melgan over the wave Rn approach that we've seen earlier now having said that let me just give you one insight into the TTS GAN paper and the model TTS GAN model so that you'll get a different view and perspective and the flavor of GANs the wide-ranging flavors of GANs that are actually present now in the earlier milk GAN paper if you've seen it is optimized using weight normalization as the regularizing technique on both generator and discriminator it is important to keep in mind what a regularizer we use during GAN training otherwise it can get completely unstable so the authors in melgan use weight normalization quite in contrast if you look at TTS GAN paper you see that folks have used what is called as conditional batch normalization it's essentially a variant of batch normalization but the affine parameters are conditioned on the random noise vector so it's conditional batch normalization and then there are bag of tricks which I have summarized in this slide so they also use what is called as spectral normalization which normalizes the spectra or the largest eigenvalue of the weight matrix as opposed to weight normalization in melgan so some of these techniques is worth studying and then looking at what fits the best for the given application when training a GAN for neural vocoder at this point of time we have two such examples one is the melgan and the other is the TTS GAN but there are plethora of GAN papers out in for the age generation which we can combine the techniques with and see how we can improve so here are some of the bag of tricks that we used in the TTS GAN paper for stable training of the GANs now with that I would like to summarize at this point of time that non-autoregressive models are promising and I only discussed the GAN variants but there are other variants as I highlighted which the other authors the other presenters will go in more detail but the whole and sole non-autoregressive architectures whether it is flow-based architectures or GAN based architectures they're very promising and however the only thing to note is that until this day they are yet to reach as high a quality as the autoregressive counterpart so some more research is definitely needed to really get the non-auto aggressive counterparts to the quality as the autoregressive ones and hopefully we will be making some progress in the direction and I hope that this particular talk gave you some overview of what these models are and what is the current state of the art and what are the directions of progress for the future and yeah that is it I have from my end thanks a lot for having me again and if there are any questions I'll be very happy to take them on the slack and try and answer them I hope I think I can hand over this to Vasili now me actually I will talk about two autoregressive vocoders WaveNet and I will say a few things about Wave R&N for which Sivanan has not talked before especially for sparsity which is an important component to make it run faster and if there is time I will also talk about a non-auto vocoder which is the parallel WaveNet WaveNet is the first neural vocoder ever designed which models directly the raw speech it was presented by DeepMind in September 2016 WaveNet outperformed the best ETS system which existed at that time both parametric and concatenative in my opinion scores and up to now is the neural vocoder that produces the best quality of synthesized speech we do not have any other neural vocoder that has surpassed the quality of WaveNet is still the best in terms of quality let's let's see a little bit how the WaveNet works if we have speech and we want to model the probability of all speech samples we may use the chain rule of probability and decompose that into a product of terms and the WaveNet models the conditional probability of speech and in order to generate speech the condition also includes linguistic or acoustic information the conditional probability is modelled by a deep neural network one dimensional causal convolution layer captures the time dependence on the previous samples dilated convolutions are used to increase the receptive field without increasing the depth of the network and toward the filter length I will explain a little bit the convolution how it works there is a convolution filter and let's suppose that we have an input signal then the filter is applied we multiply element by element the filter with a part of the signal that corresponds to that and we after the multiplication we create an output sample we move one sample to the left to the future the kernel we produce the next output and so on and this is the output signal causal convolution do not consider future samples this can be done using one of the following two alternative techniques the first one is to set the future the part of the filter that corresponds to the future set all elements to zero and the second trick is to the second trick is to shift the output of the signal to the future for example we can see we multiply element by element the part of the signal with the kernel and the output is exact is this one but compared to the previous example now the output is shifted one sample to the future in order to make it causal for the dilated convolution we do the following let's suppose that we have a filter with width 3 and the equivalent and we can construct for example an equivalent filter with width 5 by inserting some zeros and then the multiplication is done as follow and with that trick we increase the receptive field of the filter um and another example we can further increase the receptive field by inserting more zeros so that to have a dilated convolution with dilation for for efficient implementation of dilated convolution do not consider the equivalent filter with the field zeros here we can see the high level view of the dependence of the wave net and how it uses the past information of the signal in order to predict the current uh the current sample uh for here in this example we have a stack of the first one it has dilation one the second one has dilation two dilation four and dilation eight and the output uh depends on all that blue uh input uh samples uh for here the receptive field uh in the above example is 16 samples uh in wave net the dilation is doubled every for every layer up to a certain point and then it is reset again to one and increases up to this point and again and uh in the original implementation of wave net this was done three times so there was three cycles of increasing uh dilation um the dilation width uh in the implementation uh which was proposed um by by two there was five cycles and here is another example of the dependence another component which is very important uh to build the wave net is the residual connections um why we need residual connections the residual connections from the mathematical point of view is a trivial function which instead of mapping x to the f of x uh it maps uh it changes the output and uh it maps x to x plus a residual f to x so it's equivalent to the original without to the original network without the residual connection however from an engineering point of view the residual connection is very important for many reasons one reason is that the f capital to x is a small uh deviation uh around x around the identity matrix x and for this reason is much easier to initialize the weights of the layers between the residual connection so we can safely initialize them close to zero for another more important is that uh with the residual connection we can solve the vanishing gradient problem because in the back propagation of the derivatives uh the derivative can take two paths one path is through the weights uh of the layers and through this path the derivative uh it starts to lose its power and after some layers it becomes pure noise but also it can take the path through the identity connection the residual connection and through this path uh the derivative uh it's not changed at all so we the derivative from one block to the other can pass without any degradation and here are two residual blocks one after the other another very important component which we can see uh in wave net is the expert gate construct this component can remind us of some similarities of the gates that we use uh in some RNNs for example the LSTM and the GRU network here uh it is a simplified version of gates we have only two gates the expert learns a part of the hidden space and the gate uh evaluates how important or how relevant is the information that the expert has learned and is multiplied with the expert in order to allow to pass this information to the next layers or to block it in the initial implementation of wave net uh the output was uh was modelled by a categorical distribution and in order uh to make the information and in order to compress the information from the raw audio which has been recorded using 16 bits into eight bits and to retain from this information as much as possible the trick was to use an mu load transformation and after this transformation uh XT is quantized into 256 values and finally this is encoded into one hot vector uh one toy example can be seen here the signal let's suppose that we have a signal and let's suppose to make it the problem easier that we have only instead of 256 quantization values let's suppose that we have only four quantization values first the signal is passed through the view low and it is transformed into a view low and then quantized into eight uh bins and then from the eight bins uh it is converted into one hot vectors and this is the input to wave net if I wanted to improve this uh input uh procedure I will have um something that has been mentioned in the paper of LPC net for example uh before uh doing the mu load transformation we can preemphasize the signal and uh in the after generating deemphasize the signal what the preemphasis and the deemphasis offer they reduce the quantization noise which we have by compressing and 16-bit audio into 256 values for the output since the original wave net was modelled uh with a categorical distribution we use the softmax function to convert the output of the network to probabilities we have an example here with a receptive field of three of how the probabilities are predicted for example in order to predict the probability at a time step four the three previous uh input samples has to be used as an input and then here the receptive field is assumed to be three and then in order to predict the next sample uh we sample from the probability before which is a categorical I will show you how to do that in the next slide and using the new sample x4 we predict the probability x5 here we can see an example again with receptive field three and how we can use during training how uh what to compare during training in order to learn the weights of the network we compare the output of the network after passing through the softmax which is a probability distribution p4 p5 p6 uh in the time steps with the target which is samples x4 but how can we compare the probability with a sample we convert the sample using the one hot encoding to a categorical distribution which has 256 values all of them apart one are zeros apart one which is one so this is also a probability distribution which has 255 zero values and one value equals to one so now we have two categorical distribution one which has predicted by the network which is the p and one which has been uh the target which has all zeros apart from one being which is one and then we compare these two probability distribution using cross entropy and as far as uh the shabling uh the generation is concerned there is an example here of how we can generate uh we randomly initialize the first samples up to the receptive field length which is in our example is three then we use the wave net to predict the probability of the next sample then we sample from this probability distribution there are four side sideboards in our uh example the side the bin zero the bin one the bin two and bin three let's assume that uh from this uh distribution randomly we choose uh the bin one so the next sample is one then we use samples x2x3x4 as an input to predict the next probability and from this probability distribution categorical we can sample to predict the next sample and so on and for the sampling methods we can use direct sampling we can use temperature sampling the modes the mode sampling which means we choose the most uh probable bin the mean uh sampling we take the mean for the sums or the top k which take the top k higher probabilities and choose one of them from these sampling methods the most successful is the direct sampling which is the easiest one of the easiest to implement some people also use temperature sampling in order to reduce some uh noise which is caused when an extremely low a bin which is with an extremely low probability is chosen and this noise is sound like clicks however we have to be careful with temperature sampling because it may change the quality of the synthesized speech after the generated after the samples have been generated we scale back to the uh original audio scale by using the inverse mq low transformation in the the original wave net in the first paper the generation in the first or then the original implementation the generation was extremely slow for example in order to generate three seconds of speech we needed something like one day of calculations in a gpu later on people realized and especially tom le Pen which at that time was a phd student realized that most of these calculations are redundant so he proposed to cast the previous values the previous computed values and only compute the current uh the current time step by doing that he greedily accelerated the generation process and after that three seconds of speech needed about three minutes uh to be generated with tensor flow of white torch in a gpu uh and by doing a clever and uh implementation and using CUDA and persistent kernels as symbol and mentioned before uh people managed uh to synthesize one second of speech in approximately one second of calculation for example in real time but still this is not sufficient for real applications which we need to synthesize speech much faster than real time because we have other modules also to run within the real time window uh the the architecture as a whole is the following uh wave net consists of a stack of residual blocks and uh each of these blocks has a skip connection so that to take into account uh the multi resolution uh information and then combines all that information from the residual connection and pass it through a post net to predict the probability in the original implementation uh from google the number of residual blocks uh was chosen to be 30 and here is another uh view uh of the bay of the wave net architecture and how we can insert uh conditional conditioning information because if we generate uh speech using only the previous sample only the previous samples we create bubbling noise which has the identity of the speaker which has been used to train the system but it does not contain any global information uh so we need to use um sorry about that additional information one additional information is for example the identity of the speaker if we want to train the model for more than one speaker and for this we use a speaker at bending another very important information is to condition uh the wave net in acoustic or linguistic features in the first implementation the local conditioning was linguistic features but later on people realized that the wave net performs better if acoustic information is used as a conditioning and this has to be absent into the same shabling frequency as the speech and can be inserted into the system as this slide shows and we can combine more than one conditioning information for example the speaker identity and the uh acoustic features and but we can use both of them as a conditioning to the wave net and the new architecture after the conditioning can be shown here um some improvements that we can uh that people had proposed in order to accelerate the generation of wave net is uh are shown here in the following slide for example instead of having each one of the residual connection passed through and one through a matrix matrix multiplication which is uh called uh one to one convolution and then sum all of them we can concatenate all of them and then multiply all of them with a huge matrix and this can increase the speed up to 10 percent and even more more important change uh to the original architecture is to combine the dilation convolution and for the for the expert and gate into one convolution this two convolutions you can see the difference and this greatly accelerates uh the wave net up to 40 percent uh we can make it up to 40 percent faster and these are the two most important changes that we can do without affecting at all the quality of the generated speech and now we can see the whole architecture and how the losses form in order to train the model another uh change that has been proposed to the original wave net proposition was the following uh the original um wave net uses uh filters which has kernel of width two but we can use uh filters of width three and this can allow us to have a longer receptive field using fewer uh block fewer blocks in the block in the stack of the wave uh of the wave net and accelerate the generation here is for example a high level view how this operates in the original implementation the output distribution was a categorical distribution which was predicted using the softmax function later on uh in the parallel wave net uh another distribution was proposed this distribution was a mixture of discretized logistic this distribution was first proposed in the paper of pixel cnn plus plus uh however uh as you will see uh in the hands-on code that uh I will uh I will submit in a few hours from now uh this distribution is not very it was an unlucky unlucky choice of distribution in the original uh parallel wave net paper and is not very stable so I propose the other two distribution which perform much better the first one is mixture of logistic the difference from the previous one from the mixture of discretized logistic is the following the mixture of discretized logistic uses the cumulative distribution while here we use the probability density function and this one although it is actually the same distribution as the previous one in practice it performs much better and is much more stable um and also uh in the paper of clarinet proposed by bydu people use a simple gaussian but we can extend that to a mixture of gaussians and this also performs well in practice very well in practice and we can hear now some samples uh of wave net the first one uh we can hear some samples without any conditioning information it is bubbling noise within that I just show I always require so you understand the identity of the speaker but it conveys no information at all it's a bubbling noise which has the the characteristic the f0 the formants all the characteristic of the speaker here was one of my first attempt three or four years ago uh which the input was the linguistic uh information so actually in that uh sample I use the wave net as a statistical model and as a vocoder at the same time and the front end actually performed only the mapping from the text to context dependent linguistic labels and the assembling the alignment and the assembly to the sample rate of the statistical of the samples uh now this mapping is is done by tacotron in modern systems let's hear my first attempt to make matters worse the state's largest malpractice insurer ohic announced in march it will no longer renew policies for wyoming doctors uh there is some background noise uh which can be attributed mainly to the small uh database which was used later on uh as I say people propose to use wave net as a vocoder and the conditioning the local conditioning information to be for example male filter banks let's hear some samples from this it how can you manage all alone mr young how can you manage all alone mr young how can you manage all alone mr young how can you manage all alone mr young there is one model uh which has global uh conditioning and local conditioning the local the global conditioning is the speaker Trail, Philip Steele's etc. Author of the Danger Trail, Philip Steele's etc. And how can you manage all alone, Mr. Yang? And so here the output distribution was a gaussian and the network predicts from the gaussian the mean and the standard deviation, the logarithm on the standard deviation. And from these two parameters, which vary across time, we can predict the speech. The people realized that WaveNet produces some kind of artifacts. Vu et al. publish a paper, collapsed speech segment detection and suppression for WaveNet recorder in 2018, which categorizes these artifacts as belonging to two types. The type one artifact, which are mainly produced when the output distribution predicts parameters of the probability, such as for example the parameters of a gaussian mixture of gaussian or mixture of logistic. And the type two artifacts, which sounds like leaks, which are produced during shabling when the output distribution is a categorical distribution. For the type one artifacts, type one artifacts is a characteristic of the train model, which predicts out-of-range values for the mean and the scale of the output distribution. For example, the mean value has to be constrained in the interval minus one to one. But when we have, we may have out-of-range, for example, the mean values of the speech can be predicted to be four instead of constrained to the interval minus one to one. These type one artifacts, they are not caused by the shabling algorithms. They are in stabilities of the model, which are related with two things. One is to the form of the loss function. So the loss functions plays a very important role in the stability of the model. And also it is this instability is caused when there are very small variations around the mean trajectory in the samples in the database. And this usually happens in silence segments. So the most difficult to model are the silence segments. Let's see a little bit how these two errors happen. And why the cross entropy does not have the type one errors in the categorical distribution. The type of cross entropy is this between two distribution p and q can be shown here. In the case of WaveNet, the target distribution is a one-hot encoding where all the p values are zero apart from one, which has the value one. So in this sum of minus p log q i terms, only one of these terms survives. So actually the cross entropy is minus log q i. If we take the derivative of this, we can see here the plot of the cross entropy and the plot of the derivative. And we can see that this cross entropy penalized the small q's. So whenever the class i happens, the system tries to increase the corresponding q i probability. Let's see the negative log likelihood loss. The form of the negative log likelihood loss, if we assume that the output distribution is a Gaussian distribution, is shown here. If we take the derivative with respect to one of its parameters, for example, and for simplicity, we assume that the mean value is zero. So we are in a silence segment. And let's take the derivative with respect to the standard deviation, to the scale parameter. Then the derivative consists of two terms. The first term is the derivative of the logarithm of s, which is one over s, one over sigma. And the other is the derivative of the x square over sigma square. And we can see that if we are close to the mean value, which in this case is zero, we are not very far away. So the next sample is close to zero. Then this cross entropy, the log term dominates, and this rewards small s, which means that the scale becomes smaller and smaller. And finally, they can go very close to zero. The problem then happens if a sample, after the silence, there is a sample which is far away from zero. It's a non-silence sample. Then the second term dominates, and the second term penalizes the small s and causes the s to become very big because the numbers which are involved here are huge. And we actually have this situation. The first term, which is the logarithm, in silence regions, little by little, causes the sigma parameter, the scale parameters to become smaller and smaller. And then suddenly, if a sample appears, which is far away from the silence, the second term sends the derivative, and through the back propagation, the weights of the network, very far away from the optimal values. The cross entropy does not have this problem for the following reason. Whenever a sample appears, which belongs to class i, the corresponding probability, qi, is increased by a small amount. And the other probabilities are decreased through the softmax function, and not explicitly, but through the softmax function. But no one of them is forced to go to zero. So there is no instability here. And let's see an example using a database, which has some artifacts that can cause this kind of problem. Here is the LG speech data set 1.1, which has two types of problems. There are usually audio clippings, and they are long silent segments. And this database is ideal to demonstrate the type 1 problems. Let's hear first how we can, we have trained a model using the categorical distribution with cross entropy. And the model is stable, and the output is this. Printing, in the only sense with which we are at present concerned, differs from most, if not from all the arts and crafts represented in the exhibition. And sorry. And let's see the kind of problems, the type 1 problems, which can be caused when the negative log likelihood is used, and when the output distribution is a gaussier one. Printing, in the only sense with which we are at present concerned, differs from most, not from all the arts and crafts represented in the exhibition. And how we can avoid the type 1 problems. The first one is the most obvious one, using bigger and cleaner databases. The second one, which usually is done in the literature, is to reduce the silence regions, by cutting the silence regions. The third one, which has been used in LPCNet, for example, is to add some noise to the input data. However, if you add some noise to the input data, some part of this noise can appear in the synthesized pitch. The other heuristic is to use dropout. Dropout is very important component in speech recognition, because it makes the model robust. However, in speech synthesis, the dropout may affect the quality of the synthesized speech. And also, the dropout reduces, but it does not completely eliminate the type 1 artifacts. Another heuristic, which is used in the literature, is to clip the gradients. We are not allowed the gradients to get very, very small values close to zero. No, we are not allowed the gradients to get extremely high values in absolute value. And we can clip the gradients. For example, in figure A, we can see in some samples some gradients to have an extremely high value. And after clipping these gradients, we get figure B. We clip to 40 in logarithmic scale. And we see here that now there are no outliers which affect the quality and affect the training of the model. Also, we can clip the variance parameter from the network to values above the threshold, which also is a technique usually used in the literature. Another technique is adding a term to the loss function that penalizes the small variances. And one of the technique that works very well is to slightly change the architecture of the network using weight normalization. This is a technique proposed of Tim Salimans and Kingma. And it has been used to stabilize the training both of WaveNet that has an output, a distribution that predicts parameters, for example, Gaussian or logistic distribution, and also to stabilize the training of the parallel WaveNet. And there will be an exercise in the hand zone to write your own implementation of weight normalization. And the 10th technique is to use a larger batch size. Let's see if we apply all these techniques, how we can improve the quality. I will not play again the noisy signal here. I will play the original. Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition. This one where the output distribution is a categorical one, and it was trained with cross entropy. There is no instability in the model here. Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition. There is an artifact which is a click artifact at the beginning. Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition. And now the output is a Gaussian distribution, and we have applied some combinations of the techniques that I described before, and we have stabilized the training of WaveNet. These techniques can be used, some of these techniques, to stabilize the training of other unstable networks, for example the gun, the training of the gun networks. And this was the end of the WaveNet presentation. Now, in order to be inside the time, I will leave some times for questions, and then I will present some improvements to the Wave R&N. Okay Vasili, thank you very much. If there are questions, probably this is a good time to ask, but we can take a break or so. Okay, for me it's okay, we can take five minutes break, and during that period I can answer questions. Yeah, no it's better, but if there is any question to be asked now, and then take a five minutes break, so everybody can have a break. You can raise your hand, this is art UI, in the windows please, and okay, so we have a five minutes break. Okay, I will pause the recording, then we will continue in five minutes. Okay, so we are ready, please go ahead. Okay, I have presented in the previous talk the WaveNet, which achieved state of the art results in audio synthesis. However, due to many residual blocks, the computation is very slow and inappropriate, in practical, for real applications. To increase the efficiency of the sampling from these other regression models, Karlsbrenner et al. in a paper, which was an efficient neural audio synthesis, proposed a number of techniques. The first technique was to substitute this block of residual connections, which is responsible to take into account the context of the speech into an RNN cell, you see the difference, they keep the post-processing network, and they only change the stack connections, the residual blocks with a GRU cell, and this greatly increases the efficiency of the model, because the number of calculations now has been reduced, greatly reduced. And also in the original implementation of Wave RNN, instead of having a softmax output with 256 quantization levels, people, they propose to have an output distribution, which splits the 16-bit audio into two bytes, a coarse byte and a fine byte, and there is a conditioning, the fine byte is conditioned in the coarse byte. The quality of the speech, using that, was further improved, and now we have seen models that have, as an output distribution, the 8-bit softmax, we have seen models that have as an output distribution, that have as an output parameters of continuous distribution, such as gaussian, or logistic, or mixtures of gaussian or logistic, and we have seen models that we have this coarse byte output distribution. In terms of quality, the best quality is given by this model with coarse, by the output of coarse fine, followed by the models which have parameters of continuous distribution, followed by the models which have 8-bit output. In terms of easiness and stability during training, the most stable one is the model which has an as an output 8-bit, followed by this model here, which has coarse fine output distribution, and then comes the models that have as an output parameters of continuous distribution, and actually it is very difficult to train a wave RNN model, a stable wave RNN model, which produces parameters of a continuous distribution. While this is possible with WaveNet, it is very difficult to do the same with Wave RNN, and this might be the reason why the authors of this paper prefer the coarse byte, coarse fine output distribution, instead of parameters of a continuous distribution. Let's see a little bit how the training of the coarse fine of the wave RNN with coarse fine output is done. It can be done in parallel, because both coarse and fine can be predicted at the same time, and be compared with the corresponding target values in parallel, because we have the input speeds available during training. Unfortunately, during generation, we have first to produce the coarse output and sample, first the probability and sample from it, and then use it to sample the fine, and for this reason, especially in GPU, we have a sequential process of producing these two sums, and this makes the system twice as slow as if we had a single output distribution, a softmax with 8-bit output. In that case, when we have a single output, the system is much simplified, it is a very simple system. We have a three-layers network, there is a GU unit, and there are two dense output units, and then there is a loss function across entropy which measures the difference between the prediction of the network and the desired output. And the same, the generation is also very simple, because we generate one sample at a time, and we use that as an input to generate the next sample, but we don't have to do two times this step as in the coarse fine output. So, in order to make the wave R&M faster, one simplifications that we can do is to use a single sample output, and this can make it twice as fast, especially if we write it in GPUs. This paper of Caltz-Benedin at all proposes another two techniques to make the wave R&M fast enough so that it is faster than real-time. The first technique is to sparsify the input and evolution matrices of the R&M, which are the big matrices which are involved there. And the other technique is to produce parallel samples by breaking the dependencies of the auto-regressive model. The sparsity algorithms, there are many sparsity algorithms. The one which was proposed in the paper of Caltz-Benedin at all is this called wave pruning. It works as follows. We randomly initialize a neural network, a dense neural network, we train it until we believe that it has converged. Then we prune a fraction of the network weights, and in the paper they prune the weights that they have the smallest absolute values. And then we retrain the new model, which has been after the pruning. So, we repeat the step two and three. We train, we prune, then train again the pruned network, and then we prune again until we reach a predefined number of sparsity level. And we can see here, in this picture here, that the state, which is multiplied by the hidden-to-hidden evolution matrix of the GRU, that actually the multiplication does not have to do with the full matrices. But in this example shown here, only with the gray blocks and not with white blocks, and there are special sparsity libraries, which can do, which can handle that. And now the A100 new GPU of NVIDIA also has support for dot sparsity, for fine-grained block sparsity. So, this can also be done in hardware very quickly. Another sparsity algorithm is, the lottery is an algorithm which is based in the lottery ticket hypothesis. What is this hypothesis? This hypothesis says the following. If we start from a dense, randomly initialized, feedforward network, this feedforward network contains a subnetwork or many subnetworks, which we call it winning tickets, that when trained in isolation, if we train this network inside an isolation, reach test accuracy comparable to the original network in a similar number of iterations. This hypothesis, later on, has been extended to most neural networks and also applies to the GRU-recurring neural network. So, the algorithm works as follow. Randomly initialize a neural network, train the network until it converges, prune a fraction of the network. Up to now, we are in the similar to the previous algorithm. Now, the difference is the following. We extract the winning ticket, reset the weights of the remaining proportion of the network to their initial random values, and then retain the new network with the weights that has been reset and repeat the steps one to five. People have found that by doing that, this resetting, the quality, especially in classification problems, not necessarily in generation problems like it is in our case here, but in classification problems, this algorithm performs better than the previous one. And in some cases, the sparse network can produce higher accuracy than the dense network if the dense network has been trained to convergence from the beginning. And there is an even more strange lottery ticket hypothesis, the extended lottery ticket hypothesis, which is the following. Within a sufficiently overparameterized neural network with random weights, AG at initialization without any training at all, there already exists at least one subnetwork that achieves competitive accuracy. For example, classification problems and in our case, competitive generation quality. And the algorithm is as follows. Randomly initialize a neural network and the best possible distribution from which we can draw the random weights is the uniform distribution. This algorithm performs better if we initialize with a uniform distribution than if we initialize for example with Gaussian distribution. Then we prune a fraction of the network. We will see how to do that. Let's suppose that an oracle say to us which is the fraction of the network that we have to prune. The remaining network performs as good or even better as the dense network that has been trained to convergence. So we examine the result network if it is indeed a winning ticket. And in this algorithm, there is no training of the weights at all. So pruning is all we do. We do not train the network. We have a randomly initialized network and only we prune the network. And after we have pruned the network, the sparse network performs at least as good or even better than the dense network which has been trained to convergence. However, the problem here is the following. Which connections should we choose in order to prune and we have the winning ticket network? This is not a totally trivial problem. An algorithm was proposed recently in a paper what is hidden in a randomly weighted neural network. This algorithm works very well in dense neural network, deep dense neural network that have one layer after the other, and in classification problems. It has some problems. This algorithm with recurrent neural networks. However, it's an open research area. How we can make this kind of algorithms to work with recurrent neural networks as well. From all these sparsity algorithms, the first one which was proposed in the paper of Kaltz Brenner at all, the weight pruning is the easiest one to use and is very efficient. So is my suggestion if you want to start implementing sparsity. And also we have to consider block sparsity because the unstructured sparsity is not although it has the same number of weights as the block sparsity, it's not so easy to handle and get the real time performance or the speed that you want because the spars linear algebra requires a lot of indexing. Finally, there is one more sparsity family of sparsity algorithms which is which I can call them rewiring. This family of algorithms starts from sparse network from the beginning, so that to make the training faster, it performs the three following steps. Most of the algorithm perform only step one and three and some algorithms perform also the step two. The first step is prune a percentage, predefined percentage of the weakest weights or of some weights according to some criteria. The most used criterion is the weights that have the smallest absolute value. Then redistribute the percentage of connection in the case of we have deep neural networks with many layers and allow more connections to the layer that are more important for the task that we have, for example for the classification or for the generation, and allow fewer connections in the layers that does not offer so much to the problem that we have. So redistribute the percentage of connection between levels. And the first step is grow new weights according to some criterion which may be rather much chosen or to some other criterion, grow new weights so that the percentage of weights to remain more or less the same as the initial network. Two representative papers in this family of algorithms are the following. The first paper is the paper of Mokano et al, which proposes an algorithm that mimics the way that the brain forms its connections. During learning new connections are formed. During sleep the weakest connections are removed. The algorithm assumes a constant number of random connections. At every number of iterations the weakest connections are removed. Then new random connections are added whose weights are learned with additional training. So the new connections here are added through, are added randomly. The next paper of dentures et al, requires connections according to mean magnitude momentum of the existing weights. So the biggest difference with the previous paper is when we grow new weights in the next paper, the paper of dentures et al, we don't choose the new connections at random, but we choose those connections that they have the greatest magnitude of the momentum. How we calculate the momentum? We calculate the momentum from the optimizer and from the derivatives. If we, for example, we use SGD with momentum or with Adam where already contains some momentum terms inside. The third technique that has been proposed in the paper of Calz-Brenner et al, goes to produce more samples at a time and up to 16 samples in parallel. This can divide the generation time by 16, which can make the requirement to produce the speeds faster than real time physically. This algorithm requires that some of the dependencies of the autoregressive models are not respected. By breaking these dependencies we expect some minor degradation of the quality of the generated audio. And people have to be very careful how to implement these subscape algorithms in order to keep the quality of the generated audio the same as the Wave Varemen with a single output at each time step. And this was some additional things that I had to add to the excellent presentation of Sivanam to Wave Varemen and how to make Wave Varemen be used in production to be fast enough so that to be used in production. Finally, we can use a model which is related to Wave Varemen which is called LPCnet. I will not explain in detail LPCnet because other people, other lecturers will explain how LPCnet works. I will say here that in this original architecture of LPCnet, we can simplify this original architecture of LPCnet and we can use instead of two GRU units, only one GRU unit, we can change the feature we don't, it is not necessary to use the feature that was proposed in the original LPCnet paper, we can use our own features and also the frame rate network, although it is important to keep constant and to model multi speakers in a single speaker environment, it can be safely removed without changing the quality of the synthesized speech. And this is the end of this presentation and I will go to the next one which is Parallel WaveNet which is another kind of non-autoregressive vocoder. Back to 2017, people were impressed by the quality of WaveNet but they were, they could not use WaveNet in production because it was too slow for real-time generation. So people started to think how we can find alternative neural vocoders that model the samples of the speech. Some attempts was algorithmic and programming improvements of the WaveNet, for example in Deep Voice 1 and Deep Voice 2, many of these improvements can be found. Another attempt was programming optimization, for example there was a CUDA implementation of WaveNet released by NVIDIA, a publicly available implementation. Another kind of techniques was dividing and decimating a full band signal into subbands and sorted waveforms with lower sampling grades could have been generated and used. Then trained used in training. Then training a WaveNet for each subband and generating for each subband and then combining the subbands. This was the subband WaveNet in 2000 which was proposed in 2017. Another idea was to consider alternative or autoregressive architectures for example WaveRNM in 2018 and FFTNet also proposed in 2018. From these two alternatives, WaveRNM managed to produce samples with quality closed WaveNet. Still the WaveNet is the one that produces the best but WaveRNM is very very close to WaveNet and also after some optimizations WaveRNM can be used in production. And finally people started to create non autoregressive vocoders and the first attempt was the parallel WaveNet. It was introduced by DeepMind in October 2017 and this transformed a sequence of noise samples into speech in one forward pass of the model. Here we can see the sequential generation of WaveNet. This image was taken from the DeepMinds blog and explains why it is slow to generate samples with WaveNet because we have to generate one sample at a time. And here we can see how the parallel WaveNet works. We have an input which is uncorrelated samples of noise. The noise is usually drawn from a Gaussian or a logistic distribution or a uniform distribution and then this input passes through a network like a function and the input is transformed. This noise is transformed into another noise which has the characteristic of a speech and if we condition also the noise, the input to the linguistic or acoustic information then the output noise the output noise will not only have the characteristic of the speech characteristic of the speaker which has been used in the training of the model but also will convey some information. Here we can see we will see some details how the parallel WaveNet works and some of the principles that underline this kind of flow models. We have an observed data variable x which in our case is the speech. We have a simple probability distribution in a latent space for example P to Z might be logistic or Gaussian distribution uncorrelated Gaussian distribution for each sample uncorrelated through times and suppose we have a bijection f from the latent space to the observed space with the property that the inverse exists then the change of variable formula is shown here and from this change of variable formula we can calculate the probability of or estimate the probability of P to X if we know the distribution P to Z but we also in this formula we need the logarithm of the determinant of the derivative of this function f to Z the derivative with respect to Z and let's assume that we have such a function it is not trivial to find such a function let but let's assume that we have such a function then we can use this function from generation for example we have a latent space we know the probability distribution of the latent space it is a for example a multi-dimensional Gaussian or logistic we can draw a sample and then we use this function f to transform the sample from the latent space let me okay and also we can use the inverse function to transform the probability distribution of the data although this is unknown to us we don't know through the samples we can transform the sample to the latent space and then compare the transformed sample if the distribution of this sample is close to the distribution that we have assumed for the latent space so we have we can do two things generation from latent space to data space and inference from the samples of data space to the probability of the latent space and we can see here that we have a sequence of uncorrelated noise and through the function f we can create noise which sounds like speech and through the inverse function we can take speech and transform it to uncorrelated noise and compare this uncorrelated noise to check if this uncorrelated noise has been produced or is compatible with the probability distribution that we have assumed for the latent space the problem here is how to effectively compute this bijection function and also how to effectively compute the determinant of the derivative so the problem is the following the function f has to be continuous invertible efficiently calculated and the determinant of the Jacobian should also be efficiently calculated and one may ask is it possible to find such a mapping to have all that characteristics the solution the answer is yes and the solution is we can use normalizing flow which is a powerful framework for building flexible posterior distribution through an iterative procedure and this procedure works as follows we start from a sample z zero which has been drawn from a known distribution which can be multi-dimensional gaussian or logistic or uniform distribution and then through some iterative process and passing through a mapping we convert z zero to z i and in every iteration z i approaches our target distribution whose closed form is unknown to us we only have samples from this distribution and in our case in parallel wave net this normalizing flow has been chosen to be an autoregressive function a very simple autoregressive function uh which is a single affine transformation for example x t is equal to z t z t is the latent space random sample which is multiplied with a constant s and to this is added another constant m this constant s to make uh this uh transformation has some meaning and to be appropriate to encode speeds uh we make the constant s a timing step t to depend on all previous uh random samples and also the constant m the additive constant also to depend on all previous samples and these two uh constants also uh are outputs of a network so there is a network which produces these two constants at every time step and the parameters of the network here uh are the theta and uh we can see uh for example if z t is drawn from a logistic distribution then x t through this transformation is also uh belonged to the logistic distribution and the only difference from the previous one was the z t was from the logistic with mean zero and scale one and now the mean is m t which is different at every time step so that to follow the evolution of speeds and s t scale s t which is also different the favorite time step and this encodes our uncertainty of how close to m t uh the next samples will be and this transformation has a very nice property that the derivative can be very easily calculated because the as we will see in the next slide um the matrix of the derivative is block uh lower triangular and the determinant also is the product of the diagonal elements and also and is the the sum of the logarithm the logarithm of the determinant is the sum of the logarithm of the scale uh outputs of the network we can see here uh in this example that uh the matrix is lower triangular and the algorithm works as follows we take the initial sample which is also a sequence of samples it is a sequence of samples which is taken from a um multidimensional gaussian where its component uh is not related to the other components is it dependent on the other components and then we take four iterations of them of this affine mapping that we defined before and the final parameters at each time step can be calculated uh through the properties of the logistic uh distribution and we are shown here which is the at every time step the mean and the scale parameter after this transformation has been applied four times and here we can see in this uh slide we can see the whole algorithm of the parallel wave net where uh the network that predicts the mean and the logarithm of the scale has been chosen to be a wave net by itself it does not have to be a wave net it has to be any kind of photoregressive network but since we already have uh code or at least the the people who created the parallel wave net already had code for wave net it was for them easier to use the wave net as the network that generates prediction for n and log s one can substitute the wave net here for example with a recurrent neural network although the recurrent neural network have some problem with stability in training the only difference for each for each of these functions that are used to predict the mean and the scale parameters from the original wave net is that here we do not use residual connection but only we connect the output of the last layer to the post processing network this is the only small difference but this is not an important difference we can also use residual connections and and the residual connections does not change the properties of the model when we train uh we can see here the training and the generation of parallel wave net the original wave net allows parallel training because we had the samples all the samples of the input speeds but it requires sequential generation because because during generation we need the previous samples and we generate them one at a time the parallel wave net allows parallel generation because we have all the noise and we can use all the noise and make a forward pass through the network to generate all speech samples however during training that we have to transform the speech back to the noise and then compare the noise to the desire distribution noise this process is sequential and uh it is done once one sample at a time therefore the the training of parallel wave net is impractical so the inverse uh autoregressive for this reason is very slow uh in order to make this training uh practical we can do some the following techniques the first technique which is easy to be implemented is to compare the output of the parallel wave net with the desired output one sample at a time this technique allows us to uh debug the network however uh due to a noise it is not a good practice to compare one sample of with another sample of randomly chosen or randomly generated speech samples and here we can see what is the output of the network if we use this training process this is a very muffled uh voice another technique which has been used by DeepMind was the distillation where we train first a student network a wave net and then we extract the knowledge of the student network and train the parallel wave net through a process which is called a knowledge distillation and by minimizing the Kullbach-Leibler divergence between the student and the teacher due to time limitation I will not explain how we calculate this Kullbach-Leibler divergence term by term I will only comment here that the initial paper of parallel wave net uses this quantized mixture of logistics which as you will see in the hands on is not a good solution distribution since it is not very stable one and therefore another implementation of the same algorithm which is called clarinet and proposed by Baidu which is used as single Gaussian it's much more stable to train and gives in my opinion similar or even better quality of synthesized speech here is the high level view of the distillation process and here we can see how the cumulative and the probability density functions differ between the teacher and the student and there is a discrepancy between because the student uses a mixture of discretized logistic and the student uses a single logistic distribution and we can see here some example where the two distribution cannot match theoretically and it is difficult it is even more difficult to match during training in order to make the synthesized speech very close to the true speech the Kullbach-Leibler divergence loss of the distillation process is not enough so we need some additional losses and also you need some of these additional losses in other in other systems like in Mergans that Sivanan explained before the most important of all these losses is the power loss which tries to make the spectral distribution of the generated speech to match the spectral distribution of the true speech deep mind also proposed the contrasted loss and the perceptual loss but these two losses offer very little very very small improvement while they are very complex to implement especially the perceptual loss which also involves some notions from speech recognition we can listen some samples from parallel wave net the first one was with very low power loss became here what is the effect of the power loss so the power loss here is very weak author of the danger trail philip steels etc so the Kullbach distillation process produces this kind of speech which is not which is far away from the desired one author of the danger trail philip steels etc author of the danger trail philip steels etc and so and we will hear now speech which has been generated when the power loss the parameter which is multiplied with the power loss has been increased significantly author of the danger trail philip steels etc so the power loss is very very important author of the danger trail philip steels etc author of the danger trail philip steels etc author of the danger trail philip steels etc author of the danger trail philip steels etc author of the danger trail philip steels etc as you can listen here the last voice is the SLTA-0, it has some background noise. And this is one of the drawbacks of parallel Wave RNN that in some speakers does not perform as well as in other ones. So each quality is not uniform one. However, from all the non-autoregressive models, is the one that offers still the best quality. So if I have to categorize the vocoders in terms of quality, the best one is the WaveNet, follow it very closely by Wave RNN. And from the non-autoregressive vocoders, the best quality up to now is the vocoder that offers the best qualities parallel WaveNet. But the trend is that the mail gun may very soon, or in the near future, reach or surpass that quality. Author of the day. And this is the end of my presentation. If you have some questions, feel free to ask me. Hello. Can you hear me? Yes. May I ask one question about the parallel WaveNet board? Yes. So you showed the samples where we can increase the weight for the power loss and the quality of the speech is increased. Have you ever tried to just use the power loss without using the distillation loss? Yes. I have tried it. And the parallel WaveNet can be trained only with power loss. But still, the distillation is important. I mean, if you use both of them combined, it is better than if you use only power loss. OK, thanks. I think I got it. At least in my experiments, I have seen in the literature some papers that people use only power loss without distillation. Yes. Yes. I think there are many now progressive models. Just directly use the power loss or combine it with the gun-based approach. And they can still generate the good quality. I mean, the speech with good quality from the random noise. I think that's pretty appealing and interesting. Yeah. Actually, I stopped doing experiments in that model some time ago. So I have not done the, I did once, I trained once a model with power loss, but I did not continue that experiments. I only rely, for example, from the literature to answer your questions. And I have seen that some people using only power loss manage to train very good models of very good quality. But I cannot say for sure because I have not done it on myself. I have a very good quality. Thanks, thanks. May I ask another question regarding the Gaussian wave net in the previous part of the presentation? Yes. I think you mentioned two types of arrows artifact produced by the Gaussian wave net. For the type 1. No, Gaussian does not produce type 2, only type 1. All right, all right. Oh, so can you go to that slide? I forgot which one is which. I can go to that slide. Yes, of course. Wait a moment. All right. Yes, type 1. Type 1, I use the name that the VUETOL used in their papers. Type 1 is produced by the, yeah. Type 1 is the errors that are caused when the network predicts parameters of a continuous distribution. Type 2 is the kind of artifacts which are usually caused when the model predicts categorical distribution through the softmax. The type 2 are mainly artifacts that cause through the same thing, are mainly same thing problems, and sometimes are also modeling problems if many samples in a sequence goes wrong. You will see this problem with type 2 that many samples in a sequence goes wrong, not in wave net, but in wave R and N. It does not have this kind of problem. Wave R and N, because it is a recurrent, if one sample goes wrong, or if something is unexpected with the labeling, for example, with a male, there is a jump in the males which has been produced by Tacotron. This can produce a sequence, a small sequence, of samples which are out of expected values and produce some clicks, not subtle clicks. Thanks for explanation. I think I got it. Other questions from the audience? Hi. Can you hear me? Yes. Yeah, thank you very much for the presentation. So I have a question about the slide you probably skipped on the second part. So you said there were theoretical limitations for the student to match the teacher distribution. Could you expand on that? What are those? For example, why can I just train a student well? Just using the distillation laws. Why do I need the spectral laws to regularize that? Why would we not use only the distillation laws? Right. So I remember, sorry, just one more thing. Just I remember in the paper, they said using only distillation laws, you just get the whisper-like speech. Yes. So for example, why does this happen? From a theoretical perspective, I cannot answer your question. We have found that produces noise like something like that without using the power laws. So the distillation is not enough, probably because some high level noise seems to pass through the distillation from the student to the teacher. And this affects the results producing something like a harshness. But I cannot explain why this happened theoretically. I have not found an explanation about this. But it is certainly true that it is impossible to train a high-quality parallel wave net without the power laws. It's important. It's very important. I mean, the power laws is even more important than the distillation laws. Yeah, thank you very much.