 Anyway, thank you very much for this wonderful invitations. I'm Junichi Yamagishi and we have, yeah, his name is Shinwan. We jointly give a tutorial on neural vocoders, especially we want to introduce several types of neural vocoders including auto-regressive ones, source filter ones, and also grotal vocoder ones. In the lectures yesterday, I presume you already study neural auto-regressive ones, but we will try to give you different ways to interpret those models. Also, we try to explain the relationship between them. And also, since we are now working on non-D speech, but also music signals, we also talk about those topics in later part. Shinwan mainly talk about those points. By the way, as Yanin said, if you have any questions during our lectures, please don't hesitate. Please interrupt us anytime. You know, it's not, how can I say, it's not comfortable for me to continue to display for three hours. It would be nice if you have some social feedback or audio feedback from your side at some point. Anyway, let's start. So first of all, I'm a professor at NII in Japan. I graduated from signal processing group who proposed Kepistan vocoders by a long time ago. But my supervisor at the time suggested me to work on machine learning at signal processing group, and therefore I study hidden mark-off models at the time. But since I was, how can I say, educated, those kind of vocoder topics, I was given wonderful opportunities to work with Professor Pab Auku on global vocoders and its statistical modeling. I was also given another opportunity to work with Yanis on sinusoidal vocoders. It's a statistical modeling. So if you try to talk non-linear vocoders, but also those kind of so-called traditional vocoders, because it's quite relevant to the current neural vocoders, in my opinion. Shinwa? Yes, I'm currently a post-doc and I graduated from the same lab. I was working on the many on text-to-speech senses, especially the fundamental frequency modeling. After I graduated from, I started working on the neural vocoders, especially the neural source-feature vocoders. So that's the reason that I will give the presentation about the NSF today, neural source-feature model. Right, so this is the agenda of our lecture today. In the first part, we will quickly explain what is a vocoder. In the next part, we will explain how does the vocoder work. There are three parts in these sections. The part one is the overview of autoregressive vocoders. And the part two is the source-filter vocoders. And the part three is for growth of vocoders. After that, we will move on to musical instrument part. In the second section, we will actually overview many vocoders. As you've shown here, I think in total there are probably 15 vocoders, including non-neural ones and the neural ones. I understand that most of you are interested in neural ones, but I will quickly explain non-neural ones for each. Because theoretically it's quite relevant. And also once you understand non-neural ones, actually we could understand why neural ones are better than the traditional ones, so deeply. In part one and part two, I will explain autoregressive vocoders, such as LPC, as the non-neural ones. And also as a neural autoregressive vocoders, we will talk about web net by web net quickly, as far as web growth and the web flow. Since this topic is not only, how can I say, relevant to waveform generations, but also so-called representation learnings in machine learning field, I will also quickly mention how those models are useful for running meaningful representations for downstream tasks. Then I will talk about all the vocoders, but still important, in my opinion, which is capstan vocoders, mixed excitations, and so on. And then after that, Shinwon for short xw, we will talk about his descent vocoders, quite wonderful new vocoders called neural source filter for short NSF. Then we will go to Adoban's topics. This is most challenging part, so I hope all of you follow those sections because he has, how can I say, quite advanced signal processing and machine learning topics in this part two, three. We introduce harmonics plus noise format, trainable excitations, and even reverberations. Then probably you wonder the difference those models, so I will clarify the difference between those models and also deliverance to most decent models like a wavegun or mailgun. Then we will come back to grota models like LF or inverse adaptive filterings. And then after that I will talk about its neural variance, such as grota net, GEP, or LPC net. So those parts three will become nice introductions for your lectures in day three. Expected audience and outcomes. So Shinwon and I prepared those slides for students who already use deep learning to some extent, but not familiar with signal processing for vocoders so much. Yesterday I checked participants and I found a couple of signal processing parts. For such an exceptional audience, those talk might not be so, how can I say, interesting. But at least overview of the neural vocoders we talk should be useful for anyone. What we try to teach in these lectures are the visual concept of traditional neural vocoders, also theoretical and the fundamental differences between them and the close relationships. So we will not talk about equations too much because I like figure or I like visual understanding. So we have many figures in these lectures. Instead, we will not teach details, configurations, vocoders such as, you know, numbers, layers, type of network or type of activation layers or dimensions and so on. So we will not teach those kind of boring stuff. Instead, we try to give you, how can I say, concept of differences or between the neural vocoders. Then after the talk, probably you will understand that speech signal processing and machine learning processing have close overlap. Although they use different terminology each others, but actually what we try to solve in these two fields are relevant to each others. Okay, so let's go to part one. Do you want to say something, apart from what I said earlier? It's okay. All right, so let's define vocoders or machine learning tasks as a machine learning task. I think task is simple. The task is to generate speech weapon sample, so from given acoustic futures. Those acoustic futures could be spectrum envelope or other type of futures, like linguistic futures or other information like a fundamental frequency. Sometimes they are called conditional futures because those future become conditions to generate speech waveforms. But simply speaking, you could assume this is an input. Input is acoustic futures and we want to generate speech waveform sample. In this slide, we use mathematical notations like this. We use variable C as a conditional futures or acoustic futures. C1, C2, C3 and so on are time indexes. Conditional futures are different frames at time. 1, 2, 3, 4 and n are time indexes. We input those futures, vocoders to generate waveform samples. O1, O2, O3 and so on. If we use normal PCM waveform format, O1 could be scalar value, as you know. If we, for instance, quantize waveform using murals, for instance, O1 could be vectors. So depending on the vocoders you want to use, there are several formats of input and acoustic futures. The most basic format of the waveform is here, of course, linear PCM. But instead of this format, sometimes we also use continuous-value waveform or quantized waveform. In any case, we use same notation O in this lecture. Examples of input acoustic futures are Kepler's term, filter banks and so on. So they are futures representing spectrum envelope. We may also use other futures like fundamental frequency or APODCT parameters to provide some noise information in high-frequency part. Since some of you might not be so familiar with signal processing or how we extract futures, I try to visualize how we get those acoustic futures. So if you segment speech waveform, especially malware regions, you could see those kind of speech waveforms, which has similar patterns repeated. So you could see similar waveforms repeated several times. So if you apply time to frequency conversions called short-term free transform, you can understand which frequency is strong and which frequency is dominant. For instance, you could see that those regions are dominant frequencies of those speech segments. But in the meantime, you could also see that those speech signals have many frequencies mixed. If you don't know the theoretical background of those free transform, you could imagine this is almost equivalent to prison. Prison is some, how can I say, interesting crystal like things that transform the light to colors. The light is a waveform and the color is a frequency. So free transform also does similar conversion from time to frequency. But as you could see, there are too many points in this frequency domain after the free transform, which is a little bit challenging for machine learnings to handle. So we typically apply post-processing, such as male filter banks or Keppelstern transform. Filter banks basically apply and band pass filters, which I will show the next slide to represent auditory scales and also to perform dimensional deductions. On the other hand, Keppelstern performs a few additional things over filter banks. Keppelstern does declarations amongst those points of frequency. It also removes F0 information through lift-tie. In any case, what we have after those processing is the basically spectrum envelope, as shown in this figure. For auditory scaling, male scale is quite typical. In addition to the male, we may use other auditory scales like bark or hard scale. The left figure on the top shows the relationship between linear frequency and warped frequencies. There are some differences between those frequencies, but the point is auditory scales provide more resolutions onto low frequencies because low frequency is more important for speech signals. After we have those warped frequency scales, we basically place filters, queries on the warped frequency scale, which becomes filter banks placement like light figures. Basically, we have many filters in low frequency part, as you can see. And then we have few filters on the light frequency part. By the way, Shinwon and I now give lectures at home. So sometimes my kids might jump in to interrupt my lectures. In such case, Shinwon will take over and we carry on the lectures. I hope I can carry on my lectures as long as I can. Anyway, we can also filter the, you know, we will edit the video after that. Yeah, thank you. He might also wish to study, you know. Anyway, after the filterings using those filter banks, basically we have simplified spectrums like blue curves at the bottom. By the way, I have a question here. Thank you. I mean, I see that a lot of people use male filter banks for neural vocoders. Have you tried for the same vocoder, neural vocoder, to compare ERB, male and bark and see which one provides better results and compare as a function, for instance, of the gender. If it is tonal language, not tonal language. If there is any work around that? That's a good question. I haven't done comparisons of audio, auditory scale for neural vocoder by myself. But Shinwon did the comparisons of waveform representations. He compare male filter banks and male capstan. And he found that filter banks are better than male capstan. I think somebody else also did similar investigations in the past. I did some relevant investigations for training HMM, not neural vocoders. And I compare those three auditory scales. According to HMM training result, the bark scale was the better, the male was second, and the ARB scale was the worst. But I don't know if this is true for neural vocoders or not. Thank you. That's a good question. The point is, after the male filterings, FFT point, which is shown in the red, can be simplified with blue curves as you see in the figure buttons. Basically, those blue curves capture spectrum envelopes in middle frequency and the high frequency part. But in the low frequency part, those filtering, male filter bank output captures the harmonic structures. I will come back to this point in section two, later. Since this is a just overview of the neural vocoders, I just want to explain relevant tasks like copy senses, TTS, and voice comparisons. Some of you already work on those. In our definitions, copy senses is a task to extract acoustic filters from waveforms. Since it's a speech using vocoders. In some sense, this is kind of an autoencoder-like task if I use machine learning technology. On the other hand, TTS and the voice conversion use different types of conditional features. TTS uses text as an input. Voice conversion uses speech data as somebody else as an input. And then those model predict or generate acoustic features require for vocoders and vocoder generate waveform samples. So this is a quick overview of what is a vocoder. Now we go to the second part, how does a vocoder work? We will introduce over 15 different types of vocoders as I explained earlier. I want to introduce most simple, simple vocoders, but still quite important vocoders, which is linear error models. The linear error models means basically we predict waveform sample n at time n, or n, based on weighted sum or past waveform sample, plus noise e. This is quite simple assumptions, but it's quite suitable for speech and has been used for speech field for over nearly 70 years. This is called linear predictive coding, because it uses linear weighted sum or past waveform sample. So A is called autoregressive equations, because using waveform sample itself to predict next waveform sample, so it's kind of old predictions based on past samples. So typical numbers of waveform samples used for predictions, of course, bodies, depending on the data or task. But those are relatively limited and then values are like 10 to 16. As you can see from those illustrations, if you wish to predict waveform samples in red here, sorry, let me use a pen. If you wish to predict this, those models, what those models does is simple. We basically check the past waveform sample and consider that it's weighted sum. The linear error is simple, but it works pretty well. This is how normally apply those LPC to speech waveform. We apply frames to get segmented speech waveform. Then for each segment, as we compute optimal LPC questions, autoregressive questions. If we use TTS or voice conversion task, we predict those LPC questions from text or so speaker instead of using LPC questions, extractive input frames. Then we view generate speech waveform back through overlap art. If we use TTS voice conversions, we normally use another format of LPCs called line spectra pairs instead of LPC questions directly because of the stability issues, which we will also explain part two one. But probably you might haven't thought how we compute Joe's generation part from given LPC questions. The most naive way to generate speech waveform sample from given LPC questions is actually autoregressive generations like WebNet as you study from WebNet. We predict first waveform sample and using those predicted waveform sample, we predict next waveform sample and we continued those process until you predict or generate all waveform sequences. But actually we don't do that and there's similar research to non autoregressive research like machine learning people does right now. There are some clever solutions to speed up this LPC generations. There is no autoregressive generations. So instead of repeating those left to right predictions, we compute a frequency, sorry, STFT or LPC questions and apply those frequency presentations or LPC to the GDL signal E using a filter defined by A. Why this computation is faster than naive way? That is because we can perform those process at each frame independently. Here I is a frame index. So we could perform those LPC questions to waveform conversions independently at each frame. I borrowed those equations from lecture books, but if I use machine learning term, basically we use mean square errors to estimate autoregressive questions. Meaning basically we compute regidials between grantus waveform at time n and the predicted waveform at time n, which is sum of p weighted sum values as you could see from this here. And we compute these square errors of those errors and we look for LPC questions, which is A, A1 to P. This is most basic way to estimate LPC questions that minimize those energy regidials. But of course there's many ways to compute those LPC questions. It may compute other spectrum envelope like Keppelstern, then we may convert Keppelstern to LPCs. Anyway, I will stop speaking about signal processing because many students might feel sleepy if I continue to speak those kind of signal processing. So let's change the topics to neural net suddenly. But I will talk about in between models, meaning like a waveform models in between LPC and the wave net as a starting point so that we could gradually derive wave net from LPC. We have another motivation to explain this in between models because this model also delivers to LPC net that you will study tomorrow. So please let me explain those neural linear autoregressive models. Where basically we still use linear autoregressive questions A, but we predict waveform sample O using neural network. So this new symbol A, this one, is a neural network now. In LPC, the relationship between waveform sample and the past waveform sample was linear deterministic. But here we consider neural network to predict waveform sample at time n using linear AR questions. Just to recap your memory, C is a conditional of futures and then this notation represents a set of past waveform samples. So how can we do this? Actually it's quite simple actually. So let's consider feed for neural network provided you was told many complicated neural network yesterday. But as I said earlier, we like figure, we like visual understanding. So let's start with feed forward. Using this feed forward, let's predict waveform sample with LPCs, which becomes basically this network. So you could see that condition C is input. So those may be up samples to align the time differences between frames and waveform sample. It doesn't matter. We could simply repeat conditional futures so that the total length becomes much with the waveforms. We first predict Gaussian distributions from those conditional futures through feed forward. But those Gaussian distributions has some constraint shown here. Observations at time n has dependencies to previous waveform sample and also previous waveform samples. So in other words, this is the second order autoregressive models. If we explain these models using probabilistic form, it becomes like this. Probability O, the given set of LPC corrections and conditional futures can be represented product of each probability at time n. Then each probability at time n would be estimated by feed forward network that use conditional future C. Of course we could use recurrent neural network instead of feed forward by adding some dependency between the state like this figure. But still point is it's attaching autoregressive questions A to the output. Probably some of you wonders why we need to have such a autoregressive questions between output. Maybe RNN could be sufficient because since we have time dependency over time, maybe the current neural network might capture such a autoregressive information sufficiently. And actually unsized know RNN doesn't capture such a time dependency for observations for speech waveforms. I will explain this point explicitly from now. So here I simplify the previous models further. They are still almost the same network, but I reduce the RNN to 1. I also reduce the RNN to AR, LPC to just one dimensions. If you do that you could explicitly write down mean vectors of the Gaussian distributions at each time step as follows. It's not that this is quite easy. You can maybe if you have few minutes you can write down this equations. But point is those mean vectors has dependencies onto A and also pass waveform sample or one. Because if you rewrite those equations using multi-nominal form instead of diagonal format, you can easily understand that covariance term becomes not diagonal anymore. Even if you use RNN, the covariance matrix of observation 1 and observation 2 are diagonals, meaning the output IID. On the other hand, if we attach AR equations to RNN like this figure, the output has full covariance matrix like this and therefore it's not IID anymore. So this is the fundamental difference between autoregressive models and RNN. So after we attach such AR equations, how do we generate speech waveforms? Of course, we may simply estimate those AR parameters from conditions, but it's also possible to estimate regidials between target waveforms and predicted waveforms E. And those regidial E may be predicted by other models shown here. It's named regidial modeling. This is for visualizations we divided those two parts to two components, but this is mathematically equivalent. So in other words, we could estimate those AR equations and also the models. Another model is that predict regidial component E at the same time from seed. Sorry, somebody again, I will stop slack. Somebody continue to send these messages. So this was the training. At inference, we do basically inverse processing from given conditional futures like a spectrum envelope, or other type of futures like APODC or fundamental frequency. We first predict regidial signals E and then apply autoregressive generations, and we will get a final output. So this format is actually quite relevant to both WebNet and also LPCNet. If we advance those linear AR modeling part further, it will become nonlinear autoregressive models and also WebNet. If we advance and improve those regidial modeling part further, it will become LPCNet or other type of grotesque decoders you will see shortly. First, in this part, let me explain those linear AR models a little bit further. So if you study LPC when you are a student, probably you are taught autoregressive questions predicted by HMM gaussians or vector quantizations, not stable. And therefore, we have to check the orders of align frequency pairs to guarantee that since this filter becomes stable. And so if you haven't studied signal processing, probably you don't understand what I'm saying. But the point is those problems happen to DNN as well. Autoregressive questions estimated back propagation may be unstable if you estimate those battery. So we have to guarantee that autoregressive questions estimated through back propagation are stable. We can do this by using similar tricks to LPC. How do we do it? Probably this is two technicals for me to explain here within one minute. So ideally, it's better to check Shinwon's pass papers, but there's one sufficient conditions to estimate stable air parameters. So this is not sufficient and necessary conditions. And therefore, how can I say it's not perfect solutions, but at least it's a sufficient conditions. So what we can do is transform autoregressive questions to log error ratios, notated gammas. Then if we use sigma functions so that the values of the gammas to be between 0 and 1, then this autoregressive generation is stable. Otherwise, you might have unstable autoregressive questions. We can prove this theoretically. And therefore, we could integrate those constraints as a part of optimization loss functions to constraint backward computations. But of course, there's more simpler solutions, which is extending autoregressive models into nonlinear autoregressive models. In previous part 1 to A, we were talking about neural network models, but autoregressive questions are still linear. So from now, we will consider neural models that models autoregressive dependencies in nonlinear ways, which includes WebNet and PyWebNet. And therefore, part of this section is overlapped with lectures given by Mercedes yesterday. But probably our view would be slightly different from his. So I hope those sections are a little bit complementary. So in previous part, previous linear autoregressive neural network cases, we assume that waveforms are for a time n can be approximated using linear weighted sum or past p-samples. But we will extend these assumptions for us. More specifically, we assume that waveforms at time n can be better approximated using nonlinear transform of huge past waveform samples, like 3,000 past waveform samples. So instead of using just 10, we use 3,000 past waveforms and then we further apply nonlinear transform. That is the motivations for the neural network we will explain in these sections. How can we achieve this? Actually, it's simple solution. Simple solution is just a feedback predicted waveform samples to hidden state. That's it. By doing that, basically, we can carry forward the past information to the futures forever. The informations of waveform sample at time 1 will be fed back to one of the hidden state. Then that will be used to the predictions of waveform at time 2. Then that will go to state, hidden state again. If we attach rnn, those informations of state would be carried forward to the futures. And therefore, if rnn remembers all the information in the past, we can carry forward all the informations of the past waveform samples to predict the next waveform point. Instead of rnn a field of thought, of course, we may use other type of architectures like 1d, cnn or directed convolution neural network. The difference between 1d and cnn is quite small. But instead of rnn, basically, we use the state of the other time index as you can see from those two figures. For instance, the state, this state has dependencies to hidden state, but different hierarchical layers. On the other hand, for instance, those states have dependencies to this guy and also another state far away in the past. But probably some of you might not use a cnn before, so we try to visualize how cnn works. Here, we assume 1d cnn instead of 2d cnn. 1d cnn is basically filtering if I use signal processing terminologies. We have filters and then we have vector sequence. In the case of input layers, those would be acoustic futures. See, in the case of intermediate layers, those vector sequence are an output of previous hidden state. So basically we apply those filters onto vector sequence like this. Then we computed weighted sum of those filter values and input vectors. Then we repeat this process until we get final output. The difference between filtering and 1d convolution is quite small, in my opinion. Filter questions are estimated through backpropagations. Dilated convolutions apply those filters to waveform in a similar way, but we apply those filters per, say, two samples. You may skip a few number of samples like this so that we could capture a longer time span. With the same number of filter parameters. So basically, by using those dilated convolutions, then repeated process, also non-linear autoregressive modelings, it becomes WebNet. Still speaking, the famous figures using WebNet papers. Of course, WebNet has a little bit more details like gated activations and other tricks. If you want to know more details, you can see the slide again for our appendix. We have some additional information. The key concept of WebNet is non-linear autoregressive modelings through dilated convolutions. So far, we explained one type of input, which is C. Since this is a neural network, we may use other type of input, and those input may have different information from local conditional features, which we explained so far. So those are useful to train WebNet, for instance, TTS. For the WebNet TTS, we use speaker and language, the labels, as global time invariant features and then input linguistic embedded features plus F0. Then we predict speech waveforms. On the other hand, for WebNet vocoders, we input acoustic features. Sometimes we input speaker embedding vector as S in this figure, but it seems that acoustic features seems to be sufficient. So this WebNet, as we already tried yesterday, proves that this model can produce high-quality speech waveforms and then turned out much nicer than previous vocoders at the time. But there's a tricky problem, which is slow inference. So since the WebNet predicts the waveform sample from left to right and uses a predicted waveform sample for non-linear autoregressive modelings, inference time is proportional to waveform length and meaning waveform time. So prediction time may become extremely slow if we want to generate very long waveform sequence. This is impractical because speech synthesis needs to generate speech waveforms in real time, as you know. And therefore, I think you also study this part WebNet. This part by WebNet introduces two new concepts, which is invertible neural network and the normalization flow to speed up basically inference time. The key concept of those invertible neural network and normalization flow is as follows. For training, basically we still want to use a powerful neural network. Then we may spend out of time to train good models, but we want to somehow convert those neural network into efficient mode so that we could generate speech waveform sample efficiently. Then again, most time consuming part is autoregressive assumptions or dependency as you thought. And therefore, this conversion basically try to remove autoregressive dependency somehow. For training, we still use autoregressive dependency. For inference, we transform the functions f inverse to f to remove the autoregressive dependencies. So in other words, inference time, so basically we want to generate speech signals from white noise sequence. On the other hand, our training, we wish to transform the waveform sequence O into white noise sequence C. But this is quite challenging because the mapping from speech to white noise, of course, it's possible, but it's not learnable. So what we can do is we apply such a transformations gradually and repeat such operations many time to remove temporal correlations of speech gradually. So as you can see from those figure up bottoms, we applied neural network f which is invertible to remove temporal correlations gradually and then make it closer to white noise sequence generated from Gaussian distributions. Since this time, we generate white noise sequence which is independent each frame and therefore we don't need to do like a left to right waveform generations. Then we applied inverted functions trained to those white noise sequence so that we can gradually change the back to waveform, to speech waveforms. This is concept of normalizing flow. Then, since this type to remove autoregressive dependency from inference time, so it's called inverse autoregressive flow. This is quite nice because inference is faster, much faster, it's no more proportional to waveform lengths, but I'm afraid on the other hand, training time becomes extremely slow because we need to use autoregressive trainings instead of inference. And therefore training time becomes longer and also it's become slow to converge and therefore in the papers proposed from deep mind and they use additional tricks which is teacher student training to accelerate such a slow autoregressive trainings. Basically, they train the normal webnet and then they try to use those pre-trained webnet as the teachers to speed up the training of student waveform models. Student models has inverse autoregressive flow, but it's slow, so compared slower to original webnet, so it requires more epochs. Even for normal webnet, it takes quite a long time to train. The student model takes forever to train. And therefore, through teacher and the student framework, we try to provide more information from AR students, so AR webnet to non-AR webnet models. In the end, it becomes very complex implementations. I don't know if somebody tried to implement parallel webnet, but it's really complex because it has autoregressive flow. Through teacher and student constraint, also other type of constraint like STFT rules or other like. So it's nice framework, but it's quite complex and then many researchers claim that it's not easy to train because of those complex architectures. And therefore, NVIDIAs, also BIRU, propose more efficient models. The tricky part was inverse air flow, which is powerful, but it's quite slow to train. So basically in this wave growth and the wave flow, they give up to models to proper autoregressive dependencies of entire sequence, but instead they try to model dependency within a frame O1 to OT. They are calling this framing options as like a squeezing operation or like such a stupid name, but basically it's basically framing options. So basically we segment speech waveforms into many frames. Each of them has a T waveform point. Then those models try to learn dependencies between such a short speech waveform and noise sequence. I wrote many texts here, but the figure bottom is quite easy to understand. In the left one is dependency or relationship between waveform point and also latent model by white noise. So instead of entire waveform sequence, wave flow consider only eight waveform point, quite short frames. Then those models try to decolourate those eight waveform point only and transform them into white noise sequence, but white noise sequence also has only eight point only. Then those models repeat those operations many times so that we could get through white noise sequence after decolourations. Wave growth is a little bit weird dependency shown in light figure bottoms. It models dependency on the half of waveform point within short frame and transform such a short half of the waveform into white noise sequence. In the next operations basically they swap the areas where we apply such a flow transform and repeat those operations many times. So those operations as you can see is quite simplified, meaning it's a low capacity transform compared to non-linear autoregressive models. And therefore we need to apply decolouration transform many times due to such a low capacity flow. So in the end it becomes again slow to train. Wave growth takes two months to train because of those low capacity flow. It doesn't have autoregressive assumptions but it has many layers for non-linear decolourations between speech waveforms. So it's still slow to train. Therefore wave flow proposed by Waving has a good compromise ability. It has better capacity than wave growth. Also it also requires less flow transform compared to wave growth. But there's more efficient way to generate speech waveforms. There is the neural source filter models and also growth models, which we will explain in part two and the part two, three. No, so part three. But before that I want to quickly talk about another relevant topic, which is representation learning. Because I simply like this topic because neural waveform models can be used for other purpose, meaning future extractions. So a good example of this is the autoregressive predictive coding predicted by James Glass group. It has same or similar task to Bogoda's. Basically from past waveform sample, they try to predict future waveform sample. But they try to predict not the next sample, but waveform sample at K step ahead. So as you can see from these equations, the neural network has input, which is waveform sample from a time one to time n. Also, using those input, this neural network predict waveform sample at K step ahead. Of course, this K depend on the task, also futures that you use. But they propose this models as representation learnings, meaning they are not interesting. To generate speech waveform. Instead, they are interested in using hidden futures learned from the data. They assume that neural network basically train in this way. Which can compress all the past informations to predict waveform sample at K step ahead. And therefore, the output of intermediate layers must be meaningful for another type of downstream task, like speech recognition. So they are now investigating more sophisticated input network to represent meaningful representation for downstream task from waveform or through autoregressive models. I need to tell you this moment. Another topic which we will quickly mention is vector-contined variational autoencoders, VQVE for short. This is also quite nice models for representation learning from futures. So motivation is similar to APC, autoregressive predictive coding, which I mentioned earlier. But this model try to extract futures from speech waveform and then further try to discretize the futures. Then they propose to use those discretized latent for other tasks. The models is trained in autoencoder manners after we extract discrete latent. Those discrete latent is used as input to the webnet to generate speech waveform. And therefore, it could train those models without using any labels. They assume that the run discrete latent capture phonetic informations automatically and therefore must be quite useful for low resource speech recognition or low resource speech-to-speech translations. Recently, this model is also applied to music. Also, it's turned out quite useful for singing and the music generation as well. All right, so, Shima, am I probably should speak a little bit faster, right? Yes, do you need to take a break? I think this is also your part. I think we could carry on a little bit more if everyone is happy. No, no, it's okay. You can ask a little bit, but probably there are questions, you know. Guys, is there any question? Hello, yeah, I would have a quick one, if I may. Sure, of course. So, at some point you said that in this autoregressive models, linear and nonlinear ones, you can choose to model the coefficients or the residual. And this is mathematically equivalent, right? Yeah. But is this also computationally or in a training manner equivalent? I mean, is there one of the two which is better? Shima, can you go back to the slide where we explain linear AR models? 26. Yeah, thank you very much. So basically we simply format equations using new variable E. Then we try to predict E from condition C. But I think this is a model's purpose, Shin-Wan. So I will let Shin-Wan to speak his opinion. Okay. Yes, I think your question is about asking whether we predict the residual or predicting the LPC coefficients, right? Yeah, so I also understand that, I mean, theoretically this is the same thing, right? That can get one out of the other, but is it also practically the same thing? Is one better than the other in the quality? It depends on the application. I mean, for example, for speech, this model showing here is mainly used for predicting other kinds of one-dimensional signals such as the fundamental frequency of the speech. So in that case, the LPC coefficients part can be very simple. It can use the same coefficients all the time point. But in this case, the residual must be predicted from the linguistic features. But for all kinds of applications, for example, the, you know, the, what's the name? Glowto DN or GlowtoNet. In that case, it's more important to predict the LPC coefficients from the network. In that case, they just directly, it's not the way, they also need to predict the e, the excitation from the input acoustic features. But I mean, just depends on the application. Sometimes we can directly use time-invironed coefficients a here for all the time steps. Sometimes we need to also predict the a from the input features. It's hard to say which one is better. It really depends on the application. I also want to clarify a few points. So in this models, we don't have two-step training, like AR estimations followed by like standard LPC models. LPC estimations followed by RNN training. So those LPC questions, LPC like questions a and RNN parameters join together using back propagation. On this hand, the LPC net for GlowtoNet, you will see later, estimate those two parameters separately. Typically, we estimate LPC questions in a standard way. Then we compute LPC regedials in other bands, then we fit another models. In this linear AR models, basically we train those two parameters simultaneously. Probably I will go back to part one. Yeah, thank you. So since I want to preserve some time for scene one to speak his neural source filter models, I will quickly speak this part, which is non-neural source filter models, including Kepelstein-Bogoda's mixed excitations or straight, which can be represented in mathematical form like this. Here, f is a filter, and e is the excitation signals, and c is a conditional futures. But we should say that this excitation signal, e, isn't white noise. Sorry, I cannot. Ah, now good. The reasons why we want to quickly explain this source filter format is because the parallel webnet or webGrow that I mentioned earlier is strange, unnatural or strange in my opinion. If we study speech production physics behind, because the parallel webnet or webGrow use the only noise signals to generate speech waveforms. This is handy from engineering point of view, but it's different from physical mechanisms of speech productions. So in this short presentation, I try to convince you why I feel such a noise excitation speech generation is strange. That's because our speech production models can be approximated in source filter manners like this. That has two excitation, two type of excitation signals, which is pulse train and white noise. It's probably you have seen this figure before, because it's quite famous figure. In the source filter theory, we prepare two type of excitation signals, mix them according to voice and voice information. Then we apply a few tests defined by spectrum envelope to generate CINSEC speech. So probably you haven't seen why we could derive this format from speech production point of view. So please let me quickly explain why. In order to do that, I need to define a few symbols, which is frequency F and wave length L, which has one pitch wave. F multiplied by lambda is equal to C, which is speed of sound, 300 m per second. Then we now consider fitting the waves into closed tubes that has the length of 70, 75 cm. The reasons why we consider such a weird length 75 cm is because this is the average vocal length of humans. If we consider a fundamental wave length of a fundamental wave that can be fitted to this closed tube is, as you can see, has the length of 0.7 m, which is four times the length of the tube. Therefore, we could say that resonance of those closed tubes is 500 Hz. We could also compute other resonances that fit to those closed tubes in an analytic way. Probably you might study those kind of physics when you're in high school. The other wave lengths that fit to this closed tube has one-third or one-fifth of the length of the wave. Therefore, their frequencies are 1500 Hz and 2500 Hz. In other words, the frequency response of the tube is those three frequencies that I calculated just now. Of course, in these assumptions, meaning using closed tubes as a human vocal tract is to simplify assumptions. Therefore, in reality, we should use more appropriate shapes of vocal tract and their frequency resonances. I just tried to explain the derivations of source filter models. Other excitations we use is basically, in order to generate the sound, we need to have some source of the sound energy, which is caused by the vibrations of the vocal cords, like the muscles you have around here. Those muscles repeat open and close, and the frequency of those open and closed corresponds to the frequency of the fundamental frequencies. So if you visualize the airflow coming from your lug, it becomes like this. When the vocal cord is closed, basically airflow stops, but when it's opened, airflow comes again. It also gradually opens, and therefore it's not closed. It has gradual change like this, like in left figures. So if you visualize those airflow in frequency, it typically becomes like this. It has F0s, the fundamental components, and its harmonics as multiples of F0s. So source filter theory is basically use those two elements, and merge prism as approximations of spectrum of our voice sound. In addition to this, we also need another type of source, which is the source for unvoiced sound, where vocal folds are not vibrating, and therefore airflow, quite random turbulence, like this. So in source filter models, basically we did approximations like this. We approximate vocal folds to pulse, and turbulence to random noise. And therefore it becomes this format. It has non-ly white noise, but also pulse train, which contain fundamental frequency, but also harmonics structures that represent grotesque spectrum. And those are applied to filters to approximate spectrums. If you have used HMM species, and this framework has been used for a long time to generate speech from vocal dust called MLSA, which is name of filters. It works like this. It is in standard source filter weight. It uses pulse and white noise, and mixes those two sources to generate excitation signals. Then those are filtered by male capstan, which is state speaking. We need to apply frequency warping to get linear capstan, and then apply inverse FFT, meaning filtering is done in frequency domains. On the other hand, in MLSA filters, we skip those two steps through pade approximations. We could consider slight variance, like mixed excitations, where instead of binary selections of pulse train or white noise, we consider weighted sum of those two excitation signals. Probably you might also remember those names of vocal dust called straight and the world. They are also a type of source filter vocal dust, but it has some additional tricks to remove audible artifacts. According to their claim, they claim that a small change of F0 at each frame causing audible artifact. Please remember that our spectrum has two components, which is envelope, plus envelope defined by formant or vocal track resonance, plus fine structures defined by harmonics. Meaning if F0 at each frame has different values, slight different values, the harmonics positions slightly change as you can see from those red arrows. Then according to Professor Kawahara, he managed to remove those artifacts due to small change of F0s by applying additional smoothings onto spectra, meaning he applied time-frequency smoothing to cancel out such a small movement of harmonics of SDFT spectrum. Then this was suitable for HMM or GMM, but this is not necessarily the best for DNM because DNM can use raw data. According to our experiment, the simple SDFT spectra is sufficient to train the models. This is the end of my part one sections. Now I will hand over to Shinwon to explain his neural source filter models. In the first part, I think Yamaguchi-sensei have introduced how the traditional source filter works. For the neural source filter waveform model, I think the idea is inspired from the classic models. As you can see from these equations, we're still using excitation signals as an input to the model, but the difference is we use a neural network rather than conventional signal processing blocks to convert the excitation into the waveform. Let's show how the models work in this picture. As I mentioned, the neural network itself is still a common network without autoregressive structure without normalizing flow, so we're just filling the condition features. What we want to get is the waveform samples. Of course, the differences will provide additional excitation signals. For this particular model, we're using a sine waveform, as you can see from the picture below. By giving this kind of excitation signal, the network converted to some kind of waveform. By looking at the speech waveform and the sine waveform in this picture, you can see how similar they are. We have both periodic structures in the speech waveform. We have a periodic structure in the sine waveform. As long as we can generate a proper sine waveform with the correct fundamental frequency or the pure old, it's highly possible that we can convert this kind of sine waveform into a good quality speech waveform. This is basically a simple idea of the neural source filter waveform model. But let's look at how we can implement such kind of model. To implement this model, we simply divide the model into three parts. One part working on the source, one part is the neural network itself, another part to process the condition features. Based on this idea, we have these three modules for the neural source filter waveform models. Although we have proposed a few neural source filter waveform models, the basic framework for all the neural source filter models can be shown in this picture. We have a condition module to process the input features. We have the source module to generate the excitation signal from the fundamental frequency. Then we have the neural filter module to convert the excitation into the output waveform. Notice that the input acoustic features require F0, because we need to generate the excitation signal that carries the source information. The output for the neural source filter waveform model is a float-valued waveform. This is different from the wave net, which is the quantized waveform. I think this is important to know. Based on this structure, from now on, I'd like to briefly explain each one of its modules. The first one is the condition module. This is quite simple. One purpose of this module is to do op-sampling. As I mentioned in the first part of the lecture, we know that the input acoustic features are defined for each frame, but we need to generate waveforms for each sampling point. In this case, one simple way is to directly op-sample the acoustic features by duplicating or repeating each value of one frame multiple times so that we can get the op-sampled features. Of course, we can also use additional LSTM, RN, or convolution layers to change the dimensions or to transform the input features slightly before we fit into the neural filter module. This is a simple idea for the condition module. Of course, I think many of us may know that there are many ways to do the op-sampling, for example, by using the so-called transpose 1D convolution or deconvolution. You can check these blocks. I think you can also try it in the hands-on session to try different flavors of a condition module. For us, the LSTM models will use this simple condition module to process the input acoustic features. Giving the input F0, we use the source module to produce excitation signal. As I mentioned before, it's based on the sine wave form. To give an idea of how it works, suppose we're giving an op-sampled F0 curve like the picture here. What we can do is to use the equation to generate the sine wave form. You can notice there is the periodic part, but also there is the noise part. For the noise part, because we don't have F0 for those parts, like the unvoiced sound, we just use random noise. For the periodic part, as long as we know the F0, we can use a simple equation to produce a sine wave form that carries the F0. The equation may look slightly complicated. There are so many elements in this equation, such as a sapling rate in the initial phase, and that's the noise. But you don't need to worry about that, because actually you can try this equation in the hands-on session. There are one chapter or one notebook for that. By using this equation, we can produce a fundamental frequency design wave form. We can use similar equations to produce harmonic overtones just by increasing the F0 values. By doing this, we can generate multiple harmonics. After that, we simply use a feed-forward layer as shown here to merge all the harmonics, the fundamental component, and overtones into the final excitation signal. Notice that this feed-forward layer is jointly joined with the rest of the network. It's not fixed, so it's not pre-tuned. Given the excitation signal, the next module is the neural filter module. For this initial model, we use multiple filter blocks to transform the input excitation into the output wave form. Of course, in this case, it's a neural filter block. Showing this picture is the structure of one block. Actually, we use the same architecture for all the blocks in the neural filter module. If you see the picture on the top of this slide, you can realize how it is similar to what we used in the, for example, WaveNet and ClareNet. We used dilated convolution. We also used the gated activation function. We also used the alpha and transformation to transform the input signal into the output signal. Of course, we follow the conventions, for example, using a 10-dilated 1D convolution in each block. They will just repeat the structure for all the filter blocks. Notice that the input to each filter block and the output of each filter block has the same dimension. It's just one dimension signal. The output of the final block will be the wave form. Also like to mention is that for the spectrum filters given by the condition module, you can see they are added to the output of the convolution layers before they are fitting to the gated activation function. This is also the same as what we did for the WaveNet and ClareNet. Giving the output away from how can we train the network, we need to define the training criteria. Of course, the simple way is to just define a mean-square error like what we usually use for all kinds of neural networks for regression tasks. But does it not work? I think the answer is not. I would play one audio sample, which was generated by such kind of network. I don't know whether you hear it. The sound is quite muffled and you cannot clearly understand what is being said in the audio. I think there is strong reason for the artifact or the shortcoming of this kind of training criterion. But I don't think I have time to explain it. Of course, a better way is to use this kind of spectrum distance. There are three different points I'd like to mention from the mean-square error. The first one is we measure a distance not only per each sampling point. We measure a distance between multiple sampling points in the frame. The second one is we're not calculating the distance in the time domain, but we're calculating the distance in the spectrum domain. The third point is, as we will show later, this kind of spectrum distance is homogeneous. No matter how we change the configuration we used to calculate spectrum distance, we can always do back propagation like this. As you can see from the blue curves or the blue arrows, the gradients can be calculated after we calculate the spectrum distance and we can use this gradient to train the network just using standard back propagation and gradient descent. So this is how the training criterion is defined for the neural source filter model. Of course, in implementation, we can leverage the short-time Fourier transform because we can do this quickly. We can also use inverse short-time Fourier transform to calculate the gradient. I think nowadays everybody uses pie torch or TensorFlow. You may not care about this, but as long as you're dealing with a CUDA where you have to implement the back propagation by hand, then you have to care about the way to calculate the gradient. This is how we do it in our implementation by using the short-time Fourier transform and the inverse transform. As I mentioned in the previous slide, the short-time Fourier transform is homogeneous, which means when we calculate the distance between the natural and the general waveform, we can use different configurations for the STFT. For example, we may change the frame lens, we may change the window type, we may change the number of FFT points we used. But no matter what kind of configuration we use, the point is we can directly solve the distance together and we can directly calculate the gradient by solving the gradient from each configuration as you can see from this figure. And this criterion is quite important. If we remove, if we just use only one single STFT transform, you will get artifact. And such kind of so-called multi-resolution training criterion has also been used by many other papers, I will show later. But this is how we train the network using the multi-resolution STFT criterion. So up to now, I have briefly show how we can implement the Neurosauce Theater Waveform model. But you may still wonder how it works. So from this slide, I will extract the audios from inside of the network, and I will play it and show the spectrogram. And you will see how the network gradually changes the input excitation into the output waveform. So here is the excitation signal. P.R. obeys me when we are together. P.R. obeys me when we are together. P.R. obeys me when we are together. P.R. obeys me when we are together. I think from this demonstration, you can see how the input excitation, which only contains the fundamental frequency or pitch information, is gradually changed by the filter block into the output waveform. Although the picture I shown here shows the spectrogram, but remember that all the operations are done in the time domain with directly generated waveform. So you can notice how the neural filters can add the details to the spectrogram or to the frequency domain of the waveform. And our final output will be the speech waveform of the typed speaker. So another interesting demonstration I'd like to show is here. So suppose we have trained the network, we're filling the input acoustic features such as the mirror spectrogram and also F-Zero. So what will happen if we change the F-Zero input? So remember in the old days, for example, when we are using a straight, once we change the F-Zero, the output waveform will also change accordingly. That's one of the one digit of the traditional source filter waveform model and the straight vocoder. But whether we can do this, neural source filter waveform models and the WaveNet. So I'd like to play the samples first original F-Zero and then we increase F-Zero by just multiply it with factor two or decrease it and finally filling the zero F-Zero. I will play the sound. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. So, from the samples of a male voice, you can probably perceive while the FDRO, the pitch in the general waveform from the NSM model changes, but for the WaveNet, it does not seem to change a lot. Even if we're fitting the FDRO, the output speech seems to have the same pitch as the original case. Here, I'd like to play one more case for the female speaker. He was manifestly distressed by my coming. So from these samples, you can probably have the impression how different these two types of architectures, I mean NSF and WaveNet, can be. And of course, there are many questions we can ask based on these demonstrations. But let's first move on to the next slide. Yes, about the generation speed, this might not be so interesting because as you can imagine, since the NSM model does not use ultra-regressive or the inverse AR flow, so the generation speed could be much faster than the WaveNet. So this is expected. But here, I'd just like to mention that if somebody happened to use the NSM model before 2020, or if you see the paper from our ICASP, we have a memory saving mode for the SF where we dynamically allocate GPU memories. In the original implementation, the speed for that one was quite slow. But since the beginning of this year, we have implemented a faster version for the memory saving mode so that we can generate a long waveform with less than one gigabit of the GPU memories, and the speed is still fast. This is a simple thing I want to mention from this figure, but probably it's not interesting because as long as we are using the non-autoregressive model, the speed can be fast. OK, up to this point, I have briefly explained the basic idea of the original Neurosauce filter waveform model. So after that, we did a lot of improvement or change to the original model. For example, as shown here, we may use additional excitation signals. We may improve the model to cover reverb effect in the audio data. So this is what we did for the improved Neurosauce filter waveform model. To give you a brief review about the improvement, so as you can see from this picture, in the original WaveNet, we're using a single source, one single stream of Neuro filter to produce the speech waveform. But we know that speech waveform has not only the periodic component, but also non-pureotic component. So in order to make the answer model more suitable for speech, we borrowed the idea from the harmonic plus noise structure. We derived two different types of models here. One is fixed band for the harmonic plus noise, and another one is the time wiring band. I will show the details later, but this is one point we'd like to mention. The second one is, although we are using the sign as excitation, the question is whether a sine waveform is the best for speech production. To improve this point, we introduced a secondic noise excitation, as I will show later in this part of lecture. Finally, I'd like to also mention that we also simplified the Neuro filter block. I think I will not explain the details here. You can find the slide in the appendix. OK, for the first part, we incorporate the harmonic plus noise structure to the Neuro source filter waveform models, as you can see from this picture. For this model, we not only have one source, but we also use another noise source. We not only use single stream of filters, but we're also using two streams. So the top stream will receive the noise excitation and produce some kind of waveform. So after we produce the waveforms, we will sum them together through the conventional low pass and high pass filters. Then output will be the final waveform. Here is one example how it works. So this is the spectrogram of the waveform generated by the two branches. The upper one is generated from the noise signal, and the bottom one is generated from sine excitation. So you can see the harmonic structures in the spectrum at the bottom. Giving the input waveform, we just pass them through the high pass and low pass filters to remove the frequencies. For example, in the high frequency region for the signal at the bottom, to remove the low frequency for the signal at the top. After that, we can directly sum them together to get the output. So this is how it works. I'd like to mention that although I show spectrogram here, all the waveforms are generated in the time domain. So we don't directly generate spectrogram. This is only plotted for visualization. Yes, somebody may wonder how can we incorporate the conventional digital filters with neural networks, whether we can train the network through back propagation. The answer is yes. So to give you one example how it works, for example, once we get the gradient with respect to the output waveform, we can directly propagate the gradient back to the network through the digital filters. This is how it was done when we trained the network as a whole. Of course, you may wonder how it works. So let me briefly explain how it works. So in this case, we're using the finite impulse response, the so-called FIR digital filters. You can see the equations on the left, on the right side of the slide. So when we calculate the output waveform, we just use this kind of weighted sum, the h here denotes the coefficients of the filters. One example showing here is, for example, if we want to calculate the output value of y, we can use the input value, xn, xn minus 1, xn minus 2, and then we do a weighted sum. So as you can see from this picture, the FIR digital filter is quite similar to the 1D convolution layers. So as we can imagine, we can also do propagation through the digital filters, just like what we did or what we do for the 1D convolution layers, as you can see from this picture. Once we get the gradients with respect to the output waveform, then we do the so-called transposed 1D convolution, and then we can collect the gradients with respect to the input. So this is how we can do bad propagation through the digital filters in the neural network. OK. This is a picture I shown before. You may notice that when we do the filtering, the cutoff frequency of the filter is fixed. For example, in this case, we remove the certain frequencies from the harmonic component, and we change the frequency only in the voicel region. So as you can see from the black lines. But of course, this filter are fixed. When we train the network, we first give the filter coefficient to the network, and then we just let the network to barricade the gradients and the train the network as hoe. Of course, since we are using neural network, one idea is how can we change this kind of behavior? How can we predict the cutoff frequency, or so-called maximum-voiced frequency? The idea is like this. We add additional block in the condition module to predict the cutoff frequency, or the maximum-voiced frequency, from the inputs to futures. So giving the predicted cutoff frequency, we can change the coefficients of the futures. So in this case, in order to do that, we parameterize the filters as window sync filter. I think not everybody may be familiar with these kind of signal processing topics, but here I just show one example how can we do that. So suppose we're giving one number of the cutoff frequency. We can just simply generate a sync function. After that, we multiply it with a hamming window. Finally, we do some kind of gain normalization. And the output from the final part will be the filter coefficients we used. Or you can imagine as a weight of the 1D convolution layers if we consider the digital filters as 1D convolution layer. Of course, this process is also differentiable. We can propagate the gradients back. And we can use this gradient to train the condition module. So this is how we done for the filters. So in this case, low-pass filter. We have seen the procedure for the high-pass filter. And we prepared hands-on materials so that you can see how exactly we implement this in PyTorch or Python code. Here is one example of what will happen if we change the input cutoff frequency. So on the left column is the filter coefficients in the time domain. And on the right column is the frequency response of the filters. As you can see from the black, the red, and the blue lines, once we increase the cutoff frequency, the filter, also the cutoff frequency of the filter also changes. So this is how we manage to prioritize the cutoff frequency of the filters and how can we accordingly change the behavior of the filter based on the input cutoff frequency. Again, you can see these examples in the hands-on session. So giving this kind of implementation, we can finally get this kind of behavior from the tour. As you can see from this picture, the black lines in the spectrogram denotes the predicted cutoff frequency from the acoustic filters. As you can see, it's not fixed in the voice origin or un-voice origin. It gradually changes as the acoustic filter changes. So this is what we expected. We want to make everything trendable. So this idea for the window sync filter up to now we have, no matter whether it's a harmonic plus noise structure or the predicted cutoff frequency, they are about the filter part of the neural source filter waveform model. The next question is, how can we improve the source part? As I mentioned in the previous part of this lecture, we're using a sine waveform as the excitation. But we know that sine may not be appropriate because in speech, not every sound is purely a periodic, not every sound is purely random. So here are some samples for different types of articulation when we pronounce a sound, the modal sound, the bracy sound, the facetal sound, the crety sound. So A is the waveform and B is the residual from the LPCRs. So from this picture, I think we can have this kind of impression that not all kinds of sound will look like sine waveform. So sometimes the waveform can be quite noisy, sometimes it do have some kind of post-train-like shape. So to deal with these different types of waveform shape or the speech waveform shape, we may need to use a more appropriate excitation signal. That's the motivation to use the psychonic noise. Here from this picture, you can see how it implements the psychonic noise excitation. The equation may look complicated, but just let me show how we do it with examples. So the first step is to generate the Gaussian noise, as you can see from the picture, the black line here. After that, we're fitting the beta parameter and use the exponential decaying function to change the shape of the noise. So the output will be the blue curve, as you can see from this picture below. So the amplitude of the noise is gradually decreasing along the time. Meanwhile, we can generate a post-train based on the input f zero, as you can see from this picture. After that, we can do a simple convolution. So this is a convolution in the conventional sense. As you can see feature, we just convolute the post-train with the decaying noise so that we can get the final output showing in the picture. So from this picture, you can realize why it is called psychonic noise because it's noise sequence in the short time span, but when we look at the waveform from the whole time span, we can see the noise is repeated according to the fundamental frequency of the input futures. And from this picture, you can also see there is a parameter called beta, which is controlled how fast we decrease the noise in each epoch or in each local time span. Of course, we can change the parameter of the beta like this. So if we decrease beta parameter, the noise decrease faster and the excitation would be more closer to a post-train like excitation. Of course, we can also increase beta parameter. In this case, the output will be more noisy. So this is how we can use this beta parameter to change the behavior of the excitation. Of course, when we train the network, we can use the beta parameter as hyper parameter. We just fix the value and let the network use this kind of psychonic noise to produce output waveform. But of course, since we are using a neural network, the question is whether we can predict this beta parameter from the input acoustic futures. And again, the answer is yes. We can simply add another block in the condition module to predict the beta parameter from the acoustic futures. And then we can fit into the decay module to change the shape of the noise. The good thing is this kind of network is still differentiable. We can calculate the gradients with respect to the beta parameter and use this gradient to train the condition module. This is how we train the whole network using bar propagation. So although they're based on signal processing ideas, but we can implement it in the neural network and we can train the network using the conventional bar propagation and the stochastic gradient descent. Okay, having explained so many environments of the neural source filter waveform models, here I'd like to play some samples so you can have an idea how it sounds for different speakers. So I will play from the natural to the bottom, from left to right. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. Now, female speakers. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. He was manifestly distressed by my coming. If you want to listen to more samples, you can visit our website on the Github. But at least I hope you can perceive the difference, especially when we switch from the single excitation to the harmonic plus noise structure. So especially on the unvoiced sound, how the model can reduce the bozi sound in the unvoiced region of the sample for the baseline answer models. So up to now I have explained how we can improve the answer models to better model the speech waveform. But for all kinds of applications, we can also combine the answer model with other signal processing ideas. The one example is the reverberation modeling. As we know that, especially if people are using text-to-speech data of coppers, the speech quality is quite high because we are recording the speech data in a good environment. However, when we deploy the model in the user environment, when users use cell phones or other digital devices, the acoustic condition may be so bad. For example, one problem is the reverberation, which means the sound will reflect or deflect it in the room and will reach the ear of the receiver or will reach the microphone, which resolves this kind of complicated impulse response of this room. My idea is how can we deal with this kind of reverberation effect? Of course, one simple idea is suppose the input features contains this kind of reverberation information, then we can add some layers in the condition module to predict the room impulse response that represents the reverberation of the room. After that, we can use similar ideas to what we have seen in the conventional encoders, just by converting the output waveform from the original SF to the reverberated waveform, so that we can simulate the reverberation effect in the room. It might be hard to explain how it works, but let me just quickly show some examples. So first is natural speech with room reverberation. Ask her to bring these things with her from the store. And then it's a driveway form without, I mean, before we add in the reverb effect. Ask her to bring these things with her from the store. And then after we add the reverb effect. Ask her to bring these things with her from the store. Of course, the quality is slightly limited because of the noisy and the reverb audio data in the training set, but at least you can perceive how the reverb effect is added to the output waveform once we add this kind of module. Here's one other example for the female speaker. Six spoons of fresh small please, five thick slabs of blue cheese, maybe a snack of en brother Bob. Six spoons of fresh small please, five thick slabs of blue cheese, maybe a snack of en brother Bob. That's the simple idea of the reverberation part. to briefly summarize what we have covered in this part. We first introduced a base-sciency model which was inspired by the idea of conventional source-future waveform modeling approach. After that, for the speech waveform modeling, we introduced the harmonic plus noise structure. Meanwhile, we introduced the psychonic noise excitation to improve the quality of the speech waveform, especially with respect to different ways of articulation. Of course, there are different kinds of improvement. These are basically application-oriented. For example, we can use add the reverb effect to deal with the room reverberation effect. We can also adopt the base-sciency model for music applications. So this is one I'd like to mention in the later part of this lecture. But to briefly summarize what we can get from this NSM model, I think one strong point about the NSM model is the flexibility, because we can plug in so many different types of signal or speech processing algorithms, and we can do the back propagation in the neural network. But the limitation is that this NSM model is not as strong as other kind of probabilistic models for waveform modeling. You can find the details in the appendix. I cannot explain it here, but this is one reason why sometimes even if we add more data, the quality from the NSM model may not be compatible, may not be competitive to other AR models. But of course, the question is whether the non-autogressive model or the non-autogressive or non-flow model is doomed, whether we can only generate these kind of qualities from the NSM models or related models. Well, the answer is not. So this is a reason for the next part, the GUN-based approach. As I mentioned by the amygdala say in the previous part of this lecture, for this kind of GUN-based approach, they are also using white noise as the excitation. So this is against how we understand the speech production process of human beings, but actually they work. Here, yes, as I mentioned before, for the neural source filter waveform models, we're using the sine waveform or we're using the periodic noise as the excitation. So what will happen if we directly use the white noise as the excitation to the non-IRR or the non-flow-based approach? Here I'd like to play one sound. He was manifestly distressed by my coming. This is the same sound as I played before where we set the F0 to zero. So as you can see, as you can hear from the sample, the speech is somewhat intelligible, but the quality is, I mean, the sound seems to be harsh. It does not sound like a human speech. Of course, this is against the speech production theory from the conventional literature. But as we see from recent work, as long as we are using better networks or different types of networks, such as the general adversarial network, can do some kind of speech production from the white noise. So one example is the power wavegun, which I think was not mentioned yesterday. For this kind of network, it's very similar to the NSM model. It still uses this kind of dilated convolution to convert the input into the general waveform. It also uses this multi-resolution spectrum distance to ask one part of the training criterion. But one difference is here, instead of only using the short time for your transform distance, the parallel wavegun also adds the discriminator of the government-based approach to judge whether the general waveform is real or fake, and then they can train the network as a whole. From the audios in the paper, at least I can see the quality is quite high. It proves that we can generate the speech waveform from the random noise as long as we are using different types of network, rather than the simple non-autoregressive or non-flow-based networks. The second example is a merigun, which was covered in the lecture yesterday. But here I'd like to mention one point, I think which is not mentioned in the lecture yesterday. So in this kind of merigun, one interesting about the discriminator is that we not only use one discriminator, but use multiple discriminators. What's more, the input to each discriminator might not necessarily be the original waveform. As you can see from this figure, for the merigun, they also use down-sampled waveform as input to the discriminator. I think this could be useful because once we do the down-sampling, what we can see in the spectral domain would be different from the original waveform. So this might give different resolutions or different evidence for the discriminator to judge whether the general waveform can be differentiated from the natural waveform or not. So I think this is one interesting point about the merigun approach. Okay, again, why do this kind of approach works, although they are against the speech production theory? I think one important thing we need to notice is this. So this is a picture of what Yamaguchi sensei have shown at the beginning of this lecture, where we derive the merispectrogram from the input spectrogram. As you can see, most of the work using merispectrogram as an input, but for this kind of acoustic feature, they actually contains FZR information as you can see from the feature on the left side of the figure. This is a harmonic structure, which exactly contains FZR information. So in other words, what this kind of gang-based approach does, I think, can be interpreted in this way. So although they don't explicitly provide excitation signal to the model, the model can learn this kind of excitation signal from the input acoustic features. So I think this could be one explanation why the gang-based approach can produce the speech waveform from the random noise. Of course, there is an open question here whether we can build neural over coders without using the excitation signals completely. But I don't think I have idea about this question and it's open, I mean, it's quite an open question and we can discuss later. But this is part for the non-autor abrasive and non-flow-based approaches. I think the next part is the glottal or coder part. Good, thank you very much. So since I want to keep some time for music processing, so I will quickly explain this part and then I will hand back to scene one soon. Right, so far we have seen neural autoregressive models, so autoregressive models and the source filter models. Another type of vocoders which you may study is glottal vocoder families where basically we combine autoregressive models with another parameterized glottal models, E, as you can see from figures. In previous AR models, we use a digital signal E, but E is now a parameterized by another functions. So before we go to neural glottal models, I will quickly explain non-linear ones because it also has long histories. So first of all, why do we want to use such a parameterized excitation functions? Answer is simple, as scene one already explained, excitation signal is not simple pulse or it's not simple noise, it's complex functions. Also, sometimes we wish to control both qualities like a plasty voice, quicky voice, and tense voices. And this is especially important in singing, for instance. So there are a bunch of research. How can we define good and controllable functions for E? For excitation functions, E. The most famous function is LF models. The idea is to basically fit and the model's glottal flow better than simple pulse. So if we compute simple derivative of those glottal flow, we could get those kinds of shape, as you can see at the bottom, the left part. So LF models basically assume a shape of excitation signals like this, which is represented by six parameters. Out of six parameters, two of them are not free parameters. So in real applications, we need to estimate the model four parameters. The data on such as also propose a further simplified LF models that has two parameters. Another well known glottal models, glottal functions is iterative, adaptive, and inverse filter links where we fit another set of LPC for glottal spectrum. So instead of assuming special shape of glottal flow in time domain, we may consider to estimate LPC like questions, but low order models for modeling and controlling glottal spectrum envelope. Since we start to learn our time, I will skip how to explain those glottal spectrum D but you could refer to a part of those nice papers to estimate those. The part three two, part three two is on a neural glottal models. As you could easily imagine, we could replace such a non-neural glottal functions to neural ones which include glottal DNN, glottonet, and the gap. This also overlap with LPC net tomorrow. So idea is to replace glottal functions E to neural ones. Why we want to use the glottal waveform instead of acoustic waveform? I already mentioned several reasons, but I want to mention one more reasons which is the glottal waveform is much simpler than acoustic waveform and therefore it's easier to train if you apply LPC analysis and compute the GDLs, the GDL waveform actually much simpler, but less complex than acoustic waveform and therefore it's easier to train, easier to train neural network. That's another motivations to use neural glottal models. So we compute the GDLs and trains the models. We infer the glottal waveforms and then generate original waveform back to LPC census. The AR models, which we explain part to be, so maybe I forget the name of part, but that also part of a similar to glottal models, but the more proper glottal vocoder was proposed by a purpose group. The first ones called glottal DNN use a FIFO or other type of DNN to models glottal functions. For conditional futures, they use a spectrum envelope which can be derived from those eight parameters like prance, fundamental frequency, and so on. The glottonet ones, on the other hand, use autodegressive models similar to webnet. So instead of acoustic waveforms, they fit webnet like architectures to glottal regidials, so glottal waveforms. So meaning this glottal waveforms, glottonet use a path waveform sample of glottal waveforms to predict next waveform point. I think the concept of LPC net from Google or excite net from Microsoft also similar to this in my understanding, then further details will be given tomorrow. What I want to mention in the end, in the last part of this section is GELP vocoders, which is quite similar to parallel webgun in many sense because it use non-autodegressive models. It doesn't have any autodegressive dependency. It also has multi-solutions SDFT loss and discriminators to train these models. But it first applied air modeling to compute regidials. Then those regidial excitation signals are estimated from latent value C. So in other words, this is end-to-end models, quite efficient, also easy to train because we remove like a spectrum envelope through air ones and therefore this regidial signal or excitation signal is easier to train. So this is the summary of many vocoders or many jargon we talk today. In our opinion, there are four categories, new year autodegressive family, flow-based family, a combination of linear AR and growth of models and then non-autodegressive and non-flow models. So AR ones include WebNet, WebNNN, and Amazon's Universal New York vocoders or Adobe's FFT net. Flow models include deep mines by WebNet, NVIDIA's WebGrow, Byloos, CraneNet, by another vocoder from Byloo called WebFlow. Yeah, I already explained those growth of family which includes LPC net, growth net. Non-autodegressive and non-flow models has actually two types. So one of them use excitation information explicitly including NSF, our vocoders, also similar vocoders for high net, which use NSF for phase modeling. I will stop speaking about this. Another type of models in this family is a GAN based models where excitation signal is implicitly used, the parallel web GAN, MLGAN, also the new ones called HIFI GAN from Adobe, which basically combines the concept of the parallel web GAN and the MLGAN as far as. Okay, so I understand the power lecture time becomes only three hours, but I won't see one to quickly speak about musical processing. Yes, so compared with the lengthy lecture about the speech part, I think the music part could be more interesting. Oh, it's okay. I'll just click the play. Okay, so the music part, I think would be interesting. As you can see, read from the news in the recent years, there are so many big companies working on the music generation. For example, the Google Agenta, the OpenAI work. Yes, this is kind of fancy topic, but I'd like to mention also that for music information processing, there are actually so many different kinds of subtasks and you can find these topics in the reference. For this part, I only focus on the music instrument or audio generation. So this task would be quite similar to what we need to face for the speech waveform generation. So due to this similarity, let me introduce the music audio generation by comparing it with the speech counterpart. In terms of applications, as you know, we can use the vocoders in the TTS model to generate waveform from acoustic features. There is a similar application in the music with waveform generation as showing this picture. We can fit in some kind of piano roll which specifies note the strengths to hit the key and then we can use acoustic model to generate acoustic features and use something like a vocoder to produce the waveform for the music. So this is one kind of similarity in terms of vocation to convert the symbolic values into the waveform. Of course, because the piano roll is quite simple compared with the text. So I think there are also applications which directly converts the piano roll into the waveform but I think this is also an idea similar to what the original wave net paper is talking about just to convert the symbolic values into the waveform. So these are some kind of similar applications but of course, we can give further examples like diverse conversion where we fit in one waveform, we want to change the speaker identity in the waveform. In case of the music audio generation, this task is usually called the timber conversion or the timber interpolation. The approach is slightly different as you can see from this picture. They use different encoders to extract the latin features. The author does the interpolation in the latin space so that they can generate the music waveform that sounds like in between the waveform one, the instrument one or instrument two or they can create a new instrument by interpolating between the two instruments. So this is the similarity of the speech generation or audio generation in terms of the application. As you can see from the right part of this figure we can use the vocoder to generate the waveform. But of course the question is whether we can do this to use a similar vocoder that can be used for speech and music waveform generation. I think the answer is some kind of positive because from the point of phase excitement, source filter models can be somehow applied to music waveform generation. So let's give one example. This is a picture which you have seen in the previous part of this lecture. For the human speaker, we have the sound source, we have a local track to modulate the excitation into the speech waveform. And we can actually draw a similar figures for some part of the instrument families, for example, for the clarinet. The mouse kit can be regarded as a source which produce some kind of air flow into the tube and then the tube will modulate this air flow into the output waveform. So this is one case where we can, maybe we can apply the source filter. But remember, there are so many different types of instrument we have. So this is for the woodwind instrument. So how about the string family of the instrument? So in this case, we can think of the bow and the string and the source and the chamber of the violin and the filter. But this is one ideal case, but how about the last case where we use a xylophone for, we analyze the xylophone. So in this case, what is the filter? What is the source? It's hard to explain by using the conventional approaches from the speech production theory. But at least from this figure, we can understand how different, the different kinds of instruments can be in terms of the waveform production mechanism. Due to the similarity also, the difference in terms of the waveform production mechanism, we can see some kind of similarity and the difference in the waveform, especially from the, in the spectral domain. So in the first row, you can see the spectral round for four speakers. And then in the second row, you can see the spectral round for the waveform from four types of instrument. I think we can see both some kind of a periodic or harmonic structures in two types of waveforms. But of course, we can also see the difference. For example, whether we have consonants for the music. This is quite interesting to ask. At least this is one difference we can observe by comparing two rows of the spectral round. Of course, even within the instrument, we can see the difference. For example, you can see from the horn, the clarinet, the violin, the harmonic range of the over sound is completely different. This is due to the, well, phasic property of the instrument. But there is also other issue which is called the Panophonic or Monophonic property of the note or the sound by the instrument, for example, by the piano or the violin. So this is fundamentally different from the speech where we normally can only use one source. We are only monophonic, but it's not the case for the instrument. So by showing this similarity and differences, we can see the possibility of using the speech vocoders for waveform generation. But we should also see that the quality may be limited if we just directly use what we had for the speech generation. Based on this idea, I think many people, especially in the big companies, have tried to use neural vocoders for the speech waveform generation. For example, we can use a WaveNet, also the WaveRN for music generation. We can use the Neurosauce filter waveform model and the related model called a differentiable DSP model, a DDSP. And also we can use the Gambes approach for waveform generation. So this is quite interesting how the data-driven approach can handle the difference between the speech waveform and music waveform. Here I'd like to give some case studies, maybe one by one, to just briefly introduce how we can use these kind of vocoders for music generation. The first case is the WaveNet. So this is a paper by Google where they derive this kind of encoding and decoding structures for the waveform generation. Of course, the application for this work is mainly to create a new instrument. So suppose we are giving one instrument, for example, the bass, the flute. After we apply the encoder, we can interpolate a lot of the features and then use the WaveNet to generate waveform. So if you listen to the samples on the website, you can notice how the sound changes when we gradually change the interpolation weights. So I think this is one strong point when we use the neural vocoders for waveform generation. This also creates new application and devices like these. So this is in this device, I think if you point the finger in the screen, you're like, if you're moving your fingers, it's just like you're moving in the manifold of Latin space. So by pointing to different places in the screen on the device, then you can generally sound with different timber. So I think this is one interesting application which I think we haven't tried for speech generation. But this is good for music because we always want something different from what we have in the data. So this is for the first application, for the second application, the WaveNet can also be used in the so-called Wave to MIDI and MIDI to Wave. So for this kind of network, you can consider it as the counterpart between the speech dialogue processing system. The first part would be the music transcription like the ASR system. We convert the audio into the symbols, into the note. After that, they have this kind of music language model to describe the music language model. Finally, they can use the piano roll as you can use WaveNet to generate the waveform from the piano roll. So this is kind of simple idea, but actually have quite new applications in terms of music generation. So as you can see from this figure, one application is to generate the music in a coherent way. So once they're trying the network in the part two, the language model, so-called a music transformer, they can generate a new piano roll from the transformer without any condition, and then they can produce waveform. Another way is giving a primer or some kind of a condition. The language model can produce a piano roll which is consistent in terms of style or other kinds of aspects of the music with the primer condition, and then they can produce a waveform. So this is kind of application, I think, which is interesting. We never think about a talking head which is talking forever, but people do appreciate if we have some kind of machine that can generate some beautiful music forever. So I think this is one interesting application for the WaveNet. For the third case study, it's they're using the WaveGun for waveform generation, but I think the message we can learn from this paper is slightly different from the first two cases. The message from this paper is that it's quite difficult to generate good quality of the music waveform from the GUN-based models. Instead, they have to use traditional approaches to generate the spectra, amplitude, and the face in order to create the waveform. At least from this work, I think we can get the message. If we want to use a GUN, we have to think about something different instead of directly generate the waveform in the time domain. So this is the reason I'd like to mention the GUN since in this place. But of course, the final case would be the Neurosauce Theater Waveform Model. The motivation to use a Neurosauce Theater Waveform Model for music generation is quite related to what we can observe from the music waveform. So here, one example from the piano sound of different note. As you can see from the picture on the right side, the piano sound is quite periodic and it can be approximated by the sine waveform. So this might be good news because the Neurosauce Theater Waveform Model uses a sine waveform as an excitation. So what will happen if we directly use this network to generate some music waveform using sine excitation and using one stream of the Neurosauce Theater? For this, we did a lot of experiments using to train the models to train an instrument independent as a model. And we also tried different coppers but for this experiment, we're using a multi-instrument data coppers called URMP. This is quite different from what they used for the incense or the MIDI wave to MIDI work. But just give idea about how it works. In the experiment, we tried three different scenarios. One is whether we can train the model from scratch on the music data, although it's quite small, five hours. The second approach is once we're trying a speech waveform model, I mean the NC model on the speech data, whether we can directly use it to generate waveform, giving the acoustic features of the waveform. And the final one is to fine tune the speech waveform model to the waveform domain. Here is the most score in terms of the quality of a generous instrument. You can see from different rows corresponding to different instruments. And there will also compare different models, wave net, wave glow. But here I'd like to mention one results. If we compare the training from scratch results and the zero short learning results, we can see how different the results is. If we directly use a speech model to generate waveform, it does not work. Instead, we have to use the music data in order to train the model. But of course, there is similarity between the music and speech waveform. So that's why if we fine tune the model on the music data, we can get a better performance like this. Sometimes the quality could be better than the natural voice. There's natural audio in the database. Here, I'd like to play some samples from this work. The first sound will be natural one and then will be the generated sound from the NC model. So, from this sample, we perceive how the quality is quite close to the natural sound. Although, of course, there are artifacts in the sound. But at least, we see the potential to use neural source filter waveform models for the instrument independent modeling. Of course, we are also carrying on the work using the NSF model for music modeling. For example, one case is the piano sound. As I mentioned before, the piano sound is pwnophonic. Currently, we don't have good ideas how can we generate sine excitation for the multi excitation or the multiple notes. For this work, we simply use a noise as excitation. And it turns out it works reasonably good. So I'd like to play two samples. The first one is natural one, then the generated one. Now, the generated sound from the NC model. So, yes, this work is still ongoing because we have to explain why it works. Probably because we're fitting so much data for this work. But it's quite interesting to see how we can produce the pwnophonic sound from the noise for this piano instrument. The final example is about the room reverberation. In the previous part of this lecture, we have shown how it behaves for the speech sound. But here, I'd like to mention how it works for the music. I'd like to first play the sound which is coming from the network before we add the reverb effect. Then after we add in the reverb, you probably can perceive, for the dry sound, it really sounds like you're playing the violin in the open space. But for the reverb sound, it's like you're playing the violin in the concert hall. So this is the reverb effect, which is good to improve the perception of the music. Having introduced so many different types of networks for the music application, probably you could understand how interesting this topic can be because we can create so many different kinds of applications for entertainment or interactive performance. Of course, there are some kind of crazy applications such like the lever-ending death metal AI, which just used some kind of language model to produce death-model music. Of course, somebody is anxious about that. But for us, I think we have more things to worry about. For example, whether we can use a single model to better model the differences across the instrument, whether we just need to use big data plus the large autoregressive model instead of building an interpretable model. And the final one is whether we can do better for a polyphonic music waveform generation, which is quite different from a speech. So that's for the music part. And we come to the conclusion. I think Yamaguchi-sensei, could you? OK, actually, I didn't plan that I can do that. Yeah, so thank you very much for staying with us for three hours. As you saw, in this course, we overviews like a high-level concept and the visual concept of non-neural and the neural vocoders, which has clear and a close relationship with each other. So especially, we overview three types of vocoders, three types of families, AR, source filters, and growth types, and the plus and cons. We also mentioned that close relationship between speech processing and also machine learning. What I'm kind of a little bit worried about is disjoint speech and the music communities. I think those two communities need to be somehow merged. And as you could see from a parallel development of speech and the music vocoders, which is quite similar to us. Right, in addition to that, we teach, we taught you like a basic knowledge of signal processing like revibrations or source filter theory. So I hope you enjoy those courses. After this lectures, I think George, or maybe we, will upload hands-off sessions, which includes the NSF code and the Jupyter code and so on. We also included pre-trained models. So I think you could enjoy to generate speech and the music if you want. Xinhuan, could you quickly tell me what you could do with those hands-off sessions? Yes, for the hands-off sessions, there are three parts. As shown here, the first part would be the Jupyter notebooks, which you can run on your laptop without using GPU. It basically shows how each basic block of the NSF model works. So you can play with the code there. For the second part is just a demonstration on the pre-trained model, as your magazine mentioned, for speech and the music generation. So in that case, you still can use a laptop and you will load the pre-trained model and you will see how the signal is generated from the NSF model, how it was gradually changed by the future part. And for the final part would be the training scripts. In that case, you have to download the GitHub page, which is written in the red V of the hands-off session material. And you can just run the command to train an NSF model, or actually multiple NSF models on the CMU Arctic database. So I hope you enjoy the hands-off session later. Yeah, good. Thank you very much. And I also want to mention that if you download this presentation slide, you could see additional 60 slides because we drop technical part or mathematical part. Although Shingwan and I like those mathematical parts too, but I saw that this is too much. So if you want to know the more technical part or mathematical part, please have a look. Those appendix parts, those would be useful for your further deeper understanding of those wave formulas. Right, Shingwan, could you go to the next page? I don't remember what we have. All right, just reference. Yeah, just reference, OK? I think you can have a look. And then I want to thank you, those people. I borrowed some of the slide from Simon King. I also want to thank you, Nauri and Heiga. They contributed a part of those courses behind the scene. Right, so thank you very much. Any questions or comments? You need to answer, really, thank you very much for this nice presentation that includes, as you said, the signal processing knowledge neural networks and shows how we can use the models for speech and audio music. Yeah, if there are questions, first of all, there are some participants from US and they are not able to, I guess, to participate online right now. They will send questions through the Slack channel day two. And other people can send questions later on. But please, now it's your opportunity. OK, can I ask? Yes, go ahead. OK, thank you very much for the presentation. I actually am a little bit concerned about the F-0 conditions in the NSF because you have a, like, you use a separate, it's not like separate, but it's jointly trained. But there's a separate module that is conditioned only on the F-0, right? So I'm just concerned about the condition of the mismatch of the F-0, like F-0 that is unseen in the training data. How will it perform? I mean, yeah, in slide 82, I guess you have shown that about the modification of the F-0, right? But again, if you modify the F-0, then naturally, the spectrum envelope should also be changed, right? So it's not probably a bit more fair to use out of speakers with outside F-0 range of training data. So the spectrum is matched. And again, in this case, it is a male spectrum, right? So the male spectrum already contains F-0, and you perform F-0 modification, but I don't know how about the condition of the male spectrum. Yes, I think when you mentioned mismatch, I think there are two types of mismatch here. One is the F-0 range, whether we're seeing the F-0 range in the training data or not. Another one is the mismatch of the F-0 in terms of the input futures. I think this picture showing here my answer is the second case where there is mismatch in terms of F-0 information between the input acoustic futures. But at least from this figure, you can see how even if we keep the male spectrum the same, we can still get some kind of output waveform from the after we modified F-0. But of course, this is an EU condition. The ideal condition would be we use some kind of acoustic futures, such as male capstone coefficients, which does not contain the F-0 information, so that this kind of mismatch in terms of acoustic futures does not happen when we put the futures to the network. But as you may realize, this kind of mismatch actually happens in many places, as long as people don't care about what kind of futures they use. So for example, people may just like the male spectrum, they just fit in. So the question is, again, once we have the mismatch in terms of acoustic futures, how can we control the F-0? I think, of course, one way is to change the male spectrum. Another way is to keep the male spectrum the same, but we provide different ways to change the F-0. I think this is one potential application in terms of the SM model for that. For the first case, where the F-0 range is different, I don't have clear answer for that because I know you're working on worst conversion. So we have never tried that. But at least I can see that as long as the, well, for normal human beings, as long as we have large data coppers with different speakers, hopefully we can cover most of the range of the F-0 for all types of speakers. I don't know how, yes. I also have some different view on this. So if you don't mind, could you go to slide 197? 197 is not much. You can use, like, light this one. Yeah, thank you. So yeah, so Patrick said that he has some concerns to use F-0s as a part of input to the neural book orders. Actually, I don't have any concerns. I mean, because in the case of the NSF, it is clearly separated. I mean, you use F-0 in both sites, but you have a clear separate F-0 part. So yeah, please explain my other interpretations. So please assume we have deep learning-based encoders to extract F-0s, like in the figure in the middle. Then those latents are further input to the decoders that generate speech waveform. So in my opinions, NSF is somehow similar to autoencoder-like architectures, whereas the latents are somehow explicitly or supervisory trained. We have flow-based neural book orders. We also have GAN-based book orders. So another interpretation is to use F-0s or other type of latent is actually indications to, like we are heading to VAE-like vocoder architectures, in my opinions. But anyway, this is another way. So if you want to make your neural book orders robust to auto F-0 ranges observed in training data, you have to make your generators or decoders to robust to such unseen ranges, which is also typical issues for auto encoders. And then tricks to train the VAE is robust. But anyway, there are some solutions to make our neural book orders more robust. Although we haven't tested it yet. Yes, one final thing to mention is that, well, if you think there is a mismatch in terms of acoustic futures, for the second case, you can add a simple network to predict F-0 from, for example, the male special program. So in this case, there will be no F-0 explicitly. I think it could be used for your application, I guess. OK, I mean, that's fine. Because in the WebNet case, it sounds like all of similar sounds, because I think that in the case of the WebNet, it somehow does not really take into account the F-0 condition, because the male spectrogram is probably more dominant. And in your case, because you have separate modules that contest F-0, so the sounds a little bit different. And even though the spectrogram has the F-0. Yes, yes, great. From this example, the WebNet does not listen to the input F-0. I think, yes. But instead, we can do that because, well, because you use a sine waveform as a strong prior on what kind of waveform we can get. We can talk about this later. But I think, yes, agree with that. WebNet does not listen to the input F-0. I mean, if you use spectral inflow without F-0 information, it might be different results, I'm not sure. But anyway, thank you very much. Yeah, agree. I think it depends what kind of acoustic future you use. If you use lifter links, if I use signal processing to remove pitch information from the spectrum envelope property, probably you can control the spectrum envelope F-0's more separated way. But as long as you input male spectrogram, as you see, as you saw from our slide, low frequency part of male spectrograms contain F-0 information. So there will be some mismatch between F-0 information included in male spectrogram and F-0's given separately. So you need to handle those conflict mismatch and the conflict somehow. Yeah, I understand the point that in the NSF, because of the possibility of controlling separately the F-0 in using the same, in the very basic NSF, using the same way. This kind of mismatch can be heard in the source. Even though the acoustic features still contain. Yeah, thank you.