 All right, so guys, let's start. Let's talk about how to do the synthesis in non-stationary speech synthesis, right? So the presentation's outline is this. First of all, we're gonna give a short piece of a coding. We're going to focus then on vocoders for TTS. We're going to discuss about synthesis, about quality, and about our system. Then we're going to discuss a little bit about the speech signal, which was mostly covered by Janis. We're gonna give my perspective as well. And then we're going to go into the main part which is a volcano. Give an overview, present a model of speech, actually, because that's what it is. There's underline model of the speech waveform. So what is this? And then discuss about speech synchronous framing, spectral assembling, then deterministic stochastic model, and then some special sort of lines that I use to synthesize. Then how to synthesize sounds that are not exactly a periodic and not exactly periodic, like voice fricatives, using a coherent noise modulation model. And then we're going to talk about how to do fast synthesis in a solider. And yes. So I kind of found all these publications here where you can see that vocoders, actually, models of the speech signal, they have been very, they started up to from the 1791 to come from Wolfgang here. That was actually a kind of vocoder where the vocal track was made out of leather, was a tube made out of leather. You could put your head inside, squeeze it a little bit in several locations, when you may be able to fit exactly the location of an A and then you pass in the air and synthesize the A. Then going down to a phonia in 1846, mechanical talker in 1937. The electrical era found us with a vocoder from Bell Labs and then with a font, you all know him from 1950s in Sweden. Where he had a two-formant plot here, F1, F2, and he was able to move it around where he was synthesizing artificial resonances, the sum vowels. And then in the computer era we had, effectively, the main application was speech coding. For military purposes, where you have an ultra-low bit rate or a very low bit rate, that's how you see it, in 2.4 kilobits per second from the LPC10. Multipart excitation vocoder, which you all know, you all know the modern version of it, which is straight, and sinusoidal transfer code, which is a split-band sinusoidal multiband excitation vocoder. Then in the speech synthesis time, there was very new interest when unit selection was very hot in the 1990s where Giannis tried to use his harmonica noise model in Bell Labs to synthesize units effectively. His goal was to be able to do pitch modifications in very high quality and voice morthing. Then when statistical parametric speech synthesis came, most people started using straight because it was available and because of very high quality as well. And then later on in 2013, Aero also open sourced at his executable. He gave his implementation of harmonic plus noise model, which is also available for you guys when I try it out. So what is a vocoder from the DPS synthesis perspective, right? So it is analysis synthesis algorithm. Vocoders are tools we use to parametrize the speech signal in a way that is statistically amenable to code. It is amenable for coding and statistical manipulation. So we want to have an acoustic space of the speech signal that we can treat with any model. And that's why we use a vocoder for. We take the speech signal, we do the analysis and we assume some acoustic parameters. The typical pipeline that we use in statistical parametric speech synthesis is that we go from the text, we extract here, oops, we extract some linguistic features, some parameters of the text that describe the text itself. And then we use statistical mapping that map these features into some PDFs. And then from the PDFs, we extract one realization, one sequence, one trajectory of the acoustic parameters that later on we apply a post-filtering mechanism to enhance somehow and we do vocoding. In many cases you will see that these two steps have been done simultaneously using an LSTM. But conceptually that's a very good separation because it delineates the difference between capturing a conditional PDF of the acoustics. Given the linguistics, where here's the linguistics here of the acoustics and out of all possible realizations that follow this PDF, picking up one. So, you can use any method you like, the best methods are the LSTMs right now. And I guess a bunch of work on it is actually a thing, pioneer the field. So, what is a vocoder, what is the purpose of a vocoder in textual speech synthesis in the textual speech system? It is effectively we're trying to replace the mechanics of the speech synthesis. If you see the neural network here as capturing the relationship between text and acoustic parameters, the vocoder is the part that replaces here or the vocal system. So, conceptually it has a different job. When you wanna utter some speech then you have some sort of interrepresentation of knowledge, some sort of intention, some sort of content, some sort of context as well. And then you are sending the messages to your vocal system, and your vocal system with whatever is available tries to pass the message to convert it to speech. And this mechanism as Giannis noted at the beginning is being trained from very early age and because you hear yourself then you have mirroring, neural mirroring mechanism that go back and have a fit, there's a fit back mechanism that allows you to be able to use your vocal system to utter almost any possible sound that we find in speech. We evaluate it, we evaluate it as quality and vocoder quality using an MOS natural scale. There are many ways of evaluation. This is one we use because we mainly want to to capture the way that people feel that they talk to an actual human. So, naturalness is one of the criteria that is very important for us. And the scale goes from two to five with two being poor and five being excellent. And two means mostly an unnatural speech, three means equally natural and unnatural speech, mostly natural speech and completely natural speech. The quality below 2.9 is almost unmarketable here. You wouldn't like to hear to such a synthesizer, you wouldn't like to use it, no matter what you find another way of doing your job. The quality above 4.5, I call it imaginary quality because if the test is well constructed and it's well calibrated everything you should not be seeing these numbers. We get fluctuations around it for up to 4.6 and or 4.2, depending on how the human actually, human articulates speech as well. Whether it even correlates to how pleasant speech is. You ask a question about naturalness and also you get an opinion about the actual voice, the quality and the pleasantness of the voice, the recording. Here's quality here. The saturation effect has to do with the fact that we have a scale and with fixed points and people when they ask to vote between 4.5, the maximum they can give is 4.5. You see the same thing in the stars, in the restaurant when you do ratings. Many times you see 4.5 as a rating, especially when we have many, many samples and the guy doesn't tweak his ratings. So the same saturation that happens here as well, this is audio field quality, this is telephony quality. The one we're used when we were to communication with the mobiles. And before we came, that's where we were. We have statistical systems that were kind of entering the unmarketable quality, unmarketable quality era. Do you hear me back there? Now you hear me back there. Do you consistently hear me back there? I'll take that as a no. So the use selection system was rated on 3.7 and 3.8 and we clearly had to do something particularly for this domain. In terms of Google, we use the statistical synthesizers for embedded synthesis whenever the mobile is not capable of being connected to the internet with a high speed, sufficiently high speed. So it's a back fold. We also have to use it definitely for accessibility purposes because we have to utter very fast speeds, ultra fast. And people depend a lot on the latency as well. So the Android TDS quality was 3.2, the before, okay, and the embedded was about 3.5, a bespoke order that we could ship in a product. It was fast enough, it was 3.5. The best vocoder we had was 3.7. And then clearly we had to, and this was the upper bound of quality in the statistical synthesis, right? You can't get better value of a code where you are doing statistical synthesis if you're sufficiently high. So are you sure you don't need a better vocoder? Well, I guess we do. That's what we're showing right now. And when we started, there was a 0.5 MOS gap between the current and the state of the art. So this gap here, 0.5 MOS is really huge. And we use it in a bunch of applications. Let's talk about a little bit about the speech signal now. When we are doing waveform modeling, especially for the speech signal, we incorporate inside our models implicit or explicit assumptions. So we may pick up principles from essentially the speech production system from models that we know, like the source filter about how speech production occurs. Or we may pick up principles from the ear for auditory models. So there's an interplay there. These are the, that's what I call the two pillars. There's two sources of justification for picking up any design choice or any waveform model in algorithm. Once you have based on these two, you are going well. So one of the, for example, the principles we use is from the ear is the male scale. We model male septum, right? This principle originates from the fact that the ear responds in a scale that is more related to male. And also amplitude compression. Also it's related to how the ear listens. And phase coherence is also related to modulation detectors inside the ear. So mouth, we use a concept of vocal tract. We use a concept of nasal tract aspiration. We know about glottal excitation. You've seen this all over there. So when you construct a vocoder, you are picking up ideas from here and there and you're trying to put them in one set that has high quality. There's three ways of seeing the speech signal. I call them dichotomies in the sense that they guide our perception of what the speech signal is. The first is a source filter dichotomy where that originates from the vocal, from the speech production. So where we assume this periodic, of course, and then we have a vocal tract and then we have a radiation impedance and then we get a sampling of the spectrum at the harmonic frequencies here and a vocal tract. We sample the vocal tract at the harmonic frequencies. Already Janis talked about so I'm not gonna stick it any more. Then we have other dichotomies like deterministic and stochastic. We discussed about recovering synusoids but what about recovering which parts of the signal are noise like in a voice communicative, which parts of the signal are deterministic, are not noise. In straight terminology, we have the spectral envelope like this and we have an upper periodicity envelope and the difference between the two is like a signal to noise ratio for every band, for every frequency. So if you're familiar with straight, then you'll see that this difference here is the amount of noise you have effectively between your harmonic and where the noise level starts. This is the noise level, this is the harmonic at this location here. And why do we use this sort of per-band noise, a periodicity or a non-deterministic model? Because we have a lot of phenomena that happen there. Like aspiration, we have frecation, a hape son constrictions, we have voice frecatives as well. We have the fact that the vocal folds do not close completely in many cases. So you have incomplete closures of the vocal folds, which means that after like this air passes through. Another very famous dichotomy that people inherit their mind is from FFT, comes from FFT, which is the amplitude and phase. When you are talking in terms of amplitude and phase, you're effectively talking in terms of FFT, right? Amplitude and phase are constructs of FFT effectively, almost. So if we see from this perspective, then we can say that the speech signal is also an amplitude and a phase and the sum of modulated cosines. You've seen this model already, you've talked quite a lot. And there's a question there. How do you construct a good, if you have only the amplitude, what is a good model for the phase? And this is one of the questions that McCain answers to a certain extent. So the overview of McCain is this. The functional overview is that it has high spectral resolution. So you can, when you have some parameters and then you are synthesizing back the original signal, then the parameters you are getting are very close to what you already, what you expect, what you gave into the input signal. Because the process of synthesis, for example, overlap add itself, does not guarantee that you'd have exactly the amplitude, the Sadanius amplitude and the Sadanius frequency that you gave for synthesis. There's no complexity penalty for the high resolution. We have a decoupling between the spectral parameterization and the DSP implementation. Many of you guys that have already synthesized using, for example, a filter for male cap, right? Or LSP's or whatever spectral envelope or model that you may have used, you see that there's a restriction between the synthesis method, there's a correlation between the synthesis method and the actual parameterization of the spectral envelope. McCain breaks this. You may have male cap, you may have LSP, you are not forced to synthesize using auto regressive model or to synthesize using a male filter and so on. So this decoupling allows you to explore parameterization, which we never really found some better parameterization, but allows you at least to explore which is the best parameterization for what you're trying to do. And then you have an asynchronous phase model. The asynchronous phase model is quite interesting because it allows you to blend vocoded speeds with recorded speeds. So if you want to do a statistical unit and you want to have actual chunk of audio and then in the middle put a chunk of audio that is synthesized artificially from a model, then you want to make sure that you don't have huge phase mismatches and this asynchronous phase model allows you to do this. Of course, it goes ultra wide band, whatever, up to 48 kilohertz and it's a universal decoder, effectively, because you can use whatever model, you can use YANCH, H&M, adaptive course, harmonic model, straight, multi-band excitation, whatever you want to synthesize, you can get it back. So think of it as a synthesis algorithm. High quality, comparison straight, can be the yes in a fair experiment, low computation complexity, yes, single structure, multiple data, friendly, I already did that, so yes. And it's very simple. So that's very good for maintenance, especially when you have engineers together with software engineers, software engineers together with researchers working, you want people to be able to navigate through your code. All right, so the speech model of a cane is effectively one equation. Describe the whole speech signal in one equation. You don't have different ways of synthesizing, let's say the noise and the deterministic part and then glue them together somehow and then dealing with how we have spilloff of noise from the harmonics, the non-harmonics, from the noise to the non-noise and so on. You don't have any of these problems because it is described in this equation here. This is the cosine, right? So you've seen this, this is the phase, we'll talk about it, how do we construct a nice phase? Because the whole trick is here, how to generate the noise? There's a trick that we do and the trick is exactly on these phases. Then this is a model for voice fricatives. We're going to talk about it and this is the instantaneous amplitude of the kth harmonic, so. And this is the first harmonic. So all the harmonics, except the first, they're described by this part here. And the first harmonic, if you notice, the phase of the first harmonic here, it is used to modulate the rest of the signal. So if you remember what Gianni showed you about the time envelope of the noise in speech, the time envelope of the noise speech exhibited a particular speech synchronous behavior. As the speech marks were like this, the glottaclosal distance, then there was noise, noise, noise, noise. Was fall and the time envelope was in this fall. Rokin does it explicitly by taking the first harmonic and just, it's a model of the aspirate, of the time behavior of the aspirated component, the aspirational components, the non-periodic components in speech. Let's talk about speech synchronous framing. So we've seen the equation, we know that there's a little bit of this bit here, this bit here, we don't know exactly, we haven't gone into depth into several components, but the most important part to understand at this stage is how we do the pitch synchronous framing, the asynchronous speech model as I called it. So in Rokin there's no difference between a voiced speech and unvoiced speech, right? They're all synthesized with exactly the same functions. And also what happens here is that, so there's a common concept of what I call reference synthesis in space, which in voiced speech, they are the glottaclosal distance, in unvoiced speech, they are the pitch marks, which are regularly placed every 10 milliseconds. Whenever we start at time zero, we start from here, and when I synthesize one pitch period, it might be voiced, it might be unvoiced, these are RSI, reference synthesis, they are not glottaclosal distance. We compute the pitch period or the synthesis period to be more exact, we compute the synthesis period and we say this is the frame where I'm expecting the parameters at the end. So I'm expecting the parameters here, I know the parameters of speech at the beginning, I know the parameters of speech at the end, and I'm interpolating between frames two and three. And I synthesize this pitch period completely independently of the previous one and the next one, as long as I keep here some continuity smoothness constraints. And then I do, once I synthesize this pitch period, I know this frame and where it goes in T1, I can add the next pitch period synthesis, or synthesis period if it's unvoiced, and then I go to frame seven. So bokeh tends to throw away all these frames and it does not impact quality except for the case of super fast speech. In this case, it uses them, but it makes the presentation much more complicated, so I will skip it. So now we can presume that we are talking about synthesis of one very, very particular case. We have the beginning of the pitch period, we have the end of the pitch period, and we want to, we have the parameters at the beginning of the pitch period, we have the parameters at the end of the pitch period, and we want to synthesize what is in the middle. What we have to do is to assemble our spectral parameterization, our spectral envelope on the location of the harmonics. So have a spectral envelope, which I showed you something like this, and then we have the harmonics here, here, here, here, here, and we want to assemble them. It is very easy and very cheap computationally, that's why we use it in the Melcap domain, and you can also do it in the LSP domain directly we're using this formula if you like. I have the references side of the paper, but in general, you don't have to tie up your spectral parameterization to the signal process that is used for synthesis, you could just give FFT, and in fact, Palavira is doing it. So you have FFT, log spectra, you just give it whatever you have, you can construct, you can assemble it. And then we have the phase. So we discussed about how to recover the amplitudes at the beginning and at the end of the synthesis period, which might be a pitch period, might be completely unvoiced. How do we construct the phase so that we construct simultaneously the deterministic behavior and the stochastic behavior of the phase, because you may be, for example, but eight kilohertz was completely noise. It may have a completely noisy harmonic, or you may want to synthesize as we said an unvoiced part of speech, that you have to put some phase there that generates noise. Traditionally what we do is that we put random phases, which is, so it's very clear what we do, but it is quite clear what we do in completely unperiodic speech, in noisy speech, but what we do in deterministic speech is also very clear. What is not clear is what we do in between cases where we have the signal to noise ratio to give here of, let's say, 10 dB. So at this location here, in particular, we want the periodic part of this harmonic, the periodic part of speech here of the signal to be 10 dB higher, let's say, that's actually more, it's like 30. Yeah, to be 30 dB higher than the noise level. And this has to be done by synthesizing one deterministic component. And that, I'm not gonna show you how I do it. So now we'll talk about the stochastic part, right? So we get the explicitly provided phases from a phase envelope. So you may have a phase envelope that is, you have trained externally somehow, you may have constructed your phase envelope using any way. Or you may give a fixed one. What can you use a fixed one? So that doesn't have to worry a lot about, mainly because of speed, but I have used minimum phase, maximum phase, mixed phase, you can use whatever. Another thing that matters here, and the unvoiced spectra here are, in the phase of the unvoiced spectra, are uniformly distributed between minus pi to pi, or zero to pi. So between these two points, what you wanna synthesize completely unvoiced speeds is very easy. The voiced peak spectra now is a sum of signs, excitation with some phase dispersion. When I say exactly how this happens. Now, let's assume that you want to synthesize a pulse, right? That is a sum of signs pulse with phase dispersion where you have the first harmonic is completely coherent with the reference interface, which means that if you have the reference interface at this location, the first harmonic will go like this. And so, and then the rest of the harmonics are offset it by two pi and a uniform distribution plus minus four. This means that the phase is, and this is the deterministic phase spectra. This is fixed. We compute it once at the beginning of the utterance and we keep it all the way. Once we want to synthesize speeds with a particular periodicity, we multiply the noise effectively, the level of the noise inside the phase. So, we add noise to the phase itself by the function of the periodicity. So, which means that if the speech is completely a periodic, yes. Yeah, yeah, let me see. You're right. Yes. This group. So, our phase, when the periodicity of zero, the function of the periodicity is zero, then this phase, fk becomes ck, becomes a deterministic. When a periodicity is one, which means totally a periodic, then this phase component is dominates over this one. So, we have, the more a periodic signal is, the more noisy the phase at the harmonic is. So, instead of adding noise, yes. I'm not the wise case that is. So, yes. So, the, the phi here, you mean, right? The phi. First of all, this one is the one that comes out of our phase model. We somehow have, it may be sampled from the signal itself. We may have the signal, the waveform, we may compute the phase envelope, somehow there's ways to do that. And then, you want to add to this phase as much noise as a periodic your component is. So, if your component here is completely a periodic, is completely deterministic, because termination is better than a periodic, then you would like to keep only this. You would like to be only that, right? If it is noisy, then as much, if you would like to make it as noisy as possible, so you add this component here, right? This gives you a way of introducing a periodicity, non-dependent, or without affecting the amplitude of the harmonic. So, you add noise without touching the amplitude of the harmonic. The instantaneous amplitude of the harmonic is what you intended. That's why I call it high resolution. How do I do that now? So, in the quadrat, and, so we see now the following things. We're seeing instantaneous amplitude at the beginning of the harmonic. Constantinous amplitude at the end of the harmonic. We've seen ways to compute the phase of the harmonic at the beginning. We saw ways to compute the harmonic, the phase of the harmonic at the end, and we're looking for a way to construct the signal. For the amplitude, we can do a simple linear interpolation, so that's easy. But for the phases, it's not as easy as it seems. And the cubic phase model that Janis described, the model from Poirieri, that's come to the previous slide, is not going to give us a structured way of breaking the harmonicity of the signal. So, that is the key point. The cubic phase model, which was something like, I don't know. Just don't know how to do that. With the, yes. It's all right. That's all about it, okay? The cubic phase model goes on this form. What it didn't do is that it didn't give you, you were given some amplitude at the beginning of the end, and then it could go any way that the side could do as long as it was cubic. But you didn't have no way of controlling how harmonic are the components within here. If you take it across many different harmonics. So, if you have many harmonics, the first, the second, the third harmonic, you may have taken all the difference, all the change in this bit, in this part, or you may have taken all the modification in this part. And this is something that introduces glitches from time to time. And so, I needed a different model in order to be able to synthesize noise with that. And that was the cubic model that I, the quadratic, sorry, the quadratic phase splines. So, in the quadratic phase splines, the synthesis period is split in two halves. The first half there, and the second half. First half that goes up to here, this bit, and the second half here. The, and corresponds to a piecewise linear frequency model. So, we have a piecewise linear frequency model that goes up to, omega zero is piecewise linear. And then, we have a mid-period, we have a continuity at the center of the pitch period, which is chosen to maximize smoothness. And it's very fast as well. So, effectively, what we do is that this phase, we have this model for the first half and this model for the second half. So, as you can see, this is phase. I saw the derivative of this is this spontaneous frequency. This is spontaneous phase at time n, and spontaneous frequency is omega k comma s plus gamma, two gamma k comma s multiplied by n. So, from the beginning to the middle of the pitch period, we have a linear function and then another linear function, one line and another line. And we have a phase unwrapping integer here, which tells us how many times we have done the pitch, we have how many multiples of p, how many times we have gone around the unit circle in the sinusoid. If we solve these now quadratic phase plines, in order to obtain the parameters, we have to solve this linear system with seven constraints. Remember, we have, it's a Daniel's sample, we have the phase at the beginning of the pitch period. We have the frequency at the beginning of the pitch period. Right? We have the phase at the end of the pitch period. We have the frequency at the end of the pitch period. We have one smoothness constraint that we have continuity at the middle of the pitch period at NC. We have continuity both in phase and in frequency. So, this phase continuity, this frequency continuity at the middle of the pitch period. And then we have a smoothness constraint that tells us that the curve we want to reconstruct is maximally smooth, which means it's the integral of the double derivative of phase squared of it is minimized. And by minimizing this term, we effectively get back the two pi multiple. So, we started from how many unknowns? One, two, three, four, five, six, seven unknowns. We made seven equations that we can solve for the phase, for the quadratic phase line. This is an example of it. We start, this is a pitch period or a actual synthesis period that goes from zero to 160 samples here. And the, you can see that, and we added several levels of noise. So, if we have no noise, you can see that it goes straight. This is the frequency, right? It is not phase, this frequency. And this is the phase of the same curves. So, it's linear phase here, which means fixed frequency. If we have noise, then the endpoint frequencies remain the same at 100, but the midpoint frequency goes to 125. So, effectively, the more noise we add at the phase, at the harmonic, the more the sender frequency, the frequency at the sender of the pitch period deviates from the frequencies of the, at the endpoints. So, you start with the frequency that is, you start from a value here, oops, doesn't work, or let's say 100 and then enter the value of 100 and this epsilon here, is delta, is related to your level of noise to a periodicity. This is omega k at the end. This is omega k at the beginning. And the omega k at the middle, right, is related to omega k at the end, a comma at the beginning, and to the periodicity. So, what do you do with that? You break harmonicity. The signal becomes more and more non-harmonic. So, that's how you generate noise by not having an effective noise model. You have a stationary set of sinusoid and then you can construct noise with it. And this is here how to, it's effectively, it gives you these advantages. The sinusoidal tracks are not harmonically related. This is another way to control a periodicity, right, in the sense because you don't have to, you can correlate it directly to the signal noise ratio. Sinusroids are guaranteed to be harmonic only at the pitch mark, at the end points of the synthesis. So, if you have, let's say, this continuity is, all these discontinuities will be located in particular location inside the pitch period. And harmonicity breaks according to the noise level. So, we discussed about, it's a Daniel's amplitudes, this and this, linear interpolation, standard phase quadratic phase model. And now we're going to discuss about this term here which is the coherent noise modulation term. This is nothing more than a way to shape the noise according to the aspiration pattern of the noise that James presented you in the previous slides. And it makes sense only for, it makes sense in voice frequencies and in the higher frequencies of the powers because you don't want it to be very buzzing. What does it do effectively this term? Check it out a little bit closer. It has a fixed term gamma zero here and then it has a periodicity multiplied by the first, the cosine that is actually of the first harmonic. So, if they have a periodicity equal to zero, then it's a fixed term, it doesn't do anything. If a periodicity is equal to one, then it becomes like this signal here where it has this behavior around one. So, if you modulate your signal like this, this is your harmonic, right? And as we saw, this is the sinusoid and the sinusoid can synthesize noise itself. Then this term here modulates this sinusoid with this time envelope and concentrates noise around the glottar-closet distance. This is in time domain. In frequency domain, it takes energy from the center of the harmonic and just spread it around because that's what happens when you multiply cosine with the signal that takes it and puts it minus plus minus the center frequency. And the question is, well, does it work? The French language is particularly rich in voice fricatives. In French, for example, Vasivaza. So, we had a great improvement there as we're going to see from the results. Here are some references where, of course, Giannis, it's Gianu's work here. McCree, as well, who is one of the first guys that proposed this. Bastian Klein, of course, he did a listening test to show how time frequency masking happens on noise because that's what we're trying to do effectively. We're trying to mask the noise to incorporate into the speech signal with the time envelope, this one. And then I used it also in my LF plus noise forecoder. So this experiment is that with several parameters, I'm not going to go through the details because I only have five minutes. So what I want to say is this, that in terms of computational complexity, it is almost as fast, it's exactly as fast as the previous implementation of mixed band excitation, which was very, was quite optimized as well. Then, when we did a listening test for English, this is the MOS of the recorded speech. We're talking about clean database now, noise database could have a different. This is vocade synthesis with straight analysis and this is what vocane synthesis with our version of straight got because straight is patented, so we can't use it. We have our own implementation of something similar to straight. So we got this number, the actual implementation of straight got this number here when we're using vocade synthesis. This straight version produced this MOS here, 4.07. So you can see we matched the quality of straight while being very, very fast and able to go and to put our method into the small devices. But we also got something much better when we combine the merits of straight analysis with vocane. And this is where we are in the scale and all this is for English. When we go to France, then we see a great improvement. We see that straight goes on to 4.016 while the same type of analysis with straight synthesis goes to 4.265. Why? Because of the voiced pre-cutting model. Our previous analysis was, and Ben Nalsch was getting numbers like this one, 3.3 in French because it was not capable of dealing completely with voice pre-cuttings. And now we are in 4.3. That's the version you are using whenever you are using your mobile. The overall summary is that the original speech MOS gets about 4.5. Straight with vocane gets at 4.2. Straight alone gets 4.05. Coder, our own previous vocoder, our own vocoder analysis gets 4.05 if we're using with vocane. Coder with a previous two vocoders gets 3.7, 3.5. So effectively we got improvement of 0.5, 0.75 MOS in the server synthesizer in the embedded synthesizer 0.721. So those are quite huge improvements. Let's wait a little bit, I have five minutes. Now I did another experiment, that's quite interesting because it has in the same experiment, uniselection systems, parametric systems, and copy synthesis with vocoding, and recorded speech. So it's kind of puts in, if you put them all in the same list and test, you get an idea of how relative the distortions are, what is important, for example, what distortion is important when you have in-selection, what distortion is important when you have, how important is the degradation coming from vocoding related to the degradation coming from speech synthesis. So here you see that our in-selection systems, two of them, two different technologies, that's a fell generation, that's a second generation, it's a 3.788, right? So they all yield around the same 3.8, let's say MOS. Copy synthesis with state goes to 4.90, and copy zero with our own analysis and vocation goes to 4.176, and we put straight analysis goes to 4.337. But effectively this tells you that when you want to compare how much, where you are getting degradation from what you are doing in speech synthesis, the vocoding surface is not the primary source of degradation, there are much more significant things that you have to cater for. And all the parameters in the size are here, so you can see that this here, vocaine, plus next citation, and LSTM synthesizer, I can make the STM, was 3.738, right? So we pretty much went close to the in-selection, and the question was, the same thing happened from friends, we're gonna skip it, and I'm gonna go to, what happened if you then, okay, we show that we have a statistical system that goes very close to the in-selection, but how close can you get? Can you be the in-selection to what extent, right? And what we got is that for US English, right? The in-selection system is much better than three different versions of statistical systems that we have here in blue. So we have an embedded one and two non-embedding configurations, and this one got much better. Why? Because in English, American English, is a language that Google has really invested in and it wants to make it sound very good. At the same time, for Spanish, the difference is smaller because we have an investment of much time, and the same image, if we take all the European languages, you see that some of them, actually, you have a small difference improvement over here. You didn't say not that much time, you mean not that much data. What I'm saying, not much time is that there's both data and cleaning the data, and actually, how many people you put, how much effort you put to fix the in-selection system, because in-selection system effectively goes down to getting to every single way from fixing bugs, yes. What are the evaluations like from one speaker or one voice? All these values are made in one voice with one, with one, yes, a single speaker. Or with female or male. Always female, because people wanna hear female voices, and it's sad, but all the results are present are female only, and, yeah. It's kind of interesting to be introduced and how it runs. That's a good question, very good question, thank you. In this evaluation, we used 100 sentences, selected to reflect a kind of usage, so half of them, or a little bit less than half, were originating from DriveAbout, so they were kind of in-domain, semi-in-domain, I will call them, and the rest of them were web answers, and stuff that was coming from the logs. So they were, I would say, 50% in-domain, 30% out-of-domain, or stuff like this. Now we use 1,000, I think 900s utterances for valuations. So, still 100 utterances with seven samples per utterance, seven or eight is per utterance, is pretty much a very, very big test to run. It's much bigger than people do in other cases. The average duration? The average duration is between one, second, and three. It effectively corresponds to our usage, the way people use Google TDS. And you mentioned mixed accession on the bone border. Yes. It's using the straight information on an algorithm. It's using a straight like. Straight parameters, I've started now. Yes. Which, how many bands are there? Seven bands, for the 22 kilohertz signal. So it is a typical standard, the Melcap-based mixed band excitation with the filter that could have made for Melcap. Huh? It was a straight band, right? Yes, yes, yes. These bands actually not from straight, they come from the mixed band excitation for colder paper of Alamakri, exactly the same, yes. And you go to, okay, this is what happens in European languages, but if you go to Asian languages, we have a much better improvement, mainly because we don't understand the Asian languages. Now we have more people from China and more, we have much better representation. So we're doing better, but when this test was running, it was a clear benefit of the parameters and of the new selection. So what I would conclude out of this is that if you can spend a lot of money to make your new selections then sound good, that's great. If you cannot spend a lot of money to do this, then the LSTM may actually be better for a reduced slightly smaller amount of data as well. So, there's lots of discussion and here are the papers. I'm already five minutes late, so I will pick up questions for five or 10 minutes and then we can go. Yes, anyone? You started out in your description of the game, saying that it was universal in the sense that it was compatible with straight. Yes, it is, it did. And then in what followed, our kind of lost track of what you presented afterwards in terms of whether it corresponds to a different model or a different implementation that one plugged straight into. Okay, so straight here, you have to use the analysis and we use what came for synthesis. As of the beginning, what came is the part that does the generation of the waveform. It is not a part of the analysis. Think of it for example, in MP3 implementations of the algorithm, the MP3 standard tells you this is how I'm doing synthesis. You can do whatever you want in the analysis as long as your bit stream has these specifications. This is exactly what cocaine is in a way, is the thing that takes you from the parameters to the actual signal. So, it is both state compatible, it is H&M compatible, it is AQHM compatible. You can use whatever model you want. What about this part with the paramotor graphics lines? That part was about how to generate the signal once you have two end points and in these end points, you'll know the phases. So, that's a very simple question. I'm giving you the phases in this part of the signal, in that part of the signal, the amplitude in this part of the signal, in that part, we're talking about instantaneous phase damages because I've given it for a particular time instant and I want you to synthesize something. What you would say would be like to do an overlap add, right? That's not the best you can do. Assuming stationarity, an overlap add-in, yes. And so, cocaine is going to do this regardless of whether you get it straight or not. Exactly, yeah, yeah, yeah, it has already. Cool, there is no state. It's something that produces some sort of parameters that are straight like. There has a periodicity in bands and it has an amplitude, a spectral envelope. But as long as it gives you the parameters, then you can do whatever it is. So, to elaborate a bit more on how you get phase as many as you can get, I feel that the amount of time you just get to analyze it. So, that's a good question. So, a very good question. We'll have to write this down. I should be keeping this actually. I guess. So, we have two components there, right? Every phase it has a deterministic component plus another component that depends on a periodicity. Right? Let's put this as a function or something. It's a function of the periodicity. This is the deterministic part of the phase. This is a stochastic part of the phase. The deterministic part, I use many. I use minimum phase. You can extract it using minimum phase assumption from the log spectacle envelope. I use maximum phase, the complete reverse. I used a mixed phase, but I ended up using this very simple way. I fill up a vector with random values between minus pi, uniform distribution, minus pi, comma pi. And then, this is the phase that goes always to the first harmonic. This is the phase that goes always to the second harmonic. This is the phase that goes always to the, let's say, 200th harmonic if there is one ever. So, as you change harmonics, as you have different pitch periods, right, and different pitch, you will pick up a different phase. If your signal is stationary, this is completely fixed. If your signal changes, this phase changes almost randomly in a way. If you have a big change, you will get a big change as well. And this is like assuming that it's a fixed envelope per zero. But this is the cheapest way I could get ever to put in a phase, because as you remember, Vokain takes three, has to work in these devices here, right? And these devices have, and it also has to work, not only these, it's a high-end device, it has to work to a billion devices. So, it has to be extremely fast. So, I think I threw away every sort of minimum phase assumptions, any sort of phase assumptions, and put that, because that's much faster. And ultimately, if you hear it, you can tell a difference. And I have this sample, I don't represent it to any publication, but you can hear a difference between minimum phase and this one, but you can hear a difference only in maximum phase, only on the voice tone sets. Why? Because in maximum phase, let's say in a periodic speech, there's no difference at all between minimum phase and maximum phase, which is weird, because minimum phase goes like this, and maximum phase, that's the exact opposite, maximum phase goes like this. So, there's no audible difference, but in voice tone sets, it doesn't matter, because in voice tone sets, voice tones go something like this, okay? And then here, you have somehow the voice tone set. If I synthesize with maximum phase, then the energy goes here, goes before the voice tone set, so it changes the perception of, it is like pre-echo in a way, right? So you don't want that, so there's sometimes audible paths, but from this part on, it's okay, which brings a little bit of questions up above what quality do we need phase, and for what reason? I have very good examples about that effectively telling that phase is not, I'm the primary, let's say, problem, I'm not worried about this one, I'm not worried about phase. What about phase when I have to glue it together with the normal waveforms? Yes, there, I'm worried mostly about linear phase, but I'm not worried about phase when it has to do the waveform modeling itself. And I hope I answered the question, and you can open up your mobile, and I can play you something. I had a demo here, from time to time, you know, whenever you're trying to do without practicing it, it may not work, so you will excuse me in that case. I will put, hopefully, ah, sorry. Yes, yes, yes. I think this is where you are. I think this is where you are again. Okay, hello world. Ah, okay, I found it. Give me a second. Okay, it keeps failing. As demos happen, I had to prepare that, but your mobile actually uses it. If you have the latest Android, you can put any synthesis, put local only or disconnect, put it in airplane mode, and you will hear the LSTM, or the LSTM. For cop, eh? iPhone. Yes, I mean, Google Now and iPhone, well, we do the Google Now, but we don't do it for iPhone, because it's a different company. Okay, so Google Now uses the in-selection system when you're connected, so you put it in airplane mode, you will force it to use the embedded one. Yes? And researchers outside of the local use working? Of course not. You know how the pattern, the pattern law works. You can copy any pattern you want. Any at all, right? As long as you don't put it in a product. So you can implement it yourself. There's a bunch of things I haven't told you because how much can I tell you in the database, right? But you cannot use it without violating the primary claims. I'm always pro open sourcing and I will open source a new vocoder, straight like yet another vocoder from that Hedekin may where he was in our lab and he was working with me and Hega and pretty soon. So this will happen in due time. So we'll open source, but we will not open source again because it is hard to implement. So it gives a kind of an edge there because it's very fast. That's the main thing I think you may be able to copy the quality maybe, but the speed of this quality is a hard thing to do. Yes. Other questions? Just to make sure my answer is correct. Yes. So earlier you put that phase equation so you expect the actual phase as the sum of the local track properties and noise properties. Ah, that's a very good question. The, it breaks down. It's not a source filter. I think it's not a source filter vocoder, right? I see. I just wanted to ask for the kind of noise part. So do you find, this is the case where you find the kind of correlation between the noise I mean that if you're obviously with obviously the phase theory it may not be a good point. No. Only if you're, yeah, but this will be applied if you remember the equation by a random number, right? So you will not have any correlation because you put an random number here. The answer, I'm sorry, going the wrong direction. You have a random number here. It's a uniform that goes for minus a function of a periodicity p4 to plus. So it's always random. So you will have no correlation because of that. It's actually, this bit here. Because I don't want to use overlap and add. I do not want to use overlap add as a way to have somehow to reconstruct away from between two parts, right? I don't want to use overlap add. If you use overlap add, you are having an implicit assumption about how overlap add or how the interpolation is made, right? And the, but the problem with overlap add what happens when you have two different quite different frequencies you have say 100 heads or have a 40 heads. We can deal with these differences with the, okay, there's no problem. You can have some, but then that and so that's the idea of interpolating or other than doing overlap add with specialty assumptions. The other thing that happens with overlap add and overlap add is that the higher frequencies in overlap add you make some assumptions, right? You make some assumptions that the pitch period represents your pitch. That's true if you're at zero is gives you an integer pitch period. But if it doesn't, if there's more deviations there, then the overlap add process destroys, has two main disadvantages. One is that it dumpens the higher frequencies because it produces random fluctuations of the, that have of the sort of components that have a destructive relationship in the higher frequencies. One sample or half a sample difference in one kilohertz is nothing. Half a sample difference in eight kilohertz is a really huge amount. And that introduces a dumping of the third format the fourth format that sounds like a less vivid speech. So that's one thing. And the other thing is that if you try to synthesize noise with overlap add the noise has the mean that you expect that the variance fluctuates with your window, your synthesis window. So the variance of the noise is reduced whenever you sample two noise samples, right? We have epsilon one plus epsilon two is has less variance than epsilon, right? Because if you did all these follow the same distribution because you reduce the variance here, some stuff which means an overlap add here in the middle you summed up a noise here and the noise there. And then, but your variance of the noise goes lower. So you have fluctuation of variance. And I don't want to have this. And in high quality, some high quality methods that they were actually trying in audio they were trying to minimize this with special windows as own ideal directly in the third or something. Yeah. What do you mean with a long page of your base information that is over the previous line of cubic interpolation? Oh, well, unless you're asking about the experiments, right? No, I've used the cubic interpolation method quite a lot. And then it had random glitches because sometimes it was not, for example, you can control, oops, the short answer is then there's no formal experience for this. You can control whether it goes like this. You can control whether the most of the change, I don't know what I was saying, the derivative, let's say, of the, of your curve here where the most of the amplitude, this, I said this is your criterion here, right? Second derivative of this. Then, and this is from, you can, that it's evenly distributed. If you would like to have a maximum difference because when you have a difference from here to there, you have to distribute it somehow within the global cycle. And when you would like to distribute your difference in order to do transition from this phase to that phase, you would like to have a couple of properties. The property I want to have was to break harmonicity as much as possible. This doesn't explicitly break harmonicity for me. So, if you try to synthesize noise with this, occasionally you may get glitches or you may not get a clean noise without a pitch and so on. And that's why, but this one, this guy here, allows you to do this because it goes and puts it straight in the middle. So it breaks harmonicity very efficiently. Also, this model has an extra parameter and when you are counting every computation in order to fit it to the specifications, then having a model for less complexity is a much desirable thing to do or to have. Yes. We're done. Thank you. I think we're already way past four. Thank you very much.