 The first day of this school is dedicated to signal processing. So we will show you how to analyze signals, how to process them, how to synthesize, and these are very important models because when you go towards more statistical approaches, more you need a vocoder, at least at the moment. And the quality that you can have from a statistical model depends a lot on the vocoder. Recently, there are advancements on sinusoidal modeling, and a sinusoidal model has been around for speech since 1984, and in a conference in 1983 from Macaulay and Quattieri. This is a very important work, and got actually the IEEE Best Paper Award in 1985. That sounds really, probably you were some people here, probably they were not born yet, but that was an important model that we used a lot in speech. And we will show you first the basics on sinusoidal modeling. Why? Because the next technical lecture is on the advanced sinusoidal model. So it's very important in order to understand the advanced sinusoidal model, and to understand and make sure that we have a common understanding of sinusoidal model. I'd like to excuse myself for people that they are experts in sinusoidal modeling, probably they will not be so much interested in the first part, but I must say I will use first a notation, so to introduce you to the notation, so that even if you are expert in the sinusoidal model it's good to watch this. Second, always the views from where we see sinusoidal model is important. If it is for instance for speech coding it's different from speech synthesis. And the sinusoidal model originally has been suggested for speech coding. So we will first see the sinusoidal model, the basics, how that is connected to the speech production mechanism, and also I'd like to excuse myself to people who are not signal processing oriented, because this is a bit heavy in mathematics, but I will try to explain as much as I can. Now, when we go to advanced sinusoidal model, then we see a bit more opportunities on how to use it for speech synthesis, and then Yanis Ayomir Yanakis from Google will show you how to combine this into really a vocoder that is currently in use from Google, and he will teach us this afternoon, but also later on, on how to do voice morphing. So that's why we start from sinusoidal models. And again, Rainieri Maya from Toshiba has a different view of vocoding. He is using complex kepstra, both for minimum phase and maximum phase, or from minimum phase and all-pass information of speech, but unfortunately he's not here with us. So we will study a lot sinusoidal model, and hopefully when Rainieri reaches us, then we can see if he can talk about complex kepstra in one of the sessions later on. Okay, so that is the introduction and why we pick up sinusoidal model. So I will talk about speech production, first of all, just to see the principles and the modulators, and see how these, in that context, sinusoidal model fits. Because sinusoidal model, you can see it very simply as Fourier transform. So why to bother with speech production and modulators? But actually if you want to do something useful for speech synthesis, voice conversion, etc., with sinusoidal model, it is very important to see the connection between sinusoidal model and the speech production mechanism. Then we will go and we will see a specific, another category of sinusoidal models, the harmonic models, and there we will talk about the harmonic plus noise model. That is a model that I developed when I was doing my PhD, like when I was around your age, and there we suggested three harmonic plus noise models. So why we quickly look at these models, try to explain you from where they come from, and why we suggested at that time, and especially one of them, we will make a focus, because that will be the continuation of the next technical lecture, which is going to be the quasi harmonic model and adaptive harmonic model. Okay, so here is the simple view that is from the book, an excellent book for speech processing, Quattieri from MIT Lincoln Labs. So excellent book, so if you really look for a book, that's a very good book, that is number one reference here, and you will see it later on. So we start here, I don't know if... Okay, nice, you can see this. So we start with the power supply, that means here our lungs, and the air passes through this modulator, which are the vocal folds, and then through pharynx, the glottar air flow goes towards the oral cavity and the nasal cavity. Now at that level, there are some productions, like periodic puffs, as it is shown here, this is for voice speech, and noise, this is for the unvoiced speech. The first is for ROA, this kind of periodic signals, and noise like shh, this kind of things. Now, this impulse is not really generated, the impulses, like t, k, they are not generated here, they are generated in the vocal tract, which is this one. Anyway, all of these, although this is shown as modulator, all of these are modulators, and we have a major modulator here, and this is very important for the quality of speech, and this is a modulator in order to recognize speech, to make recognizable sounds. At this level, this modulator, as I said, it is associated to the quality of speech, so when we do speech synthesis, the first thing that we are asked is quality. That's why this part is very important to address. The problem is that we don't have access to these modulators, we have access to the signal that we get here as a measure of a change of the pressure. That is what a microphone collects. And then this signal, of course, is traveling, arrives to this lady's ear here, then there is another mechanism. We will not talk about that because... No, we will talk a bit, because we will talk about speech intelligibility. That comes from a mechanism of our auditory system. Here, there is the outer, middle, and inner ear. The inner ear is the cochlea, then this signal is analyzing in frequencies, and this is actually the place theory of Bekessi, and on Friday, when I will talk about the 25th anniversary of Cambridge Research Lab of Toshiba, I will show you some very interesting pictures of the inner ear, the decomposition of sounds to frequencies, and that is the 1961 Bekessi theory, place theory, and he got a Nobel for that. Nevertheless, that goes then to acoustic nerve, which sends the signal, the stimulation to the brain, and somewhere there, in some areas that more or less we know, the representation of the sound is located. That... the patterns that we create inside the brain then drive us our modulators, and through the speech that we produce, the sounds that we produce, we have a feedback mechanism through our ear to... because we have a pattern for our, let's say, mother or father's speech, and it is saved here, so we try to mimic this. We move our articulators, and we see that the sounds are not exactly the same, and then we correct that, and this is the way that we learn the first three to years, five to three years, let's say, to speak. And then the next one level, of course, is the greatest invention that humans made, the language, and that is through speech. Then that's a very high level of information. And language, of course, it is associated with a lot of things in our current, in everyday life, like this person says to this woman, go and shop, do shopping. So, you see here now, there is this unvoiced sounds, this is the all, the voiced, which comes from this one, and this is the impulsive sound, like p, which is created at the level of lips. The vocal airflow stops there, and then there is the release and p. So, that's why we see now, we recognize the three sources, noisy, periodic, and impulsive. Now, the first thing, as I said, comes from the vocal folds. Again, on Friday, I will show you vocal folds' vibration. Right now it's just schematic. The left is this one, the right is this one. This is when there is voicing, the vocal folds, there is tension, and the air passes, but there is also escalation. When, while for breathing, the vocal folds are quite relaxed, and that is very important, because we want in very short time to get as much air as we want, as we need, so that these should not be an obstacle anymore, should be relaxed. But when there is voicing, it is important to control these vocal folds to create either relaxed or less relaxed or tension or vocal folds in order to produce the different sources that we have seen. So, that means I have different tension if I produce a sustained vowel, or that's different different position of my vocal folds. That means I control this through the brain, and I control it because I have heard these sounds, I have the pattern, and then I control. At some point, I might lose that control. For instance, could be like as I'm getting older, then probably I will start losing the capability of controlling the vocal folds. And that is the first sign that something goes wrong in my brain, because I cannot control these that I learned. This is not an animal anymore process. This is human process. So, when I start having problems, then first I cannot process very well memory, I will probably start losing the capability to finally control my vocal folds. The same thing happens if, for instance, I can control but my vocal folds have problems. That is, first of all, speech pathology. Yes, please. Will these slides be available online? Yes. So, you don't need to get even the videos. So, these are very important functions, and that's why also, and our lab has a lot of experience on quality of voice, how to control the quality of voice, either to check if there is any pathology, or in terms of vocal fold vibration, that means if there is like, and I forgot to even the terms polyps here, or there is a damage in the brain. But that was, that is not going, we're not going to look at that in this summer school. So, when there is the vibration, the vibration, first of all, sometimes people call that vocal cords, but we discovered that these are not really cords, these are folds, that means there is mass, and they move this way. And we will see, definitely I have one video that shows vibration of sustained phonation, but this is probably something that we can find easily in the internet, and this is with aliasing frequencies. The problem, the thing is that we use videos with 20 frames per second, but my vocal folds right now, when I say vibrate with 120 times per second. So, how I can see that? The answer is you have to use high speed camera, and then when you record that, you have to wait to see the movements with 25 frames per second, but without the aliasing frequencies. If you want to see vibration, you will see aliasing frequencies. That means you will see the vibration, but mainly you don't see the vibrations. You see some of the activity. Nevertheless, when there is a vibration, then we recognize two major phases. This is, or three, sorry. This is the closed phase. First of all, when the vocal folds are closed and there is no air that is passing through, this is here, the closed phase. There is the open phase, which is this one. So gradually, now the blotter air flow velocity is increasing, so closed phase, there's no real velocity, no air, goes through, then there is this, which is the open phase, and then there is the return phase. The return phase, that means we go back to the closed phase, and then we continue. And this is the periodic thing that we observe when we say, for instance, ah. When we say, shh, there is no such a periodic thing. Now, why there is this return phase, everything has to do with the pressure. For instance, why here the vocal folds open and why here they go to the return phase? This is always controlled by the air pressure. And there are two air pressures. Below the glottal, the glottis, and above the glottis. So when the air pressure under the glottis is smaller than the air pressure above the glottis, we are in the closed phase. When the, ah, the below the glottis, the pressure is higher than above the glottis, then the glottis starts to open and then the air passes through. But as we do that, ah, then there is now a process that the air, the pressure between above the glottis and under the glottis is equal. Then we start the return phase because look, this is opened, but it has a mass. Cannot stay there. So we'll go down. And depend on the, how big are your vocal folds, we'll do that quickly or less frequent. For instance, this is one difference between males and females. Males first have big mass of vocal folds. Bigger than females. So then there is the notion of inertia and how the vocal folds then vibrate. That's why the, ah, males have lower pitch, lower fundamental frequency than the male speaker, than the female speakers. And the same thing we can think for kids, for instance, which have even less mass than the female voices, female speakers, sorry. Now, another thing that we have to observe here is that this is again not really the vocal flow velocities, just the schematic representation of that and there are two problems with that. First, that the vocal layer flow it does not go that easily from here to here. There are some differences. And the differences are like here, there is some oscillations. It's not like straight line or kind of hyperbolic here line. It is, there are oscillations. These oscillations are related to the glottal formant as we say, because this is a filter, the first filter, a modulator means filter, that means is the first formant, that it is a system that means has an eigenfrequency and that eigenfrequency is the glottal formant and it is obviously related to the particular vocal folds that I, Yanis, has. I have this, you have different vocal folds so that is also a clue that in 1995, something like that, people used to extend their features for speaker identification using also source. And I will come to that again when we will say about the 25th anniversary of Toshiba. I will say too many things about Toshiba. Anyway, that is one thing. The other thing is that when the vocal folds open then the air does not go just straight away on traveling to our vocal tract. There are also a lot of noise around the vocal folds and that means that some sounds they are not really very periodic. There is with strictly periodic pulses but there is also the noise that it is produced towards at the edges of the vocal folds. This is very well known in sound waves when sound is produced. And the last thing is that we say this is a periodic signal so what happened here will be repeated but there is no such a thing in nature I can claim also even in machines that can make this thing to be periodic in a strictly mathematical sense. First of all you understand that these cannot be the same as this. So first this is not periodic in a strictly sense but you can say we are engineers not mathematicians we will say it's about the same. Okay I agree but you have another problem the distance from here to here not only just the shape but the distance from here to here you say it is the same from here to here this is not true in speech and this I can again claim that there is no machine that can do even that so there is no periodic sound we bother about periodic sounds this is the same thing why we bother about complex numbers because it helps us to develop systems and then we can use them as engineers okay so these are the observations about this little modulator that you have to pay attention otherwise you will have problems like I will have problems at the end of this talk if I don't drink water during my speech so protect it it is very important okay now we moved from that modulator we go to the vocal tract there we have a very important articulator one of them is the tongue in this very important because it can go it can move and probably you have seen videos I will show again a video on Friday while we talk how the articulators are moving it's amazing how much we can control them and as I said tongue is one can position from back forth and then we can say back vowels or front vowels we can touch the palate like here and then this is the way to produce plosives but also it can be high but not touching the palate and the fricative sounds are produced and of course we have lips for example these are the major ones but the teeth if something is missing here I cannot produce any more the sh sound very well and now this is the high level let's say description of the speech production mechanism but in order to study the speech production mechanism you have to use now mathematics and you have to use signal processing theory and here is an important operation that of convolution so let's assume that the vocal tract is a filter and has an impulse response h of n so the excitation that the signal that goes towards that can be also a result of a convolution between this glottal signal that is the signal that is like that the opening and the closing convolved by a series of pulses that's the p of n which are periodic these are chroniger pulses spaced by p where p let's say is the period when we solve this signal the closing opening and return just that which is the glottal signal with the series of pulses then we'll create this series of glottal airflow velocities pictures that we have seen and this is then the u of n the u of n is exactly let's say looks like this one then this excites the h of n which is the impulse response of the vocal tract so the result of this which is u of n travels and we obviously here we use the hypothesis that h of n is a linear and time invariant filter in order to write the otherwise we are not allowed to write convolutions here in such a simple way and then that goes and is producing let's say the speech out of the lips outside the lips and then we observe this through windows in order to make the assumption that the signal that we are observing is stationary why because the mathematical tools that we use make the assumption of stationarity but we will see that this is not true and more you go towards advanced modeling of speech we have to take this into account and then that's the windowed speech around time to or tough as we say in Greek then of course you can apply a Fourier transform of it and then people who are familiar with signal processing then you know that convolution in time means multiplication in frequency domain and then these convolutions become multiplications but and this is the Fourier transform of h of n this is the Fourier transform of g of n this is the as we say the glottis signal and this p of n which is periodic we know from the Fourier transform that is also a periodic parses yeah please it's h of n it's u of n the excitation of the vocal tract and what is h? h is the impulse response of the vocal tract okay so yes please dismissing the liver radiation from this analysis is included say so you are using the liver radiation yes h of n is considered also all the vocal tract and now when we do the Fourier transform we know that p of n also it is 1 over p and then there are parses again but not repeated every p which is the period but over 1 over p that makes to the frequencies which are the omega k and then we have here the convolution outside which when you write the equations it's not necessary to say the details here but easily you can then put this w inside and the sum and then it is mainly a modulated window which is the Fourier transform of the window sitting on multiples of fundamental frequency if we say periodic signals as we started to say or at the frequencies omega k and this is then the Fourier transform of that speech signal at around time 2 okay so how does it look now if we take the magnitude spectrum of h omega g omega this one then this is the line this is called spectral envelope the bold line here but actually we do not observe this what we observe based on this equation is a sampled version of that envelope which is defined the sampling by this omega k these frequencies and in short we observe this what is this what is each one of them this is the Fourier transform just the main lobe of the window and it is modulated as I mentioned here is sitting at different frequencies omega k so around omega 1 omega 2 omega whatever so this is really what we observe and we know that underlying process has a a filter if we put together now the vocal tract and the glottis filter together like this one that they have a spectral envelope but we do not observe this spectral envelope we observe only these peaks so how to go from these peaks to an envelope is one very important research problem and actually it is related it is at the heart of coding vocoding for speech synthesis and this is from where the straight comes from like a very important vocoder that we all use for speech synthesis this is the spectral envelope estimation is the key is one of the keys actually and this is you will see more about that from Janis later on who will say how do we use that information but very important for speech production is some frequencies which are maxima here the frequencies where we have the maxima of that underlying not observed envelope and these are in mathematics called eigen frequencies in engineering speech processing are called formants and these are the maxima here to find the maxima it is also very important because it is through them that we recognize speech and we do automatic speech recognition either explicitly or implicitly now you can see that when for instance this lobe here is not here but it is located where the formant is then we are lucky because we can find the first formant or the second formant but normally this does not happen very often however people who are trained to sing in the opera they know this mechanism and they change their vocal track accordingly so that the formant and this one omega one let's say that the fundamental frequency they go together one on top of the other why? because here you don't have so much energy here you have energy so you don't want to change the sound if you sung you can modify the beat but the purpose is to reach the omega one which is the tone that the orchestra is playing and obviously this is very important information so you have to move your formant without changing the identification of the vowel that you sing on the pitch that the orchestra is playing and this is learned during many courses when you learn how to sing the opera ok so after that we arrive to more specific models now on the because that was the general production mechanism now we arrive to the singoidal models what is a singoid singoid let me see this one what is a singoid is an oscillator so that is the maximum displacement A then the distance from here to here is as we said a period and the fundamental frequency of that is one over P and then there is this delay here this delay here from the observation time T equals zero is called the phase displacement and that makes the signal of A let's say cosine since we I used f0 to be f0t plus that displacement phi so that is phi is very important also in speech synthesis because it is associated to the excitation phase somehow and that is important because it is the excitation phase for the quality of speech that you generate and then that is let's say an oscillator now imagine we can put this as omega zero and I think we have more space imagine now if for that you have some of them this is a powerful model because can model a lot of sounds including speech music etc so from that simple oscillator to a combined oscillator and this is the St. Zodl model but because we have only cosines and cosines are the functions let's say which are even that means we can very well produce let's say in this simple way only even functions even sounds but the sound has also odd parts and even parts any signal can be written with the sum of even and odd parts then we have to write some here with a function of let's say bk sinus another phase here ok and if we use Euler formula instead of cosine and sine we can use the complex numbers and then we go to exponentials and that is what we have here we have 2pi fk that is the omega k j exponential time the signal and we have k components the gamma k are the complex amplitudes however now this is in a polar format and these are gamma k are complex numbers they have the amplitude information and the phase displacement now a specific phase of the sinusoidal model is if I have I put a constraint on fk to make it k times f0 then that means I don't have too many frequencies to bother with just one the f0 and then I make the assumption that all the frequencies fk are multiples of that f0 then once we have that we have to estimate the parameters from speech and usually we use it is called a linear approach these if we know this fk because ok what are which are the unknowns let's say the x of t is the known the unknowns are the gamma k and the fk these are unknowns this system however is non-linear you cannot solve it in a linear way it's a non-linear however if you know fk or if you know the f0 so you know the fk through this assumption of k times f0 then the system becomes linear and you can solve it very easily what does that mean it means that when I know fk then I can make as we say a space generated by these functions the exponentials and then this operation here when multiply that space by s is projecting s which is the speech sound the sound that we have generated we have measured through the microphones and the analog to digital conversion we have the samples here we project them on that space and the output that we get are these gamma k which are the complex amplitudes so this is the linear algebra way to show how to make the parameters but there is also another engineering part which arrives to the same equations nevertheless it is called mean square error estimation that means these are the measurements this is x I said before it's speech but actually it is our model then we try to minimize the distance between the model and the observations and this is called mean square error for obvious reasons so now we have to take derivative of epsilon over the unknown parameters but if we know fk then the system becomes linear and the solution is through the linear algebra system that I show here one point is if you see this this is simply the Fourier transform so you can apply for instance on the sounds from DAX you can apply it on the sounds that cars generate noise or the noise from these ventilators but once you go to speech of course you can apply just do Fourier transform get amplitudes frequencies somehow you can do something I will show you how and that's it but if you don't see behind the scenes what is happening and you don't take into account the speech production mechanism you don't have chance to enhance this model and apply it effectively efficiently in the speech synthesis problem so we go back now and we say okay let's make a step and see step back and say can we see the same model in the context of source that means the glottal airflow velocity and filter then traditionally in speech coding we had this series of policies and we say if it is voice we have series of policies if it is noise we have just noise components and then we can mix them and we made linear prediction coding we made even coded excited linear prediction and blah blah et cetera here we say the excitation the source signal the U of T is a sum of sinzoids because we know with this sum of sinzoids we can produce any sound from noise to purely periodic signals okay so I can then have a general representation of the excitation a specific so this is the amplitude information so this is an AM FM signal this is model using amplitude modulation and frequency modulation components the frequency modulation components actually comes from the phase and the phase is defined as the integral of the instantaneous frequency plus displacement that I said previously the phase displacement so now this is a time varying function and you can also simplify it by make the assumption that this is K times omega zero that means constant over time as a function then this integral is not any more integral it's very simple I mean it's very just a straight line if it is a constant now if you make this a function that moves over time with let's say a linear way then this integral will make the phase a quadratic function etc. you add one component and the impulse response of the vocal tract including lips as people said before they are also magnitude information and frequency information and this is through the phase and this is the amplitude again oops it is an AM FM again representation now we will pass this U of T through this if the U of T is not a sum it's just this one ak exponential each exponential we know from the system theory that will generate is an eigenfunction which means that the output when this passes through this filter will be exactly the same one but its amplitude will be modified by a constant and that is the eigenvalue that means that this system has eigenfrequencies because this is an information of frequency if that frequency however the eigenfrequency is not an eigenfrequency which means the system has not that frequency the output will be zero now we can extend this when we have a sum then it's not any more eigenfrequency but through because the system is linear a time invariant we can say that the output is eigenfunctions and that is the speech which is shown here and now what is this then amplitude it is the multiplication of the amplitude that comes from the vocal folds multiplied by the amplitude of the vocal tract while the phase that we see here because of the exponentials the let's say the magnitudes will be multiplied as we show here but the arguments will be added so the phases are added now the phase of the time varying phase of the excitation signal and the phase of the vocal tract filter so if now we replace this ck of t with the integral that we had before this one plus the phase displacement and then we just put there the phase from the vocal tract so speech can be seen like that as simple as a Fourier transform but if you analyze these two components into excitation and source filter you have to consider this information okay so we are making connections between the sinusoidal model not just as a Fourier transform but as a system that describes with high fidelity the speech production mechanism please it doesn't depend at all on these sample frequencies that you've showed before why I'm asking so could you please repeat your question the bigger omega k there does it depend on the little omega ks that were the frequencies being sampled before it is here we use omega k and omega k and this is because when we do this the Eigen frequency is the result mainly when we pass this Eigen function inside mainly it means the output is a constant which is the amplitude of whatever we have as amplitude of that omega k one and this amplitude here which means the sample at frequency omega k and at frequency omega k so we have only one frequency comes in the output is exactly the same function at frequency the Eigen frequency and multiplied by an Eigen value which is the multiplication of the two amplitudes alpha k and m but the m is a function the value of the function at omega k at time t this is the sample version that we see of that function both for amplitude and for phase okay so that means why we speak we produce different frequencies and they sample the system at these specific frequencies and this is a time varying here I have not put that tau the Greek letter but we can think that this omega k is a function also of time because it changes while I'm talking so I will go and say this slide then we have a break and then we'll come back okay so here now we make I go slowly here because this is very important to understand these parts why don't you understand this all the rest you don't need me you can understand it even without me talking so the important part is stationarity this is because the basic tools that we use to analyze speech like Fourier transform make the assumption that signal is stationary what does that mean in mathematics in this model it is that around the instant that we observe the amplitude frequency and even the number of components that we have which is a function of time inside that window that we are observing the sound they are constant they don't change obviously since for instance what is the implications for that when for instance the frequency we consider that does not change for let's say 25 milliseconds window analysis then the phase what is it it is a linear function a function of time order one this is because of the integral and this is exactly what we show here so the phase is function only of order one of the variable t tl is the time where we do the analysis the center that that we mentioned ok so if we under this assumption t is become this inside this window tl as you can see here is the center and we take from that tl minus we go back t divided by 2 whatever is that t is the half of the analysis window and further t divided by 2 that means we have an analysis window around tl of t capital T samples or say seconds now if we do a discrete time analysis not continuous so now we did have samples which are nw then this phasor as we call it this is a phasor now this is real but this is the complex number this is a phasor is gamma k and then this omega k tl that is because of the discretization of time becomes in signal processing then in discrete time we use this not the capital but the lower case omega to show that this is in discrete time then this is how we arrive to this model and again you can avoid to see all the rest and you can think that of Fourier transform that's its speech but you have to consider all the previous ones to understand that this is not just a Fourier transform this can be applied to characterize with efficiently as I mentioned before the speech production mechanism so it's up to you to decide to use Fourier transform or to know what is this gamma k here and from where that came from of course you don't have access to all this information these are hidden so the next one is going to be how do we go now to the mean square there how to estimate these parameters but I suggest you take a break then we come back and we see now how do we get these features how do we estimate these parameters of the model and then we will continue the next one will be much faster okay you have any questions before you leave the room before the break okay please okay let's make like five minutes break and then come back okay so now we can continue now you probably see here the change in the notation and I apologize for that so now the original way from is not anymore s but it is y while the model is not x is s so we like to minimize this mean squared error which is this one and this can be written in this way now this is the Fourier transform y of n is observed signal and we do Fourier transform and obviously we have some frequencies it is a Fourier transform so we go from n to omega in general and here we have the energy of the what is called energy signal processing for y of n observations so we can show that this error can be written in this way and you want to minimize this obviously minimizing epsilon you cannot do anything here the things that you can play with is this part especially you cannot do anything also even here because this is the Fourier transform of the speech at specific frequencies but probably you can minimize this one then if you minimize this one then you can minimize epsilon this is called maximum likelihood actually estimation and it is maximum likelihood because you have a model and you try to match that model to speech to the observations but the omega prime k is there are still three parameters where this one yes because of the omega indeed so I will come back to that but the first thing that we will observe as a way to minimize this is this one so when this is minimum obviously when the difference here is zero and when it is zero for instance these omega k are the same as the ones that we have from gamma imagine again remember we know omega k because we try now to solve a linear system so that means these are the peaks sorry I said the keyword these are now we have to minimize this difference and these are the phasors so we can since we know omega k so we know where are located these to make gamma k the values of the functions of w of y omega at the frequencies omega k in short let's say I have a function I know that I want it here these are the omega k so then I can sample here, here, here but I know that my gamma k are located at omega 1, omega 2, etc so the value in order to make this zero is to make gamma k equals to the values of gamma omega this this and this then if we do that this is zero and then this is reduced to this now it comes what you just said now we want to minimize this what is this minimization it is this is the energy of the signal minus another energy you probably remember the wonderful Parseval theorem that this is related also to an energy in time this is Parseval so from where you are going to get the energy you want to get from that energy from which has many frequencies some energy that comes from few frequencies so if we go back this one from where are you going to really pick up is going to be this omega 1 or will be this one obviously you will select the red one because this is a peak you will not select probably this one this one, this one you will select probably this one so now you have really an estimation of omega 1 let's see omega 2 now I make it quite sparse but nevertheless that means of course I can remove all the frequencies I can sample everywhere and then this error is going to be 0 this is because of the Parseval theorem but I don't want that I want to minimize the error by picking up few frequencies that are related to the speech production so I have to get a set of frequencies out of all these frequencies so which are the frequencies that are let's say I do now a zoom out and I have this this is now y of omega from where are the frequencies that I can pick up in order to reduce that error again if I take all the frequencies the error is 0 but this is I know I have not learned anything but if now I do this this this, this, this, this this, in short the maxima of that y of omega then I can get an estimation of where this omega k should be so now I have an estimation of omega k and at that omega k I can get just the values of the function at omega k that minimizes then this error and that is how we estimate the sensual parameters and why we, this is a maximum likelihood approach now we do that for this type of signals but this type of signals are mostly the periodic signals or quasi-periodic I would say what about can you represent noise with some of sinusoid's because if you cannot do that then you have to have a switch whenever it is voiced you go to the sinusoidal model whenever it is noise you go to something else so this has been, of course there are many speech models that they work like that but this is not an efficient way to do things so can we assume that sinusoidal model can both characterize periodic sounds and noisy sounds in the same way so then you don't need this switch and that's very important and we say the answer is yes you can actually there is this expansion and that is the Fourier series not for deterministic signals the Fourier series expansion for stochastic signals this is what is called or otherwise the KLT Carhu and Loew transformation and that allows us to construct a random process using harmonic sinusoid's uncorrelated complex amplitudes in order to have these random characteristics however in order to do to to make this there are some conditions how close these frequencies that we will pick up will be and how fast we sample this over time and you can understand that very quickly because of the sampling theorem but in the frequency domain if you have something like that obviously you need just one sample to represent this but more you go to something like that more samples you will need to represent nicely this process and this can be sinusoid's and not all the frequencies again because at the end we have to represent this speech production mechanism and theory but remember this is a system that has been suggested for speech coding for speech coding you don't have infinite number of parameters you have few parameters then you try to quantize them so you cannot have infinite number of frequencies because you are going to have infinite complex amplitude then so the thing is how to use the minimal information that you need to represent the process the bottom line is because of Karchuk and Loewheff you can represent now both periodic and non-periodic parts as a sum of sinusoid's that's the key so you don't need any more switch and this shows a pic-picking for a voice we call this pic-picking and this is an unvoiced so this is like the Karchuk and Loewheff and this is the deterministic representation in short, sum of sinusoid's for everything and we found that approximately for let's say 4 kilo-heads bandwidth 80 parameters are enough to represent the speech both voiced and unvoiced when we reconstruct the signal you cannot make really difference between the original one and the reconstructed one but we will talk more about that and you have a hands-on session in the afternoon and you will play these type of signals and you will hear also you will make reconstructions and then you can see the difference between the reconstructed and the original signal and then you can play with the parameters a number of parameters so we call this pic-picking so now we have the complex amplitudes we have the frequencies so we can we have this so now we can put it in the formula plus of n let's say this one we can move half let's say this is the offset the shift that we do in the analysis and then we generate another one from the next frame these are called frames then the next one another frame and then we can put them together to do the overlap and add all up and then if we do this overlap and add we add these frames and we generate the speech and this is the reconstructed it says here synthesized but because we do text-to-speech and we say synthesis of speech this is not text-to-speech this is reconstructed signal this is one way to do it simple way but there is another way which is to much so now you have at this time instant like L and another time instant L plus 1 you have now components estimated and let's say you have picked up at omega 1, omega 2, omega 3 etc and the next frame you have also let's say three components then but obviously since my frequency content changes as I speak the frequencies also the pick-picking mechanism picks different frequencies if let's say we have three and three then we have to do like a frequency matching to observe how my frequency changes over time of course I don't have any information between the instance of my centers of the analysis window I have only here and here so a simple assumption can be like a linear interpolation linearly change from this to this that works well but what about if here we picked up three and the next one we have five then that means something how we are going to do the association frequency matching probably these goes here and these somehow it is there but did not exist in the past or if still we have three and three but this is such a fast change and it happens while I speak like in emotional speech things change very fast how you do the association are you going to do this to this so to nothing or this to this so these are to be or not to be kind of questions so to analyze more of these than what Shakespeare did we go further and we say so at this time and next time so this is the L left frame this is the L plus one these can be associated to a certain frequency like this one and has search been plus delta minus delta where you search to see if really you have some frequencies picked up by the pick picking mechanism inside this interval let's say that inside this interval you have two picks two picks picked the one which is closest to the previous one it is associated this one has two options either to die or either to born to be born and this actually because it appears here we say this is born this is the birthday of this frequency so this is a dynamic programming problem obviously and here we see how it involves over time and then there is this this is not associated to anyone so must die this is here a new information before was not here so must be born so then we have the birth and birth process and then this is how it looks over time the association from one frame to the other frame and this is of course just an example but this is from a real speech how it looks and then you can understand ok wow here we have picks in the high frequencies not so much information here so from where this comes from the answer is here from unvoiced segment why because sounds like they have more energy so more peaks in the high frequencies and since I have put a constraint on the number of components I can pick up then obviously the maximum energy comes from the high frequencies for the unvoiced sounds for the voice segment almost equally distributed energy and in nicely not a random space which is related to f0 f0, 2 times f0 etc. that means fundamental harmonics after that and here you don't see any structure but here you see a structure probably if you have seen with waveset for instance then you can start realizing you can see with a high resolution spectrogram you can see such a picture but please every time one of these new peaks is born doesn't that change k in the frame that you have a synthesizer now you've got to go back and pick new gammas because you have more of them the k can be constant like for instance can be as I said for 5,000 bandwidth as it is here let's say we through trials and errors we found that approximately 80 components are good so that means we pick up standard numbers this is also very convenient when you do speech coding because you have a constant vector to play with which is born among the ones that you've already now once we have made the association from this frame to the other frame and future frames now we know which component is associated with which component then we have to generate samples not only at the center of our analysis window because speech is not only there speech has samples between frames the center of the analysis frames so we have to generate components samples samples between the analysis instances or the synthesis instances and to do that we have to do to interpolate because sinusoidal model has two mainly components because the frequency is already there in the modeling amplitude and phase so you know the amplitude here you know the amplitude here but what is the amplitude between you know the phase here you know the phase here but what is the phase as a time varying function between the analysis time instance and this is why the paper got the best paper award is because of the previous analysis mainly because of this one for amplitudes the easiest thing to do say it is linear does not change in a very simple way this is a linear model and for phase however we use a cubic phase interpolation model that I must say that of course Quatery and Macaulay they publish it but was Almeida also in Portugal who was talking about cubic phase model at the same time and these people were going towards that direction independently just Macaulay and Quatery were lucky the paper has been accepted so everybody recognized that these days they work and they got the best paper award why however they worked towards the same direction this is the phase if we have such a phase model what is the frequency model you think what is the connection between phase and frequency in order to get the phase you have to integrate the function that describes the frequency so that means if the frequency is a model of order 10 the integration will increase the order and the phase will be a polynomial of order 11 now that I said that the polynomial order of phase is 3 so what is then the order of the frequency which means it moves it's not constant between the two estimations the frequency is allowed to move not to stay constant with a quadratic polynomial so it has really some components the problem is that how to estimate the unknowns here there are no unknowns we have the amplitudes but here you have unknowns zeta or zeta as we say in Greek alpha, beta and why I don't have it here ok I don't have it here but if you go to the paper there are sufficient conditions initial and final conditions that if you write them down and you know that for instance the derivative of the phase at this time instant is equal to the frequency that you have estimated and the same thing to that and then the phase at this time instant that you made the analysis is mainly is that z zeta then same thing from the phase the next time and also that you have to take into account that the phase is estimated always modulo 2 pi do you know what does that mean or do you know why it is estimated modulo 2 pi this is because it is estimated through an arc tangent arc tangent function arc tan and that is modulo 2 pi are the values between minus pi and pi but this does not mean that we have only these values these are the principal values as we say of the function then also if you add another 2 pi you have the values of the same function et cetera et cetera so modulo 2 pi means that all your values between minus pi and pi and these are the principal values of that function if you take this into account then you will estimate this and that is a paper it is not necessary to go into the details now if we do the system, the analysis and synthesis then you see that we take now the phase this is the generator phases and frequencies we have done this matching we do frame to frame phase unwrapping and interpolation this unwrapping is of z, gamma, alpha, beta parameters, amplitudes you do linear interpolation so now you have the phase and the amplitude at each sample not just at the center of the analysis window and then this is the same wave generator that means cosine of this theta multiplied by the amplitude we sum and we get the signal and this is the synthesis the block diagram from synthesis does it work? Well, it works here, upper is the original speech and actually this is from a jazz singer it's a very difficult voice to model and this is what the Saint-Germain model produces optically it's the same now, there are some details however, which are missing we have to plot the error between these two we will see that later and definitely you will have the opportunity during the Hanson session to see a lot of these things okay, so that was all about this but now you I hope you learned a lot and you know now what is gamma k how to estimate it what is fk, how to estimate it, etc. we can make again as I said we can constrain the frequencies to be k times f0 so then we don't need to estimate fk with pic-picking just we estimate 1 we find the periodic signal what is the period then 1 over the period we have the f0 and then all the frequencies we make the constraint of our integer of that fundamental frequency then we arrive to the harmonic model and then since we know the frequencies it's very simple we can do the same thing but it's not necessary to go here really to do that just projection of the observations to that space is enough so that goes towards the harmonic plus noise model not harmonic but why noise model comes from from where that comes from it is because really the speech cannot have only harmonics or cannot have only sanzoids of course we said that sanzoids can be good also to represent noise but if we have a mechanism to estimate noise in a different way probably we have also different tools that we can apply when we want to modify the speech not only to reconstruct the speech but to modify it because then we will have a decomposition of speech decomposing the harmonic part from the noise part then we have strategies different strategies to modify each part and then put them together and reconstruct the modified speech that was the necessity why we started to think about the hybrid system so let's say more complex kind of harmonic plus noise model so this is the harmonic part again here I start using now omega 0 the noise part the speech is the harmonic part like the sanzoidal part as we oops this is like the sanzoids but just the harmonics and the noise part it is modeled as a modulated by this it's modulated by ET noise component which is this is a filter and this is white noise convolving white noise by a filter this you get a colored noise it's not anymore white then you modulate that with E of T and this is very important so that when you put together the harmonic part and the noise part this will not sound as a noisy speech noisy harmonics but a harmonics extended by the noise part this is very important how to fuse these two components otherwise again you will get just a noisy H of T so now we will see three models of H of T and a model for N of T so the first model is simply exponential functions we say without slope because the next one has slopes this is we go back to the notation of 0 but nevertheless K times omega 0 and these are a simple harmonic model nothing to say more about that and these are notations to say for the analysis A stands for analysis and I stands for the I frame the H and M2 is similar to this but that function of A K of N is this one now you can recognize that the first part is exactly the same but has another term like a slope because this is order 1 as function of N another term BK now these can be simply be a slope but because A K and BK are complex numbers not real numbers this is a bit more complicated than that and we will see because all the work on the adaptive advanced sinusoidal model comes from this equation and we will see after lunch then there is another third model which it is an AM FM decomposition where the AM is a polynomial like the previous one of time this is order 1 this is order p and the phase is just a linear and this is the displacement and this is constant frequency so this is a simple information for phase the amplitude is a p-order polynomial but what is the difference between this and the previous polynomial this one of course this is order 1 this is order p, that's the one difference but it's not really that difference the difference is not that it is that these parameters CK 0, 1, p here they are real numbers now they are not complex numbers that really is a big difference between H and M3 and M2 so we have now modeled the harmonic part this is the speech if from speech I extract I remove the harmonic part then I will have the residual part which hopefully is just the noise component then I have to model this ok how do I estimate I do a minimization of the mean squared error the harmonic model the speech, the observations and this is the window that I use for analysis then for H and M1 and 2 the solution has a quadratic form and can be solved easily using linear equations however H and M3 is a non-linear system and has to be solved in a different way so once this is estimated H means estimation we subtract this from the speech and we get the estimation of the residual signal and here is an example of this subtraction that the noise component so this is a speech this is from sound like that sound from a very famous word that I used all over my PhD vazi vaza and that is the z sound comes from and it's very important because it must be modulated if you don't do it well it will sound like or noisy it will not sound like vazi vaza so we want to talk about the localization of that noise part so nevertheless we estimate now these components we subtract the harmonic part so only hopefully the noise part remains so this is the error then using the first model the simple harmonic model you see there is a lot of components here low frequencies this is the error 2 with this complex slope let's call it order 1 and this is with order 3 the h and m 3 so these two produce kind of similar error but the difference between these two is that this uses a polynomial with real coefficients while here it's a polynomial of order 1 with complex coefficients here we show what is the variance of the residual signal for h and m 1 so you can see that at the center this is the center of our analysis window the variance of the error is very small this is why because that means the estimation because you put a window the estimation is very accurate for h and m 1 at the center of the analysis window towards the edges of your analysis the variance is very big and you don't like that that means it's not a good estimation now if you go to h and m 1 2 and h and m 3 obviously you get also a good estimation at the center but you have also low variance that means good estimation also towards the edges so that means h and m 2 and h and m 3 are just more powerful models but obviously you can say easily that yes but h and m 2 and h and m 3 uses more parameters than h and m 1 so it is obvious to have such a performance ok we will talk about that later here is the original speech and here is the error in the power of error in dB for this that's the that's the so you see and we do the harmonic modeling this is 16 kHz sampling frequency so the bandwidth is 8 kHz we model only the first 4 kHz with harmonics and all the rest from 4 to 8 with noise so here we see the modeling error from the whole bandwidth the error that we do around z is big this is a z error for h and m 1 even for this error this is error in dB like around 20 is the average error from minus 20 dB while when we go with the h and m 1 and h and m 2 in these areas where we don't have this z because we don't model it we don't model h h is not for the unvoiced or the fricative voiced sound of z so that's why they have all the models have errors here so when we model so the error goes to minus 40 dB or minus 35 dB on average to understand the meaning of these numbers if this is below 25 you cannot hear any difference between the reconstructed signal and the original so now that was the model for the sinusoidal model now we this question so could you repeat the question first of all if we use h and m 2 and h and m 3 compared to h and m 1 we have a lot of gain in these areas now in these areas because we model up to 4 kHz and all the energy of the fricative voiced is above that not all there is some because here we are in minus 10 dB that means these models they cannot touch we are not allowed them to process to model information which is above 4 kHz that's why we see this error for all the models if I allow the models to go beyond 4 kHz and we will see later on results on that and that is the adaptive sinusoidal model you will see that the h and m 2 and h and m 3 are much more effective even for these areas but then I allow them to model this information here I put a barrier I don't go beyond 4 kHz really the most important point is that whenever I have the most of the energy up to 4 kHz let's put that way then I have a very good modeling so yes please we would prefer a term 2 to 3 that is next lecture ok sorry we will talk about that we will focus on h and m 2 but after that I will comment on h and m 3 so now the noise it is if this is f the noise let's say is something like that so you model this with autoregressive models so that the models the model is like that so you by using this autoregressive model the red line you can model the frequency content and then you can this which is noise white noise if you pass this noise which is white noise through this red filter the output will be like the blue one and then you transform a white noise to a colored noise and that is the frequency information of the noise but we go now to the time domain characteristics very important in order to fuse the harmonic and the noise model so here it is a zoom of sound and this is when I subtract let's say h and m 2 model from this one this is the error that I get so you see that the error is quite noisy but it is localized in time and this localization is very important to fuse let's say if you are able to reproduce this to put it back on the harmonic speech so that at the end you have something like that like the upper panel so how to do this localization it is the next topic because if you pass the white noise through that filter to have the colored noise that is okay so you have similar frequency content as this process as this but not the localized properties of that process that is the modulation simple ways to do this another way that is time domain envelope we can do Hilbert transform low pass and model it as that a third one is to do it model it this one because you have many parameters to model it with a few sign rates like let's say 3-4 sign rates and then this is what we can get with n equals to 7 what is n? so just delay if we do this moving average this is moving average filter with n equals 7 then we do using few Fourier series decomposition the Fourier series decomposition here we use just 3-4 parameters and then we can model that and here is now a comparison I think I'm done I just a second please I have two slides and then we are over so here is the signal the upper panel is that the R of n that we have seen this is when you put this triangular envelope as a simple deterministic envelope so you can see you don't have really this localization anymore and when you put this back to H and you add them together to have the reconstructed signal the reconstructed signal probably will sound noisy H of n not really the two components fuse together this is when we use Hilbert envelope and this is when we use this energy envelope after averaging and use 3-4 Fourier series components so you can see that this solution is not deterministic in that sense but it is estimated through data while this is imposed and this one has better localization properties compare this 3-2 to this one and here I had some samples it is not necessary for you now to play this but you will have hands-on version and you can have version hands-on session and you can have headform so you can really compare otherwise now this is my last slide saying that f0 of course is a simplification of frequency content of speech and you can see that by put together the Fourier transform of speech this is the continuous line and the harmonic model that means k times f0 this is the dashed this is magnitude spectrum and this is the dashed line this is the harmonic model probably we have a very good accuracy of modeling up to let's say 3000 Hz but above that you see there is drifting that's normal because f0 is not a good representation of the speech spectrum now there are ways to estimate better f0s and indeed then you do very good job even up to 4 kHz and 5 kHz it depends on the voice but the reality is these are just some pictures and the frequency content changes can change fast and f0 it's under we don't know really if it is if such information exists on speech so there are problems in estimating with frequencies either if you say k times f0 it's not accurate even if you have a good f0 estimator this does not work all the time it is a good model but not very accurate and if you do pick picking or you estimate frequencies and you do let's say we call it spectral estimation this is mainly spectral estimation methods a topic of statistical signal processing you will not go far away because there are errors in these frequencies now how to deal with these frequency errors because if the frequencies are good everything is good then amplitudes, phases everything is well estimated how to address this problem of not well estimation of frequencies is exactly why we started to do the advanced sinusoidal models which is the topic of the next lecture ok so we will start with that problem so concluding I try to give you the basis of the sinusoidal model and more important to make the connection between the speech production and the sinusoidal model I introduce you also to a notation that I'm going to use the next lecture so the next lecture will be much faster and it's going to be about the advanced sinusoidal model and how to estimate its parameters and how this can be used in many applications one is speech synthesis ok so thank you very much for your attention and then we will see hopefully later on the advantages to use HNM2 ok thank you