 My name is Yanis. I'm professor here, but also I'm a group leader at the Toshiba Labs in Cambridge, UK. And so today I will introduce you to speech production and advanced techniques for speech modeling. So this is the outline. We will see speech and we will, many times, we will speak about modulations and modulators. And then we will see classical, classic modeling approaches like the linear prediction, and I'm probably not familiar so much with the sensual model. And then after seeing the sensual model, we will go one step further to see advanced modeling of speech using adaptive sensual models. And there is a big family now of algorithms in that area. It is the quasi-harmonic model, QHM. It is the adaptive quasi-harmonic model and what we call lately the extended adaptive quasi-harmonic model. So all these are new models. Some of them have been introduced in the tutorial interspeech 2012 in Portland, but since then we've made some steps. So today I will show you all the details of this new approach of speech modeling. Okay, so this is a simple view of human-to-human speech communication, and here we see how really the speech is produced. So a very important modulator is this one, which is really the vocal folds. This is a very important modulator, and this is related also to the second part in the afternoon of the technical discussion today, which is associated to voice pathology, because the sound that comes out from that first modulator is called voice. Then the signal is traveling, passing through the vocal tract, and this is the filter. This is another modulator. So if you listen to the signal above the first modulator, you cannot recognize if it is A, E, O, but with the appropriate modulation that you learn after, while you are let's say two, three years old, how to move your articulators of the vocal tract, then you transform this voice into recognizable sounds like A or E, and this is called partially speech. Then of course we connect this into words, and then we process with our brain and our ear, and this is language then. Then if we see speech over time, what is the traveling sound here? It is the unvoiced part, like for example if this person says shop, then this is the unvoiced part. This is regular oscillations. We call it periodic part or voiced part, and this is impulsive. These impulses are produced mostly at the vocal tract, like there is a pressure that is increasing. I have my lips closed. Then at some point I release, and this is producing. Now these are the vocal folds, and they vibrate. When there is invoicing, like here in the left part, there is a tension. So there is a vibration, and this vibration we see like a periodic part. I will show you of the airflow, and there is a simplified view of this mechanism with vocal folds opening and closing. While there is breath, these are very relaxed. Why? Because simply we want to maximize the amount of oxygen of the air, let's say, that will come into our lungs. So these are a relaxing state. These are vocal folds vibration, the vocal folds, real vocal folds, and of course you may know already that the vocal folds vibrate depending on the gender. For males with, let's say, around 100 times per second, that is the F0 of 100 Hz, or for females 200 times per second, or 200 Hz, or if you are a child, it is 300 times per second, 300 Hz. Obviously, we cannot with our eye, we cannot observe this movement simply because we are around 25 to 30 times per second we can observe. So of course, since we under-sample, we down-sample the whole process, there is an adiasing effect on the movement of the vocal folds. So now you will see, I hope, it will tell me that I trust this document. There is sound, so I don't know why this does not come out. Okay, so that is the vibration, but there is an aliasing effect here. This is a female voice, so you cannot observe 200 movements per second. You cannot see it unless you slow down and you do what we say with high-speed camera, so you will see the movement. But more about that in the afternoon. So these are real vocal folds. Now, while the vocal folds vibrate, we have this what we call glottal airflow velocity. It passes through, it's modulated by from the vocal folds, and then if we measure here the amplitude of airflow velocity, there is an area which is, there is no velocity. Again, this is a simplified view. This is called closed phase. The vocal folds are closed. Then they are opening gradually. So gradually, the airflow velocity is increasing. This is the open phase, and then decreases fast again, and this is the return phase. Now depends on the mass of your vocal folds, how fast the vocal folds will return to the closed phase. And that mass depends, of course, on the gender again. If you are male, your inertia is big, for female is smaller, and for children is even smaller than the female voices, than the female speakers. Then that means your F0, if you are a male is short, it is small, it is around 100 hertz, 200 hertz for females, etc. And this is because of the mass of the vocal folds. So this point is what we call the glottal closure instant. And many algorithms in voice pathology, but also in speech technology, like speech synthesis, are based on this glottal closure instant. Not all of them, but many of them. And then we try from the speech signal to find this glottal closure instant. Then, of course, the sound, as I said, goes to the second modulator, which are the vocal truck shapes. And then, depending on where are the articulators, we can produce vowels, back or front vowels, we can produce plosives. You see here the, we stop the air there and then we release and then we produce, for instance, or with the lips. And then fricatives. So there is the difference between vowels and fricatives. You see that there is an area here. And because of the Bernoulli effect. There is a noise that is producing here. And this is what we perceive as high frequencies. And these are the sounds like of. So this is in the vocal tract that produced. It's also not, but it's not fricative. Of course, this is again, simple. We can have voiced escalations, but not so like in a or all like boo boo. So there are also two areas there. Important, but okay, that is, I don't want to spend more time on this because I'm sure that you have seen your courses in speech signal process in speech signal processing. So then let's come into more into the. How we can model all this process and how then we can go towards the modeling part. So let's assume that the vocal track, the impulse response is this one. Then if this is the excitation signal, as we say the global airflow, then the convolution. So this is the, this is the vocal track and this is now the excitation of the vocal track. How we can make the excitation we can. Convolve you may remember this figure. This is the g of n. And the p of n. It is just impulses. And of course, the distance here is one pitch period. So the convolution between the glottal airflow signal and the impulse response of the of the glottis of the vocal folds and this periodic part creates this excitation for the periodic sounds of course. And then by convoluting, assuming that the system is linear, this excitation with the vocal track. And if we put also a window, then we get the speech signal in time, but we can do for get transform and then we have a time frequency distribution, as we call it short time for get transform. And if we see the magnitude spectrum of it, then we see the multiplication of the vocal track filter and of the global system as what we call envelope. And these are now the harmonics. The thing is that we do not observe this one. This is not observed, but we only observe these ones. So that means during the speech production. Because of the periodicity, we sample the spectral magnitude into the harmonic so we know just these areas. Then there are many algorithms that they try to go from the circles, which are, let's say samples, discrete samples to the continuous spectral envelope. And this is another research area, which is mostly we see it in speech synthesis, but also speech modification. Okay, so now this is the speech signal. This is over time, and this is over frequency. So this is a time frequency plane, and this is just one dimension time. So we can recognize things that we just said these are we can listen to this. Unfortunately, this does not work, however, which is very bad. So I hope someone from the people in the support, they can find out why I don't have sound here. I think they are listening to me so they will come. So nevertheless, this is an English sound. So you can see here is the periodic part. And this is a periodic part. Unvoiced sounds. There are some stops. I cannot write on this because if I click, then you just listen to the sound. And this is the time frequency plane during voicing like here. You see the structure of the four months, which is exactly what we we mentioned here with the dust lines. These are the four months F1, F2, F3. And obviously these are also they are not known. The four months because simply we do not have this continual information over frequency. We have only the discrete part of it, which are the circles as I mentioned. So we see this formation of four months and there are many algorithms that they try also to estimate the four months from that discrete information. And whenever it is unvoiced, you see that there is high energy in the high frequencies as it is here, here and here. And then depending on how what is your analysis window, you see the periodicity either as vertical lines. I don't know if you can see the vertical lines here. This is wide band analysis or with vertical lines, which is short narrow band analysis with using long time windows. So that depends on the analysis window that we use. Now, if we make a zoom, you see that, as I said, there are regular oscillations and we see that this is a window signal. So there is a window here. That's why we have this shape and the magnitude spectrum of it is this one. So this is over time. This is over frequency. So for regular escalations, we see that there is a periodicity and something that we can predict. Probably we can predict the signal using previous samples and also this periodicity. It is shown also in frequency domain. You see that there are some, as we say, kind of harmonics kind of harmonics, which are there is a structure in the frequency domain. And this is how, why we say, oh, that's probably good idea to have a harmonic model if we look the frequency domain part. If we look in the time domain part, then we say probably linear prediction may be a good model for modeling the speech. Now, if there are no escalations at the vocal folds, then this is again time. This is frequency. So you see that in the higher freak. This is free. This is frequency. No time. I can. So this is frequency. So you see there are there is higher energy in the high frequencies, low energy in the low frequencies. The scale here is in dB. If it is linear, you will not see anything in the low frequencies. So these observations again leads us to specific models like the linear prediction model or the same shuttle model. So let's start a bit with the linear prediction. So now we move to the classic modeling of speech. So let's assume that the input to the vocal track system is a train of unit samples, as I said, of this glottal signal. So we have an input, we have a system, and then we have an output, which is speech. Assuming again that the system is linear, then we can write the equations that connect the output, the input and the system. We can write also this Z transform as a Z polynomial or all these H of Z is this function. And if we go back from the Z domain to the time domain, this is what we get. No, no, this is still sorry. This is I can do that. No, I can remove it. Nevertheless, okay. So this is still Z domain. If we go and we do inverse Z transform, then we go to the time domain. Now the time domain shows really that we can write an equation that connects the current sample of speech with the P previous samples of speech. And then there is another term, which cause, which is obviously the error of that prediction, because we can write that as a predictor. Then if we write it as a predictor, then the question is how to find these coefficients. And then there is the terms that we use. Of course, they are linear prediction coefficients, auto regressive because this is an auto regressive process and the whole thing linear prediction analysis. Of course, there is there is advanced linear prediction analysis. We can have not only auto regressive, but also a moving average. That is the armor modeling of speech. So auto regressive and moving average. We can consider many options for the excitation signal for or for the prediction error. We can, we have two techniques, two major techniques to measure these linear prediction coefficients. One is based on the autocorrelation method and the other is based on the covariance method. The covariance method is more accurate. It's accurate model, but it's not stable. There is no guarantee that your filter that you get, it will be stable. In order to do that, in order to make sure that it is stable, you try to make the analysis during the closed phase. That's why the detection of the closed phase is very important, the one that I just mentioned to you. But this is very hard to detect that area from the speech signal. And the covariance method, it's good that because it requires just few samples in order to estimate this autocorrelation coefficients. Not the, sorry, the auto regressive coefficients. The autocorrelation method on the other hand is stable. We can really make sure that the filter that we estimate is stable. However, requires more samples in order to estimate it. And we can show that if that window, the length of the window goes to infinity, then we can estimate these coefficients, a.k. with high accuracy. But then, of course, we have to model a long speech signal, but speech signal, the speech signal really changes fast. So we cannot have long window. Always we have to have short time analysis. And consider there that the filter is linear, time invariant, in order to write down all these equations about convolution. Otherwise, these equations are not valid. So this is the analysis part where we estimate the coefficients of the filter. Here you can see this is the predictor. So we predict speech at time n with then we measure the error. And this is the prediction error. This is a very important step because then we can use this error here to as an input to the one over AZ filter, which is the prediction error filter in order to synthesize back speech. And based on this simple approach, we had models developed in the past like the LPC-10, which has been used for military, mostly applications. It's a low bandwidth speech coders, but also by making more and more sophisticated this excitation signal, then we ended up with algorithms like kelp or melp. And these are now the state of the art systems that are used in your cell phones. So started with these simple approaches, but now making a bit more sophisticated this part. The code excited for recent linear prediction kelp. It is how your cell phone works. I must say that the linear prediction has a problem also for female voices. It doesn't work so well for female voices as it works for male voice. So I can ask you later on why. So I just said that if these are the coefficients that we have for the linear prediction coefficients, if these LK are equal to the AK, which is the auto regressive coefficients, then we have a chance to synthesize speech. And then in the standard linear prediction synthesis, we have to separate if the, the excitation signal will be a periodic impulse train and or one impulse, a periodic in order to produce voice speech and impulse in order to produce the stop sounds like to cook and white noise. If we have to synthesize sounds like et cetera. So that makes it a bit complicated. That's why people develop this kelp for instance, our algorithm, but there is another approach. It is under classic modeling to count all these parts, the periodic impulse train, the just an impulse or white noise. Can we really unify this excitation into one model? And this is the generation of the signage of the harmonic models. Oh, but before go there, sorry. So you see here the original speech signal. And then this is synthesized using this simple linear prediction and the linear prediction in order to, since this is an auto regressive filter, this is a minimum phase filter. And if you excited with just impulses, this is what you get. So this is a minimum phase signal. And it is the minimum phase version of the original signal shown here on the left. So there is a big difference. And actually, how does it sound? It sounds quite baszy. Why it sounds baszy? Why this is not, this does not sound baszy because this minimum phase property increases in a non natural way. The correlation between the samples and this unnatural high correlation. It is perceived as basiness. Okay, so this is about this linear prediction. And then we go to the sinusoidal models, which they try to unify this excitation signal using science rates. So we will see how the sinusoidal model can be seen as a source filter model as it is the linear prediction model. First, let me introduce the main sinusoidal model. This is the main model. These are the complex amplitudes. This from that we can get the magnitude if we take the abs or if we the phase information, which is using angle. For instance, and this is, these are the frequencies of the science rates. In short, a sense of the model is defined by two parameters complex amplitudes and frequencies. Then if we make a restriction of the frequencies to be harmonics. So f k is K times of zero. Then we call this harmonic model. And then, of course, we have to estimate these parameters. And this is done by projecting s, which is the speech signal into a space generated by these basis functions. Either these are science oids or restricted constrained science oids to be harmonics, which are these two. So f, then it is a space and it is represented from by these basis functions and then projecting s onto that we get gamma, which are the complex amplitudes. And, or another way to see it, we minimize the mean squared error between the original speech, which is this one and the model x. So now, how the science oids go with source and filter. This is the source. Again, we think that these are science oids sample the sum of science oids. And then these are the excitation phase, which is an integral of the instantaneous frequency plus a phase which accounts. We call it phase offsets, which delays the individual components in order to reconstruct either an impulse or noise or a periodic part. And then the filter has also a magnitude information and a phase, a system phase information. Now, assuming that with the filter does not change over time. So it is linear and time invariant. Then again, the convolution operation can be applied. And then this combined it give us the speech file, a model for the speech file where the amplitudes is just the multiplication of the excitation amplitude with the system amplitude while the phase that we measure is the sum of the two components of the excitation phase and the system phase. Now, again, we assume that the signal is stationary because our analysis window is small. Then that means this instantaneous amplitudes frequencies and even also the number of harmonics that we use are constant during the analysis window under that assumption, then very easily we can show that the instantaneous phase is just this equation which is a linear function of frequency. Then by putting this into our model and by measure by calling this gamma of K, which is the complex amplitude in terms of magnitude and phase. And then that leads to the model that you I shown you two slides before, where now I just using omega K, which is two pi multiplied by FK that we have seen the previous slides. And how we do this, I can, we can talk if you have questions about that, but mainly we can show that a maximum likelihood actually approach. It is just provides a very simple solution based on pick picking. We, we see the magnitude spectrum here for voiced and for unvoiced sounds, just we do pick picking. We take the local maxima and that defines both the frequencies and the complex amplitude in terms of phase and frequencies. Of course, because you have a relatively low resolution, you try to increase the resolution of your Fourier transform. There is one technique called quadratic interpolated Fourier transform, which has been suggested in 2005 around from Stanford University. And we will see it in the hands-on session and we'll be compared against the standard sinusoidal model and also the adaptive models. So then synthesis, we can do it with overlap pad. We can generate the sinusoidal just the sum of sinusoidals it is shown in the previous slides and then different frames overlap pad and this is how we generate speech. Another way to do it is to try to track the frequency to do tracking of the frequencies over time. And this is what we call a process from birth and birth because and and continuation. Then then using that we can make associations of frequency components from frame to frame. And once we do this association, then we can also associate, we can find a rule how to go from one frame to the next one in terms of amplitudes or in terms of phases because these are the two components that we need to interpolate. For amplitudes, linear interpolation is the simplest way. For phase, we can use also linear interpolation, which means that the frequency is constant or second degree polynomial, which means that we let frequency to be a first polynomial of the time or a third polynomial. And this is what we call cubic phase interpolation that has been suggested in nineteen eighty five from Macaulay and what Harry from MIT in a very nice paper. And this is part of this work. And you can see now the reconstruction. This is the original and the reconstructed. This is a typical signal. This actually has Diplofonia. You can see that with a secondary excitation between the major excitations. These are the major excitations or the glottic closure instance. There is a secondary excitation or like a secondary glottic closure instant. And this is what we call Diplofonia. So it's a very difficult signal to model, but the sinusoidal model. As you can see by comparing the two waveforms is doing a very good job. Of course, it's not minimum phase. It's mixed phase reconstruction. So we cannot really compare the two techniques, the linear prediction and the sinusoidal model directly. So and the harmonic model again is since we talked about that is simply having multiples. The FK will be K times F zero and F zero we call it fundamental frequency or pitch. Now that obviously it's very good simply because we don't need to estimate FK. Just F zero. But of course, as you know already, nothing comes easily in life. So even if we say F zero, okay, we just need F zero, then you need an algorithm to estimate F zero. And you may have seen hundreds of algorithms estimated at zero and they work okay. But look at that. This is a magnitude spectrum. And obviously if you make a mistake in F zero, then you will pay that by not modeling well your signal. Here you see in the low frequencies where there's no really an error between the continue line, which is the original speech and the dashed line, which is the harmonic model. But in the high frequencies, if you can see from where you sit, there is the two maxima are not really aligned. And we and this is because there are small errors in F zero. If now we try to improve these errors, then we improve also the modeling in these frequencies. Now, how do we perceive the error in F zero? Simple errors, small errors like this one, the first one. Simply we do not model enough energy of the signal. So, and that means the signal has, as we call lack of presence. So when we listen to the signal, it's like, it's not like the original signal being next to us, but a bit farther. Simply because it does not have most of the energy of the signal modeled because of this F zero estimations. Okay, so now this give us enough motivation to go towards the adaptive sinusoidal model and see even if there are mistakes in F zero, even if there are mistakes in FK, even if your analysis window is very small, you can do really sinusoidal modeling. And this is, and this is what we call adaptive sinusoidal models. But do you have any questions so far? You know everything. So I'm sorry to repeat things that you know. Okay, please. Yes. First of all, when we measure the phase from the signal, this is already mixed phase. So the mixed phase means we can of course, the phase can be factorized with different terms like all past phase or pass or minimum phase, or maximum phase and minimum phase, or all these when they are combined are the mixed phase. So what we measure in speech is the mixed phase that means we measure as well minimum phase and maximum phase, or we measure also all past information and minimum phase. Now, if we go to the linear prediction system because it is auto regressive model, this is a minimum phase system because all the poles and all the zeros of the system is inside the unit circle. And that makes the system stable. If you want to measure the maximum phase, you have to put some information outside the unit circle. But then you have problems on how to estimate that and how to keep the system being, let's say, stable during the estimation process. I don't know. Yeah. Other questions? Are you familiar with the Saint-Germain model? More or less? Okay. So let me then go introduce you to the adaptive Saint-Germain model and you will have hands on in the afternoon. Please come where you will using Matlab, you will see this adaptive Saint-Germain models. How do they perform and how they are compared to the standard Saint-Germain model or even the quadratic interpolated Fourier transform from Stanford. Okay. So this equation, the first one shows you again the Saint-Germain model. Here you have the complex amplitudes, the frequencies, and this is the window that we put in order to have a short time analysis and estimate the amplitudes and the frequencies. Now the methods that we have in order to estimate these are FFT based, which we call them non-parametric approaches, like the quadratic interpolated fast Fourier transform from Abbey from Stanford University. And I'm trying to remember now the professor. Anyway, it will come later. Julius O. Smith, which is quite famous in music, signal processing, but also speech in a person. So Julius O. Smith. And we have then parametric approaches, subspace methods. And Denmark is quite strong in that they are in our work. There is a lot of work on subspace methods. These squares methods. And then also then I can add also Bayesian methods, how to estimate the Saint-Germain parameters. Nevertheless, always, there is a mismatch. We can, we should be naive, really very naive if we consider that these frequencies are really the frequencies that we estimate are the real frequencies of the signal. That's why we will consider next that the frequency here, the real frequency, what we measure and the distance, there is an error. And the key here is to estimate that error. If we estimate well the frequencies, then we will see the mechanism to estimate again the amplitudes, even in noisy conditions, even despite this frequency error. And that will give us the adaptive Saint-Germain. Okay, so now, instead of this fk, we have f hat k, which is the frequencies that we measure, which have an error. So then that, in order to address this point, we rely not on the Fourier transform, but what we call deproni. Deproni published his work in 1795. Then it has been revisited sometimes, but after that, of course, but a big really step happened in 1999, when Jean-LaRouche in France revisited the deproni model. Then LaRouche was one of my supervisors when I was a PhD student in France, and I studied with him, the deproni model, and we produced some outputs for speech synthesis, voice conversion, and things like that, based on partially quasi harmonic model. At that time, we knew that this model is very powerful, but because it was so powerful, we did not know what to do with that. So then we ended up with simplified versions of it. And then during my thesis, we have produced three models. One of them was the quasi harmonic model. And then some another simplified version got called harmonic plus noise model H&M that has been used a lot in speech synthesis and also speech modifications. And then there was another model, which is not used at the moment, but has big potential as well to be used. But nevertheless, then my PhD student, Janis Pandazis, tried again to revisit the model to see why the model is so powerful. What is the model? The model is that just you augment this information, ak, by this term where t is time. So you can see then bk as a complex slope, all these are complex values. Then if we consider one component and we see that in the frequency domain, and w is the Fourier transform of the analysis window. That one component can be written this one in the frequency domain like that, where ak is the standard complex amplitude and bk is the complex slope. Then the key idea here is to decompose bk in terms of ak one perpendicular to ak and one parallel to ak. Then we can show that xkf can written in this way by considering this projection of bk over ak. Then we take it to account, we make the development using Taylor of the Fourier transform of the window around this fk. And the Taylor series development is these three terms where you see that the there is a third term which depends on this raw tool and the second derivative of the Fourier transform of the window. The second derivative or frequency of the Fourier transform of the window. Now, if this is small, so we can throw it away, then we can write xk of f from here like that, where we don't use this parameter. Now, when that term is small, it turns out first of all that has raw tool. So when that error between the raw tool, we will show that it is really related to the error that we make the frequency domain. So if that raw tool is small, small error, then that term is small as well. Also, if the analysis window, it has small duration, smaller is the duration of the analysis window, smaller will be the derivative, the second derivative of the window at the frequencies fk hat. So, in short, we require small error, which is okay, a wish, but also a small window. Bigger is the window, and remember that for your hands on session, bigger is the window, less valid is this assumption. So then we found previously that this is the component, then we go back to the time domain, and then initially we assume this. Our fk is the fk hat, which are the frequencies that we want, plus the error. So if we compare these two, forget this term at the moment, if you compare this part and this part, then you make an association between the error that we are looking for, which is eta, and this raw tool. In short, that bk provides a mechanism, under some assumptions, of estimating the frequency error, eta k, for each k. So in other words, this is what we suggest. Let's see how we can use this. Okay, so very important to know, again, are the constraints that we have put. This shows, again, the Taylor series expansion of the forget transform of the window around fk and raw 2 divided by 2 pi. So what we would like to have, these are what I have just said, but because of my Greek accent, it's better to see it written as well. So we want small value of the second derivative of the forget transform of the window at frequency of k. And then this is because this, of course, has an influence, the length of the window has an influence of that value. And because the secondary derivative, it's easy to show for a rectangular window, it is analog to the window length power three cubic. So what we would like to have is short analysis window. This is not a bad result, actually, because speech is non-stationary. If I would say we require very long analysis window, then that will be a drawback because during that long window, many things change. While now we say no, we require short analysis window. So we can do really good analysis. And this is a very important result using very short time windows. Obviously, however, long analysis windows, we know that they provide robustness going from one frame to the other. Because if you, for instance, you make very short your analysis windows and you'd make an analysis here and later on, then these two estimations, they do not have any connection between them. So it's very difficult then to regenerate the speech during speech synthesis, for instance. So the length of the analysis window has obviously, as you know, in the signal process, from the signal processing courses, connection with the bandwidth. Then we can relate, we can find what is the connection between the bandwidth of the analysis window and the error. So in order to see the influence of, that was the influence of the window, now we will go to see the second term. You remember W was one factor that influences our assumption of Taylor series expansion. The other was the mistake, the frequency mismatch. So this slide shows the influence of the frequency mismatch. So let's assume that this is the model, the same model for one component, and this is the quasi harmonic model again. So with just one component, we can write a least square solution in order to get a and B as a function of the magnitude spectrum and also of this eta, which is the mistake. So by projecting B into a and find this coefficients, row two, we can show that that frequency mismatch has this form. Now let's plot this in order to see it's how it performs. This is the frequency mismatch in short that the algorithm can tolerate or a rectangular window. Let me make a zoom here or electric rectangular window and dashed for a humming window. So that means if your algorithm, if you if you make mistakes around zero Hertz from let's say minus 50 to 50, there is secure. You are secure from the algorithm that the error that you will find there that you will find you will find the frequency mismatch and the error will be zero. But if you are in this area, then there is no guarantee that your algorithm will find the error, the frequency mismatch. So here, let's say around 80 Hertz. But if you use humming window that I am saying the bandwidth of the window plays a role, then you extend that area even towards to have even a mismatch of 100 Hertz. Then then QHM has a mechanism since OK, you make a first assumption of the frequencies, you estimate the amplitudes, then you can iterate this. You can use your new estimations as initial points, then you re estimate amplitudes from the sample to use. You can find a new frequency mismatch. You correct the frequencies. So in short, QHM not only can correct the frequency, the frequencies, but has a mechanism to apply that in an iterative way. While sinusoidal model with pick picking. If you do pick picking, that's it. You don't have a mechanism to update that information. So if we do iterations. As it is shown here, so we can improve even farther. The dust line is with two iterations. Solid line is with no iteration. So here, for instance, there is a problem in this area. You see, you start increasing your estimation is not good. However, if you do iterations like two iterations, the error, even in these areas around 150 Hertz, then still you can estimate the original frequencies. So the we found them that we can reduce them. The bias that exists that this mismatch can be removed if that bias is smaller than the bandwidth of the window or divide by three. So if you are in that area, there is a guarantee that the algorithm will find the frequencies. And although you give initial frequencies, which are wrong, the algorithm will correct you using, of course, the mean square error criterion and this deproni representation. Now, before going into make put problems into the system like adding noise to see what is going on. Okay, if I make a frequency mistake, it is shown here mathematically that you can get the estimations, you can get the real frequencies using the quasi harmonic model. But if I make the case more difficult, if I add noise into the speech, can you estimate the frequencies then. So that is the next topic using the quasi harmonic model, but on noisy signals now. So not only you have a wrong frequency estimator, but on top of that you have noise. But I think we need break. So we will make five, five minutes break, and then we will come back to see how to use quasi harmonic model, this advanced speech modeling technique in noisy environments as well. Thank you for your attention. Okay, so we will continue. The, the course. We have seen that the algorithm is able to estimate the mistakes in frequencies, correct them, and now we will see if we add noise, what happens. So now the model is that we have the senzoids and noise, and we make the experiment with four components just to be able to measure quickly and fast. And we have a mean square error in terms of frequencies and in terms of amplitudes. And we will make comparison with the Kramer row lower bounds and the quadratic interpolated fast Fourier transform. And we do that with Monte Carlo simulations. So these are the results in terms of frequencies. So you see here this line, the dark block line shows the Kramer row lower bound. These, the triangulars are with the fast Fourier transform or the quadratic interpolated fast Fourier transform. These are with QHM but without iterations. And this is when we start making iterations. So you see after three iterations, we really reach the Kramer row lower bound for frequency one, two, three, four, etc. So these frequencies have been selected, so some of them they are very close, some of them they are apart in order to make that experiment. So there is a paper later on in the references that you can read in the transactions about details. Now, in terms of amplitudes, that was about frequencies, also for amplitudes, the estimation, you can see that QHM with three iterations, which is this one, with the circles, again, it is very successful in estimating the amplitudes. So that means, that means that QHM is able and it is robust against estimation mistakes, like providing wrong FK, as well against additive noise. We have not tested on convolutive noise. But in QHM, there is an approximation. So that approximation is valid under the assumptions that I gave to you, I shown to you in the first part. And we have to respect also to relate the frequency mismatches with the bandwidth of the analysis window. We have shown that QHM is robust against noise, and it can be shown, if I have time, I will show to you that QHM, it is equivalent to, or approximately, to the Gauss-Newton method, and that there is a relationship between these frequency estimations that we make with quasi-harmonic model and the reassigned spectrogram. I have some slides on these two parts. I think we will start now. Okay, the Gauss-Newton method. So regarding the Gauss-Newton method, we have n samples of this signal. It is, again, with one component. W is the analysis window. C is the complex amplitude. Omega is the frequency. Then Gauss-Newton algorithm also provides a mechanism to correct the frequencies. And it is shown that the update mechanism follows this equation. Now, let's compare this with the quasi-harmonic model. The quasi-harmonic model makes this assumption. Instead of just C, it adds one more component, that B. So estimates B, project B onto C, and then one of the projection, row two, is used to correct the frequencies. So this row two, it is exactly what I just said, it is the projection, the vertical projection of B onto the C, and B can be shown to be this one. It is estimated in this way. So now, if you compare these two very easily, you can see that actually the two algorithms are the same. And this is what we say approximate Gauss-Newton solution of frequency. Now, if we do a root mean squared error for frequency, and we compare the Gauss-Newton method, which is the blue line and the red line, which is the iterative quasi-harmonic model. That means the quasi-harmonic model, but with iterations, then we see that they are very close, as expected based on the equations that I just shown to you. And after some iterations, they arrive to the Cramer-Rau lower bound, the CRB. And this is for the frequency estimations, we have two frequencies probably here, and this is for the amplitude. But here we have, with SNR0 to B, okay, always we have SNR0 to B. So that means we have noisy signal, same power, the signal with the noise. And as you can see, after some iterations, we reach the Cramer-Rau lower bound. Now, regarding the connection between the quasi-harmonic model and what we call reassigned spectrogram. Reassigned spectrogram has been suggested by Patrick Flandrand in France, and is considered today one of the most successful time frequency distributions, time frequency representations of non-stationary sounds. So let's see if there is any connection and how, by making that comparison, we can gain some knowledge and introduce it to the quasi-harmonic model. So mainly the reassigned spectrogram, by taking the derivative over the phase spectrum, can relocate where you do your analysis in terms of time and frequency. And these are the two equations in terms of time relocation and the frequency relocation. That means taking reassigned spectrogram takes just the short time Fourier transform and then relocates the information that we see on the short time Fourier transform. It relocates that information but using the phase information. So this is how the reassigned spectrogram does the job, and it is very, very successful. So let's see now the connection with the quasi-harmonic model. Unfortunately, you have to remember some slides before, but okay, I will try to go back and forth so to make the connection clear to you, hopefully. So again, this is the quasi-harmonic model. This is the projection of B onto A. This is A again. And it can be shown that the relocation between, in terms of time, if we want also to make a relocation over time and frequency, it is actually the same as in the reassigned spectrogram if we weight the row two, which is the frequency mismatch estimator, with this ratio of the moments of the analysis window. In case that the analysis window is Gaussian, it can be shown analytically that by considering W2 and W0 moments, the zero and the second moment of the window, then if you weight the frequency mismatch by this ratio, then you end up to the same relocation of frequency as in reassigned spectrogram. Same thing, this is the first time now that you see row one into the place. Row one, it has not been used in all the previous slides before. And this is related to kind of the slope over time. So that slope over time shows you an information which is related to what we call group delay or where to relocate the energy of the signal over time. So it is a time relocation mechanism again. And it is shown again that if we weight that with this ratio in terms of Gaussian windows, then we arrive to the same relocation property as the reassigned spectrogram. In short, quasi harmonic model and reassigned spectrogram, they have the same mechanism if we weight this row one and row two by these two terms. Now that gives us an idea on how to weight probably our analysis, our speech file, not with the window itself, but by the derivative. The first derivative over time of the window function because if we do so, then we show that we improve further what we call the reassigned spectrogram quasi harmonic model. And this is shown here. I will make a zoom using again the bandwidth of the window. I think this is for a Gaussian window first of all. And this is when we do the iterative approach of quasi harmonic model. And this is when we weight the time domain signal, speech signal with the derivative, the first derivative of our analysis window. So you see for instance that there are areas which we make mistakes. We start seeing really because imagine these are in hertz, 1,000 hertz. So here it could be like 200 hertz mistake. But you see that if we use the reassigned spectrogram, then the error is much, much smaller. So not only we made a connection with the reassigned spectrogram, but we gained some knowledge and we transferred that knowledge from the reassigned spectrogram to the quasi harmonic model. Okay, so that was the quasi harmonic model. But we do one step further and we got what we call adaptive quasi harmonic model or adaptive sinusoidal model now. So the quasi harmonic model still makes assumptions about the stationarity. Yes, you can find, you can retrieve, you can really re-estimate the frequencies. But still you make the assumption that your signal, even in that low short analysis window, it is stationary. But this is again a strong assumption for speech. So the question is how we can move from the stationary assumption to the non-stationary assumption to an adaptive way. And this is done, the key here are these frequencies, these Bayes functions, excuse me. So if I change, these are stationary functions and this is where the stationarity comes, partially. If I modify it and I make this dynamic over time, then also my Bayes functions now are adaptive to the speech signal because that Vk tilt, it is estimated from the speech, which is the frequency, the phase information. So then you make your Bayes functions adaptive to the speech signal. To the local properties. Graphically the difference between quasi harmonic model and adaptive quasi harmonic model is the following. This is the quasi harmonic model and this is let's say the initial estimation. This is the real value and we try to estimate the difference, the mismatch and this is row two. But during the analysis window everything is constant. So in the adaptive part, this is not anymore a line, a straight line, but can move over time. And that makes the adaptation. So of course in order to do that there are many ways to make it and one is to make interpolations between estimations or no interpolation between the estimations. So that means that depends on the frame rate in the quasi harmonic model. That means one straightforward way is to go from quasi harmonic model to the adaptive one sample by sample. Or you can do an estimation at some point then an estimation later on with let's say three, four, five milliseconds apart. And then between the two estimations you make an assumption on how this information evolves over time. And this is graphically again how the system works. So these are, I don't know if you can see it, there is a black line and a gray line. The black line is the estimated and the instantaneous frequency. No, one is the true frequency and the other is the estimated. Now you will see step by step how we move from the estimated initial estimated to true instantaneous frequency. So this is let's say QHM and then we do adaptation. Look how close now we are into the true instantaneous frequency. And with a second iteration we are basically very, very close. We do some experiments. For instance we have a signal that we know well what is the evolution over time of the amplitude information and evolution over time of the frequency modulation. In short we have an AM FM model using just one component. And this is the way, a deterministic way, that the amplitude evolves over time and the phase. Then we introduce a frequency mismatch of 35 Hz and we use hamming window. And we compare that with the standard sinusoidal model in terms of the amplitude modulation error and the frequency modulation error in terms of Hz. So the quasi harmonic model if we do analysis every 10 milliseconds. So this is compare the quasi harmonic model with this estimation of the mechanism to estimate the frequencies. And this is the sinusoidal model with 10 milliseconds frame rate. So the frame rate from the first and the second is the same. And AQHM also using the same frame rate, 10 milliseconds. However it uses the first estimation from QHM makes a spline interpolation of amplitude and frequency information, estimations and then changes the base functions. Then the system adapts very well to the input signal, which is not speech right now, it is this signal. And then you see the advantage of this adaptation if you compare the error that we make in terms of amplitude and frequency in Hz. So adaptation is very important. So then based on that we can starting from the quasi harmonic model we can generate the instantaneous parameters both for amplitude, phase and frequency. And this is the way to estimate the amplitudes. This is the phase. So now you see how we introduce into the equations this BK and here is how we track the frequencies. So in short that means that from the quasi harmonic model we go one step farther to advanced speech and not only speech music. In general signal modeling using a high resolution AM FM modeling of the signal. So that's the key words that you have to have in mind. So this is a high resolution AM FM signal modeling. Okay, so that is the AM FM part. Here we show again by, this is any AM FM signal. In order to support that really your representation it is an AM FM representation. You have to put together your model and the standard AM FM model and see how these two things are connected. And this slide is about this connection. So this is a standard AM FM model. Then F VK, which is the F phase, we can make a Taylor series expansion. And this is the equation. The instantaneous frequency, which is the derivative of this divided by 2 pi is this one. If we consider T equals to zero, that means T equals zero is your center of your analysis window. So at the center of your analysis window, the instantaneous frequency is given by this one of any AM FM model. And previously we had that the instantaneous frequency at the center of our analysis window. It is whatever we assume as initial estimations plus this row two. In short, we suggest that this row two is equal to this VK one, which can be used here in order to write the model as an AM FM model. Now, how does this, if we apply that because we saw this on to synthetic signals. How does it work on real signals? Here is the original speech. Original speech over time. This is the reconstruction error after you do quasi harmonic model. This is what you do when you have sinusoidal model. You see that the error that you do for the sinusoidal model is much higher than the quasi harmonic model. But very important, when you do adaptive, the reconstruction error is really very, very small. So adaptation is a very important part. Okay. So that means we can reconstruct speech using our AM FM model. That is what I just said, the high resolution AM FM modeling of speech. Okay. If we applied on real speech files from team it, and we applied on five minutes from 20 female and male voiced speech, then this is the signal to error reconstruction ratio in dB. For male, female, for the quasi harmonic model and for the sinusoidal model as well for the adaptive quasi harmonic model using three iterations. So you see really that we have a big improvement in terms of dB. Do you know what is the value of the signal to reconstruction error ratio that above that value you cannot distinguish the original signal from the reconstructed signal? So? Thirty. Thirty. Another person. Forty. Higher you go, of course you are more secure that you were. Thirty, forty, other person. Of course depends on the capability of our hearing system. How sensitive we are. How much we are used to listen to reconstructed speech, original speech and compare. Okay. For a person who has listened to many signals and tried to compare and identify differences is 25 dB. And both 25 dB, I don't think there is a human ear that can distinguish between the original and the reconstructed. So in order to understand also perceptually these numbers, because these numbers may mean nothing if you don't know that threshold. Okay. So, now we go one step even farther and we say, well, okay, we did it QHM. We did it adaptive QHM. Now we're going to do an extended adaptive quasi harmonic model. What is this extension? This extension, it is motivated by the following fact. First, we say that look, these are adaptation. That means the base functions are adaptive to the signal. Okay. But still, you have an amplitude information, which is deterministic, which you impose the rule. How the information will evolve over time. Then the question is, can you make that also adaptive? And this is the extended and this is by adding this part here. Of course, then these two parts, one and two, can be combined. However, you want to have it analytically. That means factorization of these two information in order to estimate it nicely, simply because you have a quasi harmonic model that provides you the first one. So if you combine them together, you may not have that capability. So the nonparametric adaptive amplitude information, it is simply introduced by this alpha of t. Okay, here I will not, because that is another lot of information and many steps to develop that, but we will see the result of it. And during the hands-on version, the hands-on session, you will touch also your hands. You will put your hands on also the extended quasi harmonic, adaptive quasi harmonic model. It's getting more and more longer now. Okay, so let's see for two components. This is amplitude modulation one, amplitude modulation two, frequency modulation one, frequency modulation two. The true is the solid line. The dashed is the estimated part. You see here how the fm part of the adaptive, adaptive quasi harmonic model, how well we do with the frequency modulation. We really track nicely the frequencies, the instantaneous frequencies, but we cannot say the same thing for the amplitude information. As you can see here, there is an error for the amplitude. So then this is the key, and that was the motivation to suggest the extended version of that. And this is now the same signal towards the extended version of it. So again, frequencies are nice as previously, but now we do a better estimation of amplitudes because of this alpha of t that we introduced into the system. In short, this extended quasi harmonic model, extended adaptive quasi harmonic model, it is considered today as the state of the art, high resolution, high accurate AMF modeling of speech, and also other signals. Now how does this compare, if we compare original speech, adaptive and extended adaptive? So if you see these two signals, one, two, and three, they look the same actually. If you listen to them, since the error is above 25 dB, the reconstruction error ratio is above 25 dB. As I mentioned before to you, you cannot really distinguish so easily the differences. But if you look at the time, you see that this model, which is the extended version, has much lower error than this one, and all over the place. Now, sometimes these things, I think it's just a bug now here, because you have to initiate the first window. But once you do that, then you progressively, because you don't know that you cannot make adaptation since you don't know really the evolution of the information before. So the first analysis windows may have some mistakes. But obviously the signal never starts with like that, starts probably with silence and then comes to speech. So you have time to adapt. Now where that is important, it is easily shown because it's not so easily shown during voiced parts. But it is very easily shown in where you have non-stationary parts, highly non-stationary. And where is that during, for instance, stop sounds? If you look at this, this is original sound of t. This is the representation of that with a nice sinusoidal model. This information shown in the red, inside the red circle, is called pre-echo. That means there is energy distributed over time towards the past. That means that actually when you listen to that sound, that t is not well represented. It's not so easily shown into our ear. And that has consequences on the intelligibility of the sound and the quality of the sound in general. Now this is the adaptive quasi-harmonic model. So you see we do much better than the sinusoidal reconstruction. But still there is some pre-echo. But this is nice. You see, you have a strong start point. And then when we go to the extended, there is no anymore pre-echo. How we do it, we do it, but let's look the individual components. And this, the dust line is the time of the release of the pressure. When we say, when we do the release, it is the dust line. Look this for the sinusoidal model for the adaptive quasi-harmonic model and for the extended adaptive model. So you see that really just before the release there is no mainly energy in any of our base functions. So more or less the base functions have weight nearly to zero. And then when there is a release abruptly the weight is getting much higher and this is the recreation of this abrupt or as we say the speech production mechanism release. Okay? Now we tested on many sounds like puh-tu-ku-boo, etc. in small and large databases and this is large scale evaluation. And you see now in terms of the signal to reconstruction error ratio in dB the error for just focusing on these parts for the sinusoidal model for the adaptive quasi-harmonic model and the extended adaptive quasi-harmonic model. So you see really a big difference because of this extended adaptation that means the weight of our deproni amplitude with alpha of t. So key papers to read in order to understand more all this theory it is from of course the sinusoidal model from Macaulay and Quattrueri that is in 1986 transaction paper. Then there is a paper in the transactions for what we call the adaptive sinusoidal model from my ex-PSD student, Janis Padazis. Then there are two interesting papers. No, three interesting papers. These are connected. It is actually a discussion between our group and Peter Stoika group. Through publication we sent it to signal processing letter. The reviewer was Peter Stoika. So he asked some questions and he said okay, I'm Peter Stoika. So then his questions have been published as well. Then we have to make comments on that. It is where we started to see the Gaussian-Muton connection because he made some comments about that looks like a Gaussian-Muton method so I like to see if you have any connection between the two methods. Then this discussion between us and Peter Stoika who is well-known in frequency estimation and many other signal processing, his team is very good. It has been recorded into these three signal processing letters. The extended adaptive, I forgot to actually mention it here, the extended adaptive sinusoidal model, quasi-harmonic model from my ex-PhD student. He just made his VIVA defense, PhD defense, Euroshkafezis has been published in various conference papers and he is writing hopefully now a journal paper on that. But in collaboration with Gilles D'Eco-Tex and next also Postdoc who was here for about two years we made a paper on how this quasi-harmonic model can go back to an adaptive harmonic model. So just using harmonics. So that means everything in the full-band speech even high frequencies or if it is just unvoiced, still if we have harmonics we can do it very well especially if it is adaptive. So then that means that goes back, it reminds me back when I, in my thesis, when I went back from the QHM now we call it QHM, at that time we call it DSM or something like that, two simple harmonic plus noise model. Now here from the quasi-harmonic model all this work that I present you, I wanted to go to the harmonic model again. So this is how we ended up with another model called adaptive full-band harmonic model that has been recently published in the transactions. But I have not talked about that. I have talked about only the quasi-harmonic model and try to give you why this model works to show you the mechanism of the quasi-harmonic model of the DeepRony model because we understand now why it works and how this understanding can be extended and help us in modeling speech. For what applications? It's up to you. I'm done with that. So this is references. I give some references during my talk. So this is the quadratic interpolated fast Fourier transform from Abbey and Julius O. Smith. I don't know why Laitech did not wrote Julius O. Smith, the third, and then some other papers from Zalaros when he revisited the DeepRony model. Of course the book, I will say that later. And then the papers that I already mentioned. And that was also a collaboration with Olivier Rozek who was at France Telecom and now he is in VoxGen, a speech synthesis startup that came out from France Telecom. And I have used during my talk some figures, especially during the production and the Saint-Gener model, etc., from Thomas Quattieri's book, Discrete Time Speech Signal Processing from Preston School. And I got the permission from Thomas and from Preston School to use these figures in my slides. So I'd like to thank them for giving me the permission to do so. And my students who worked on this topic and make that great job, Eurus Caffèdis, it was, as I said, working on the extended quasi-harmonic model. Today is his first day in the military service. So we wish him good luck. And I'd like to thank again Tom Quattieri and Preston School to give me the permission for use the figures from Tom's book. And you, because that is probably a heavy signal processing course in summer. First day, I apologize. But as I say, I hope I gave you motivation to revisit the slides. If you have any questions, I will be happy to answer now, later on with emails, and I invite you to come to the hands-on session to see really by using your own eyes and hands to believe or not to this nice model. Okay, thank you very much.