 Okay, we are ready to start after a long break. So we are going to use now to see how these models, but not only these models, how they can be used for voice pathology or better as we say for voice function assessment. So you may remember from the first slides on the previous session what I called voice, because when I was also a PhD student, it was not so clear what is voice, what is speech, of course that was English terms, but I don't think even English people know the difference. So voice, you have to remember, it's not the modulation, it is the modulated signal but just from the vocal folds. Okay, so voice pathology, then it is how we can assess the quality of voice or the pathology if there is such of our vocal folds. This is very important because we use these vocal folds to produce our speech and because if we overdone it, if really there is a problem, then we may develop certain disease and many people end up with serious problems that they cannot talk or even if they develop cancer, for instance, they lose their voice. And this is very important because then quickly you are isolated from the society and this is very, very bad. All of us we take as granted the voice because just we produce it, but it's not really something we have to pay attention to. So all these models that I show you contribute to the analysis of speech, they produce very good systems and have robust estimation. So we will see how then this can be used to develop mathematical models for voice pathology. And there are also hands-on later on on this part as well. Okay, so I will introduce you to the topic. If you are not familiar with voice function assessment, I guess you will get a good idea. What are the tools we currently use to assess quality of voice? Then there are two important parameters. One is jitter and the other is shimmer. And we will see how we can develop mathematical models of these two phenomena and how this mathematical modeling, one of them is the quasi-harmonic model and the AMFM modeling, how this modeling can be used for voice function assessment and estimation of jitter and shimmer. Then there is a fourth one, which is the tremor estimation. How to estimate tremor? I will explain also later on what is it, but it is an important estimation parameter which is very difficult actually to estimate for the voice. Downstairs, if we are successful to install wave surfer, we can record your voice. No, no, actually we can record the voice. You can record your voice and check also the quality of your voice in terms of tremor and jitter. And then we will see how modulation spectra can be used for voice function assessment. Okay, so we can use invasive and non-invasive approaches. We use voice production models, algorithms like the ones that I just mentioned and also special devices to assess quality of voice. The goal depends from which point of view you see it and what is your, if you are a medical doctor, for instance, you are looking for objective ways to assess the quality of the voice. For instance, you want to know if you make a treatment what is the situation before the treatment and after the treatment or if you suggest a treatment, what is the evolution? How well the treatment is applied to a specific person? So you want to have an objective way to do that and at the same time, if you use subjective, that means humans to assess the voice because they have big variance between them, in their opinion, you would like really to have something which is scientifically proven to be good so if there is any problem and you go to face some court process then you can say, okay, look, I have done my best, this is the treatment that I have applied and accordingly to an objective measure these are the results that I got and they are good. So then you will be safe. From an engineering point of view, you want to know really what is going on with that voice production process because more you know, better the models will be to, for instance, model speech or control speech which is very important for applications like speech synthesis. Speech synthesis, we want to control the voice. So it depends from where you come from. Now, there is a big list of devices. The three first are invasive techniques and we will see some examples from them. Then there are non-invasive techniques like the, starts from the electroglotograph so these are non-invasive and the cameras here, they refer to, if just by watching a video, I mean if we have a camera really, we can assess the quality of the voice from the way that you move your articulators. So that will be like an additional information. But the target is either contact microphones or microphones. That means just from the speech signal to do inverse filtering or inverse engineering to try to estimate the signal not at the out of the lips here but after the vocal folds. This is called inverse filtering, of course. And it is part of what we call in signal processing blind deconvolution. Now, invasive techniques. So using camera, there is a non-endoscope that gets here, then there is light and we try to see the vocal folds while they vibrate. Because there is a non-endoscope inside your mouth, you cannot say anything other than because it is open the mouth. It's very tough. I mean for me it was not successful. Medical doctor, this is first of all a medical action. It's not an engineer who can do it. For me it was not successful. I had a very good medical doctor try to apply it on me but it was really not good. So in order to see now the vocal folds vibration, the vocal folds vibrate with let's say 100 times per second. There is the vibration. So we cannot see it. How we can assess then the quality with endoscope? The key is the light. And this is like disco. If you go to disco and then you see this light sampling and then you see the other person, not a continuous movement. So if you do the same thing but you synchronize the vocal folds vibration, that means the F0 with the light, then you see frozen the vocal folds. Of course we know that voice is not completely, even voice pitch is not completely voiced or harmonic I would say. Then you have estimations of mistakes on the F0 estimation. And all of these may make it more difficult for you to see the vocal folds. Nevertheless, this is what we use in endoscope. Now I'm not sure if we can see this video. So these are vocal folds movements like the ones that you have seen before. And it is just 25 frames per second. And we've measured let's say for a few seconds. So here there is aliasing result. Again, there's no sound, sorry, but it is ah, ah, there's breath, ah, it's three times. So ah, breath, etc. Now the interesting part is if we do with a high speed camera, which records 4,000 frames per second, but these 4,000 frames per second we will see them with 25 frames per second. So it would be like slow motion. And it's just two seconds. And this is the real movements of the vocal folds, however. Now ah, we have to pay attention here. It's moving, but it's imagined. It's for two seconds. So you have let's say two 4,000 frames per second. And now you see 25 frames per second. Now many frames you have to see in order to find, now it's moving. This is the real movement of the vocal folds. Hope you can see the video. It's an amazing movement of the vocal folds. So these are the real movements. I don't believe they close completely. Why they should close completely? Sometimes yes, but so this, that's why what we say glottal close resistance for me is very misleading. You can see it if we play it again, just we have to wait. But of course they close, but there are areas which don't close enough. I mean, completely. And the more there is a pathology, the more you see that they cannot close. That generates some noise. So we will see. Now what is the problem? The problem here is that this is the real movement of the vocal folds, but who can wait to see two seconds of recordings? No one, especially medical doctors, can wait so long to see the movement of the vocal folds. So then the question is, is there any way to present this without waiting? And this is going to be the next slides. How people try to represent this movement without waiting so long. Okay, then the first trial is you do age detection and you put then you see where the glottis is exactly. You put some colors like these colors here. And if you follow for instance the movement of the, in the middle here, the yellow and the green, you can monitor the moving of these dots. You can monitor this and this is over time how it moves. Okay? So this is one way to show it. Another way is you do, let me make a zoom here. These are the vocal folds. You make age detection right and left. Then this is what you get. And you consider values here from the middle. So these are the values. Then the next step is you take this, you open it. You go like that. And then you create this. So this one, now you see it over time and you allocate, you do a color map depending on the amplitude of the right and the left part. Okay, now let's see how such a representation helps. This is then what you get from the movement of the vocal folds showing again the right and the left part. And then there are, you can make a zoom and you can try to find some metrics in order to find out if there is any problem in the vocal folds. But such a representation is quick. So very quickly you can, from these vocal folds movements you can end up with just a picture. So that is nice. Let's see how that can help on some pathologies. Like for instance this one. There is, on the left side there is a polyp which means that the vocal folds do not close there. So you see black line, you see it. This one, the second one, is paralysis. So that means the right part does not vibrate. Only the left part vibrates. This creates such a noisy speech, very annoying. And you see here, it's not really red. It's just thin line of red. This is nodules, both sides. So such a representation which is called phonobiography, phonobiography, it is one way to visualize high speed data quickly. Then there is another one because these high speed cameras are very expensive. Not many labs have these and not many doctors use them because first it is expensive. Second, still we don't know what is the best representation and that, the one that I show you is one candidate to use with a high speed camera. But in our lab, here we have a lab in the medical department, medicine department, medical school and there we have a high speed camera and we work with medical doctors to get data. Now in videochemography, in order to have a camera not so expensive, you just record, you can adjust which line, which line of pixels you want to follow and then you see that over time. Then you again see, but specifically for that point the opening and the closing, the opening and the closing. So yes, here you can see that there is an opening phase and close phase, but at that specific time, time, sorry, specific place of the vocal folds. And here is Kriki vocal fry. So you see major opening, then small opening. So here there is all glottic closure and not so, there is a distance here. So that creates a secondary excitation, but not so big as this one. Okay, so this, yeah. That's true, but you have to pay about 40,000 euros more to buy the other camera. So this is falsetto voice. This means high pitch. You cannot, there is no any closing here really. So we have to find here a glottic closure instant. That's very, that is the problem. So when you go and you ask glottic closure instant for female or children voice. It's very difficult. Here for instance, you see there is a problem here. So if with the endoscope you see that there is somewhere a problem, then in order to make a better idea you use this videochemography and then you adjust where is the pixel that you want to monitor and then you use your videochemography and then you get this picture with the problem that is here, so on here it is here as well. There's no closing, good closing and this is because of the problem. Okay, now we pass to the non-invasive approaches. The most used one is what we call electroglotography, EGG and that tries to measure the activity of the vocal folds because while the vocal folds move by passing a weak electric current we can see the impendence of the vocal folds and this is what we measure and this is the electroglotography. So it is a representation of the activity of the glottis. Then we go to... because we make a focus now on speech which is non-invasive as well and we will see it from a perspective of modeling and analysis. So now you know this model. This is the AM FM model and you have seen this, you have seen the original, you have seen the reconstructed and this is the error between the two. How... this signal now, it is what we call sustained phonation. That means... Okay, so now you see here time and frequency because as I told you this model offers the capability to measure a high-resolution AM FM model. So then you can, like the time frequency... to have a time frequency distribution and here you see the signal and here you see the time frequency representation of a normal phonic voice using the AM FM model. So it's very parallel, the lines are very parallel so that means there is a structure. That means good control of your vocal folds. Now this is in terms of this phonic voice. The first thing that we see that there is area of strong modulations. Now, stronger are these modulations higher is the probability that your ear will perceive this as noise and this noise is what is the first indication of a natural voice of pathology in your vocal folds. For instance, if I say ah, that's a good ah, if I say ah, I cannot control my vocal folds because probably there is a pathology, a polypod or something. And this creates noise and modulations. So we will see that. Another technique that we have not seen because we talk about modulations that means if we do modulation spectra there is a chance to detect also voice pathology because we can detect the modulation that exists in speech. So another tool that we have developed is called modulation spectra. Modulation spectra is you do like a Fourier transform or whatever transform you want time and frequency, time and frequency and then you take this line which is you consider that as a time domain signal and you do a second Fourier whatever transform and then that creates a frequency frequency representation. One is the modulation frequency and the other is the usual acoustic frequency. Now let's see, this is an introduction to the topic this is the modulation spectra for a thin male voice you see that it is around 200 Hz now this is acoustic, this is modulation frequency of a normal phonic speaker. If you go to the dysphonic speaker and you see the same picture, it's like that. So a lot of modulations. Now, okay, so that was the last slide on introduction now we go and we see the estimation of jitter. Jitter is and Schimmer we can say there are perturbations in irregularities from period to period and jitter is about perturbations in terms of periodicity so we do not have an exact periodic signal. Of course we know that the only periodic signal can be created by machines. Humans who create a natural signal, not periodic. You can consider it as periodic but in reality it is not. Now of course you can measure sort of my voice even if I say this is nearly periodic but my f0 is not constant so the perturbations of the periodicity from period to period this is called jitter. So I have jitter, you have jitter but our jitter hopefully is not pathologic because it has low value and you will see what I mean by value. Now if your jitter is becoming bigger in terms of value then there is a chance that there is a problem in your vocal folds. So since it is period to period variability of pitch one estimator is this local jitter where this is the pitch contour. I cannot write on the PDF very nicely but anyways. So it is a pitch contour. So then you see n is the time so you see after and before you do make a derivative or absolute jitter is this one. Now what is the problem of this measurement which has been used for the last 50 years? It is exactly the estimator of this one. In order to estimate the f0 you make the assumption that your signal is stationary. So it has the same f0 value for how many frames depends on what is the method but if you use the autocorrelational method for instance in order to estimate the f0 you make that assumption of stationarity for a long window. Then after doing that assumption you estimate a value which I don't know what is it it is a measurement of periodicity but making the assumption that everything is periodic well it is not and then on top of that you take the derivative in order to measure the period to period variability. So that is the mistake I believe while the jitter as an idea is good the way to estimate it is very tricky because that u of n makes strong assumptions about speech. Then what is the method that we can have? So the assumption here we suggest what we call the spectral jitter model Sje. In that case we try to model the jitter with a mathematical term which means we put impulses here p is the pitch period and here is two times the period so we have impulses at every 2p and then another term which is this is the even periods these are the odd periods. So if I add this together and assuming that this epsilon is equals to zero then I have a fully periodic signal with distance from one click to the other from one impulse to the other equals to p, p capital. But if now I change epsilon and epsilon is not zero then these two signals they are not anymore periodic and then I create jitter. Of course this jitter is cyclic jitter that means I see it every two periods I see that but nevertheless it is so if you make that assumption that in a short window you can just see two to three periods then you may have a chance to get the jitter without estimating the pitch or without relying a lot on pitch and let's see how. In order to do that we write the frequency domain the magnitude spectrum of this jitter impulse train. So here there is a very important term which is this one which again we have to follow and track that epsilon. So we take this that important part and then we can easily show that this one is equal to this part. When we multiply cosines then we generate what we call bit spectrum because we say bit spectrum because we are in the frequency domain here and that bit spectrum the bit spectrum is like that you may remember that from physics bit signal. So the crossings depend where we have the crossings over frequency depends again on epsilon which is our jitter. So let's see where we can use that. We develop further our mathematical model and then we see that we have one part which is at harmonics which we call harmonic spectrum and another part which is at half of the harmonics which means sub harmonics. So we have now two spectra. So that means whenever we have jitter we have two spectra to deal with not only the magnitude spectrum of speech at the harmonics but the magnitude spectrum of speech at the sub harmonics and these two interact. More interaction more conflict between them more conflict means more noise that we perceive and this is jitter. So the jitter now so this mathematical model proves why jitter produces noise and let's see how it is. Now if I put that epsilon equals to zero that means there is no jitter. If there is no jitter there is no interaction we can show that there is no co-spectrum so there is no interaction between the harmonic spectrum and the sub harmonic spectrum and unfortunately here I have all of them but this is and this is with epsilon equals to zero for the harmonic spectrum and the sub harmonic spectrum. Mainly the sub harmonic spectrum does not exist because simply because everything is harmonic so all the energy is here. Now we put jitter epsilon is equal to one then this is what we get this is the harmonic spectrum this is the sub harmonic spectrum how many interactions? One because epsilon is equal to one then if we increase jitter then this is the spectrum of the harmonics this is the spectrum of the sub harmonics how many? One two and epsilon here is two for the case of the squares. In short with that epsilon I can control how many crossings will be between the harmonic spectrum and the sub harmonic spectrum and more I have crossings more noise I perceive in these areas. Now using that measurement what is called spectral jitter estimator we test it on two databases one is very typical one and it is suggested by Massachusetts something there in US May which is not however that's a very standard database of voice pathology but it's not a good one simply because the recordings between normal phonics and these phonics speakers have been recorded with different microphones then obviously even with silence you can separate to some extent the normal phonics speakers from the dysphonic speakers so everybody gets good results on that database so this is what we call multidimensional voice profile which is a commercial program solved by K-Elemetrics now K-Pendex and this is using since we do detection we do receiving operative characteristic curves and then we measure the area under the curve bigger is the area under the curve better is our detector so here it is mdvp measures jitter the way that I described initially gives this score which is a very good score Pratt which is freely available it is also very good S-J-E it is 94 now there is an improvement if we go to another database which is a Spanish database provided from us from University of Madrid you see that there is mdvp cannot cope really the score is low while of course for all the techniques is slow it's low but for S-J-E it's much higher than the others so it has provided better score which means that this mathematical modeling of jitter makes sense however medical doctors they don't care about this this is really something a first test to see if you can separate normal phonic voice from the dysphonic voices but no one will pay attention to these results I mean for engineers they will pay attention but medical doctors no one they are not interested in this to what they are interested you will see later on so here we put now a threshold for jitter because we know from some databases that some people are really dysphonic some others are normal phonic and by separating by getting this area under the curve score we have put the threshold the best threshold in order to achieve the better area under the curve score so then we see for let me make again zoom this is for a normal signal this is the value of the jitter in microsecond so if it is this is a 124 about microseconds this is the threshold for jitter if you have below that value then it's fine but occasionally of course you can go up over that threshold and you can see really this is a normal reading text this is second while you speak we can measure using spectral jitter and by measuring the harmonic spectrum in a sub harmonic spectrum the spectral jitter and we see that there are values that indeed they go higher then we can measure how long how many frames continues continuously or continues they are above the threshold or below so if we say we will stick with the over what is the percentage of the samples which are above over the threshold then for a normal phonic speaker this is 13% but for a dysphonic speaker this is 80% so then based on that we suggest to have a metric on what we call over then this over or max over or max under let me this is max under this is max over and while the other percentage is just over all of them over so this is the area under the curve for this database but not on sustained vowels there is in that database there are two types of recordings one is sustained vowels that means the other is the what they call the rainbow passage so people are speaking obviously you cannot have an endoscope there while you are reading that medical doctors believe that a lot of interesting phenomena happens while you are talking not just a sustained vowel that's why they don't like the endoscope they want to use the voice function assessment but on running speech so this is one way it is one way to do that so we test it on rainbow passage and indeed the area under the curve is good but for May then we can develop two metrics one is called local so to see you put a window and you follow inside that window how many times what is the value of the over and you just shift the window and we call that local over the other thing is that if you have problem and you start talking you may start with good quality but with your vocal folds you don't control very well your vocal folds then after a few I mean two seconds you lose the control and then more and more noise is building up and you end up with a very bad quality so then this is what we call running so that means it's like accumulating statistics so at the end you arrive to a statistic and getting more and more data that something is pathologic not only based on the local this does not work I will play some sounds ok so this is example from May running over this is from a normal signal so more and more here we accumulate statistics with seconds here this is the time and then after 12 seconds of voicing we are pretty sure that there is no problem if there is we have here separated the area of over into three categories normal so that means you have normal jitter which has been green there is then the inter in the mind let's say so it's not you cannot really say anything it's the gray area and then there is the pathologic area which is red so while you are talking then ok occasionally we took a decision but after really accumulating statistics we found that no it's fine your voice is fine this is for the pathologic signal the second one so you see while we are talking we are pretty sure that something goes wrong this is local over that means just a window and you see what is happening inside that window so even for normal phoning speakers yeah while you talk you may have jitter but for the pathologic signal for the dysphonic speakers very often you go over that threshold now obviously if you change your threshold you change also the data so it is threshold dependent this is good to know because you will play with the threshold downstairs in the hands on session unfortunately this is I mean this is now before treatment because you remember the medical doctor wants to see what happened after the treatment can we have an objective measure they don't care if you can distinguish dysphonic from normal phonic voices to do what with that if you can really find a metric to show the difference between before and after treatment this is very very important for them has really big nice value I will turn this on just in case playing by from the speakers you listen so this is before treatment let's see so that is before treatment this is after treatment this is the same speaker before treatment and after treatment so you will test also your voice downstairs drink raki for five days smoke a lot yes obviously because if you have something and your vocal folds have big inertia exactly that is why when you smoke there are a lot of particles that are accumulated here and your vocal folds become really heavy like mine but if we take them out your voice will become again female voice and with nice characteristics yeah that's the pitch but that is because of the interaction according to the spectral jitter model it is the interaction of the sub-harmonic spectra that interaction creates zones in low frequencies of noisy and in speech synthesis sometimes you know we say tend to say that this is a periodic part this is non-periodic part but all this comes because of the jitter major part because jitter exists in normal voices of course if you have a lot of jittering then you have a lot of noise and then you perceive it with a lot of noise here the jitter is low between the harmonic spectrum and the sub-harmonic spectrum which means you do not perceive this interaction and that means less noisy is that okay? good you will play with exactly this signal downstairs now if you want to read more about this spectral jitter estimation I will show you general papers with Miltos Vasilakis on how to do the spectral jitter modelling and how then to suggest this green and red area on short time jitter estimation in running speech and what is the nice application of that now I am talking I could have really an indicator there while I am talking and I have a closed top microphone here what is the quality of my voice by measuring the jitter then I can see if it is always green that means it is fine of course I can adjust it because the threshold changes a lot the values but assuming that I have a threshold then if I see that there is red it stays red for ten minutes that means I misuse my vocal folds so it is better to stop talking that is a good application so now we move from the jitter estimation to again jitter and shimmer but using another mathematical model and this is based on the adaptive of the quasi harmonic model of speech here we show shimmer this is shimmer so what is shimmer is perturbation from period to period but in terms of amplitude so everything is periodic but the amplitude there is an escalation then this is jitter that means from period to period the f0 really changes what is modeled the jitter here is by this delta while the shimmer is modeled by this gamma yes you mean for running speech the spectral jitter estimator you mean now or so the previous topic okay you need indeed the pitch value but you do not rely on it just to have a first estimation of your harmonic spectrum and the subharmonic spectrum but you do not make measurements really on using these harmonics so a simple estimator of f0 will work for voices now of course if you are seeing the unvoiced parts there is no activity trying to be periodic so then the notion of jitter in that term does not apply but you have other areas definitely in the check also you have also voicing so then we try to see during the voicing parts but that voicing can be anything it is not necessary to be a sustained vowel now this is another mathematical model for jitter and shimmer so this is the principle of the jitter and shimmer you will see how such a modeling can be used inside the sinusoidal models so again this is a sinusoidal model in terms of AM, FM so this is the excitation and this is the vocal tract this is the excitation phase and this is the vocal tract phase the phases are added the magnitude information is multiplied and this is the excitation phase this is the integral of the instantaneous frequency plus a phase offset excuse me, a phase offset so this is the model that we have seen also in the previous slides ok then here is how the jitter and the shimmer comes into the equations so we already introduced a model for jitter and a model for shimmer and these are the values you can make also delta as a function of the frequency so to have a jitter now in each frequency band but we choose here not to do it so we can also have the same thing for shimmer for a reason here we kept it so now if let's go back so you have to remember this one this is the excitation amplitude and and of course this one so now we will change this excitation amplitude and frequency based on this so we put a formula for these two components then we introduce them to the AMFM model and this is what we get or after some multiplications etc we end up with this model so which means a sensual model which accounts also for jitter and shimmer has this formula this way where now we see as a function of frequency the jitter and the shimmer so these are the unknowns now the unknowns for jitter and shimmer this is the jitter so the question is how we can solve the equation of the mean squared error in order to estimate gamma k and delta k so in order to do that we suggest a kind of aq no quasi harmonic model and now the time is not just t here but it's sinus of pi f0 t here is cosine and this bk and ck they are complex values and we decompose them bk and ck as we did for qhm over ak this is the decomposition of bk and ck over ak and then we end up with this model so by doing this this one is becoming like this one so now this is our model and let's see how it is compared with the initial model so what we want was this one and what we suggest is this one so now you will start seeing the relationship between the shimmer and jitter senjodal model and the model that we get by considering the version of the quasi harmonic model so easily you can see for instance that this gamma k cosine is actually this pi here or that the delta k sinus ck is this rho 2 so through this that means if you now estimate ak a a b c and based on this comparison you can estimate the jitter and the shimmer and these are the equations just by associating the two models this is the association and then very easily you end up with this solution for the shimmer and for the jitter ok that is the theory how you are going to validate that we can generate synthetic signals where we know the jitter and the shimmer by if really the model can cope with the jitter and the shimmer the jitter and the shimmer effect so again you may remember this is the model of the jitter that we had the spectral jitter model and then however the amplitudes the amplitudes the amplitudes a1 and a2 they are given by these equations so what are these equations for what they count for for jitter or shimmer what is that a perturbation over amplitude so is it for jitter or shimmer shimmer so this is the shimmer and this is the jitter so this is in order to to create the synthetic signals then we pass this this we call that the glottal air flow rate the glottal air flow rate and we pass it because these are the impulses mainly then we convert this with the glottal air flow and then after that we pass the complete excitation signal through a vocal tract that means an AR filter ok so that makes we can then measure we can make A from we can use A from a real speech, male speaker to estimate this AR and then we pass the combined excitation through this AR and we create the synthetic signal so now using this epsilon and alpha A we can control jitter and shimmer so here I will make zoom this is the signal, the solid line with shimmer and the dust line is the model with the sinusoidal model the simple sinusoidal model so you can see that first on here we cannot estimate the amplitude because simply the sinusoidal model is stationary so all the amplitudes are the same of course here we see a windowing signal but nevertheless we cannot really estimate at the center of our analysis window we cannot make a good estimation of amplitude because of the reason that I just mentioned and you must know that means squared estimation means that at the center there the estimation should be very very good where once you go away from your center of your window then more is the error of the modeling so here at the center of your analysis window where you expect really the estimator to be very good it's not good because of the shimmer then this is with a suggested sinusoidal model so you see that now there is a difference between the two ones so with this one the shimmer is we take into account that probably there is shimmer we take that into account in order to estimate the amplitudes and this is the error from A and B from A you see that there is this error at the center as we mentioned we do not capture very well the amplitude while for the second one the suggested sinusoidal model the error is mainly 0 as you see it it is this dashed line here mainly 0 then we go to jitter same thing this is the standard sinusoidal model the dashed line so the amplitude is not modified now there is no shimmer there is only jitter so you see that the standard sinusoidal model cannot synchronize its impulses with the original signal because of this jitter effect now if you go and you see it on the suggested sinusoidal model even of this jittering still the sinusoidal model the suggested sinusoidal model even if you give f0 that model still can measure this perturbation of a freak of a period and this perturbation are modeled in a signal but if these are modeled then you can use them to estimate the jitter and the shimmer and again this is the error again the same thing and then we test it again on May to see if both jitter and shimmer can tell us anything about voice pathology and we find a good this is a row curve and this is the area under the curve 92% the same which is not bad now if you want to learn more about this there are two of course there is this exact transaction paper on AMFM decomposition and another one we made on in a conference about how to model jitter and shimmer yeah well in natural speech you have both you have jitter and shimmer because nothing is so nicely periodic even over time or over amplitude so a normal voice has both jitter and shimmer so you can do as it is and gives you a measurement of both jitter and shimmer now how do you combine this in order to take a decision that's of course it's another discussion we did not pay attention to that in order to maximize for instance to put a weight and maximize the area under the curve and nevertheless we test it only on May but the main message is yes with this model you can measure both things starts with the mathematical modeling of speech exactly yes I think I believe so it's better actually it will be more accurate but indeed we have not tested on first and speech synthesis or something else yes yeah yeah okay yes okay so other questions you care I see that you make a lot of questions about voice pathology so you always ask me questions do you have why may I ask why it's interesting yeah medical applications it's always interesting to see medical applications yeah that's a very interesting I think direction of speech signal processing because that's one signal that we produce there are a lot of things that are from the vocal folds for the hearing aids that Reiner Martin will talk to us tomorrow and then how we process this here and how this then controls the modulators again and if there is any motor let's say problem as we say the way that we control the vocal folds probably speeches in an early indication of if there is any problem in our brain so this makes really a nice topic of research for speech and there are no so for voice pathology there were papers there are people who work on that but I don't see many too many papers at interspeech so the community is only oriented to solve towards this human machine communication issue but I think there are very interesting applications on the speech itself because it is one signal that we humans we produce yeah please so you mean if you start to have some problems then you start yeah the jitter and shimmer they try to detect a bit because all this creates mainly noise and modulations and the perception of this noise and modulations yeah yes actually it is articulators but I have not tested I'm not sure I mean we can discuss it of course but yes yes yeah that's another application actually there are applications in order to protect also the professionals for instance a teacher is a professional he is using the voice to teach it's not only the singer it's not the pop music etc okay yeah please yes I see yeah we have not tested this model on emotional data no the maze only constructed for voice pathology and they have recorded normal phonic and dysphonic speakers saying sustained vowels and also the rainbow passage that's all and of course these people they had they had not high speed camera but definitely a camera to see the quality of their vocal folds we know for sure that they are normal phonic or dysphonic speakers but they made the mistake to make the recording with different microphones so which is a big but yes no emotions there so it would be a good idea to test these models also for emotions did that's a good idea actually you that you work on send your other models for emotions I think that's a good idea but yeah other questions no okay we will make a break just drink some water relax and then we go to the towards the second part I hope that I gave you motivation to come to the hands on the session so we will see more later on okay thank you okay so we stopped last time when we were working on we said about the sinusoidal model and how that model can incorporate jitter and shimmer inside and what is the mechanism using sinusoid to estimate the jitter and shimmer now we go to another application which is a very important application how to estimate tremor in voice tremor is not only in voice but also mechanical tremor devices that unfortunately they produce tremor and you want to estimate how much tremor they produce but the same thing also happens with humans so then the question is first of all what is tremor because to be honest six years ago I had no idea what is tremor and then tremor has been introduced to me from Dr. Jean from Brussels university and since then we start to work on tremor how to estimate tremor what is tremor there are modulation that we do not control modulations of frequency and amplitude and we try to estimate tremor in sustained phonation like R etc like jitter there is pathological and physiological that means normal vocal tremor so our voice has vocal tremor already as I said if strong motor synchronization means no tremor but if that starts to have problems then we start observing tremor tremor below a value is physiological above that value is pathological so the key idea is what are the modulations really that we have to look at and this is the modulation that we are looking at are between 2 Hz and 15 Hz that's the modulation of the voice below 2 Hz there is modulation of our voice but this is because of our heart it's cardiac modulation so we have to remove this the heartbeat that modulates our voice and look at the very low modulation frequencies since you are looking at the very low modulation frequencies your analysis window cannot be really very short that is one big problem and you are invited to solve it so when we use the angelal model like the AMFM model that I have shown to you still you need some long window in order to estimate very low modulations so what are the vocal tremor attributes when you are looking for tremor what does it mean you look for modulation frequency and for modulation level so modulation frequency is how fast are the modulations modulation level is how strong are the modulations so we use again an AMFM decomposition algorithm the QHM and because that offers to us high resolution in the time frequency plane and we can estimate the vocal tremor if we want for any angelal component of course I don't know if that has any interest in practice the tremor is related to the modulation of the vocal folds so why to estimate the tremor in high frequencies well I don't know it will be interesting probably but what I will show you and all of the methods that we have done is based on just one component and then many people when they say tremor they give you one value of the tremor and you will see according to the attributes that I will show you what is your tremor they give you one value for all the phonation that we have done but actually it is interesting to estimate tremor as a function of time in order to understand very well the tremor actually it is the following when you go and you do a presentation at the conference and you are fresh PhD you start your PhD then you are very anxious I mean you don't know how it will go so you start talking and then if we measure your tremor there it's high it's probably pathologic but more you get confidence more you can control your tremor so tremor has been used from CAA to check if you say truth or not but you have stress for instance stress estimator so as I said we know that already so that's good nothing to say on this slide because we have said so many things so far and here is the time and frequency playing computed using this AM FM component also we have seen this figure so you see now here the tracks of sustained phonation this is a male voice 100 Hz to twice the F0 200 Hz 300 Hz etc so now we isolate just one component again if you think that there is a difference you are invited to check the same algorithm on the higher harmonics so we select just one and because we are looking for modulations 2 to 15 Hz we can down sample that signal from 16 Hz to just 1000 Hz then we must remove the very slow modulations because again that comes from our heart beat and the way that we do it we do it with Savinsky Goli filter which is a smoothing filter a filter that is not so much used in signal processing it has been used in chemistry it's very nice, it's a very nice model and there is in signal IEEE signal magazine there is a nice report on Savinsky Goli filter and its properties because someone has said that's an interesting filter and no one is using it for signal processing so yes now let's see how we apply this Savinsky Goli you see this solid line the solid line is that F0 measurement but after removing the mean value that was around 100 Hz so now it is around 0 of course because we removed the mean value then we detect the heart beat and we remove it from using the Goli filter Savinsky Goli filter this is the dust line so the dust line actually are the modulation from your heart now if we do the Fourier transform if we consider that this signal over time is stationary which is wrong as an assumption nevertheless we can make a Fourier transform so we average all the spectral information and this is what we get in response to the solid line and there is strong modulation from our heart which is the dust line so the S Goli filter estimates that and this is really indeed the activity below 2 Hz then if we subtract the dust line from the solid line we have a component which is free of the heart beat then on that signal we apply one component AM FM analysis using quasi harmonic model and interesting enough now this AM FM model is a non-stationary model of speech so you don't make any assumption of stationarity because your signal is quite dynamic and changes so it's not stationary so the solid line is again the original signal and the dust line is your AM FM signal so it tracks very well much better than what Kalman can do your waveform now how you can say if this is good or bad oh no, it is good or bad because you reconstructed and you put your overlay the original signal with the reconstructed signal and you see that the reconstructed signal follows the original signal now this you remember that comes from one component AM FM so not yet transform when people tell you this is your tremor they give you actually, they do Fourier transform of the signal and they take where the center of gravity is for instance probably the center of gravity here is around I don't know 6 5 because you have this big components, big components here so the center of gravity is around 5 now if someone says no I will take the maximum then this is 7 but again this comes from a stationary analysis of that signal which is not valid and the dust line corresponds to this frequency so this is not nice what is nice however how we can really make it better if we see the modulation frequency and the modulation level as a frequency of as a function of time so that means that AM FM component that you have seen that tracks so nicely the original signal it is produced by amplitude multiplied by the cosine of this one at every sample so you see now a dynamic information now you can say two things here first of all what is the modulation frequency you see that the modulation frequency actually varies as a function of time so you can have a tremor estimator as a function of time now that was a very important goal and the second one you can say really here what is the frequency the tremor frequency and also the modulation level so according to that your voice can be considered as normal phonic or dysphonic in terms of tremor I hope that no one's voice will be detected as having a problem with tremor because either you have stress when you will do your hands on or there is a problem here I hope that everyone's voice that if you do the recordings will come out with nice results with modulation level very low Maria is working on tremor used to work so that is the part of modulation for tremor estimation using AMFM so that is another application of the AMFM model and these are the papers presented by Maria on that tremor now we move to the modulation spectra as we said voice pathology means modulations actually there is always modulations while we are speaking there are always modulations modulation level 1 modulation level 2 so the modulation are very important in during speech production and actually there was a paper recently from Tomoki Toda who shown that synthetic speech has a problem because also lacks modulations natural modulations and he was trying to re-storing the modulations of the synthetic signal ok again the principle is just you do forget transform you have a time frequency distribution and then you get this one so you consider it as a time domain signal you do another forget transform or another transform and then you end up with a frequency frequency representation so if we use forget transform what I have said in equations this is the standard short time forget transform these are the basis functions the regular forget transform basis functions these ones this is your analysis window and this is your signal so this is what we call the short time forget transform and I 1 denotes the number of frequency beams and this is in the acoustic frequency axis then we take the amplitude spectrum as input signal over time now this is M then we apply again Fourier transform this is again the same thing and we apply a window and that give us the modulation the frequency frequency modulations and I2 is the number of the frequency beams along now the modulation frequency axis so now we have the acoustic frequency axis and modulation frequency axis and these are from K and I now depending on the size of this analysis window that determines which modulation frequencies we are looking at such a method however cannot be used for very low modulation frequencies why? because you need the high resolution in order to be able to see differences in herds we saw in tremor but this is good at the level of pitch modulations at that kind of frequencies so this is modulation spectra images that we get using normal phonic normal phonation and abnormal phonation you have seen that already during the introduction and also we see that between for instance two dysphonic voices the adductor spasmodic dysphonia on the left and on the right the vocal nodules we see that there is a difference in the way that the energy is distributed over frequencies acoustic and modulation frequencies so these are now pictures as you can see so how we can use these pictures to get any information about pathology in your voice so what we do we have instances of your awe then we make that picture that you just saw and we put together we stack together all these images so that means we put matrices one after the other and that creates a tensor and we want now because these are too many features is how to select and how to select features in order to make from that tensor how to select features in order to detect if there is a pathologic voice or not because you have many features so you end up with a problem of feature selection so we have tensor and this is a nice way of using so these are the acoustic frequencies the modulation frequencies and how many frames how many times you put these how many instances you have and then we apply what we call high order SVD so we decompose the tensor as we call n-mode singular vectors and then as we do in normal SVD we can rank the singular values so in order to keep a certain percentage of energy as we say of the tensor d then that means we put a threshold and by doing so that we keep few singular values and that is the R1 and R1 and R2 and then we truncate then the matrices which correspond which have the singular vectors to the acoustic frequency and modulation frequency matrices then if new data are coming we project this new data B into the truncated matrices and that makes another representation Z which just has R1 by R2 dimension so in doing this high order SVD you reduced that tensor into a matrix but in a nice way and that is the high order SVD ok just an example for instance starting a tensor of this dimension we end up with this matrix a lot of features reduction so if we want to see how the high order SVD reduced the redundancy that exist in the features then here is a distribution of mutual information using original features the tensor and this is the redundancy mainly nothing this low redundancy of the packed as we say features that 31 by 31 matrix so that means we reduced the features but we reduced also the redundancy mainly ok so that is the first step then the second step is how you can use now that result feature selection and the feature selection is done through mutual information the mutual information that there are many ways to measure it you have to pay attention because it's not straightforward it may contain mistakes the way that you compute the mutual information but anyway one criteria that you can apply is what we call the maximal relevance criterion that means you are trying to find the features that are more relevant for your classification problem then using the mutual information as I said here we compute the mutual information between the features and the class then we rank all the computed mutual information and we select the top M features as the standard approach to compute to select features based on information theory criteria like the maximal relevance criterion ok so here we show two results one is simply detection between normophonic and dysphonic voices using I guess this is using may again database so again the area under the curve is high ok it's not a surprise of course because that is an easy database but then we went one step farther and we said can we use modulations in order to distinguish not only if it is pathologic but what kind of pathology is there quite tricky I would say and I'm not sure really if the results are valid but nevertheless we shown that we can discriminate well using look at this discrimination scores well polybot polyps from adductive adductors pathologic dysphonia for instance or polyp from keratosis or from vocal nodules so it shows that probably yes all these generate different types of modulations and using high order SVD and this maximal relevance criterion it seems that you are able to detect some features which are good even to separate the pathology but ok that was one test I'm not sure really I like to see more test in order to say more about that but nevertheless this is a motivation to do further test using modulations these results if you show them to the medical doctor they don't care I mean they say I don't like this result because first of all what you are trying to do are you going to replace me it's not and of course we don't like to replace the medical doctor if for instance why to find if this is polyp or adductors pathologic dysphonia they don't really like that probably for screening may be useful but not to take a decision actually you don't need to think that as an engineer you make an algorithm for a system to take decision because that has very very important consequences if you make a mistake and there is no authority no one will authorize such a device but to make to provide measurements yes that's good so then we did some work to fuse modulation spectra because modulation spectra are not like magnitude spectra so from magnitude spectra we can have malfrequency kept coefficients as people use and then to see how we can fuse with modulation spectra so that is the fusion the other thing is having different databases one from Spain the other from Greece, the other from United States, different recordings, different microphones how you are going to use the same algorithm on all of these for instance if you claim that you have an algorithm that works well and someone from Japan sends you one signal how are you going to say that this is pathologic or this is dysphonic if you don't know even what are the recording conditions so in order to do that we suggest it but not shown here what we call normalized modulation spectra in order to count also for these cases and then it comes a very nice application that medical doctors really like a lot and that was something that medical doctors pay attention to how to use modulation spectra to assess in an objective way voice quality in some grades like some measurements like grade, roughness asthenia, strain and breathness scale in short grasp actually this comes from Japan and many people are using now this scale they try to assess the voice along these five dimensions in order to do that you have people who are experts in listening to voice they are voice pathologists there is another phoniatrics then they are listening and they are great the problem with them is that between different humans different humans they do the correlation of evaluation is like less than 60% now if the same person takes the test now and two days later the correlation is about 75% so it's not so easy to assess quality of voice using this scale and using modulation spectra just graden or hoarseness what we say graded sometimes it is hoarseness which means noisy and breathiness we arrive to nearly 79% correlation which means better than what the humans are doing and that is an interesting result for voice pathology this is nice for the medical doctors ok I already mentioned fusion so this black line is MFCC, male frequency capstone coefficients this is with maximal relevance modulation spectra, the blue line and when we do the fusion which shows that these provide complementary information this is what we get may it's not as good as the results of course this is a more difficult database but nevertheless it shows that the fusion is good now key references for the work on modulation spectra, there is a transaction paper on modulation spectra for voice pathology with Maria Markajki and next speech this student of mine and then some collaboration work some people in Spain who actually made recordings for PDA that database in Spanish and some other conference during this presentation I refer to some works if you are interested in phonobiography it's very difficult for me to pronounce it this is a key paper to understand this high-speed camera and how to visualize the high-speed camera quickly if you just you keep one line this is the videochemography and that is an invention from Jan Sveck a very talented person in the Czech Republic then Natalie and Rhys Christoph D'Alesandro there is someone who comes from Christoph D'Alesandro lab no, from Limsi there is a person from Limsi yeah they use the electroglotographic devices to assess the pathologic voices Atlas and Shaman they suggest the modulation spectrum and yes, some work from us so this part I like to acknowledge military facilities for the spectral jitter estimator Maria Markaki for modulation spectrum and Yanis Pandazis for the AMFM model and that was a collaboration with Olivier Rozek and my current PhD student Maria Kuczo-Yanaki who worked on tremor estimation currently of course she continues to work on voice pathology and she has a paper at interspeech correct and but you work in another aspect of voice which is the intelligibility of the casual speech how to improve the intelligibility of casual speech which is a bit different from the pathologic voice but it has some relationships to some extent ok, so really I like to thank all these people for these years to produce these results and then be able to show to you and thank you again for your attention also in this part, so thank you very much