 So, today we will move a bit forward in terms, in many terms, but one term is that let's assume now that we have a synthetic speech. We have created, we have learned HMM, we have learned LDM, we have learned about DNN, LSTM, how to prepare the databases, do labeling, recording even, you could do that at home, with the great hands-on sessions that Simon and Vasilis have prepared, so the group here at university on these other representations probably then you can, you want to try more advanced vocoders. Okay, so you have done that great work and it works. Well, there are probably here artifacts, here and there probably you have solved these problems, prosody might not be that correct, so probably you have to spend more time and we have actually to return back and have a summer school only on prosody because it's very important prosody. You know, we as humans, we received calls every day for example, we communicate with others and the communication is not that perfect over the phone for example, there is a lot of noise, but still we can go on if we are interested on the topic of the conversation, we can even go on for one hour. It's not really the artifact, the noise that is annoying us, of course, important in that scenario is that we understand what the other person says, otherwise we don't want to continue the communication. So we are robust to some extent, even if noise is present or even if the decoding of the linear prediction or kelp that our mobile phone does and it's not that perfect because of X, Y and Z reasons, but we are not really robust if for a long discussion if prosody is not correct, this is very important and indeed more and more we improve what I call segmental quality of speech, we have to pay more attention now into higher level characteristics and that is the prosody, prosody is very very crucial. So I would love to have another summer school just on prosody. But before going there, you have now, as I said, synthesize a nice speech, good prosody and you want to use that in a real environment. So there is a big possibility that the sound that you will produce will not be fully understandable in a real environment. When I was at AT&T labs we prepared what was called at that time the AT&T next generation TTS system and indeed it was the next generation system at that time, 1998 and it was really very good space synthesis system, but when we sold that and the software to a company that they wanted to use it at the airport, at the end they have not used it at all because of the reverberation of the ambient noise that exists so much, the noise level in such listening environments. So listening only into your room, in your office, the synthesis system, this is not, it is necessary but not sufficient. So today we will talk about two things. First of all, how you are going to evaluate your synthesis system and there is the speech synthesis community has developed this blizzard challenge, there is a lot of effort to do that and the University of Edinburgh really is behind all these challenges, that is the blizzard challenge and Simon will talk about that during the second hour. But before that I will present work on how to improve the intelligibility of speech in noise and we can see that not only for synthetic speech but also for natural speech, so I will talk a lot also about natural speech in order to motivate the discussion. So to see that even if you are perfect in synthesis, perfect in terms of free of artifacts, excellent prosody, so natural speech. How does that will be perceived in a noisy environment? How much I can understand from the communication and I will have the opportunity also to speak about two projects that we were involved, one we were involved and one that we will be involved and actually Simon is a partner in woes and will be a partner in these two projects and there we will see another challenge, not the blizzard challenge, the hurricane challenge that also was organized by University of Edinburgh and the second project that we will talk about is, that's why I have put here the word beyond, now what does that mean, we will see it later. So I will talk about the LISTA project that was really a very big in terms of effort I think and very big in terms of results that we got in a European Union project and where we developed the hurricane, we set up the hurricane challenge so we will see some selected results, one of these results we will see that when we will see these results, one model the SSDRC showed up as a very effective model so we will see some details about how SSDRC is working and then we will see also how this can be helpful not only on natural speech but also on synthetic speech. Then we will move towards a bit results from last year and this year so I will talk about more tests that we are doing and this is a new project that we start and reach and that is the word beyond so I will say just one, I will just refer to that so if you are interested there is a web page, I will link the web page where you can get more information which is supposed to start in October. Okay so that is what I am going to talk now so as I said I am talking because I hope and actually this is because of my Greek accent not correct English etc. I am using also these for you to understand me better because that is speech plus text so to improve the communication if I am talking and nobody understands what I am saying why I am talking what is the reason why I have to do that effort so I have to improve of course the way that I am talking and even if you are Greek and I was speaking to you in Greek in a noise environment I have to do an effort to speak better and that for instance you know very well that if you go to a cafeteria for instance in a restaurant you don't speak the same way even now I am not talking in the same way as if I talk with you during the coffee break. So in a noisy environment I make an additional effort that is why also I cannot talk with you for a long time in a noisy environment and this is what is called that effort it is called Lombard effect of a Belgian researcher Lombard who studied that effect. Now there is also another different styles of speech two important styles are what is called casual means conversational spontaneous speech and also clear speech clear speech. So when I try to speak clearly definitely I improve the intelligibility of my voice and that is true for normal hearing and this normal hearing I have to say the following thing can be normal hearing and linguistically experienced listeners but can be also inexperienced listeners and inexperienced listeners this does not mean only foreigners that people that they have like a second language English or Greek or whatever can be also kids kids are inexperienced listeners and especially they have special problems to track people track the voice of a human so of course hearing impaired you understand why so this is these are the a list of communication barriers and in order to motivate the discussion I show you a result first now this picture has a lot of messages and I will try to decode this when I will show you more slides but I just few things to say in order to motivate again the discussion is that we have done some tests and I will provide you more information about about this test and we we in a noise environment we try to see the intelligibility of natural speech and when we try to modify that speech before we put it in a noise environment that means we do not control the noise the noise is always constant the energy of the signals that we do the modifications is always the same as their natural speech so you do not increase for instance the volume you you're not allowed to do that modification and zero means whatever was the score of intelligibility for natural speech so this is the reference point positive means you have improved the intelligibility negative means you have decreased the intelligibility so the first thing to to see is that this is a tts system so the tts system the synthetic speech was less intelligible than the natural speech this is based on hmm speech synthesis system even if you work your database that you learn from is lombard speech then the tts improves but nevertheless it is less efficient than natural speech so that shows the problem that I told you that even if you have a very good synthesizer you have to pay attention where you are listening where are you going to listen to that speech now another result is that of course more positive we are better we are it's better system and you can see now this ssdrc that has the better quality the best quality the best let's say intelligibility compared to other systems and of course the natural speech which is zero so this is something again to motivate the discussion and to see where we stand we will see again this picture later on but you will have now you will you will understand better what does that what all these numbers mean okay so we had a project that was from 2010 until 2013 and as I said Simon myself were partners there PIs Bastian Klein at that time from KTH and Martin Cook from Iker bus in Spain so and the problem was that we have a machine that talks but never listens we as humans we talk but at the same time I listen or I'm watching you and then I have a feedback mechanism and I modify my speech but a machine that talks does not do that so we are talking about a listening talker that's it's from where Lista comes from the listening talker so we like to make a machine that talks but at the same time listens listens so to the environment that was the first thing so in order to do that we like to learn from humans how the humans how how humans really react to changes in the listening environment and once we learn that we will apply it and then we will improve hopefully the intelligibility also from machines from synthetic systems this was this still is valid this link so if you have more more interest about Lista you have to go there and there you will also a lot of results from the hurricane challenge now that hurricane challenge the results that I show you to motivate this discussion comes from the hurricane challenge and that I will talk about more about that so we use what are called Harvard data set like an example of the key you designed will fit the lock and that is phonetically balanced sentences and in general we can say that they are more representative of everyday speech as we say here and however are the data set designed for intelligibility tests so we had a UK speaker who talked recorded very good conditions of recording and we have down sub to 16 kilohertz and for the hurricane challenge just we use the 180 sentences because this is much bigger it's 720 okay so that was the speech material that we used in Lista we used also two types of maskers one was the fluctuating masker that means a masker a noise that changes all the time that it was a female speaker that is not by chance the target speaker was male voice and the masker was female so to make it a bit easier and steady state masker that is speech shaped noise that means we average spectra from speech and then we run noise then we have like a constant but has the color of the average spectrum of speech so these are the SSN and CS but you will see competing speaker and speech shaped noise now of course we have to make sure that when we do this test people are not allowed to listen twice this to the same sentence because then they probably remember the sentence so it was not fair so we have to make sure that people listen only once that sentence and here there are some details when you do a listening test you don't want your signal to start like that when just from time zero you have to have some kind of little noise so to prepare the ear that the sound will come otherwise you have a startup effect and that is not very good so you put if 500 milliseconds are enough at the beginning and at the end so to prepare the human for the listening test now we check that the signal to the noise ratio we have decided three levels and I will show you in a minute how we have decided this signal to noise ratios how to combine the natural speech with the noise and we decided this SNRs to be where the humans had 75% intelligibility 50% intelligibility and 25% intelligibility okay so then this is the way that we decided these are the three SNRs that we will follow so now we'll show you a bit that test that is called baseline results we were about from 50 native speakers that means UK and you see now this is competing speaker this is speech-shaped noise so it looks like and this is the SNR so smaller is the number harder is the condition and you see that we are more robust as humans in the case of competing speaker than the speech-shaped noise this is probably from the fact that the competing speaker while it speaks continuously nevertheless there are some gaps here and there there is less energy sometimes yeah please to do SNN properly wouldn't you in at least in most cases have to condition the f0 of your voice to avoid the harmonics being spectrally masked the speech-shaped noise however does not have any information about the pitch just it has the average spectra so it is something like that let's say and then everything here is noise so it is a constant information over time like so it has no any information about pitch the other competing speaker has that is from the female voice while the target is male again in order to not to make very difficult the task so we wanted to explore the possibility so yeah SSN does not have any information about pitch okay so here so now you can see from where we got 25 percent minus 20 for instance 50 something minus 14 and 75 minus 7 for this curve which we we did these are the results that we got and then we fit a logistic function in this way so then we can find in which SNRs we are going to work when we have competing speaker same thing for speech-shaped noise so now if I invert this then I can compute the SNR that where I'm working on about regarding the result because this is the probability of detection the p of n so if now I invert if I know these then I can estimate in which SNR I'm working on and if I do that and I know also the value for normal speech when I subtract then this is called the equivalent intensity change and it is in dB what does that mean it means for instance if this is positive this has done this method has done a very good work if it is negative that means this is smaller than this one then you have not improved the intelligibility on the contrary and mainly we were so now this is EIC and this is a metric that we have used and now probably you understand better this number equivalent intensity change so when it is positive we improve the intelligibility when it is negative we don't do very well so now we see this EICs in dB and let's say this method SSDRC is about 5.5 dB what does that mean that means that if I first of all I make it easier for you and say the IEIC is 6 dB you probably know that 6 dB means doubling the volume okay so if you double the volume then you have a gain of 6 dB so this number here means that if you want if you are allowed to because depending on the speakers to increase to double the volume of the normal speech then you will have this gain of the intelligibility so you have to double almost double the volume of the natural speech in order to match this intelligibility and you understand now more you go like that it is good but not as good as in this area and again TTS does not do very good job especially even even for Lombard speech if we learn in Lombard speech then this is not good and that was for the case of speech shaped noise at minus 9 dB and the listeners to do that was about 140 listeners now we will see some methods from here but not the details I will provide more details on the winner the SSDRC okay so this is not a new problem people have worked in the past and for instance a very effective method is the high pass filtering and amplitude compression and that is from since 1976 but we have also some models that we measure that we use to measure intelligibility and this is for instance the speech intelligibility index SII is the glimpses proportions and stories are some of the models that people are using so if we have such a criterion we try to optimize this so to modify the speech show to optimize to minimize or minimize depend or the the intelligibility criterion and and another thing is selective enhancement for instance Valery Hazan from UCL found that if you detect nasal sounds and you improve you increase the volume of the nasal sounds then the intelligibility is increasing same thing from onset and offsets why for obvious reasons because of the speech production mechanism we don't have so much volume uh from the nasal sounds uh because there are a lot of zero so a lot of energy is absorbed and not so much and it goes out and onset when we I start I can I don't have high volume the same thing from offsets so onset and offset are however are very important to distinguish what are the what is the word that I'm saying so especially for yeah also stops these are very very important if you want to improve the your intelligibility you have to pay attention to these three sounds okay so we learned about that and we know also Lombard effect already I have talked about that what does that mean in terms of signal processing if we analyze that we even meet frequencies from three to five we increase the energy when we talk in a Lombard style why why three to five we have learned this is a feedback mechanism that we learn that that area is the most effective it's like the opera singer that I sent the the first day she is trained or he is trained to um sing in a certain way the same thing we are trained every day to see how we will modify our style when we are in a noisy environment and three to five is the area of your auditory system which is very very sensitive that's why that what does that mean sensitive it means that you need the least energy in order to excite it so in that area I will put more of my energy I will move it from other areas to that area in order to increase the intelligibility when for instance there are clips in the audio unfortunately these clips creates aliasing frequencies and the areas and frequencies occupy that area of three to five kilohertz and that's why when there are clips we the intelligibility drops okay so that about uh Lombard speech and signal processing on Lombard speech clear speech has also similar characteristics we see expanded vowel space some people of course when you say speak clearly please this they slow down but if you are trained to speak clearly then you can even talk faster but clear um as I said already about nasals and onsets I have already mentioned that they have low energy so these are the observations so can we use uh these observations to develop a model and that is the ssd sc ssd sc stands for spectral shaping and dynamic range compression so there are two stages there is the this the upper panel is about the spectral shaper obviously uh if we these are kind of formants although you do not detect the formants but you know that the maxima are very important for your auditory system so you try to enhance this and while the valleys are not that important so if you take energy from the valleys and also from the low frequencies where already you have a lot of energy and you distribute it to maxima that is good also when from the Lombard speech and that is we can increase the energy or certain frequencies and take them again from uh the other areas of the spectrum why I'm trying to this redistribute energy uh I have said that but probably it was not so clear because the constraint is you do all these things whatever you want to do to improve the intelligibility but without increasing the energy okay so if you if you increase from somewhere you have to decrease it in another place but all these modifications should be uh somehow dependent on the state of the voice if it is kind of voiced or unvoiced you have to pay attention otherwise you will introduce artifacts and also when we play back the sound from uh normal speakers or our headphones or earphones this they have also a low pass characteristic so but the intelligibility also comes from the high frequencies so you have to protect your speech to increase to pre-increase the energy in that frequency so to protect the loss of the high frequencies so this is a spectral shaper then the dynamic range compression is not new but you have to know how to design effective drc without artifacts mainly drc works on input envelope output envelope then you have an a correlation a function that defines the relationship when the input envelope for instance we are here we do expansion where we are here we do compression etc when we are here this is 45 degrees we don't do anything it's just kept constant and this is mainly when uh there are silences or very low energy components which is not from statistics we know that these are not speech in order to do the envelope you have to detect the envelope so we there are many ways to do it you can pick up one like first and Hilbert's envelope or other because all these things another constraint is do you want this to be real time system so all the solutions should be real time if your constraint is to be real time so once you have this envelope then you see where you are and mainly what is the key here is that in the areas which are like nasals onset and offsets which are in this area you will expand in the areas where you have a lot of energy like sonarans ah or it's already high so you will do compression so from here you will get a lot of energy and you will move it here okay so it's like redistributing the wealth of planets in a better way I just have a small clarification so why does the high pass filtering help us so much because given that we are more perceptible to three to five kilohertz range you mean this filter here yeah no the yeah that yeah this is depends now that I have not put any frequency here but mainly this is to increase to mimic the lombard effect so the mid mid range that's kind of like a pre-emphasis thing that kind of pre-emphasis kind of I have the equations just the the the slide later so what is your question my question is shouldn't it be more like a band pass band pass yeah five it is I mean here is just graphically I have no it is kind of band pass but the other one is high pass high pass again because the devices we know that the device that they will play back the sound has low pass characteristics if so then you will lose a high frequency are very important we saw that from 1976 the results that if you you you have high frequencies there you have more intelligible sound and you know that also from speech synthesis when you have better high frequencies the sound is crispier yeah it's more crispy so then why don't we account for that in our actual features in the male scale scale features you're still giving more importance to the lower frequency features well because the low frequencies also are important for the indication of vowels yes so so you sacrifice one you lost the other so now but once you do the first part the vowel is correct you have to now to find out what are you going to do with the high frequencies it's not any more noise whatever it is something there and your ear is very you know sensitive to this okay so this is the big picture of ssdrc now in terms of equations I will go fast first as I said speech is not it cannot be treated in the same way you have to detect kind of probability of voicing and based on that probability of voicing we measure this is the tilt function that can be computed in many ways for example one way to be like capstra lower the capstra and this is uh this is the magnitude spectrum of speech so when you divide by the tilt you have a tilt free envelope and then you improve that you when you put here the power mainly where are the peaks you put them here and you put them in a higher position okay depends of course if it is bigger than one or less than one and this kind of thing but you will find them out pre-emphasis kind of pre-emphasis is also a function of the probability of voicing and is kind of band pass filter depending where are these frequencies and I think in ssdrc it is although it is shown here as a high pass depending on this omega and another term that you can put then it can it turns out to be like a band pass or something like that and then again a fixed filter to boost high frequencies so the spectral shaper you take now this filter this this and this and you multiply your magnitude spectrum and that is your modified one then you can put the original phase in verse Fourier transform overlap and arch and you have your spectral shaped signal the dynamic range compression as I mentioned first you detect the envelope and then you have two phases you have to understand that there are again the envelope is not similar when we have attack or we have release this is very important for the idea to separate these two so you have to have for example to respect the attack because this is very important for the intelligibility and then we have to find this gain to modify our signal and this is going to be the drc for how we get this gain is we have this envelope the envelope from that envelope we estimate the input as we call it envelope and this is a reference envelope as it's just a number that it is by design in the drc and once you have that then you compute this is the input output relationship now and gives you the g of n that you put it here and you multiply and this is the dynamic range compression how does it look here I show only the drc output because if I do it on spectral shaper you will not see any difference between the original and the modified so you can see indeed that these areas are magnified and also nasal sounds are magnified while the sonar parts are reduced okay so now you have higher chances this nasal sound to be heard okay we do objective evaluation of course because when you design this you have really to attach to this e0 to this release constants attack constants you have to design your system so a system that we use with the spectral shape no sorry speech intelligibility index sii and we showed the result of ssdrc and the contrast of original signal and we contrasted with the baskin sourced method baskin has done excellent work and many years he was working on this topic he called it near field speech intelligibility boosting something like that he has even the matlab code is excellent web page and we were able then to compare the same data on and baskin also does I have forgot to say that baskin's method optimizes the modifications to optimize the sii so it is really an optimization algorithm towards sii so this is for speech shape noise this is the intelligibility of course as the input s and r is increasing it is easier and then the intelligibility the sii is increasing which means intelligibility is expected to increase okay that is for natural speech baskin's method is doing very well is improving the intelligibility but ssdrc is doing really a big jump that is good what is going on in terms of competing speaker this is the dashed line is the original the baskin's method is this one so it's not doing very well for competing speaker because the the masker now is changing fast it's not just constant so in short the optimization is not so robust it's not so effective wireless ssdrc continues to be better but anyway this is objective evaluation this who cares I mean this is just to evaluate objectively what you are doing so you have to do subjective tests so we have designed the hurricane challenge and here you can see we had many listeners which were native UK speakers we made sure that they can hear very well and we had as I said two maskers speech shape noise competing speaker and three snr levels we use harvard we make sure that they have not listened the same sentence etc so this is very rigorous the noise conditions I mean 18 was in the room of listening or it was added to the some it was added either yeah we had clean speech and then we had measurements of noise then we add them together I will show you results on a few data like normal speech lombard now how we create the lombard speech we had the same listener the same speaker we had headphones the noise is coming in and he was recording his voice so he was he really producing he was producing lombard speech then a spectral modification methods they two spectrum modification methods one based on glibsy proportion and the other is based on sii the bastion sort method that I told you and the suggested method from catalin ssdrc okay so here are the results for spectral shaping noise and competing speaker this zero is always the natural speech okay if you are positive because this is equivalent intensity gain you do well if it is negative you don't do well okay so we see here lombard lombard lombard snr drops more you go this direction more difficulty is so humans can do well very well but they give up here you cannot compete more the noise for competing speaker is different we do we continue to do better lombard speech as humans here you see the gp and sii methods they do well but not as well as lombards same thing here but much better here the machine bit the human for competing speaker however the picture is a bit strange for gp who is doing well well but for sii bastion south method we saw that in the objective measures probably you saw that it was similar to natural speech it's get it's really bad you have to decrease 2db the natural speech by 2db in order to get the same intelligibility as the one from sii method optimized according to sii so 2db lower the volume of natural speech in order to have the same intelligibility loss actually now this method the colors are not personal but nevertheless this is this this is ssdrc and you can see constantly improves the intelligibility now this one here the gp is the winner in this case however if we measure if this is significant we can find that this is not significant while the others are significant on synthetic speech we collaborate with people at the university of andyburg and here are the results this is normal natural speech this is natural speech this is tts and as we go along this axis the conditions are worse i mean the s and r drops it's more difficult so you see that as we do that the tts output is less and less intelligible but now if we apply ssdrc as post processing filter on the synthetic speech this is what we get so this becomes like that this becomes like that this becomes like that even better than the natural speech of course if you apply ssdrc on this then you will have also gains but nevertheless that shows how also we can improve the intelligibility of synthetic speech now with toshiba in japan we made some tests because all these were near filled but we can do also far filled for instance imagine that we have a loud speaker and you want to make an announcement in a big area and outside in the city so this is called far filled and here we measure this is the simulation 120 meters away from the speaker from the loud speaker and this is outdoor test with 100 meters and 200 meters so it shows that indeed the intelligibility is increasing when we use ssdrc contrary to plain that means unmodified speech and that has recently has been accepted and it will be probably it is already available and Masami Akamina was the co-author of that work okay so first conclusions is that ssdrc indeed it is very quite nice and effective provides big gains a it is robust no artifacts are generated and I can show that it is even for near and far filled is shows good results and can be run also in real time now however when we made all these things we assumed that the unmodified signal and the modified we must have the same energy so but that is not what the humans perceive because we do not listen by energy we listen and we let's say the perceptual energy is called loudness so all these results that we have seen is are under the constraint of equal energy equal rms now we developed the same methods but under the constraint of equal loudness now the problem with that is that we have to repeat all these experiments again but nevertheless that what does that mean in practice here probably you can keep the same energy but since you move any energies around let's say three to five kilohertz then you increase not probably not the intelligibility but the odd ability that it can be a concern so because when you listen to two sounds you say but that sounds you have more volume that's the problem when you don't consider loudness but if you equalize loudness then you listen to two sounds you have done your modifications but the two sounds the one and the other sound like they have the same indeed perceptual energy okay there is the ITU standard which is under discussion to redesign it and there are about five or seven models of intelligibility perceptual energy and they are fighting because perceptual energy energy we know in from physics we got the term in signals and we say sum of x square what is now the perceptual energy the loudness it is an estimation and there are models like for instance a very famous model which is near the ITU standard is brian moves from the University of Cambridge model there are also from Germany there are also very good models of loudness there is loudness measurement for example broadcasting and oh you know there are standards they definitely they take the energy but then they measure masking effects time frequency domain the measure effects of inner first of all outer middle ear characteristics and inner ear characteristics and all of these are combined and they have a measurement of loudness which can be one number for a sentence one can be by frame and we have a time varying loudness and we have now a paper at jazz that talks about how to equalize energy taking into account the perceptual energy time varying loudness and a paper at interspec if you are around there we will show results on intelligibility using loudness so we will show you the loudness model and the dynamic range compression both both the energy from yeah you redistribute energy over time and frequency plane that's what you do here you redistribute energy but you should not keep the energy the same you have to keep the perceptual energy the same it's more constrained here okay so because again here somebody will say yes you have done good job but you do not increase intelligibility you increase the odd ability that's different here however if you show good results here nobody can say about odd ability now indeed you touch intelligibility if you have good results under this constraint in short in at interspec we introduce for first time in this context um equal loudness constraints and then I believe that we have to repeat Simon all the tests that we have done under that constraint we will need probably another EU project for that so these are the results here I have plain speech this is SSDRC this is normal hearing we will see another condition speech shape noise competing speaker and there are plain speech SSDRC this is another method that we developed actually atoshiba in Cambridge which is inspired and from this method TSER the time domain spectral energy relocation which is used in the um uh a broadcasting company in Japan but it is mainly based on tourist work from MIT and we have done a frequency domain very fast spectral energy relocation and combine it with a dynamic range compression that means we use that as a new spectral shaper in order to improve SSDRC in the competing speaker case and indeed you can see and this one is ah this is another work from Petco and Bastian Klein published recently all the methods for speech shape noise they do well they do better than the normal that the plain speech the unmodified one SSDRC and FSER plus DRC is doing very well better than the others while you go towards this way this is easier now so you reach mainly the ceiling effect so it's not any more any interest there and um when we go to competing speaker the results are more interesting first of all we see that this new method that we had uh outperforms SSDRC that's good this method from Japan and from MIT is is not doing very well it's very close to the plain speech as you can see but the methods from Petco does not do very well especially for the low and medium SNR conditions these are harder conditions now this is easier okay well it shows equal under equal loudness SSDRC and the other methods improves not only intelligibility not only on the ability but intelligibility indeed okay now we went to do also with hearing impaired people so these people don't have hearing aids they just they have some problems and they have lost the ability to hear effectively the high frequencies so again for speech shape noise hearing impaired means here and we do we improve in this condition we improve from 50 percent to nearly 90 percent 85 percent amazing I mean this person I mean get a lot of help based on these results and here however the SNRs are not like the ones that we had from um normal hearing we make it easier for them so we all these SNRs when we say low medium the low probably is what was medium for the normal hearing etc so we will make a shift yeah so similar chance it's not necessary to to this to talk more now we made a lot of things a lot more tests uh clear clear to casual speech we made actually we had the phd who uh made the her Viva in December Maria so how to move from clear from casual to clear and and synthetic speech we worked with Daniel Aero to do a modification inside the vocoder before not as post-processing but inside the vocoder before synthesizing we are doing some very interesting work on special groups like kids with auditory perception disorder dyslexia and this kind of special speciality noise dependent SSDRC SSDRC is independent of the noise does not listen actually but now if we put an ear on SSDRC we can listen to the type of noise and then we can improve further the SSDRC by taking this into account like optimization algorithms but we have to do it in a way that does not destroy the quality and that have been also has been shown at icasp last year two years ago we made a special session at the interspeech and then special issue in computer speech and language it's a second and we have developed a real time SSDRC which has been shown at the show and tell at icasp yes please so the noise dependent SSDRC is is it fair to say that's kind of a combination of the SSN and CS where you SSN are conditions so now SSDRC does not listen really just modify speech right but if we put an ear on SSDRC then we will start seeing what what is the condition it is speech shaped noise so the difference might i see so there is non-speech noise yeah yeah yeah can be speech noise can be anyway we we listen to the environment then we modify SSDRC accordingly oh yes please testing on people dyslexia yes um wondering why that would i mean dyslexia is a condition where you can't read so speech and noise i'm yeah i don't know how those two connect well there are theories i can support some of these theories some others are against these theories i will not go to talk about more about these theories but we can talk during the coffee break but indeed the results on listening tests with people without dyslexia and people with dyslexia shows that people with that with dyslexia have less intelligibility they have they lose intelligibility compared to the people without dyslexia now why it comes that it comes also as i said kids even if they are perfect they excuse me okay you went away okay okay so um kids even if they don't have any problem still the auditory system or actually the brain has not processed so many sounds to be able to track the teacher first in the noise environment it's much much if we have let's say 80 percent intelligibility in a class they have 40 percent intelligibility even without any problem if they have more problems they get the intelligibility goes even lower and dyslexia is one of them okay but we have to do more tests to that probably in a year from now i will be able to show you more we had some preliminary results with the University of Edinburgh by the way but these were very preliminary i will like to make a bit in a better context and then i will tell you about that okay if the theory is good or not okay now this is the necessary but not sufficient again however you know every level you can say the same thing and necessary but not sufficient so that's why we created the new project that's called enrich because we are we are targeting the enriched communication okay intelligibility is good but not enough the question is how much listening effort how much cognitive effort we have put to understand this message because you could understand it but and in a multitasking society this is very important so we are targeting 14 projects in many universities you will see the universities the partners in this link and mainly the tasks there are three pillars reducing the listening effort not only increase the intelligibility but reduce also the the the listening effort so there is one more constraint can we enrich speech can we augment speech in a certain in which way to improve to reduce the listening effort and benefits for as as we said before individuals and and groups so there is a call for phd this program is to train phd so probably it's not for you but if you know people who are interested please go let them know this link so that we will start interviews by skype and then we are going to have face-to-face meetings in London in November for a final decision so if you think this is a nice environment to work with please let your colleagues or your friends know about it so these are some key papers to read and some references that i shown during the work is here that's it so thank you for your attention in the demo session which really is in the afternoon i will show you i have actually here but it's time to make a break a real-time version of ssdrc among others that i would like to show you okay that is about the intelligibility questions more questions yes we will put a buyer you know i will put you know a number of three questions for you especially for you not more i'm kidding yeah um can you speak a little more about the subjective evaluation uh for this what was it done with headphones headphones yes so no headphones you did one outdoor test yes that was my son can you comment on that because you did that test was also the outdoor test uh with uh headphones but simulating the uh condition so it was really what do you remember anyway i think um i have to get back to you on specifics i think because we had a reviewer who was asking the exactly the same it was not you probably yes who was asking that and catalyn who is the first author replied after consulting the to shiba japan but i don't remember the us definitely we use for all the other tests hurricane etc we used headphones that i know would you consider something more uh radical than that like taking the actual samples out into a public space crowded bar for example and actually i did that i did that i did that with uh my son and then someone else we went to uh engel pub in kenbridge that was quite noisy and uh we were listening to a story and in with in or out without or within ssd sc the results is impressive but you just did that informally amongst yourselves you didn't have no yeah but you could you could feel the difference i mean so obvious the difference the question however the big thing is that this is it is works on clean speech how does it work on noisy speech that's another constraint so can you do better i mean can you improve the eligibility of the noisy speech then you are face-to-face communication telecommunication then you can improve also you have a bigger area of applications yes please yes so um are you mentioned briefly about also that you tried uh in the context of synthetic speech uh tuning the vocoder yes and uh so could you maybe explain a bit uh what yeah what what you did and uh like uh i think in the work of tomoraitia some years ago he had pretty nice results uh within this realm of tuning the vocoder to work precisely on uh yeah yeah uh that was indeed a work that we did with daniel arrow from aho lab in bilbao in spain uh daniel has a sinusoidal uh representation for speech synthesis so we redesigned ssdrc uh to work with the harmonics and do similar things but inside the vocoder before synthesizing so it was mainly redesigned the ssdrc but for a harmonic vocoder now the work that you mentioned about tomo and there are other people or so emma is working also in your lab on on intelligibility uh they do very well uh i think however again we had here in hurricane challenge we had two challenges one in house the other was i mean just the partners of mista that was uh uh open we had many inputs and i uh the result i show you are representative yeah and i think those experiments took place before tomo's things so yeah i think tomo had an input there had some okay i think yeah it was okay yes please but you're modifying speeches in this method would there be any impact on speech naturalness aha that's good well um when uh of course i mean the problem uh you can uh it might you might have a problem with speech naturalness now to what extent that is the critical point for instance this optimization algorithms if the noise is coming then you just try to avoid the noise obviously and you don't take it to account that the modified signal should sound like speech then that's the problem with the competing speaker case that you saw that these optimization algorithms do not perform very well because they don't take into account the properties of the speech sound uh i will be a phd actually that i will examine in sheffield university that tries to solve this problem but that is an example that uh if you above a certain level then the results would not be good now if you change the a bit the quality of the sound assuming that you are listening in a noise environment probably this it will not be noticeable but i will have i will show you a real-time demo here when we will come back from the poster session i will have the demos from industry and lecturers i will do two demos one will be faster there will be this real-time demo so i have the possibility there to decrease or increase the noise environment then you will hear the quality change with ssdrc or without ssdrc in a real-time yes so you have conducted a lot of subjective studies how good are the objective instrumental methods for predicting intelligibility for your cases i would assume rather low performance here yeah well yeah um indeed si is a good measure but not enough and that uh story also was another method and um martin cook from akirbask has uh at the end he had a lot of data from harrigan challenge so many evaluations so many uh systems so he came up with the modified version of the glimpse proportion method and indeed he has shown and style so has shown good correlations like above 80 percent yeah for these conditions okay thank you very much and let's make a break and then uh thank you