 Είναι πολύ καλύτερο να είμαι εδώ και ευχαριστούμε, Γιάννης, για την εμβιτεία. Θα μιλήσω λίγο για την συνεργατική συνεργατική στιγμή. Είναι ένα από τις πιο τραντωμένες σύνθυσες. Είναι βέβαια σύμφωνα, σύμφωνα σύμφωνα σύμφωνα, και πιστεύει να λειτουργεί καλύτερα. Είναι ένα από τις πιο τραντωμένες στιγμής, λίγο για πρόσφυνα. Θα μιλήσουμε τα πιο αντιμετωπίσματα, πιστεύουμε τα πιο αντιμετωπίσματα, λίγο για τα εξοδοχή, δε θα μιλήσουμε κάποια συνεργατική συνεργατική στιγμή, για να συμβείται τα εξοδοχή. Βγαθαίνουμε τα πριν τα πιο αντιμετωπίσματα, συνέγουμε οι πιο πρόσφυνα που συμβήφει να τα κάνουμε. Πρώταλλα πρέπει να έχουμε ένα τέλος, και επειδή έχουμε φωσιμοκρότημα, και τελείωστε το τελείωσό προς το τέλος της συνεργατικής στιγμή. Θα μάς δημιουργήσουμεά τα μεταλλακτικά σύμφωνα για πρωταγή, για να απελευθούμε πριν τα πιο αντιμετωπίσματα. και άφηση να δημιουργήσει ένας πόρτω εξηγούς που είναι ένας από το πρόσπιμα που είναι αρκετά σε τεξτοσπίτσο επίπεδο. Και μόνο από τα πόρτω εξηγούς της εξηγούς και από το οποίο έπρεπε να δημιουργήσουμε το δημιουργείο μας, που συμβείτε στη δύο, φυσικά έχουμε το μάστο του εξηγούς της εξηγούς εδώ, Σίμον που θα δημιουργήσει ένας σελόγευτος από τη σκου. και then some closing thoughts and a peek into the hands-on session that's on the afternoon. We'll split it in the two hours, depending on how much time we need for its part. Okay, so let's do some scoping. There are a lot of things in common between unicellection of text-to-speech and dive-on-based text-to-speech. They both work on the time domain, usually. They both employ dive-ons as their main units, more about this in a second. And what they do is they slice segments of natural speech and then concatenate them to generate synthetic speech. However, a critical difference between dive-on and unicellection TTS is that traditionally when we say dive-on-based text-to-speech, we refer to systems that only have one instance per phoneme, which is very carefully recorded and stored in a database and manipulated in synthesis time as necessary. On the other hand, in unicellection we have multiple instances of each phoneme, the phones, and that's exactly what necessitates the unit selection algorithm, which is one of the critical components of a text-to-speech system. How to select among these tens, hundreds or thousands of instances that we have in our database. More on the dive-on TTS on a later lecture, I think it's Thursday, right? And there are also some important links between unicellection text-to-speech and hybrid systems, what we call as hybrid systems for text-to-speech, and most of these systems are unicellection in their core but employ concepts from statistical systems to fill in some of the missing components, to use statistics to assist their unicellection. There will be also some more discussion on hybrid systems in the context of HMM-based TTS or evaluation, I'm not sure, but Simon King will see some more about hybrid systems. So let's move on now to the core of the unicellection system. Let's see where we are now. Let's get a feeling of how good we're doing in text-to-speech as a community that is. That's some selection of voices that happen to be involved in developing those, but there are many, many, well, many very high quality voices after available. So these are all unicellection-based voices. Actually most of the commercial voices you find out are based on unicellection, so it's still our best candidate, let's say, for a system blueprint on developing quality voices. Of course TTS is not as far from being a closed field. There are many, many issues more involved in text-to-speech in general, but also in unicellection, more specifically. So the overall block diagram of the unicellection system is not different in broad lines from the other speech synthesizers. We have a text processing part, a much detailed discussion of that was given this morning. Then we have the signal processing part, which is the main point of real difference between the different types of speech synthesis methodologies. So just to run very fast through it, we have here tokenization and some parsing, disambiguation and so on. We normalize text, we render everything, we turn everything to be plain text, taking care of numerals, addresses and so on. Then we have these letter-to-sound rules that actually produce the phonemes that will feed as input to the text-to-speech system. And then we have some prosody module, whatever that is, this may deviate quite significantly between different systems. Luckily Simon gave a detailed encounter of prosody this morning. And then we move to the core of the unicellection system where we have a method to select among the various units and a method to concatenate the selected units. So let's get some deeper into that. So what's the premise in unicellection system? Well, the premise is that synthetic speech can be generated from a corpus of pre-recorded natural speech units by a property selecting, modifying and concatenating them. And these, some of the key concepts stand out in this definition. So what are the units? How do we select between the various instances available and then how do we modify and concatenate them? So the units, depending on the case and the application, the units could be any segments of speech of any size. It could be phrases, words, syllables, phones or lower than that. And of course there are variants. We have half phones, we have die phones, we have whatever there is, half syllables or whatever. The most often choice for a general text-to-speech system is a die phone. When considering the granularity, let's say, of the segmentation, how large a unit should be, we need to weigh between two different, to make a trade-off between two different things. If we have a small size of units, then we need fewer of those units, the types of those units are fewer. If a unit is a phone name, then we have 40 or 45 phone names, let's say for English, depending on how you define them. Anyway, if you go larger than that, then this grows exponentially. If you want to find all combinations of two phone names, you will get 40 or 45 multiplied by itself squared. So this immediately, the number blows up. Smaller units are better because we need fewer of those, we have fewer classes of such segments, but we have more joins. If we splice speech into small units, we need to then concatenate many such units. And concatenation is exactly the point which introduces all the artifacts and all the problems in synthesizing speech. Well, the most intuitive selection would be to use phones. After all, an ASR system, for example, works on a phone level. It's more intuitive to write down phones and then one may think to synthesize by concatenating phones. Well, that's not actually the case. For example, consider this natural recording here which says, okay, then if one would splice that into phones, you would get something like that. Don't pay too much attention on the precise notation, every TTS system could use its own, that's just a notation. Just pay notice to the boundaries that there are between the different phones. If you closely observe this segmentation, you will see that the boundaries of the phones are the most unstable parts of speech usually. Because it's the turning point between one phone to the other. Practically, the articulators are moving to different positions. So it's not quite a robust point to cut and then to concatenate at. It makes much more sense from the point of view of cutting and gluing together to slice differently at a diphone level. So we hypothesize, and most of the time that's true, that at the center of the nucleus of the phone, it's the most stable region. The articulators have reached their target positions, whatever that is, and there is a slight, let's say, a small interval of stability. Nothing is completely stable, of course, but it's much more stable than at the points where the articulators are rapidly changing to take their new target positions. So it makes much more sense, at least engineering-wise, to work with such boundaries, more stable regions, slice there and then try to glue there. So, as we said, diphones are much more commonly in place to cut and paste. And another, let's say, bonus we get for using diphones is that we don't get too much penalty if we miss the precise boundary of a phone or a diphone. Splicing a corpus, a speech database, in order to produce the units that would be used in synthesis, is usually made using recognition techniques, alignment recognition or whatever. So it's quite too often for a recognition system to miss some boundaries or to be less precise in the boundaries of phones. I mean, returning here, it would be really difficult to say at which exactly place this cluster of consonant sounds is exactly separated. And in many cases, there is no correct point, sounds transient. You cannot say that in AI articulation, here stops A and then starts E. It's quite counterintuitive to go this way. So there's often no right point boundary for a phone. And if you employ diphones, you get away with it. You don't need to decide where A stops and E begins. So, a continual discussion on how we choose units for text-to-speech to unit selection text-to-speech, there are some additional things to consider. In many unit selection TTS systems, stressed vowels are treated as different sounds. I mean, if you have a large enough corpus, you can spare the units, you can treat them differently. If you have a much smaller corpus, then you might need to say that all A are one class and it's just a different feature that an A is stressed while the other is not. If you need to reuse much of the diphones, then you might need to fall back to that. But in general, if you have a large enough corpus, you can treat stressed vowels as different sounds. And it's not just intensity or some specific characteristic of stressed sounds that differ from unstressed ones. There are many things different. So, if you can afford to treat them as separate sounds, you should probably do it. Also, this separation, this splicing between diphones, it's not necessarily too rigid. Some phone clusters seem to be less, let's say, lend themselves less to manipulation. So, you would prefer not to cut there because you will need to then glue such positions too. For example, vowels around an R are very unstable. You do not want to cut there and have to glue them back together. So, if your corpus is large enough again, you can afford to treat clusters of sensitive, let's say, or less robust phones as a single unit. Never splice them, treat them as a whole so that you never need to cut them and mix them again differently because it's quite problematic in some cases. Also, sometimes it's not such a rigid decision, such a hard decision. I need to use diphones or I need to use larger units or whatever. Almost all the unit selection systems have a bias, employ a bias, which forces them to prefer selecting from the database consecutive diphones. The benefit of that is clear. Selecting consecutive diphones from a database minimizes the need to concatenate things that came from different contexts. So, you expect this joining of consecutive units to be perfect, to be a full reconstruction of the original sound. And that's the ideal case. So, it's often the case that a bias is introduced, for example, in the form of a penalty for non-consecutive diphones, so that systems tend to select more and more consecutive units from the database. In the other end, there are cases where your data are not, you don't have enough data in your database, you don't have enough recordings, or you need to keep it very small and compact, so it might be possible that you run into cases that you just do not have a diphone that you need. Many systems fall back to semi-phones, in the half-phones in that case. So, their normal mode of operation is to manipulate diphones, but if there is none available, or if the ones available are very, very inappropriate for the specific context, then they go into the, let's say, risky path of cutting diphones in half and keeping just the two half-phones. Okay. So, having an overview of what the units are and what the unit selection plays with, let's have a look how you select between those. So, consider a unit selection TTS system where you have many, many instances of each diphone that you will need, and it's time for the system to synthesize. So, for synthesizing, let's say, the word two, okay, you need to have three diphones. The notation is irrelevant, too. Don't pay too much attention to how it is noted, but assume that this symbolizes the silence. So, everything starts with a silence and ends with a silence, and so you have three diphones. Silence T, T, U, silence, right? And assume that you have three, four, and three instances of each of those respectively. So, what you need to do is to explore all the possible paths that can lead to a rendering of two. All the available combinations that is, and in algorithmic point of view, that's a graph, and you need to explore its paths in order to find the best, we'll see what best means, of each path. A path is an actual rendering of the word two, given the diphones that you already have in your database. So, thinking simplistically and intuitively, there are two things you need to match when selecting diphones from your database. One thing is that the diphone, the instance, the particular instance that you select for a specific place should be appropriate for the place that you're about to put it in, which means it would be great if we could have a T, U, diphone in our database, which came from a similar context preceding silence and following U, silence, because we have the feeling that if the phonetic environment and the prosodic environment and whatever is similar, then this sound has all the characteristics that are required for the sound that we want to synthesize. So, intuitively speaking, the most similar context that you find in your database to the one that you want to put your diphone in, the better that you feel that your choice will be. So that's the target cost. You're looking on where you want to put, at which place you want to put your diphone. But, okay, the target cost. So you're looking at your neighboring phonemes, okay, and if you don't have, you might be wanting to look at similar classes of phonemes, right? You don't have exactly an S something, but you might have a different freakative something, so you can fall back to classes if you don't have the precise context for the diphone you want to select. You want to look at your prosody, whatever that means. We'll talk more about that in a few seconds. And you might want to match different things, such as the part of speech, the phrase type that you're placing in it, and the size of the word, the sentence and many, many things that surface characteristics of the diphone that you're selecting from a database, which you want to match the location where you're placing the diphone. So that's the target cost. But that's not the whole story. There's also, okay, let me go fast to the joint cost, which is the second thing, and then I'll return to talk some more about prosody. There's also some joint cost. You may have selected the perfect candidates, looking at the specification, where you want to place them, the perfect candidates for each of the diphones, but they do not match, they do not fit together. Okay, so fitting is equally important, maybe more important, because any pair of diphones that will not fit properly together will immediately produce some unnatural sound, an artifact which is very readily perceived by the human ear. So you want to be examining the ending of this diphone to the beginning of the next one, and so on. And you need to have a sense of what matching is, what it means for two segments to match. So you want to look at the acoustic properties, MSCCs, you may also want to go at derivatives, because you don't want, for example, the pitch to be just matching the pitch values is not enough. If there was a move, a motion for the pitch to rise while there's a motion in the next unit for the pitch to lower, this may produce an artifact, something that doesn't sound quite right. So just matching playing values, maybe it's not enough, you may need to look at derivatives. So you need to see spectral content of the two pieces, you want to glue together some prosodent things, and prosodent is not just pitch of course, it's also duration, intensities, you don't want to mix a fast diphone, a diphone coming from a fast sentence, but if you say so, with a slow, much slower diphone, this will introduce something like an arrhythmie, which again would be immediately detected by the human ear. And there are many, many other things that you might want to include in your mix of things you are looking for when deciding whether two segments fit well together or not. Sometimes people involve quantities that have to do with the voicing source, especially when you're working with richer speech, emotional or expressive speech, where the voicing source can become more and more relevant. Okay, so two main components in taking into account when selecting units, the target cost, where you want to place the units, and the joint cost, how well the units fit together, you need to give them both right, please. How much overlap is there generally with what the vocoder is giving in terms of representation? Are those the parameters that you're going to calculate these distances, okay? Well, the parameters you're using to calculate these distances the same as the ones in the vocoder? Well, to some degree yes, but here you do not care about reconstructing. You do not care to be representing your speech signal precisely with all its detail. All you need to have is an indication of how well these fit together. So you don't have to, but in practice do you? How much overlap is there? Well, I would say MFCCs pitch and some intensity would be fine for unit selection. You don't need much more. This would certainly, I guess, won't be enough to reconstruct in parametric synthesizer, but these are the most sensitive things to look for in unit selection. There's also another forgiving, let's say, issue there because when it's time to glue things together, you do some overlap and so that's made to a degree smooth out any differences so you can get away with more things there. Do you prepare some normalization while constructing database in order to have some kind of joint cost optimization? Okay. For example, like rate or intensity? Well, that's quite a large discussion. Maybe we can do it when it's time for that. Let me give you some short hints. It all depends on where your data came from. If you have control over the recording process, things are quite different. You can avoid all these things because normalizing, even the intensities, is nothing but straightforward. You need to normalize. You cannot do it per utterance. You need to have an overview of what's going on in your corpus. You need to have maybe an understanding of the different parts because especially in more rich speech or speech coming from uncontrolled or different recording conditions, even getting the intensity right can be a problem. Okay. So we have a joint cost and a target cost. Prosthety is generally false under the joint, the target cost branch of the selection system. Prosthety is okay. F0 duration and intensity. There are mainly two ways of dealing with prosthety in a unit selection system. So there is an indirect way where you do not explicitly set targets for any of the prosthotic features. You do not have an explicit target value for pitch, for duration, for intensity, or whatever. But you just try to make sure that you include within the other features features that are quite correlated to pitch, to prosthety. So you don't have a prosthety model which you run and gives you target values for the pitch. You just describe where you want to place your diphone and you can actually rely on unit selection to pick units that fit well in that position. And moreover, that fit well between themselves. So even if an extreme, let's say diphone is picked for a place, then unit selection will seek to find good units to frame it in to put before or after it so that in overall this sounds plausible. This sounds consistent with what is on your database, right? But of course, as also Simon said this morning and he will talk some more, I guess, when talking about hybrid systems, you can go and have a model of your target prosthety, which is you run some trained systems, some trained models, you train models on your data, and then you run them to get target values for the pitch, for the duration and whatever other features that relate to prosthety you need. So you use those calculated values as targets for the unit selection. And you don't say to unit selection, find me an A that's near a comma, a punctuation, but you say found me an A with 120 frequency for pitch and 20 milliseconds or whatever of duration and so on and so forth. Having such models that, I mean, obtaining target values, training a model to give you the correct or the predicted value for pitch, duration or whatever other prosthetic features can be convenient because you just pick the target values that they calculate and you plug them to the unit selection and you say find me such units and unit selection will go out and find your database units that match your specification, so they become part of your input specification. But usually such units, when you train them on data, they end up with just a single solution. So there may be a wealth of prosthetic patterns in your database, but if you train a model on the prosthety, this model will only give you one answer to a question. Unit selection will potentially give you many, and so you kind of maintain all the richness with all the load of carrying your database, your huge database with you, but you also maintain some of the richness of the prosthetic patterns in your database. Because if you do not need to model, then you do not need to make hard decisions on, in this context, the prosthety is like that, in that context it's these specific numbers. Another problem, of course, of models is that they accumulate errors. So if, for example, you train a model passing through Tobi symbols, as Simon was discussing about this morning, then you have a model that takes you from context to Tobi and then another model that takes you probably from Tobi to target values. Of course, models do not come at no cost. There is some error and accumulating error can at the end lead that too much error being accumulated. Because carrying the entire database with you, as unit selection requires you to, has a considerable cost and not in normal voices and normal speaking styles, but in expressive voices also has another price. That's uncontrollability. What I mean by that, that's, we often came across that in cases where we experiment with very expressive voices coming out from audio books or whatever. It's not unusual that when you change a slight thing in your input, what you get in your output is, let's say, prosthetically completely different. Let's say a real example from one of our experimental things, expressive voice that we used. The only difference between the two utterances is just a small word and a comma. Of course, that's funny stuff. That's coming from a dialogue sentence, probably, where something great was happening or whatever, and the grew diphone matched so well in its context that it picked that one and all the consecutive diphones to match that. Okay, that might be okay for let's say a toy voice, but you certainly don't want your system to have such an uncontrollable behavior. A small change in your input specification to lead to completely different renderings in the output. Any early selection out of the box has such problems with such uncontrolled, rich speech. Well, if you look at this problem, what happened there is probably that there is a feature that you did not take into account. A feature might be that that sentence, the great where the girl came from, was probably part of a dialogue, was probably part of a different context, whatever context that is. You're looking for 10 features in your context, but that does not mean that this is all there is. There may be 100 features, 100 things that can differentiate between two different renderings. So there may be some feature at the semantic level, which you cannot and you did not take into account, which clearly separates one rendering from the other. But for as long as you do not have that feature in your system, you cannot have it, or you will not do not want to have it, you are left at the mercy of your selection of luck to pick between one of those two renderings. Of course, there are many things you can do to get rid of such cases and to be more safe, let's say. You can prune, you can do several things. We'll talk some more about this later. Okay, as Simon said this morning, there is a definition by Paul Taylor, right, that there are two ways to formulate your cost function. So we have a bunch of target costs, we have a bunch of joint costs, we need to arrive at an overall function. So you have the IFF as you called it, as Taylor called it, formulation, which is the most usual, I would say. And then you have the acoustic for unit selection, for pure unit selection, and then you have acoustic stress formulation. The actual difference being whether you use an acoustic target or not, whether you calculate a target, acoustic target, I mean specification for the pros, let's say. If you do it, then you are probably at this pace. If you do not and use plain features and indirectly, for example, let's say, specify pros only, then you're probably at this level. I will not go further into that because I think in discussing the hybrid systems, you will go into more details. In the most usual case where you work with the features and not with some calculated target values, you have a set of measures, instead of features, and you try to measure how different is a diphone in your database from the features where you want to place it in, right? So I have an instance of a diphone in my database, which is the same as the one I want to employ, but it's taken from that context. And that context has 10 features to describe it. So your target context, where you want to place it, has a different 10 features to describe it. You need to find a distance between how much this deviates, this context, in order to have a measure of how well fit is a specific instance in your database to your specific target position that you want to place it in. Okay. So for the target cost, you have a distance, you see how much it deviates, your specification from what you find in your database, and for the joint cost, you have another set of distances, how much your spectrum parameters deviate, how much your pitch values deviate, and so forth. And for each deviation, you enforce penalty. And then you put some weights because not all deviations are born equal. Some deviations are much more, let's say, important to look for than others. For example, you may afford a larger deviation, it's just an example, it doesn't mean anything. You may afford a larger deviation in your spectrum parameters than in your pitch parameters, for example. By hoping that the overlap add function that follows your unit selection will smooth out any difference and will produce a rather smooth transition from the one acoustic setting to the other. So you select weights. You need to have some weights with which you multiply each deviation and then you sum the whole thing up and find one measure of how well a diphone is for a specific place and a specific context. And then you have a target cost for this instance, a different target cost for this instance, and a different target cost for this instance. Then you have a joint cost for each pair of instances, this, this, this, and that, and similarly for the others. And what you do is you run through each path as you do in typical determined style shirts and you accumulate your cost. So if you pass from the first path, the very first path, then you have a target cost for this instance, then plus the joint cost for this, plus the target cost for this, plus the joint cost for this, and so on. So each path bears some cost and you accumulate the cost and when you arrive at the end you see which of the paths, full paths has the lowest cost and you consider that to be the best path. And each path is a combination of diphones, actually each path is a solution to your unit selection problem. And then it's time to take action. Before, this is just a short indication of what's some usual features that unit selection systems employ, right? So for the target component and for the joint component, target component, the phonetic context, left and right, how many diphones before and after are similar in your target and the database instance you're measuring. The part of pitch you wanted to be similar, the position within a syllable, it's at the beginning or the end of the syllable, the word or the middle. The prosodic boundary events, are we near some significant prosodic event, are we at just before a comma, a punctuation or sentence final, these are significant prosodic events. You expect the prosodic to be quite different at those points. So you expect those features, let's say, of the context to be critical for selecting the proper unit. You don't want to pick an A for the end of a sentence that is coming from the beginning of a sentence. You expect very different things to happen prosodic wise in the two instances. And of course as we said it's the spectral, F0, duration, intensity and so on. At the joint course there may also be some bias where you favor consecutive units from your database to be retrieved. And in your target you may also start considering things as style. Maybe this will be more clear later on when we'll be talking some more about expressive speech. For example, let me give you a short description and we'll talk more later. If you develop a synthetic voice from an audiobook, you might expect your audiobook to have an aeration, some narrations and utterances, and some dialogue utterances. And you would intuitively expect those to be quite different in their properties. So you would probably not want to synthesize a sentence picking some diphones from a dialogue part, I mean character imitation, but with some diphones from your narrational part. So you will, if you need to, you will fall back to picking whatever you can find, a diphone with the exact specification that you want, but you would like to favor picking diphones from a database which are similar in style, whatever that means, in a different application to the one that you want to synthesize. So this would qualify as a target course because it's part of the specification of where you want to put your diphone. And so after you have done all your exercises and found a good sequence of diphones that your unit selection algorithm pointed out to be the best choice there is in your graph, then it's time to take some action and actually start to concatenate them. The standard method for doing that is an algorithm called pitch synchronous overlap and add. Each idea is quite simple. It requires a design time. I mean when you prepare your database, you segment it and you do whatever you need to do to prepare it to have it ready for synthesis. There is also another additional thing you need to do is to find the pitch marks. Well, the pitch marks are some characteristic part of a pitch period, pitch period, so actually the voice sounds, some specific point in the pitch period that for example is the highest energy point in each pitch period. Actually the truth is, so you have a pitch mark here and a pitch mark here and a pitch mark here and so on. But actually it's not important to pick the highest point. What is important is to be consistent about it and not only the context of a specific utterance, but throughout your database. Because your unit selection system will not only concatenate diphones coming from the same utterance or the same paragraph, it may concatenate diphones coming from your entire corpus. So it doesn't need to be the glottal closure, instant or specific point carefully chosen in a pitch period, as long as it is consistent, consistently calculated throughout your corpus. So at design time you need to calculate the pitch mark, which is part of the database building process, your offline process. And then at runtime when you need to actually perform the concatenation, what you need to do is you apply a window, so you kind of separate a pitch period plus some context, some frame around it, and you separate each of the periods. And if you do so, you can then pull them apart, so put them further apart, or pull them together, and this would be perceived as making the pitch higher or lower. So increasing or decreasing pitch actually comes down to putting closer or further apart your pitch periods, your windowed pitch periods as performed this way. Of course you can, as a side bonus, you can change the duration if you by a specific strategy start duplicating pitch periods, then you increase the duration of your segment without affecting its acoustic properties. You are not changing any of the qualities of your signal, not its spectral qualities, you only affect its duration. So if you maintain the distance between your pitch periods and you just repeat one period or so, you will achieve changing the duration or lowering it or increasing it without any significant degradation in what you hear. Of course there are limitations. If you move things around too much, then you will start hearing artifacts. You would expect large changes to have audible effects and degradation in the quality of your signal. But if you keep things at a lower degree of change, then you can expect that your pitch will not be degraded. You will be able to change the pitch, change the duration without the signal losing its characteristics, qualities. Okay, small modifications are okay. Okay, so this is, yes please. Does this somehow affect the join costs when you select the actual units, because maybe through changing the pitch of two consecutive units you might connect to units that actually don't fit that good together, but have a better target cost compared to units where the target cost isn't that good, but they already fit good together. Yes, but what we select on the stands is a bug of things. Everything is summed up together. You do not have different ways, I mean usually you don't have different ways of manipulating different things. You just have a weight. You can express the strategy you just described through weights, giving a larger or smaller weight to something. Maybe you can implement it that way. But generally, yes, you can get away with some small deviations in your pitch by employing this, and this is actually one of the main things that you use it for to move maybe some things around the concatenation points. And one of the things that you can smooth is some rate or some pitch deviations. It's straightforward to do that through pitch synchronous overlap pad. But this will only take you up to a point. You cannot go too far. Okay, thanks. So, should I move on or should we... We time. Okay, let me run through some things. Okay, maybe better have a break because the rest of the part is... That's fine. Okay, so let's make like a 10 minutes break and then come back. Let's now move on to some discussion on how we could tune the unit selection system. Obviously, one of the most important parts of the unit selection system are its weights. The two obvious ways to come up with some decent values for the weights is either to do them manually or automatically. Setting weights manually seems still to be one of the most often cases. Of course, as many other things in unit selection CTS, it's not exactly science. There are some intuition, some listening tests, some trial and error and a lot of engineering involved in arriving at a good set of weights for your cost function. It's very labor intensive, since it involves listening a lot, experimenting and so on. And one hopes that its unit selection cost will be reasonably well set to be able to handle different contexts. That means your weights ideally should be the same for different unit types, for different contexts, for different speakers, for different styles and for different languages. If you have to retune your unit selection cost for each of these, then you are in a very bad position. You need to do continuously, do nothing more, nothing else than tuning your unit selection weights. So it's very important through intuition mainly to arrive at some features that generalize well, make sense for different speakers, languages and so on, so that you don't have to tune them for each different speaker or language that you want to port your system to. But in some cases it might be necessary for example to use a different set of weights for different unit types, a different set of weights for joining for example CV diphones than for joining diphones that involve some for English difficult R sounds. So depending on the case, depending on the features, depending on your needs and application, you hope that you only can do with one set of weights but you might find yourself with several. Having said that, it's very tempting to try to automate the process of tuning unit selection cost weights, but to train weights what you would need is a kind of an error to measure an error somewhere and to have a rule that based on that error would change parameters of your system that's producing the errors. But what, where do you measure the error in a unit selection system? Theoretically at its output, its output is speech but what do you compare this speech to in order to arrive at an error? Do you have some target speech? You don't, it's synthesis. So there's no reference rendition of what you need to say so that you would calculate, you could calculate an error and then fit this error back to change the parameters through some learning process. So two indicative strategies that one could exploit to train would be either through a supervised way. For example, there is some strategy that uses, takes the speech database, leaves 10% of the database out of, develops a synthesizer using the 90% and then tries to synthesize each of the sentences in the 10% that was kept apart and calculate some errors. I don't know how successful that is but it certainly cannot generalize well in all the circumstances where you would like to train your unit selection text-to-speech systems or somebody would try to follow a more unsupervised style for training its weights by observing that there is a wealth of good examples of speech within your speech database. So if you look in your database, there are many, many large number of good examples of joins, for example. Every join in your database is great. So you have a case where you have good examples but no bad examples and it's practically impossible to create bad examples in order to train for example a supervised classification system. So at least for the join cost, one could try to employ one-plus classification strategies. That is strategies that only rely on good data, on positive data of a single class in order to develop a system that can classify as something belonging or not to a class. I find that to be quite interesting but just to repeat that you will probably not get away without some manual tuning in your database. What do you mean by one class or this class for you? There's a family of classifiers that are called one-class classifiers that only rely on positive data, let's say one-class data. So you give it a bunch of examples of data that belong to a class and through those it finds a way of, it formulates a model, let's say, of that class and can answer questions, is this new thing part of the class or not? The way, for example, for variability in speech rate and etc. Actually this is more applicable. What is the history of how the speech material is varied in different conditions? The most, let's say, obvious example is to replace the join cost part of the function with such a classifier. So you look at your joins in your database, you have your database, you have segmented it, so you have a bunch of joins there. So you take the features from the left pane of the join and features from the right part of the join and this combined vector is a good example of things that fit well together. So if you extrapolate at runtime then you get two candidates for joining and you can have a sense of how good, how natural, how consistent with your database this join is. Okay, having made some, this reference to automatic training, evaluation is important. It would be great if we could have a system that could perform some objective evaluation that would correlate well to how humans evaluate speech. Even had such a magical system, this could close the loop so we could put it in a feedback loop and generate speech, evaluate it through this automatic system, get an error value which we so much are missing and use this error value to tune the parameters. So all the work that's already taking place in automatic evaluation could lead to different settings for TTS. There's another concept of pruning. So the complexity of the join selection algorithm grows exponentially with the size of your database. So if you don't have 10 but 20 instances of one unit, this will multiply all the dimensions of your problem and soon you will end up with millions or billions of paths that you will need to evaluate. So in one way or another you will need to have some pruning strategy. So there are two things that you want to do. You want to prune offline, which is chop down the database, throw away parts from your database in order to keep it smaller, but you also need to have some online pruning. So given the database, you would like to avoid evaluating the full graph of the unit selection. So throw away paths not to keep expanding them because their number is getting larger and larger. So for the online pruning, one simple strategy for example would be to keep a track of the number of paths in your graph and if the accumulated cost at one path is getting too large compared to the best one that you have in your collection, you just stop expanding it, you throw this path away in order to keep your number controllable, manageable. Or you could employ some pre-selection strategy that is, you do not pull out from your database all the instances that are of a specific class that you are looking for, all your specific instances of a iPhone but you throw away before starting to expand some instances whose critical feature does not fit you well. And there are plenty of ways to do that anyway. Now regarding the offline pruning, that is how much part of your database you want to throw away, there are many choices there depending on your problem again that is. So developing a speech database involves many things which is segmenting into phonemes or diphones. So things that did not score well, so you are not too confident if those are quite nice instances, the recognizer was quite successful in how it segmented and so on, just throw them away. This is also a safety policy and insurance policy to keep bad units out of your database. Another thing you can also, let's say, sniff if something is not quite all right, by looking at its features, if you find that an instance has double duration than its class and most of the instances of its class, then there might be something wrong with it, or too small or too large. So in some way you can sniff outliers and if you can afford it, throw them away, this will still keep your number of units smaller. We maintain in unit selection all this large database just to keep our options open. So if some unit diphones do not offer us additional options, then they do not belong there. We can throw them away. For example, if one has a strategy to measure how similar two diphones are in terms of features, acoustic or whatever, then one could throw away all the diphones that are very similar, since having identical diphones does not offer much more to your system. Broadly speaking. Finally, one thing to consider we're pruning is some diphones might not be too similar to others, but due to the workings, or specific tuning of one's unit selection algorithm, some units just never get selected. So after you have designed your text-to-speech system, you can run it through an enormous size of text, and actually at the end of the day measure the frequency with which the diphones have been used. Not as an absolute number, but as a consideration, other things. This would mean that your unit selection algorithm is blind, does not even have a look at those diphones, never uses them, so it might as well be safe to throw them away. And when throwing away diphones, usually what you want to do is throw away phrases from silence to silence or the complete utterances. You usually do not want to throw away just a diphone and leave the other one, the neighboring diphones just hanging there. We have a path in unit selection, and we accumulate costs, right? These simple approaches may have some problems. For example, consider this path and that path. All other things being equal. If these joint costs are like this and those joint costs are like that, then the unit selection algorithm would speak this path to be the best, because its score, it's not actual, but you get a picture, it seems to be lower than the first path. But is it actually so? All these small joint cost errors could go unnoticed for the human ear, while these great joints with this isolated but large exception might be significant and perceptual for the human listener. So numerically this seems to be a better path, but actually this might be preferable for a human listener. So maybe you don't need to just blindly accumulate cost. Maybe you could have a strategy that has the concept of good enough. So everything that's below some threshold is not cost. It's okay. It's not perceivable by a listener. So consider it to be zero. If you did that, then this would stand out to be a better path. But of course you just move the problem to a different question and then you need to answer what is the first quote. Okay, there are maybe quite a few ways to go about in answering that question, but I think it might make sense to have this in mind when calculating cost functions. Something quite good about unit selection text-to-speech and something that pays off for all the lifting you do with your entire database is that you don't only have one rendering, a possible rendering of your utterance, you have in your database many, many, many different renderings. And all these renderings are there and you can have strategies to recall them. You don't need to confine yourself just to the best path that your unit selection suggests to you. So there are many ways that you would pick a second best to the one that the unit selection algorithm chooses. Don't choose the best path in your graph. Choose a different path in your graph. Or exclude all the diphones that the best path includes and run your unit selection again and see what different path it will come up with. Listen to this sample which employs the second strategy without any further thing. So in the first sample it's what the unit selection goes into as the best then we exclude it from the unit selection all the diphones that participate in the best path and listen to what the unit selection came with. Okay, these are quite different renderings, right? I'm not sure you can describe why they are different or how different they are but they are different renderings. And for example, for interactive TTS where there is some dialog designer that's designing some dialog which involves TTS these might be quite useful options to provide. And I don't know if somebody could put a label on the different words and the features that the two different renderings had. But it's something that probably is too nuanced to go into detail, to describe, to be able to describe. But there might be the focus in the first rendering is at a different place than it's at the second but you cannot exactly say that it's focus or it's something. Anyway, it's things that is quite difficult to describe even to specify but it's nice and it's very useful with certain circumstances to have the options. And unit selection can provide many, many options. So you just have some naive strategy of choosing different paths and you immediately have different rendering. So all the patterns, the presented patterns are there. So this is what you get by all this cost that you have to carry all the database with you. So one of the things we looked at DTS but that's the system itself. There is a bunch of processes around the text speeds and for example in the premise of the unit selection the keyword corpus. So we assume that we have a corpus, a database from which we draw our diphones and we select in a way and concatenate them and so on but how do we... Where do we get the corpus? You can use a pre-existing open corpus that's carefully designed. For example it's balanced or it's in a way curated. And there are a number of open speech corpora that one can readily use to develop a DTS system or you can start feeling more lucky and go on and take some resources that are free and open for you to use but are more challenging because they are not prepared as carefully as these resources are. And there is also a final choice that you want to build your tailor-made voice, your own voice which includes quite a large number of things. In a corpus, in a database or I use these terms interchangeably maybe I assume but I think you get the idea coverage is the key. Unit selection is looking for diphones that conform to a specific specification in order to use them and build sentences but so at least you need to make sure that the database contains all the instances that the unit selection will require. So you need to have all units in all possible contexts. Of course that's not always possible. So what do we mean by different contexts? Okay, so a different context is either the features that we describe and we take into account a unit selection but in general a context is anything that can affect how a sound, how a diphone word or whatever sounds like. Okay, so you may have a phonetic context what's preceding, what's following a diphone that you're using. You can ensure coverage in this way. There are 40, 45 phones there are a thousand and a half diphones so you can be rigorous, you can be careful, you can be thoughtful and make sure that you prepare a corpus and record it that has all the phonetic context that you may require. But then you need to also make sure that your diphones are also present in places where you want to put them because your context is not only the phonetic but is also the prosodic and linguistic and so on. So you may want to take that into account. Of course there's no closed answer to that. If you are designing an expression synthesizer things are quite different when you design a more formal voice that you want it to be reading news or whatever. So there are much less context we need to take into account when designing these two different systems. And of course sometimes you just don't get to choose your database just use an audiobook so that's what there is. So this does not mean to scare anybody but that's maybe a block diagram of the overall thing. So here we have the TTS and we have of course all the linguistic resources for expanding from pronunciation and so on. We always need to keep in mind that not all speakers follow the norms so there may be a speaker-specific pronunciation dictionary a speaker-specific exception for the words of how they expand it and so on. So there's also in the general case there's always some speakers-specifically linguistic resources and we have a database. One of the most important things is you have a voice-building process. The voice-building process actually ends up with the speech database because the speech database has its typhoons, segmented, pitchmarked, curated in many ways labeled with the proper feature values and what you need to do is you need to preprocess, normalize, clean, whatever at the first stage be it equalization of the spectrum or the intensities, make sure that everything is let's say of similar quality as much as possible then you have the segmentation which is a critical part important it's a line in the script you have with your audio you need to close this loop and have a way of detecting errors because scripts and audio are not always consistent you will most probably find this in practice if you use an audiobook from an open source book that things are not always said the way they are written and you should have a strategy or a way to identify as automatically as possible such cases and correct them but if you want to have a commercial system or if performance is important you need to encode, compress, pack and all the things you need to do to keep your database compact and as small as possible but segmentation and voice building requires a corpus if you have a corpus you don't go behind that you use that corpus and that's it but if it's on your hand you need to do all your balancing you need to find a voice actor you need to specify give instruction specific for its recordings and so on things are much easier now than they used to be I guess 10 years ago 10 years ago many of the algorithms at the engineering level actually were not as robust as they are today so you had to be quite specific and careful on giving specific instructions from the voice talent on how to pronounce things so that they are in your corpus the way you expected them to be now you can be more relaxed about that and you can even afford to work with corpora that are recorded in a completely uncontrolled way by volunteer or by somebody at the home studio with a cheap microphone and so on and you can still make sense out of it and work with that Of course developing language resources for the front end is a huge story it's an expressive story as Simon explained this morning and then at the end of this overall thing there is the evaluation which actually puts a grade on how all these things turned out if you have any problems in any of these processes you will more or less find out when it's quite late quite here you have many ways you are continuously checking your data your outcomes, your things and you don't expect your listening tests to tell you that your synthetic voice is crap Now I'm going to rush to share some thoughts rather than give answers on some things about expressiveness in speech before one tries to imitate expressiveness would probably go through analyzing expressiveness to see what's made of in real human speech and there are quite many theories of speech emotion recognition most of these defines some primitive many big six emotions and work on that there are some categorical approaches some dimensional approaches and so on however thinking in terms of the big six the joy, the fear leaves a lot of things outside there is so much expressiveness in speech which you certainly cannot describe in terms of this is joyful this is sad there is so much expressivity it's not only emotion it's not only the big six there are so much things going on yes please Could you tell me the big six please ok it's a six primitive emotion let's say it's joy, fear I think it's discussed it's sadness if you just google it you'll get a better answer somebody says it's five you'll find quite some variances about it no it depends on who is looking there are people that say there are 700 people who work with autism they mention that big number of 700 emotions how did they come up with that these are psychologists and they work with autistic children and they say 700 is a good number think about annotating with a 700 possible value things speech there are some categorical approaches some dimensional approaches but i don't know i have to ask the professor i don't remember the name who works on autistic research so autism for autism what do you mean categories then say again if there are categories you can fit them inside a six dimensional space but i'm also a fan of the idea that the categorization six dimensions not really no i think it is just a combination of some basics that they really can distinguish 700 cases in that space points and the dimensions of the space yeah okay you can have positive negative we mentioned on the rattle access and learning yes yes that is the thing here but the point is that how many discrete points do you have in that space and they ended up with 700 discrete points to describe the space that's it but it's a continuous space it's not really a discrete space well but what is the resolution that you have in your ear of course it is a continuous space but what you can perceive and they end up with 700 because from that point to that point you cannot find the difference but or from this to this there is a difference so then you identify that as a different point okay that's the way they work i would be very interested in seeing the cross annotator agreement in that okay so there is there are many things going on in expressive speech that cannot be let's say easily expressed in such terms for example speaking speaking style i'm not sure that it would be easy to discriminate between the different renderings for those sentences okay it maybe partly has to do with the focus partly has to do with some emotion i just feel that if you ask 10 people to describe it you will arrive with 12 different descriptions or whatever anyway it's to reach to describe okay but all those samples derived from a unique selection system and using alternative paths nobody specified how to say something we just explored some of the alternative paths and we took the best one the first rendering then we excluded that diphons and we asked Unicellar search again it came up with the second and so on and so on this is quite an intuitive way to see that all the speech patterns, the reach are there in your database and you need to have efficient strategies to explore it so from that we say that Unicellar search system can mimic expressiveness i mean just render it without having no understanding of what it's doing but can we imitate it what i mean here is can we go a bit deeper and of course not have the machine understand the semantics or the full scale things that are carried in a sentence but at least get some feelings i said this discussion will only raise questions don't expect any answers right now from this a very interesting case is expressive synthetic storytelling where the task there is to deliver through the synthetic speech a listening experience that is equally engaging as that provided by a human storyteller that's quite interesting definition there so there are many questions that one comes up with when facing this definition can we the practical things, can we discriminate between different things that we perceive are different styles can these be linked to measurable features in speech sure but what these are and how they are linked it's a quite complex question and is there actually can we have some data driven way to have an unsupervised clustering of different expressive patterns and this way arrive at the emerging taxonomy of expressive styles not emotions that are imposed top down through an emotion theory but from the data up by clustering or other data driven ways by just looking at the surface features of the different expressive styles and this could be quite interesting to see if there is a typology that emerges from the data and the nightmare of trying to put labels on that if that is indeed the case anyway but to comfort us a bit in text of speech we don't need to have a precise understanding on what's going on actually my personal feeling is that most the most successful systems are the ones that understand the list of what's going on and I don't say it just to I say it with a fun dimension of course what I mean is the less modeling explicit modeling one imposes maybe and relies more on the data than on specific model or specific something theory imposed maybe that's a better way a more successful way to go into imitating what's on the data instead of imposing a model that he considers correct and trying to push the data in order to fit that model so in express TTS I think we don't need to care about what the speakers state difficult questions that emotion theories need to address or at least have in their a repertory we just want to imitate it with varying degrees of understanding of what's going on there we don't need to put labels just use things that emerge from the data we don't need to understand how to call an expressive pattern but if we know through context when to employ it even if the system doesn't know exactly what it's doing if it can use an expressive pattern at a proper place that's good enough for meeting the expressive storytelling definition and also there is no right the concept of the right process there are many many many different processes one can apply to an entrance and it will many of them will be equally perceived as natural ok and I don't know how much time we have ok let me I'll run through some things about Blizzard Blizzard was this year's Blizzard was very interesting it tackled I think the full problem just some remarks about Blizzard it's an international contest devised to better understand and compare research techniques on Friday there will be a very interesting presentation on that it's organized yearly by CSTR and started more than a decade ago and it has been put into the test not only text-to-speech systems but the whole production pipelines because all teams participating need to produce databases and systems that exploit them ok I'll skip some of things this year Blizzard challenge had an English corpus very interesting task 50 books for children about 5 hours of speech and targeted to children between 18 months to 6 years plus and the task was to use that data about 5 hours of data and develop an expressive voice that would be used again for reading children's stories let's have a sense of the richness of the data it's the full package everything is there you have you are turning the pages you have imitations, you have acting you have non-legressive organization, special effects made by the reader you have full emotion it's all there plus I don't know if it's the choice of the organizers but you also had to treat to deal with errors in the script some of the scripts were in pdf there was quite a few inconsistencies between text and speech so what I mean to say is that it put the whole production pipelines to the test you had all the things to deal with that are considered to be open and I think it's great it's a very good thing to keep on pushing on so you had bad errors you had missing quotations because you need quotations in order to understand what is part of a dialogue what is part of a narration and so on and I will talk some more about that right after you needed to update your pronunciation lexicon because you had some Shakespeare there so there were names that normal lexicon would not know how to pronounce so you had to do more or less everything as you heard we had any different type of sounds but it's interesting to see how different types of expressiveness were dealt with first you had your boom that's normally throw away but again your target context is application is still children's book so it might be quite a good idea to try to keep this and you do not want to do it manually chop them but what we did is we tried to provide an equivalent way of writing down these things so that the segmentation process could chop them and then we can use them give them a label and then use and then let the designer or the audiobook use such effects such as audible emoticons let's say special effects for them to choose but there were some very very extreme voice acting which we had to throw away but we tried to keep more or less most of the data that was there even data that were quite challenging even for segmenting and and then what we did is of course a sentence level alignment we did our own alignment we didn't use the one provided by the organizers and a reasonable choice we made which quite paid off was to discriminate two major styles or categories or domains or whatever so we create one database with narration which was outside quotation and another database with dialogue things that the characters would say actually it was one database but we labeled the diphones differently and we had it's easy to predict so for each feature you put into your system it's important that you have a good way of predicting it predicting whether something is a dialogue part of a dialogue or narration is straightforward we are just looking for quotation everything within a quotation you assume it's a dialogue part of a dialogue everything outside it's part of the narration so you have a good predictor it's a good feature to have so we use that as a feature kind of feature and then we went through the typical segmentation process many errors each of those steps had many traps and this is the output result this is one is one of the test audio books it's not part of the of the training set i'll just run it through here but ok the general idea is it's quite good if you listen carefully you will find these pure things some of it is masked because it's mp3 and it's the audio and some things but if you are wearing headphones you will be able to find some things but well you have to you have to but not enough to make us embarrassed i think it's quite good and you will listen to quite different difference in way dialogue is rendered compared to how narration is rendered especially to parts that there is a lot of dialogue ok and i'll just take 3 minutes more to think loudly if we had enough with this unit selection it should be discontinued or whatever well unit selection is less intelligible or it's considered to be but we usually measure intelligibility in the lab giving nonsense words or a semantically unpredictable as we call them but that's not a typical application case of course intelligibility is very important in noisy environments or in your car or when you are listening to something on the road or whatever of course but typically intelligibility is quite ok i don't think in a real application i mean with nonsense words and so on there will be a huge difference between the intelligibility obtained by a parametric system which performs the best in intelligibility and with the unit selection system a coverage it's weak point i mean if you don't have it in your database there's nothing you can do about it you're all stuck well that could be a valid point but actually i was expecting the unit selection to be in favour of the parametric systems i mean there are so many things going on there having a model might be a safe strategy the unit selection has no model so i would expect things to be quite in favour of the parametric system well things are that the unit selection systems did i think quite quite well there some of the best systems in these years blizzard chargers are still some unit selection hybrid unit selection system all top unit selection systems are nowadays hybrid another thing unit selection has a large footprint yes it does but some efficient pruning, coding and so on it does fit in your mobile phone so it's okay right so isn't that cheating you just copy paste your yes but that's the idea if you are a researcher that's not a fair thing to do but if you are developing applications that's exactly what you want to do you want to maximise the chances of having to do copy synthesis so yeah it's not science but it's the best thing you can do if you are designing applications especially for limited domain systems where you can expect the same things to come over and over and over oh it's old tech yes it's been an awful quite some decades but as we've seen it can be used to tackle new problems expressiveness in ways that are quite manageable you can gain insight on what's going on with the expressivity by designing an expressive text-to-speed system and trying different links out different features that may relate to expressiveness and you can hybridise it there's a lot of room for hybridisation I mean using your models DNNs, HMMs or any other statistical parametric thing component along with your unit selection core so it's fine and we shouldn't disregard that it's still the best there is in terms of naturalness it uses the best vocorder there is and that's of course a no vocorder and it still has many many things that are interesting and needs to answer and as Simon King said to one of his survey papers of the 10 years of Blizzard challenge there is some there's a lot of engineering challenge in designing and tuning a well performing unit selection text-to-speed system it's not a trivial thing that everybody can take out of the box and play with but there are challenges there even at the engineering level and just one shot about the hands-on session which was generally provided by Simon King many thanks for that Simon so on this afternoon oh I thought I had a preview, yes I do sorry so at this address this afternoon you would file a voice building tool chain that is based on festival which is an open source pipeline you could have a preconfigured space you could have some data free Arctic corpora and you would learn how to label pitchmark, build and test run your voice ok and I think that's more or less everything there is if there are any questions or whatever I'll be glad to talk now or the break or whatever you want until Freddy I'll be here ok I have a question can you summarize a bit on how really you get expressiveness with unit selection to have a variety of speakers I don't know speakers also can we get in short flexibility for me has expression and has also identity speaker identity right so unit selection system as all text speed systems is tailored to a single speaker so you don't have a database of many speakers and you choose you have a database that is from a single speaker it seems that unit selection does have the main tools to reproduce many different expressive patterns so in order to make this thing more controllable because uncontrollability is rather negative than positive I mean having a system that's very flexible but you cannot control it is not a good thing to do so if you want to expressiveness to be controllable then you need to start to apply some specific features you need to develop a way of specifying what type of expressivity you need your system to exhibit and then have your system do it for example in the case we had only two styles narration or dialogue that was a kind of expressivity specification feature one can think of different sets of such features one could even think of joy pleasure and to be a set of features as long as it has a way of predicting it but in short you have to have data from that no data no data ok ok thanks questions