 Llywodraeth ddiwedd, mae'n ddweud y dyma yw'n gallu ddechrau i ddweud o'r detailaeth yn teimlo, a'r ddiwedd yn cyfreidiau oherwydd oed, a ystyried rhywbeth yn ynno'r hynny o'r cwestiynau, boedd yw'n gallu ddych chi'n gwybod y gwlad â'r ddweud o'r llyfrion a'r pwysig. Mae'n ddweud o'r ddych chi'n ddweud o'r ddweud o'r ddweud. A dyna bod y gallwch yn ychydig i'n ddiwedd ar gyfer ar gyfer tych, Ond ychydig o'r tynnu dweud yn y gwirionedd cyd-af amddangos, sy'n cael ei fawr o'r ffordd. A hynny'n gweithio'r gael, y ddechrau unig ac ysgwrs-symthysgau mae'r ddiddorol yn rhan o'n gwirionedd. Mae'r ddiddorol yn cyfgaredd o'r ddiddorol ac mae'n gweithio ddechreuol. Ond mae'n ddiddorol yn cael ei ddiddorol, ond mae'n ddiddorol yn cael ei ddiddorol. Mae'n ddiddorol yn cael ei ddiddorol, a'r lleidiau bod gweithio'r modelau i'r cyfwyrwyr arlau'r ddiddiel. Mae gennym o'r rhan o'i gweithio'r cyftel Gweithreul Lusedig a'r cyflwydoedd sydd wedi'i'n glwyd y tro meddylen oedd o'r model yn y gweithreul. Felly wrth gyrddwn i gael hoffi'r Gruffr Merthyr ni'n eu cynllunio, o ddain o'n sefydlu cymdeithasol. Yn y gallwn o'r cyfle o'r gyflwydoedd o gweithreul Lusedig. Oherwydd, golygu'r cwmwysgfa Yn Gwyddau We will know which bits and which bits don't matter. To make that super clear I will recap quickly my view of Units Sele140. I hope it is harmonious with yesterday, so we can smoothly make this transition into HMM Synthesis. We will have a quick look again at unit selection and pick out the key concept of unit selection. is in fact going to be the same as the key concept of HMM synthesis. It's all about turning an almost infinite set of possible linguistic contexts into some finite set that we can then pull out of a database or find a model for. That finite set might be flexible in its size depending on the data we have to support the building of it. And again we're going to repeat that species synthesis could be seen as a regression problem, a really difficult regression problem and just narrow down a little bit the scope of where we can actually apply the machine learning to the regression problem and where we still do lots of intuitive, knowledge driven engineering in the features. So we don't actually put text into the TTS system and we don't actually get speech out. We narrow the scope of that a little bit and engineer the two ends of that. And I'm sure Hager tomorrow will tell us a bit about how we can expand back out and get certainly closer to the waveform on the far end of the machine learning. Let's just be 100% clear we're okay with terminology because there's terrible terminology in the field because it's been invented organically and there's some potentially conflicting or confusing terminology. So in speech synthesis, the standard unit, and it's okay just to assume that kind of all speech synthesis systems use diphones. If they don't they probably use half phones but with a strong preference to trying to make diphones out of them. So to a strong preference not to make joins at phone boundaries. So unit selection uses diphones sized units. Unfortunately there is a formal speech synthesis called diphone speech synthesis which is the same thing except with only one copy of each diphone. But that's the first bit of confusing terminology. So diphone is a type of speech unit. It's not a method of speech synthesis really but it gets used in that way. So that's what that means. Unit selection we understand from yesterday. So let's say they both use diphones. We might generalise that but it's okay just to say that. We're going to use some other units now in HMM synthesis. We're going to use in fact probably going to use Quinn phones and then we're actually going to make them something more than Quinn phones. So let's just be clear what the difference in these things is. And if you know this that's good. If you don't it's worth recapping it. So here's some speech and it's got the boundaries of the segments, the boundaries of the phones. The phones are the tokens, phonemes are the classes. And so a diphone goes from the middle of one to the middle of the next. So that's the diphone and we might call that diphone. So these labels go at the end of segments. That's the convention. So this is the second half of this and the first half of this. So this might be this diphone here. That's fine. Triphones are a little different. So here's the segment called this. And a triphone is exactly the same shape and size as a phone. It's just a context dependent copy or version of that. So it's a rather confusing diphones and triphones. And then we won't get into biphones because we're not sure what those are. So we might want to give names to triphones. We might want to have some notation to talk about them. So we'll use HDK style notation because that's the only standard available and it's perfectly sensible. And we might have something like this is this segment, this phone in the context of Z afterwards and D before. But remember it's still a model of a phone sized unit. It's just there's many, many different models of this unit depending on the context it appears in. And you can of course generalize that by adding the next, next phone and the previous, previous phone. So minus two will give you Quinn phones, Quinn is five. And you can go on from that if you want, but Quinn phones will do for us. It's also worth recapping that there's been an implicit assumption in everything we do in speech processing. Everything that most people do most of the time, the mainstream, and that's to pretend that speech is this string of units that we can draw boundaries and segment it. And you'll have discovered if you try and do that on speech, it's not quite true. It's really quite hard to draw those boundaries. However, if engineering purposes it's extraordinary convenient to make this assumption that we can represent this continuous signal, this waveform that's continuously changing and really doesn't have many discontinuities except things like stock closures as a discrete sequence of symbols. Because if we don't do that, we're not sure what to do. And it's close enough to the truth that it works because we can make it even closer to the truth by making those units context sensitive, context dependent, such as instead of pretending that speech is a string of phones, saying it's a string of triphones and it's now closer to the truth. Okay, there's still a problematic where the times of those boundaries are, it's closer to the truth in the set of units that there is. There isn't just 45 units, there's many, many thousands of types. So the key is all context and it's all about context. And in unit selection, in HMOM synthesis and DNN synthesis, it's all about discovering what the context is. In other words, what affects the current realisation of this current sound that we're in and what doesn't. Because if it doesn't, we don't need to make the model sensitive to that aspect of the context. So how can we possibly know what all these contexts are? And if we just write down the product of all the possible linguistic environments that a sound can appear in, and even if we just restrict ourselves to within a sentence, there are an infinite number of possible sentences in every language because you can say anything and it can always invent a new sentence, and therefore there's an infinite possible set of linguistic contexts in which a sound can appear. And therefore if we make a sound sensitive to its entire context, we have an infinite set of classes, which is going to be pretty inconvenient. We could never collect any examples of them all. So it's all about dealing with this infinite variety of linguistic contexts and collapsing it to a finite and hopefully reasonably small number of contexts that we can do something with, such as store units of, make models of, or write symbolic representations of the context of. And there's one key concept that's assumed across all of speech synthesis and it will be impossible to do speech synthesis if we couldn't make this assumption. And that's that it's not the complete context that matters. What matters is what effect it has on the current sound. So if something in the context changes, such if we change a word, four words away in the current sentence, but that doesn't make any difference to the current sound, that context doesn't matter. So we don't need to multiply all the different contexts by that factor. So it's as much about discovering what contexts don't matter as what contexts do matter, and then making this mapping from the infinite to the finite. And so we need to discover then, automatically, whether a particular contextual factor, whether it's the previous phone or the stress of three syllables in the past, or whether the current sentence is a question or a statement, whatever the context factor is, does it change the current sound or not? Now, probably all contexts make some change on the current sound, but maybe not a significant one. So we're going to have to make approximations here. This is reality. So it's those that have no possible effect or no effect that's worth modelling. So this linguistic context is a complicated beast because it's many timescales and many linguistic levels, and so yesterday in the lab you built a unit selection voice, and in that unit selection voice are these mysterious things called utterances, utterance objects, these .ut files. And if you are brave enough to look inside one, you realize they're not human friendly, but they are a representation of linguistic structure. They've got all sorts of exciting types of structure, and they've got simple sequences such as the phonetic string. They've got tree-like structures such as how phones make up syllables that make up words, and they've got relationships between strings of things and trees of things. So strings of phones belong to tree-like structures of syllables and words. So here's a picture that attempts to capture some of that. This is my inner nutshell simplified utterance object with a linguistic structure. So that's a very heterogeneous object. It's got different types of things in it, and it's not really obvious how to start doing statistical modelling with this kind of mixed bag of things. And remember that in the front end, we said in the front end you can extract all sorts of things from the text. You think it matters, you extract it from the text, and you might extract a tree, a parse tree, extract boundaries of things between words. And you might have all different representations of these things. That's fine. You might represent them in the most linguistically intuitive way, but it might not be immediately obvious how to map that onto a statistical model. So what we're going to do is we're going to take our assumption that speech is a linear string of things, a linear string of context-dependent things, and we're going to convert this structured linguistic object, this utterance, into a linear string of things. The string is going to be rooted around the basic unit of sound, the phone, and we're going to attach all the other context to the phone. So I like to call this flattening. We're going to flatten the context. So we're going to take this deep linguistic structure, and we're just going to squash it, and everything that's above the phone gets squashed and attached to it. This is not linguistically very sophisticated. This loses structure. It loses information, but it's convenient for modelling. So we've now converted rich linguistic structures into a sequence of symbols, and the symbols are going to be rather complicated, because they're going to be phones with all sorts of things attached to them, and their names are going to look really messy, but there'll be a string of things with boundaries in between, a string of isolated, separate. Maybe we can make the assumption that's statistically independent. So the key is, then, of all these many different types we've created, because flattening linguistic structure onto the phone string creates many new types. It's the product of all the different factors, so the seemingly possibly infinite number of types. We're going to now discover amongst this infinite set of types what the actual natural classes of sound are. In other words, which things sound the same. So that when we, for example, in unit selection, take a unit out of a linguistic context, out of one sentence that we recorded, and play it back in a different sentence, in a different linguistic context, and notice it sounds the same as it would have sounded if recorded in that context. So it's equivalent, or nearly equivalent, or imperceptibly equivalent, and that's what we're aiming for. So that's what your selection's all about. If you can't find the unit you want, go and find one that's sufficiently similar, and you measure that with some handcrafted thing called a target cost. And that target cost is a function of the linguistic context. It might be as simple in festival as a weighted sum of mismatches. So every time there's something different in a linguistic context, you incur a penalty, and that's weighted by how important you think that mismatch is. So if the next phone mismatches, you get a big penalty, because that's probably important. So that's the guita unit section. We've got this thing called a target cost. I'm going to fly through some slides there in the pack for you on the website if you want to follow along just like yesterday. And it's all about crafting a target cost that appropriately uses these criteria. We also have mentioned, and Spiros mentioned, in his unit selection talk. We can go that step further in unit selection instead of using symbolic features, use acoustic features. So we need to do effectively what's partial synthesis. Now if unit selection, it doesn't matter if we synthesize something that we can't get to the waveform from. It doesn't have to be a comprehensive representation of the signal. It just has to be something that's useful for measuring distance. So some acoustic space where we can measure difference. So we don't need a full vocoder specification. We could do something simpler. For example, MFCCs, which are not the most convenient thing to synthesize a waveform from, but are quite good for measuring acoustic distance. But if you're going to do partial synthesis, why not just go the whole way? If you're making all of these steps down your journey, instead of stopping at some point in concatenating units, why not go the whole hog and actually generate speech? And that's what we're going to do. So in HMN synthesis we're going to take those extra steps to get to a comprehensive acoustic specification. And then we're going to use that to drive a vocoder. We're not going to say much about vocoding here because we've already heard about that in the lecture. And here's where we stop doing statistical modelling and draw a line and hand over to feature engineering. So the vocoder is an engineered object. It's designed by an expert engineer with good intuitions about the speech signal and how it works, how to deconstruct it and reconstruct it. And that's handcrafted, that's expert, a crafted object. It's a crafted piece of machinery. And maybe later we could talk about how you might go even further with the machine learning and get closer and closer to the waveform, maybe all the way to the waveform. So that's engineering, this is machine learning, and the front end we've seen already is all engineering. You might be made out of bits of machine learning, but it's as an object, it's an engineered, a crafted thing, a made thing. And the machine learning is then in the middle. There have been linguistic features that have been engineered and vocoder parameters which have been engineered. So that's what we're going to constrain ourselves to now. We're going to do this. We're going to use regression to get from the linguistic features to the waveform. We won't actually get all the way to the waveform. We'll stop at some point short of a waveform and that's going to be our vocoder parameters. So the vocoder makes that last step to get to some waveform we can play back. And this thing in the front end has already been made. So we're going to get from here, flattened on to the phonemes to the vocoder parameters. So that's definitely regression. Symbols in, continuous values out. So there's lots of examples out there you can find of good parametric synthesis. I'm not going to play you necessarily state of the art samples and we're not going to listen to a lot here because we won't hear big differences over the sound system here. Let's listen to the sort of quality you could easily make yourself with the available tools. The tools that you can get freely. So things like the HDS toolkit for building the models, the straight vocoder for doing the vocoding or any other vocoder you like. So we just listened to a few of those just to get an idea of what things sound like. Let's try that again. Statistical parametric speech synthesis is actually more intelligible than unit selection. So hopefully it doesn't sound as good as unit selection. You hopefully can hear that's been vocoded. Not super well. Statistical parametric speech synthesis can use high quality data. So that sounds a bit better. But also works well with more variable recordings. So it's clearly different to unit selection. There's no concatenation artefacts because there's no concatenation per se. But there is obvious vocoding effects there. But with a better vocoder, you would expect to do better there. Those are not state-of-the-art examples. Those are the sort of quality you could build yourself from two to four hours of data, say. So there's a lot of different ways, a lot of different angles we could come at HMM speech synthesis to try and understand it. So what we'll do is come at it from a few different angles and maybe one of them will make the most sense to you. The first way of thinking about it is to think we've got a vocoder and the vocoder can code speech and reconstruct it. That's something that's often called copy synthesis. And we'll often do that anyway to test the vocoder. We might often do it for the purposes of evaluation to set some top line because that's the best we could ever achieve with HMM synthesis. We can't exceed the quality of just vocoding the natural speech. So one way of understanding statistical parametric speech synthesis is to say it's a kind of a speech coder, that takes speech in and produces eventually speech out. But whereas a speech coder, a vocoder or a coder like Kelp or something else its purpose is to compress or encode the speech so we can transmit it or store it efficiently. So we're going to encode it as some sequence of features, speech features and we're going to do something like transmit it and then we're going to reconstruct some sequence of features, the same features possibly with errors introduced by the transmission or the compression and we're just going to break that chain and instead of just storing them we're going to distill them. We're going to do a very special form of compression we're going to compress them into a statistical model that makes a relationship between the text and the features and then we're going to store that model and then later at any time of our convenience we can retrieve that stored model and given only the text we're going to compress the speech so if you're a speech coding person if you'd like to think of speech as something you can compress and uncompress maybe this is one way of thinking of what statistical parametric speech synthesis does it's kind of the ultimate speech compression it compresses it into something that's text and then decompresses it We're going to do that we're going to model this coder representation so we're never going to model directly the waveform at least until maybe we get to very advanced DNN synthesis it's still not working as well as modeling these vocoder parameters so we're actually going to model the vocoder parameters themselves as we saw in the first lecture so we're going to model the F0, the intensity, the spectral envelope and we're going to of course need some model of duration so we're going to have statistical models and the models are going to be of some linguistic unit we've already decided we'll use phonemes as the linguistic category so these are models of context dependent phones and they're just going to generate some duration of speech some fixed frame rate because fixed frame rate is more convenient than do things at variable frame rate which is always very messy and then this is the job of the vocoder we don't need to worry about that because we want to understand that so that's one view don't necessarily think it's the easiest way to think about what statistical parametric synthesis does it's not a very obvious form of speech compression so what we're going to do now is give you some other views that are much more helpful and I'm going to give you first the conventional view which is the sort of procedural view it's the view of how you actually build a system and it's the view that comes from automatic speech recognition which is where all of the stuff is borrowed from is that Yanis with a question or a stretching? stretching so let's start with this procedural sort of mechanistic what happens when we try and build a system where we'd like to train a statistical model of every phone in every possible linguistic context realise why that's impossible and then fix it find a solution to that problem and that solution is borrowed directly from automatic speech recognition that will help us understand what happens when we actually make such a system and it will be very practical but then once we've built the system we can look at it again and ask what is it really doing what's the system really doing and we'll see it's just doing regression so to set the whole thing in context before we get lost in any detail let's just see the whole chain of events that's going to happen in first building and then in using an HMM based speech of the system we're going to need a front end and the front end is going to perhaps be exactly the same front end we used in unit selection or any other time it's the same exact same linguistic processor and it produces the same linguistic features that are available for unit selection we may use them all we may use a subset of them it doesn't matter so the linguistic processing is exactly the same and all that really differs is that we flatten the context and attach it to the phone this phone level now that's kind of happening anyway in unit selection because in unit selection we're proceeding phone by phone or in practice diphone by diphone but it's effectively the same thing we're proceeding through a linear string of units and for each of them we calculate a target cost that target cost might queer this rich hierarchical linguistic structure but it's doing so for each diphone so it's still pretending that speech is this linear string of context dependent units and the context is associated with things at the phone level but they could be phones so we're still all doing the flattening thing in unit selection but we do it explicitly in HMM synthesis we explicitly rewrite the string of phones as a string of context dependent phones then for each of those classes this very very large number of classes we need a statistical model so we can store this model and then when it comes to synthesis time we put our new text through the linguistic processor through the front end and it tells us the string of context density phones we need we have already got a model for each of them we retrieve it and we string them together and perform some process to generate speech from that which will first generate speech features or speech parameters and then use the vocoder to actually get us a waveform so all that seems fairly straightforward train your models, store them use them to synthesise and the core problem is going to be that there are more models than we could ever train from any reasonable data set and to put that more specifically at synthesis time when a new sentence comes in and our front end rewrites it as a sequence of context dependent phones there will be a model in there with a name that we don't have in our set of train models because it never occurred in the training data and then we need to know what are we going to do when we don't have this model there's only one thing we can do and that's use one of the models we do have and this is the moment at which we collapse this infinite set of linguistic context down to a finite set of things we've actually modelled and make this assumption that some things are equivalent or nearly equivalent they sound sufficiently similar now it's obvious we're going to use HMMs because that's the title of this class why HMMs you may ask you may well ask not the smartest model not the most sophisticated model in fact a very very simple model and the answer is we use HMMs because they are a very simple model and the consequence of them being a very simple model is they've got really nice algorithms the algorithms are tractable and they're efficient they converge and they train quickly so they're very nice models to compute with we'll see maybe later on there's other more sophisticated sequence models such as linear dynamical models or better models these are less convenient to compute with although they may be more faithful models of the speech signal we're going to stick with HMMs HMMs are simple generative models and an HMM is a finite state machine and in each of the states of the finite state machine lives a probability density function it might be a discrete distribution if we wanted to generate symbols we're wanting to generate continuous vocoder parameters probability density functions they're going to be Gaussians because that's most convenient for computation anything else is more difficult and although in speech recognition it appears that we use HMMs as a classifier it looks like we turn speech into a sequence of classes HMMs are not classifiers an HMM is a generative model so an HMM can indeed be used to generate and it can generate a sequence of observations like that so as an aside how do you use a generative model to make a classifier very simple you have a generative model of each of your classes and you see the probability that your observed data was generated by each of your models and whichever could generate it with the highest probability you assign that class label so you can form a classifier from a set of generative models and that's the standard form of speech recognition in HMMs things have moved on since then in ASR that's the basic form of speech recognition and we're actually going to use them to generate we're going to use them as generative models so the whole process is going to look a little bit like this just to keep setting ourselves back in the overall context of things we've got some input text and in that input text we've got a sequence of letters those letters get turned into phones through either looking up in the lexin or letter to sound they get lots of context attached to them by all of the modules in the front end all those processing modules and each of those processing modules is effectively attaching features to the phones and eventually we're going to find out that the features we're attaching are really just binary features we can collapse everything down to a larger to binary features like is it or is it not something is it's part of speech now or not so we're going to eventually effectively treat all of these features as binary features as ones and zeros yes's and no's it's true in HMM synthesis and it's true in DNN synthesis we collapse almost all the features down to binary yes or no features and it's pretty much true in unit selection synthesis because the target cost is effectively revolves around things that match or don't match and we can only do that just knowing whether they have the same feature or a different feature so it's effectively collapsing everything to binary it's all about beautiful linguistic structure that we spent so ages constructing trees and things it's flattened down to this rather long sequences of ones and zeros which will give them probe with questions or users features in a DNN in the HMM case then those features think of these features now as some large binary number with a very large number of bits we'll see tomorrow when we build a DNN voice in the lab there may be 300 such binary features or more so each of these models has got a name chosen from a set which is a 300-bit binary number so that's 2 to the 300 possible models that's quite a lot I suspect it's more than the number of atoms in the universe it's a very large number and so obviously we can't model this but we're going to try we're going to try and build different models so that what synthesis time will have every model we could ever need and we can retrieve it back we can perform some generation process from these models to generate sequence of vocoder parameters and then we can vocode to produce the waveform so far so good chime in with questions at any point so it's going to get less and less human readable I'm not going to draw 300-bit binary numbers so let's draw them in a slightly more readable form because this is what we use in the code and if we know HDK we'll be familiar with this we've taken the front end we've produced this structured object we've flattened it onto the phone level and each of those phones is now a name that encodes its entire linguistic context so this is let's pick one this is the model of OR in author and it's in the context ER to the right and silence to the left ER to the right of that and more silence to the left of that and all of this horrible stuff here is an encoding of all the other context maybe this one encodes its syllable structure maybe this one here encodes its distance from a prosodic phrase break or something like that so we've encoded all of this stuff and we could further expand that out just to binary features is there a phrase break four words from now or not is there a phrase break four words from now or not it's very naive but linguistically very naive but very convenient for what is to come so we've now got this very very large set of categories of classes and we would like to build a model for every single one of them this vast okay now I did imply earlier that there's an infinite set of such categories because there's an infinite number of sentences in the language this is true however our front end can only extract so many things from a sentence and it's going to do the same processing for every sentence so the number of features that we extract is going to be in fact finite so the context is going to be somehow limited so it's not going to encompass every single word in a sentence because that's a variable length thing it's going to become a fixed size but still ridiculously large so it's not quite infinite but it's 2-300 which is practically infinite as far as the data is concerned so we couldn't possibly have a training example of every one of those units in the data the context is very very rich in fact it's so rich that if we converted our training data for example the database for unit selection which we might use to train our statistical model if we rewrote all of the phones with all of their context attached and then did a sort on those and then looked at them we'd find there's exactly one occurrence of each never does the same thing happen twice only in some weird cases like a special sort of silence or if the same sentence happens to occur twice in the database like we've got some saying okay twice as a single isolated sentence so context is so rich we've got pretty much all the time exactly one training example of the things in the training data the few tens of thousands of segments in the training data and exactly no training examples of the billions of other things that are possible so that's problematic not only can we not train models of the things that are not in the training data we can't train models of the things that are in the training data because there's only one example and that's not enough to robustly estimate the statistical model so we need to develop a solution which solves both of those things at the same time it can create a model for something we never saw that sounds a bit tricky and it can robustly estimate a model of things we have seen but only once and the solution to both is to pool data at some point we've got enough data pooled together we can train a model and we use that model for all the context we've pooled together and all the other context that have enough in common with that group of contexts so we're going to do some form of clustering of these models so this is what we're calling view one this is the how to build an ASR system converted to how to build a TTS system which is try and train models fail come up with a clever solution which is called clustering or tying and we get to train models so it's going to be parameter sharing which in the world of speech recognition is usually called tying in other words there are two models that each have apparently their own parameter but they just point us to a common underlying parameter so they're tied together and joined so we can achieve that very straight forwardly by clustering effectively we're going to cluster the data we're actually going to train very bad models and cluster the models but it almost amounts to clustering the data those clusters of data that's like pooling training data across groups of different contexts different types, different classes and that will give us enough data to robustly estimate a model so we're just doing what we said we have to do all the time is that we have to make an assumption that this sound in this context sounds so much like the sound in this context that in unit selection we can use one if we don't have the other and in parametric synthesis we can use the same model for both so that's what we're going to do now we have to cast your mind back two lectures ago to the lectures on the front end remember looked at classification and regression tree cart there are lots of different ways of clustering things we need a very special way of clustering things here because we don't just want to do bottom-up clustering and form clusters of training data we need a synthesis time to be able to take just the name of a model which we don't have and find the right model in the set of models based only on its name in other words only on its linguistic context so given only linguistic context we have to retrieve a model from the set of models so we always know the linguistic context because the front end tells us training data and we know for the test data in the training data we also know the acoustics, the speech, and there we can get a model sounds awfully like the sort of problems we saw in the front end and this is the sort of problem we can train with a classification tree or a regression tree so we're going to build a regression tree and the regression tree is going to take as its inputs, its predictors it's going to take the names of the models let's just pretend they're trifones it's quicker to write, try phones and easier to read so the names of the models are things we always know so those are the predictors and the thing we're trying to predict is the model parameter such as the value of the mean of a Gaussian in the third state of the HMM so those are the predictees so we're just going to build a regression tree that regresses from names of models to parameters of models and that's it very simple we could try and do it by rule but of course we're going to learn it from data so we're going to build a regression tree it's going to query these linguistic features and now we can see why all of these linguistic features are effectively going to be treated as if they're binary because in our regression tree assuming it's a binary tree we always have to ask yes, no questions is it this or not this and that's just like rewriting the features as binary so we might have a feature that is the next phone the next phone can take 40 possible values but when we query it we'll ask questions like is it a B, yes or no is it a D, yes or no and that could be just equivalent to rewriting the name of the phone as a sort of one of K vector of ones and zeros we could ask is it a vowel or not that's equivalent to rewriting the list of 40 phones as a list of the vowels and a list of the not vowels and then having a feature that's one or zero depending on which of those is the case so we're effectively rewriting the features as binary we don't explicitly do it in HMMs because we can query symbols and ask pattern matching questions about symbols in DNNs we will literally rewrite it as a binary vector because we need those numbers as input to the DNN but they're being treated exactly the same very naïvely so we've got this regression tree that given only the name of the current segment we're trying to synthesise and its surrounding context can descend the tree and at the leaf of the tree there is a model the clustering is going to be done state by state for the models it could be done on whole models it tends to be done state by state and we'll refine this for you a little bit later on actually things will be a little bit more sophisticated than this later on press your hand as you grow this tree you need to keep track of the entropy of the resulting mixtures or something like that what are you measuring the entropy of? so the question is about how you build this tree so you need to read the PhD thesis of O'Dell who basically came up with this for HMMs for speech recognition it's not going to be entropy because these things down here are continuous valued things they're not symbols anymore so we're going to measure something else what we would like to measure so this is a generative model of data and so how do we measure how good a generative model is of some data we'll measure the likelihood that it generated that data and what does it mean to train a model on the data what it means to find its values of its parameters and the normal criterion will be to maximise the likelihood of the training data so we do maximum likelihood training ML, maximum likelihood training so what we would genuinely really like to measure when we're growing this tree let's just make it really clear what happens as we grow this tree what does it mean to be at the root of the tree it means to have just one model for all contexts let's imagine we do this per phone class or maybe for all phone classes being at the root is saying there's a one model and you should use it all the time regardless of context to go down the tree is to say we'll have two models and in the case where there's about to the right of the model that's what that means and we would like to measure them when we're building the tree how much better is it to have two models instead of one model how much better does it get and the obvious thing we should measure will be the likelihood of the training data this will always increase the likelihood of the training data you've got to model with more parameters fit it to the data unless you've done something horribly wrong it will increase the likelihood of the training data and as long as it increases it enough we'll take this this increasing likelihood will be the criteria for choosing the question and it might also be eventually the criteria for stopping if we can't increase it by enough anymore to measure the actual likelihood of the training data would involve actually training these two models and then going through all the training data and computing its likelihood for every possible question we could put here that's actually way too expensive so if you read O'delcesis we can actually approximate this very very well by just storing certain statistics of the training data and the only assumption we make is that the alignment between the models and the data doesn't change as we make the model set more sophisticated it's very beautifully explained in the thesis it's actually quite straightforward so we're going to measure increase in likelihood of training data we're going to approximate that with a very good approximation in speech synthesis the trees tend to get much larger especially for parameters like F0 because there's very few parameters that each leaves are very very deep trees and it becomes more important to more carefully control the size of the tree and somebody asked a question I think when we're talking about cart there are different stopping criteria available the naive one is just to put a threshold on the increase in likelihood and if it's not big enough to stop we might also use some information theoretic measure like minimum description length to control the depth of the tree we don't need to understand all those details to understand the concepts though so there's a picture borrowed from the HTK manual so HTK is just a standard tool kit for speech recognition which is extended for speech synthesis and let's see if we can understand the whole picture ignore the phone set I don't know what phone set this is this is a speech recognition person's idea of a phone set I think in speech recognition we use tri-phone models so in speech synthesis we're going to use Quinnphone plus all that other stuff but the concept is exactly the same in speech recognition we use tri-phone models in speech recognition we typically take all the models of the same centre phone so these are all the models of ah whatever sound that is and we cluster differently for each state position so we're going to have one tree specifically for clustering the centre state of all of the tri-phones of ah so there'll be three trees for this phone and then there'll be three more trees for everyone of the other 40 odd phones in the language and we just query the name so we ask is it voiced to the left silence to the right and so on and that the leaves of the tree are groups of context that have this much in common but we don't know anything about the things we didn't ask so this maps directly on what we were talking about at the beginning some aspects of the linguistic context make a difference to the current sound so we should ask about them and we should try and match them some aspects of the current context appear not to make a difference to the current sound in other words in the training data we didn't find enough evidence that this made a big difference to the sound and so it wasn't worth querying them in other words when we built the regression tree at some point we tried asking about that context feature we tried partitioning the data and it didn't increase the likelihood very much or as much as some other question so there was not much acoustic difference between this split so it wasn't very predictive of the acoustics so that question didn't find its way into the tree so this is how we discover automatically that some context some features in the context don't have much effect on the acoustics and we don't need to ask about them and that's good news because that lets us have multiple contexts at the leaves of this tree now I'm not going to go into huge detail about the mechanics of training this let's just verbally summarise how the recipe would go to cluster the models you need some models where did these models come from well in the training data we do have one training example of some models so we train a model on that one training example it's a rather badly trained model but it'll be different to a model trained on some other context so it'll be enough information to know which one's turned out somewhat similar and how to cluster them so we will cluster based on rather badly trained models once we get to this point we'll make the model share their parameters and then we'll retrain them and in this retraining we'll have a lot more data available because they've pooled all their data together and we'll get much better models the clustering is based on rather badly trained models and then we train better models after that in the really fancy HMM synthesis recipes we then repeat this several times so we can get some slightly better models to recluster them and we can go round and round and round and the recipe starts to take a week to run on your big computer instead of a day but that's just a refinement it's not important to understand that so we each measure the increase in likelihood to get the best split and how do we get models for unseen context well hopefully it's completely obvious imagine we wanted a model of a in this context that we've never seen before but synthesis time and we want to say something that involves needing this model we don't have it anywhere in the training data it doesn't matter because you just descend the tree find out which leaf you go to and use the model that you find there and we're saying that this is equivalent to whatever we found all down here so that's trivial at runtime that's going to be super fast there's going to be no problem doing that the computer is going to be in building the tree which is the training process that's offline any questions about that conventional view this view of attempting to train models you can't really train and then finding this nice solution of sharing their parameters now we've got this system it's tempting to think of this as hidden mark-off models doing lots of work, they're state machines they've got Gaussians in, what's really happening is hidden mark-off models, nice sequence models and then there's this unfortunate kind of bit of mechanism in the background that provides the models with their parameters it's just sort of hidden it's inconvenient, we wish we could train separate models for everything, we can't so there's this kind of messy tree thing which when the model says give me a Gaussian, it happens to get the same Gaussian as some other model, it's this hidden mechanism that will be a one view of it that will be the mechanistic how do we build it view but it's not really what's going on what's really going on is that we've got a big tree that is making predictions from names of models to parameters of Gaussian distributions that's regression that's all the work of speech synthesis everything is happening in the regression tree all the HMMs do is tell us what order to do things in they tell us to go 1, 2, 3, 1, 2, 3, 1, 2, 3 they're just a timekeeper a sequence thing not really doing very much more than that so the second view then the view that is easiest to connect back to unit selection and to connect forwards on to DNN synthesis is that this is speech synthesis based on a regression tree so we've got a big box that's doing all the work in the chain of events from text eventually to speech or to speech parameters to drive a vocoder and it's all about regressing from linguistic context features flattened onto the phone and then expanded out this big sparse binary vector of 300 or 400 bits and we're just querying individual bits of this vector and in arriving at any leaf in the tree we're just querying a very few bits think about how deep this tree is compared to the number of questions we could ask the whole point is we don't query all the features the point is we query the small subset of features that are predictive of the acoustics and we don't even look at the values of the features so we just go down a few questions so the depth of the tree is going to be order of 10 maybe order of 100 in F0 but it's not going to be hundreds and hundreds it's not going to be all 300 questions we could have asked so we just ask a small subset of questions in a specific order and end up with a prediction of an acoustic parameter now it's going to be a distribution from which we still need to generate but effectively we've already got parameters now here at the leaves of this tree so it's almost a lookup table it's a kind of fancy lookup table so I think we'll take a break in a minute but we'll just leave you with some thoughts to think about in the break to help us connect when we come back we've queried the linguistic context which is extremely rich and we've made a prediction of an acoustic value set of vocoder parameters but hopefully it's immediately obvious that depending what acoustic parameter we're predicting imagine F0 compared to the spectral envelope we might want to ask about very different linguistic features in a very different order if we're predicting the spectral envelope we probably first want to know what phone we're in that's pretty much crucial in fact probably want to know whether it's silence or speech first and then we would like to know if it's speech which phone it is and then what's the left and right phonetically and then eventually we might ask about syllable structure or something like that if we're predicting the value of F0 we might not care what the current phonies we might first want to query are we in a stressed syllable that might be the biggest acoustic difference the first partition of the training data might be into things that are high F0 and things have low F0 so we might want to ask different questions in a different order so hopefully it's obvious then that we couldn't just use the same regression tree on to the spectral envelope as to regress on to F0 we might want to have separate regression models for that that's not going to be problematic at all that's going to turn out to be dead easy sounds like it might be really messy and horrible but it's not going to be we're going to just have separate regression trees we'll see that after the break but I'll leave you with a question if we have some intuition about which features are most predictive for which of these different acoustic parameters should we pre-select those features should we decide that for F0 we're only going to consider this about the linguistic context and for a spectral envelope we're only going to consider that and then build our regression trees so that's the question for you to ponder in the break so I think I've been assuming a little bit of background in the basics of speech recognition and HMMs because we can't cover all of the background if you're missing that if you don't know some of this stuff there's lots of basic material on the speech zone website that's not all complete yet but it'll be complete in the next month or so so I left you with this question we're clustering our HMMs in other words we're building a regression tree and we've decided that we can't use the same regression tree for all of the different acoustic parameters of the vocoder because different contextual factors will have a much bigger influence on some than on others so we want to ask different questions in a different order and ponder'd whether we should then engineer that we should decide which features to use for each stream as they're called so let's think about that so let's just clarify some terminology excuse me there's some terminology so depending on the vocoder it might have different parameters but broadly speaking we're going to have something to do with the spectral envelope of the voiced part the F0 the energy and something maybe about the aperioric non-perioric energy maybe the spectral envelope of that as well and we could divide each of those quite different acoustic types of features into what are called streams and that will become clear in a moment so we're going to build a different regression tree to predict each stream of vocoder features and of course yes we could engineer we could decide which features to use but the regression tree will learn that for us that's the point of classification of regression trees as we already saw in the front end we just think of every possible feature and we've thought of 300 binary features we could use from the front end and we just provide all of them to the algorithm that builds the tree the greedy algorithm and it will select the ones which partition the training data in other words which cluster the models most effectively so there's no need to pre-select those features for each stream given enough data and as long as we don't over cluster and go too deep around the tree we'll automatically discover which features the most useful we don't need to do pre-selection unless we've got really strong engineering intuitions about excluding features because they'll cause problems but that shouldn't happen so assuming a little bit of background with the basics of speech recognition using HMMs and tri-phone models let's just clarify some of the what we're doing here in HMM synthesis and what we're doing in HMM speech recognition and the most obvious one is that we're considering a much richer context from building our acoustic models why, why is that why doesn't speech recognition use this very rich linguistic context surely it'd be better to take more context features into account what you indeed might find systems that use quinn phones instead of tri-phones you might get some small improvement by doing that but the goals are quite different of recognition and of synthesis although we might appear to be using the same models we're doing quite different things with them in recognition we're trying to draw boundaries between categories and to draw a boundary between categories we want to use the simplest model we can so that it's as robust as possible it's as well-trained on the data as possible and if a tri-phone model can draw the boundary there's no reason to move to a quinn phone model it's got more parameters it might be less well-trained and actually could do worse in synthesis we don't care where the boundaries between categories are we care where the middles of the categories are in acoustic space because we would like to generate these sounds from the canonical, the middle, the prototypical sound so when we've got a model of an R in a certain context and we generate from it we want it to sound like R we don't care where the boundary between that sound is an OO so in ASR we want the simplest possible models that can draw boundaries between categories and ultimately those categories are words so we just need to disambiguate the phoneme sufficiently so that we can then decode the words with the simplest possible model in other words with the fewest parameters so we can train it the best on the available data synthesis that's not the case we just want to know where the middles of models are where the peaks of these Gaussians are because as we'll see in the generation algorithm that's really what we're going to use when we generate the modes of these statistical distributions not the boundaries between them so that explains why we really want these full context models because we want to be very specific about what it is to sound are in a certain context where's the average, where's the middle of that sound and the second obvious difference is that the observation vector in other words the thing that we're modelling the thing that is emitted or generated by our generative model in the case of ASR again only needs to discriminate between sounds so it just needs to be the least possible set of features in some sense that helps us tell the difference between sounds in synthesis we actually need to reconstruct the waveform so it needs to be as rich as possible to give us the best reconstruction of the waveform possible in all cases we're going to trade off against the richness, the dimensionality of that and the number of parameters in the model and the amount of data so we need to keep things controlled for that reason so I think we now have to look at this picture because it appears in a lot of papers on HMM synthesis being copied many times it's not obvious who it belongs to anymore I've copied it from this paper but I'm sure they copied it from some other papers this is the complete flowchart so again the procedural view of how you make and use an HMM system it doesn't tell you really how it works it just tells you how things go when you sit down and try and build such a thing you start with your speech database and you parameterise that using your vocoder your vocoder has got two parts it's got a part that encodes the waveform into parameters and a part that decodes it back into the waveform so you use the encoder part and you encode into whatever features your vocoder uses you might not use the raw vocoder features you might need to do some transformations to them to make them suitable for modelling with Gaussian distributions so you might need to do some feature engineering to make them have the right statistical properties such as assuming independence between the parameters within the vectors so you might do for example kepstral expansions of the spectral envelope to make these statistical properties more true you've got your parameterised speech then you train your model that's what you store so you've got this stored model this is a picture of some separate HMMs this isn't a very helpful way of thinking about what the store is what the stored model is is some really big regression trees that's really what's in the stored model and then that tells us what parameters to use when we synthesise from so in the training part then training our HMM synthesiser the big part of computation isn't doing the Baum-Welsch algorithm the forward-backward to effectively align the data with the model and then to estimate the Gaussians that's quite fast the big part of the computation is building the regression trees especially if we rebuild and rebuild them but even just building them once is computationally expensive because they're big and there's a very large set of questions so every possible split in the tree we've got to try at the root all 300 questions and then at the next level down the remaining 299 questions and then 298 questions and that multiplies up going through the tree so there's a very very large number of trials of splits we have to make when we're building the tree we store the model and then we can use that to generate and we're going to see in a moment how we do that the model generates back into vocoder parameter space so it generates the same sort of features it was trained on of course and then we need to use the remainder of our vocoder to actually generate a waveform and text-to-speech then is just a question of putting text through the front end to convert it into a sequence of names of models sometimes called labels and then we're just going to go and retrieve those models from the stored set of models, sequence them together in other words concatenate them and then generate from this concatenated model our sentence so let's just say a little bit more then about the fact that our vocoder parameters are also heterogeneous, they're in different categories F0 is quite a different thing to spectral envelope it will be predicted with different linguistic features and how do we achieve that so we can break the vocoder parameters down into these things called streams and the streams are stacked together to make a big observation vector for our HMMs so that might look a little bit like this here's one version of that this is a little bit simplified this is an observation vector so we've got an HMM state and inside the HMM state is a Gaussian distribution I'll draw a univariate Gaussian but it's going to be a high-dimension Gaussian of the same dimension as the observation vector we're going to generate it might be quite a big number it might be 60, it might be 100 it will be quite a large number and the observation vector is broken down into streams and we need to spell out now what's in these different streams so one part is going to be the spectrum this might be represented using something like keptral coefficients there's a statistically decorrelated version of the full spectral envelope another part might be to do with the excitation simplest possible case will be the value for F0 and we might have a flag that tells us whether it is an F0 or not we could encode that as a 0 value for F0 or we could do something really messy which we'll see in a minute which is have no value for F0 in the voice parts what's not shown here is the non-periodic energy so we might add additional parameters of that and exactly how this structure depends on your vocoder depends what parameters you need in your vocoder so that's the streams we can see something else here that's going to have to be made clear in a minute in this observation vector we don't just store the basic keptral coefficients of the spectral envelope this thing here that might have let's say 40 coefficients rather more than the 12 for speech recognition we're going to store its deltas in other words it's velocities is it increasing or decreasing compared to previous and next frames and we're going to store accelerations in other words the velocity and the velocity is it getting faster and faster or is it slowing down and we'll see why we need those in a moment in speech recognition we also use those because they're discriminative they help us tell the difference between classes here they're going to help us more faithfully generate the trajectories that we need so those are the things divided into streams and we're going to have separate trees now for each of these streams and the fact that all of the deltas and delta deltas for the spectrum can be collapsed into one stream whereas the ones for F0 somehow stay in their own streams is a bit mysterious just here but we'll come a bit clearer in a moment or two but the output vectors the observation vectors pressure thing people at the back can hear you and the effectors should be vocoder parameters at the end of the day, right? So let's just clarify where the vocoder parameters are in here so this is the capitol representation of the spectrum that our vocoder will make so you may choose to do that we could have chosen to store some other representation of the spectrum the capitol is convenient because it's de-correlated and suitable for modelling with the diagonal covariance Gaussian so that's some vocoder parameters these are not needed by the vocoder they're going to be used as part of generation which we'll see in a minute these deltas and delta deltas likewise let's assume this value here is just exactly F0 that's needed by the vocoder the vocoder doesn't need to know the deltas of F0 but we do need them for generation again that will come clear in a minute so the vocoder parameters are indeed in there along with some extra stuff to help us with generation now when we use a vocoder hopefully it's obvious but it's worth stating that the number of parameters that we need for the vocoder needs to be fixed in each frame it can't vary from frame to frame because then the dimensionality of our Gaussians would vary from frame to frame and then I have no idea how to cluster them with a regression tree that's a nightmare no sensible machine learning to do that so if we have a vocoder for example if our vocoder parameters were the amplitudes of the individual sinusoids in a sinusoidal model and the number of those sinusoids varied because F0 is varying and the number of harmonics might vary we can't directly model that we would need to make that somehow a fixed number of parameters so vocoder has to have a fixed dimensionality the various different ways of doing that and we just have to fix the number of sinusoids another way that's more common would be actually model the envelope as the capstrum so to effect we would do the same sort of thing as any other vocoder so these sinusoidal models give very very high quality in that kind of raw naked form they're not so convenient for directly modeling we need to do some feature engineering to make them have features which we can model that's fine for the spectral envelope we all can have a fixed dimensionality that represents the spectral envelope there's the form and structure of the speech but there's a parameter in speech which is really annoying because it doesn't always exist and that's the fundamental frequency and that's because the vocal folds sometimes vibrate and sometimes don't vibrate it's very very inconvenient in statistical modeling to have a parameter that sometimes doesn't exist how can we train a model of it and what do we do at generation time there are lots and lots of different solutions to this one would be to give F0 a magic special value when it doesn't exist and the typical value will be zero so if we draw this is time and this is frequency if we draw an F0 plot it's going to have value sometimes but other times it's not going to have values we could give it a special zero value and we couldn't effectively redraw F0 like this that's one possibility and then ask our model to model that but that's kind of weird because at this moment in time the model has to predict that F0 suddenly plunges all the way to zero and then stays there at exactly zero for a while and then suddenly goes right back up because this zero value is a fake value it doesn't exist it's not really F0 at all it's just a dummy, it's a flag so that's going to be a bit of a mess statistical modeling another solution which is used which is much more sensible just to interpolate a value of F0 across this gap here so to fill in the blanks and ask the model to effectively model F0 when it doesn't exist by giving it interpolated values for the training data and at generation time we could generate these interpolated values plus a flag that says actually it's unvoiced so we're just interpolating at the moment and then voicing comes back on again that will be a sensible and good solution it's actually not the most common solution is rather more mind bending and I know people that don't believe that this is statistically possible I'm just going to tell you how it's done without telling you if I believe it or not this is how it's done we say that F0 is a vector of dimension one it's a number, scalar which is also a vector of one dimension we're happy with the fact that a scalar is a one dimensional vector that's not too difficult right that's not the mind bending part when if zero doesn't exist it's a vector of zero dimension I don't know how you draw that ok so F0 has varying dimensionality now I said before that's statistically very inconvenient indeed it is but it's the standard solution in the standard toolkit in HTS that does that and so how do we model something that's sometimes one dimensional and something zero dimensional we effectively model it with a mixture distribution a mixture of two densities one of them is a one dimensional Gaussian this is fine I can understand that and the other is a zero dimensional Gaussian so there's got no parameters because it doesn't have a value it doesn't have a mean or a variance it's empty it's a blank space so that's ok you can train that and of course it's because it's a mixture distribution we have weights we have mixture weights on the two distributions and the mixture weights tell us if we're in the voiced or the unvoiced thing so there's some mixture weights and so this R1 means we're in one dimensional vector space the real numbers this is zero dimensional space and we have some weight that tells us which of these is more likely and effectively that's the voicing probability and this is called a multi space distribution because it's in two different spaces two different vector spaces a one dimensional space and a no dimensional space and that's how it's done so effectively really what we're doing is having a flag for voicing which is the mixture weight and if the mixture weight voicing is higher then we have a distribution for voicing if we're doing that and it is indeed what we do as standard when we generate from the model we'll see the generation algorithm in it but in generation we're going to take a walk through the HMM let's assume it looks like this for now we're going to take a walk through the HMM and generate some observation and then maybe we'll make a transition to the next state and generate an observation it's possible that in moving from one state to the other we might move from unvoiced to voiced that's very likely to happen between if these states are in adjacent models different models it might happen within a certain sound so the dimensionality of this bit of the vector can change from frame to frame so it can come and go from frame to frame and that's going to be a little messy in a minute let's talk about the remaining parameter that we haven't talked about at all that's not a vocoder parameter but it is a parameter we'll have to model and that's the duration so we need to model how long speech sounds last in a normal HMM that's what this thing here does the self-transition probability and the self-transition probability would normally be a number that's close to but a bit less than one such as 0.9 and that's the probability of going round and round and round and round and eventually leaving so that's the model of duration that's used in ASR if we think about what the duration distribution looks like we can do that with an apoptic calculator take 0.9 and just multiply it by itself over and over again it's going to be this exponentially decaying distribution like that if we plot the real duration distribution of speech sounds it doesn't look like that it might look a bit more like this kind of skewed a Gaussian sort of distribution so this self-transition probability is a very weak model of duration because it doesn't look the same shape it's got a much more fundamental problem though we're using generative models sort of Markov models and we're going to have to generate in a minute and we need to generate under some sound statistical principle and the obvious one to do is to use maximum likelihood because that's what we used in training and we should just do the most likely thing at generation time here's a model of some speech sound let's use it to generate or should we do or we'll just generate the most likely thing of this model this would make the most likely duration of everything one frame we just fly straight through the model at maximum speed because the longest stay in the model the less probable things get that's a really rubbish model of duration that wouldn't work at all so we're going to actually use a proper parametric model of duration in speech synthesis and we're going to use a power CM to this or to the log of this just another parametric model of duration some implications of that that I'm actually going to skip over and refer you to readings for it's mathematically a bit challenging to have explicit models of duration it's fine at generation time if you're just doing maximum likelihood and picking the mode of this distribution it makes it very difficult to do decoding which is why we'll very rarely see that kind of model in speech recognition it makes Viterbi decoding impossible essentially so for synthesis we might train with a transition model and then after the fact fit our special parametric duration model and only use it for generation or we might get very clever and try and actually train with this full parametric duration model which is probably unnecessary and of course the duration models because they're just parametric they're Gaussians or the regression tree just like all the other parameters and so in other words duration is also a regression problem we predict the duration by regressing on the linguistic features so it's actually no different any of the other parameters really in the way it's treated it's just a generation time we'll use the duration to decide how many frames to generate not what goes in each of the frames but in terms of learning it it's not any different from anything else this little bit we've talked quite at length now about the different streams and the clustering and so on so the full model actually looks a little bit like this we've got some kepstrum or some other representation of spectrum it's almost certainly going to be on a perceptual scale if we're sensible and we're going to have a regression model that predicts its value a tree we're also going to have some source parameters such as f0 we might actually take the log of it to make it look a bit more Gaussian because f0 is a bit funny it's a one-sided thing, it's never negative so we might make it look more Gaussian at its distribution by taking log or some other function and we've got regression trees to predict that if our vocoder also has special parameters for non-periodic energy so for the difference between s and s some aperiodic parameters we'll have another stream for those and somewhere else there is a model of duration which is just another regression tree so we've just got lots and lots of regression trees exactly how we configure them is up to us, it's a design decision in ASR we might have one regression tree per monofone per centrephone, per state in synthesis we might actually cluster all of the phones together because for example in f0 we might not care very much about the centrephone until we've asked lots of other questions so we might have just one regression tree per state position so one tree for the first emitting state second emitting state and third emitting state this picture has got three emitting states it's actually much more common to have five emitting states so five emitting state HMMs is the standard for HMM synthesis so how finally then, given these models these things we're calling HMMs but are really lots of regression trees telling us the parameters of some HMMs how to actually generate from that and to remind you we're going to look at this conceptually rather than mathematically if you want the mathematical formulations go to the papers, I think pictures are much easier to understand so we're going to generate all of the vocoder parameters and then we're going to drive the vocoder so we're going to generate from a model the model is a model of a phone in a particular context concatenated sequenced together with all of the other phones in context that make up the whole sentence as predicted by the front end and just to make super super clear in case it's not obvious it's completely trivial to concatenate two HMMs together let's draw two HMMs here's one, I'll just draw some simple ones with self transitions here's an HMM and here's another HMM badly drawn and if we just join these two things together this is just an HMM the beauty of HMMs is simple, that we can make one HMM by just gluing two other ones together by sequencing them and concatenating them and any algorithm that we know for this one works for the whole one so we're going to generate for an HMM which has been made by joining together all the phone sized HMMs but it's just a really long stringy HMM with lots and lots of states so if I draw pictures of little HMMs generating everything we know there extends to really long HMMs generating they're not different in any way except having more states so if we want to generate we need some principle and we've already said that in the absence of any other principle we'll always go for maximum likelihood that's the most basic principle of machine learning do the most likely thing get a model which generates the data with maximum likelihood given a model generate its most likely observations if we did that if we generated from an HMM and we went into an HMM state and said to the state emit an observation please and emit the most likely observation what's the most likely observation we can emit from an HMM state just the mean of the Gaussian and then we go to the next frame and let's say we're staying in the same HMM state our duration model says we need to do several frames in a row from this HMM state let's generate again from the HMM state and we generate the next time the mean again so we're just going to generate the same exact same value again and again and again for a sequence of frames until the duration model tells us to move to the next state so if we drew such a thing let's just draw it for F0 because it's easier to draw things in one dimension so that's F0 in hertz we're going to generate a constant value for some number of frames and then move chain state change a different value and then some different value have you ever seen an F0 contour that looks like that except when you generate wrong from an HMM F0 doesn't look like that F0 is smooth and so are all the other acoustic parameters so we've missed a trick we've done something wrong the maximum likelihood generation algorithm there seems to have failed us there's nothing wrong with the principle it's just that we haven't modelled something properly and now we need to think what we need to model to get that smooth trajectory so we're going to sequence from this model we're going to visit the states in order and at each time frame the time frame is typically going to be a 5 millisecond step a 200 frame rate 200 hertz frame rate we're just going to generate little bits of speech features let's imagine there's a bit of spectrogram but in reality of course there's no code of features and then we move on to generate the next one and the next one and so on and we keep sequencing through emitting from this model over and over again and eventually print out a set of vocoder parameters for which we can then use to get the waveform now these vocoder parameters are smoothly varying we don't see sudden discontinuities in here so as we move through this spectrogram here things are all nicely smoothly changing and our waveform is also continuously changing we see boundaries between some sounds but they're smooth transitions and within vowels for example we get nice smooth trajectories so how are we going to get a model that has got discrete states in which we stay for fixed number of frames, a fixed period of time and then suddenly move to the next state how are we going to get such a very discreet model to generate something smooth and continuous and the answer is we better look again at the data and think that something about the data we haven't quite captured in its parameterisation in the vocoder parameterisation so at the moment we're just thinking about generating the mean of these Gaussians and that's going to be this piecewise constant thing that we just drew this kind of square wave and we don't want to generate that we want to instead generate something that's smoothly varying so we're going to kind of join those dots up it's like a dot to dot drawing and we want to draw a smooth trajectory through those dots if we just follow the mean we'll get these very nasty squared looking trajectories like that and that's the wrong solution that will sound terrible the solution we do want is to have a nice smooth line perhaps something like that so how are we going to do that well a couple of oppositions we can make about that first of all although this is the mean maybe this is the mean of F0 in the beginning of a certain sound in a certain context we know more than just the mean we already know that there's some standard deviation about that mean that we observed in the data so just generate the mean over and over again is very naive there's natural variation about that mean that's one factor but a more important factor is that we have to get from here to here in a natural and plausible way and we can only move F0 at a certain velocity you know how to control F0 with your vocal folds what do you do how do you change the pitch of your vocal folds so we do it with tension there's a muscle we stretch it just like a guitar string so there's a muscle and that muscle is only so strong and can only act so quickly it's governed by physics and so we can only move at a certain velocity between here and here and we can only change that velocity within limits as well so it's got some acceleration constraints everything has a mass so that's why we need to model the velocity and the acceleration of this parameter not just its value and that's what indeed what we're going to do so let's just stick with the velocity and then just generalize that to acceleration so we've got some coefficient maybe it's one of the capsular coefficients maybe it's F0 so that's my picture of a Gaussian and let's see what happens if you try and generate from that so let's draw that on a time axis like this put some always label your axes so this is time and this is going to be a generated parameter coming out of the model eventually we'll go into the vocoder think of it as a formant or F0 or anything that helps you understand so I've drawn the Gaussian aligned with the axis of the parameter it's a distribution of and of course it's got a mean and if we just generated from that mean so the frames go through time like this that's the clock ticking away and so if we generated over and over again from the mean we just generate this piecewise constant value which would be unnatural so that's obviously the wrong thing to do what instead we're going to do is we're going to model not just the distribution of the value we're generating but the distribution of its velocity so velocity is speed but it includes direction because it can go down as well as up so we're still going to take the distribution over the parameter itself and we're still going to use that to generate from but we're also going to generate something that's not only likely under this Gaussian but is also likely under the other Gaussian so it's the most likely thing given what we know about its average value and about its speed so this average value is this but we know that its speed is positive generating this is actually quite unlikely because that's got a zero velocity so in the distribution of a velocity that will have a very low probability that's a very unlikely thing for this model to do we've got some distribution over velocity that says the most likely velocity is plus six and so the most likely thing we would generate would actually be this this still has the same average so it's centered on the mean but it's got the velocity equal to the mean of the velocity distribution and hopefully you can mentally generalize that that if we had a distribution over acceleration we could say not only would it have this average value and in general it would be increasing but it would be getting faster all the time or it would be slowing down so acceleration would give us curvature so we've got value, slope and curvature and that's what those three coefficients are going to give us that's why we need to model them once we've done the generation we only care about the actual value that's what's going to drive the encoder those deltas are just thrown away we can generate them but we don't care about their values their constraints on generating the base coefficient this C and the terminology that's usually used for that is the static which is a stupid terminology for it because they're not static at all they're called the statics to differentiate them from the deltas and the delta deltas so here's another picture borrowed from a paper unattributed because it's borrowed from a paper that's borrowed from a paper and I don't know how far back to go to attribute it this is a picture of a whole generation process and let's suppose this is now the second Keppstor coefficient of the spectral envelope or any other parameter that you like to imagine and this is time in frames here's a model at the top for some reason this model has got variable numbers of states for silence and to speak sounds it's showing self transitions instead of duration distributions none of that matters we're going to visit the first state we're going to stay in it for some period of time this period of time and from that state we're going to generate the first few frames the first eight frames let's say that state has a mean for this coefficient and it also has a standard deviation and a mean standard deviation for velocity and a mean standard deviation for acceleration that's what these grey bands show us and what we're going to do is we're going to make a trajectory in this base coefficient space that's the most likely trajectory given the distributions over its own value so it's going to go quite near the mean quite a lot of the time it's going to be a likely thing to do but also is likely as possible given what we know about its velocity and has the curvature that we've also learned from the data so if we zoomed in a little bit let's look at this little bit of the picture here hopefully this is big enough to see here's a little bit of trajectory where the distribution we've learned from the training data in other words the statistics, the mean and standard deviation are so and we've learned that the velocity is positive so when we generate from this state we generate something that's around the mean value I'm hoping you can see that but the trajectory is increasing and that's likely because this has a mean that's a positive value and if we zoomed in further we could see it had some curvature and its curvature was it was slowing down slightly acceleration so does that mean that through this special kind of trajectory generation that you do eventually use the standard deviations for the Gaussians or in fact does it just keep the means and integrate through this? Good question, we of course do use the standard deviations store them as variances but standard deviations are the thing we plot yes we do indeed so let's have a look at some cases so let's compare the difference between what we're generating here and what we're generating here so the grey bars are showing you the standard deviation so the wider they are the bigger and flatter this Gaussian is in other words we can deviate more from the mean while staying reasonably likely so indeed the wider the standard deviation the more we deviate from the mean and when we have something like this with a very narrow standard deviation all of the training examples that were used to train this Gaussian had really similar values to each other the natural standard deviation of the data was very small and therefore the Gaussian had a small standard deviation it's saying that this sound is always sounds really similar and so when we generate from that indeed will vary only very slightly away from the mean and take some other sound like this sound here and the data that this state was trained on that was observed greater variability in the data so the Gaussian ended up with a larger standard deviation a higher variance so when we generate from it we can deviate further from the mean while staying likely so indeed so if all of these Gaussians had very tight variances we'd slowly get back to that square wave it would lock us back onto the trajectories because that would be the most likely thing to do so indeed we do need those variances or sloppy we can be or how accurate we need to be but you only have variances for the static parameter no we have variances for everything so we've found this is the most likely trajectory under all of these distributions so if the velocity distribution is negative and has a very tight variance we better make sure that we're decreasing in value while we're generating from that state whereas if the velocity distribution is negative but has a really wide variance it's not as important to be decreasing we can still be quite likely so I think these are constraints now what's really going on if we generated in the naive way we generate this kind of square wave that's going to sound unnatural and in nature in real data what we see is a smooth curve now there's other ways of smoothing there are much easier ways of smoothing than doing this we could just run any simple moving average over it or any simple smoothing algorithm we liked and smooth out the bumps so we could take things that look like this and run whatever your favourite smoothing algorithm was over it moving average and that would just smooth out the bump for us that's not so different to what's happening here except here we're learning how smooth to make it this tells us can we move very quickly to get between the two values or should we move really gently to get between the two values so all we're doing here is smoothing but the smoother is controlled by these parameters and these parameters are learned from the data so we learn to be as smooth as natural speech seems to be so this whole algorithm we're not going to look at the maths for it it's called maximum likelihood parameter generation so it's still a maximum likelihood that's still the most basic of all principles but it takes account of the deltas and the delta deltas the velocities and accelerations now you could probably run an arbitrary simple smoother over it and get very very similar results and in DNN synthesis we might do maximum likelihood parameter generation we might do very naive smoothing and we might not find a huge difference so this is just fancy fancy smoothing that's the key this algorithm is very expensive computationally very expensive go to the papers, look at the maths it's lots of big matrix operations because we're solving a complicated equation we find to find the most likely thing under a big set of constraints over a sequence of hundreds of frames so you can essentially stack that in some enormous matrix operation and then solve a set of equations as some sort of simultaneous matrix operation and it's not cheap so that's the reason to avoid it but it's the standard thing to do okay, so let's just wrap that up and we're going to spend the last 10 minutes doing one final topic which is adaptation we've talked about unislation and we've talked about HMMs hopefully we've seen lots and lots of connections between them this regression tree is regressing from linguistic features to Gaussians which we then generate from it's not a whole lot different to pulling out a bit of speech waveform to pulling out a Gaussian they're not that different from our distributions but the mean is by far the most important part of that distribution and they've got little durations and we tend to use sub-finetti units rather than diphones but really we could see the big regression tree as just a tree-shaped look-up table that says given this feature look this thing up happens to be a Gaussian pop it somewhere we have its duration and then vocode it that's not so different to retrieving a fragment of waveform of the right duration with the right features they're really not so different so you can see lots of connections between the thing the target cost does which is just probing it's querying those linguistic features the regression tree is probing or querying those linguistic features the target cost pulls as a set of candidates out of the database which we choose between to get a smooth join but it could just have chosen the top one so it really gives us just one set of parameters, effectively a mean and in order to make that join well to the next thing we make sure we make a smooth trajectory between them we don't just concatenate them and do the square thing we smooth them with this generation algorithm in unit selection we choose amongst them those that already naturally have that smooth transition before we move on any last bits of questions on that In speech can we use that for speech synthesis so the question is will you use a mixture of Gauss as the distribution model in the sense yes indeed of course we can do that let's just draw a picture of why we don't do that or why we generally don't do that his time his some parameter we're trying to generate from and I hope it's okay I'm doing my Gaussian sideways the dimension C so let's imagine we've got this Gaussian at this moment in time and then this Gaussian here right and the normal generation algorithm is to draw a nice smooth trajectory joining the dots up like that okay with that if I had a mixture of Gaussians let's just have two imagine had a mixture distribution here and a mixture distribution here so these are very badly drawn Gaussians I've got them sideways and I've now got to find the most likely trajectory through here before there was only one obvious trajectory and it simply was just join the means up constrained by the velocities and so on not so anymore if we draw the trajectory now we could join this Gaussian to this Gaussian or this Gaussian to this Gaussian or this one to this one or this one to this one there's now four paths and the further we go on the more paths because each of these could now go through two Gaussians and two Gaussians and the number of trajectories grows exponentially with the number of frames which would be a very very large number we could actually never do exact computation with such a model we'd have to do some approximations this doesn't apply because we just make this extreme and strong mark of assumption we're not drawing these trajectories at all we're just computing independently but here we are not doing independent computations what we do here depends on what happens here that's why MLPG is expensive to do MLPG with mixtures is going to be exponentially expensive in fact we're not possible to do exactly so the answer is in general probably not going to use mixtures distributions I have one comment about that so there are three versions of parameter generation algorithm the third one is sort of the Simon explain is a very basic one so the third one is based on the EM algorithm so that we regard that the mixture components need a variable and then by iterating the EM algorithm you can find a maximum likelihood to trajectory based with mixtures of Gaussian so it's still possible but as Simon mentioned it's more computational expensive but it's not as expensive as exponentially expensive and it is implemented in HDS right any more okay so let's just wrap up and remind ourselves why we're doing metric models at all given that unit selection sounds fantastic and the key really is that we can perform principled operations on statistical models that are much harder to perform on recorded speech and the most interesting of those is adaptation is to change the model after we've trained it to make it into a different model and that's adaptation so again we're going to do adaptation in a very simple form and do it with pictures it's actually a very very simple idea let's start with this idea imagine that I've got my distribution of F0 so I've got some data lots and lots of data and I've trained a really robust model of F0 for a particular state in a particular model in a particular context it's trained on hundreds of samples and I like it it's robust imagine I would now like to make this into a model of somebody else before getting to adaptation I'd just like to say that the biggest advantage of statistical models is that they are very compact so they can be placed inside very smaller devices where we cannot fit a unit selection system that's the main one adaptation is still not there to use it in a industrial environment I guess that's true commercially but in terms of research the reason people keep researching statistical models is that one I'm standing there now you're standing in the future so imagine we've got a statistical model that needed lots of data to learn this F0 model is not going to need a lot but we would like tens of samples to get a good estimate of this model so we've estimated it's mean and it's standard deviation they're good solid estimates this is a nice shaped Gaussian I'd better like to make this a model of a speaker but imagine I've only got a very small sample of speech from that speaker and we just know the average F0 of that speaker so I just know one value I can't train a new model from scratch on that data that's tiny, tiny amount of data but what I could do is I could adjust this model so I could move this model to have the mean of that data very simply by transforming it into that to some other parameter and let's not just think about F0 but let's think about capital coefficients, the spectral envelopes is very high dimensional and let's think about all the parameters in the model let's think about all the different parameters in a very big train HMM I've got a big statistical model trained on many many hours of data and I've got a very small sample of data I'd like to make this model fit but it's not big enough to train a model from scratch what can I do to this model to maximise the likelihood of this new adaptation data without retraining the model so this example is of F0 because that's a one dimensional parameter it's not a very good example because that's such low dimensionality it would be easy let's imagine this was all in high dimensions we have a well trained high dimensional Gaussian in fact lots and lots of them and we have a small set of data points some adaptation data from a new speaker a new individual that would fit to the speaker but we don't have enough to retrain the whole model conceptually what we're going to do we're just going to warp the distributions we're going to perform transformations to the distributions to make it fit this new data now it's not completely obvious in that why we can't just retrain on the new data so we'll make that clear as we go along in a minute what sort of transformations might we think of applying we're going to keep them simple because we've got to be able to learn them on a very small amount of data so we need to learn very simple linear transformations so the most simple linear transformation is just to shift we could just add something to the mean so we need to learn a number of parameters equal to the size of the mean and add that, that's one thing we can do if you're paying attention you'll see, so I've written here one of the means in my huge Gaussians in my big train models is indexed like this i, so there are i ranges over thousands and thousands of Gaussians and I've trained all the leaves of my regression trees and if you're sharp eye you'll have seen that the thing that I'm going to transform it by the offset is not indexed by i it's indexed by something else and there's going to be a lot fewer of these than there are, so the number of values of k is going to be a lot, lot smaller the number of values of i and we'll see how that works in a minute that's crucial, if it was of the same number as i we could just train a model from scratch maybe the adaptation would need as much data but so we can take a very small set of transforms here they just shift and apply them to a model with a very large set of parameters, that's the key is that the number of transform parameters is very small compared to the number of model parameters a shift is a rather basic transform we could do something like shift the pitch range but that's all we could then apply a more sophisticated transform so we might multiply by a matrix multiply by a matrix can do things like rotations and shears and scaling so we can expand everything and shrink everything or shear it, and if we combine multiply by a matrix with adding an offset we have an arbitrary set of linear transforms we can move things all around in the space so these are a set of linear transforms so we're going to apply linear transforms in other words every single parameter in the model just think of the means this is easiest to think of the means every mean in the model which we've already said is not that different we're going to move them all around in the unit section database it's a little fragment of speech sound ready for vocoding we're going to move them all around in acoustic space we're going to move them all to the left a bit we're going to make everything expand a bit we're just going to rotate things a little bit maybe someone's vowel space is a bit different needs to be a bit rotated we're going to learn these transforms from adaptation data so we're going to have a transform so we're going to have a small number of transforms and the transforms are going to apply to whole classes of parameters let's think of a simple example let's go straight to the picture let's imagine we've got our well-trained HMM synthesiser with all the big regression trees these are the Gaussians at the leaves of the regression trees and let's imagine that there are only in this language there's vowels and there's consonants so maybe these are the vowels and these are the consonants they happen to fall like this in acoustic space and if we're going to apply some transform to these we're going to move them all around we might apply the same transform to all the vowels and some of the different transform to all the consonants so we've just got two transforms but eight model parameters and in reality we wouldn't have eight we'd have 8000 or 80000 so we've formed these things into classes we can learn that from the data you'll be unsurprised to hear if you do that by clustering them with a tree you like trees we'll cluster these things into a tree forming these classes known as regression classes and then we'll just apply a linear transform to every model parameter so these are these things these are the transform things and that's the matrix and the vector of the transform and this and this belong to the regression classes and these are the two sets of those things so the number of values of k is far, far smaller than the number of values of i and we can just apply these affine transforms here I'm just going to apply a shift in a rotation but we could also stretch and shear and scale we just move things through acoustic space with these linear transforms and that transforms just learned by looking at the adaptation data and the model parameters and maximising the likelihood and because all these parameters belong to classes they all get transformed so every single one of the thousands and thousands of model parameters changes but the whole groups of parameters are changing in the same way and in the limit with a very very tiny amount of adaptation data we might just learn a single transform that applies to every single model parameter that's the simplest possible transform and that will have a very tiny number of parameters in that transform one matrix of the same dimensionality as the observation vector and the offset vector of the same dimensionality so a very very tiny number we could learn it even on one sentence and it would do an approximate job of transforming the model to sound like to say a new speaker or expression and the model parameters let's not draw the classes so it looks like a very complicated non-linear transform happens to the model the entire model changes everything in the model changes but the transform is structured in this special way so you can use that technique those transforms are learned automatically with respect to some adaptation data we adapt a model that's been previously trained and it's just up to us it's a design choice on what that model is and what that adaptation data is an obvious example would be that the model is trained on many speakers so a speaker independent and that the adaptation data of one particular new speaker is speaker dependent and we make a speaker dependent model independent model with a very very small amount of data maybe 10 sentences or 100 sentences something we could never build a good system on but we'll get a very good system because the underlying models train on a large amount of data and you can expand that to an expression independent model to make expression dependent emotion independent model to emotion dependent and so on and so forth there's any number of things we could do with adaptation the limit is your imagination so let's just wrap up and I'd like to look ahead to what you're going to hear about a little bit later on I think it's tomorrow morning just to make the link on to DNN synthesis here's the picture of HMM synthesis and deliberately most of the picture is a regression tree and there's a little bit of stuff happening here and there's a front end happening here but all the action is here in a big regression tree if it's a regression tree that's doing all the work the problem is regression we can immediately ask ourselves why on earth did we choose a regression tree regression trees are ancient old fashioned and make very inefficient use of the data when we learn a regression tree we step with all the training data here and we do a very very bad thing we make the data smaller and smaller we train on less and less data as we go down the tree so deep down the tree it's obviously going to be much deeper than this several levels down the tree somewhere when we grow in the tree here we're saying which question should we use to split the data of all the remaining few hundred questions the data we're thinking of splitting is a tiny subset of the original training data so we're going to make very poor decisions because we've just thrown away a lot of the data we're not using all the data so it's not a very clever regression model it's got advantages such as it's very very fast at run time it's human readable if you like to read that kind of thing otherwise it's a poor choice of regression model so pick a different one pick any other regression model you like and the choice of the moment is the neural network it's the most generalised regression model we can think of think of it as a learnable regression model a nonlinear regression model and so we're just going to insert a neural network where there used to be regression trees and everything else would be pretty much the same okay that's it