 OK. So, you're going to weigh too much of me starting now, and you're going to have even more of me, if that's not enough, on this website, speech.zone, that's the domain name, and on there you'll find a variety of things, and if you like to have a copy of slides in front of you while I'm speaking, you'll actually find them already on here in this special course at the top of the courses section. And later, when we do the hands-on, we'll use this website for the instructions for that. OK. So, if you want to follow along, there's a PDF of slides there, and we'll look at this site later on a little bit as well. Right. So, we've got two hours to talk about text processing, and there's a lot to say, and yet there isn't so much to say, because I think it's pretty fair to say that there's not been major advances in text processing. There have been very incremental, grindingly slow advances in text processing in the last 10 years, and a lot of people are very lazy about their text processing in their speech in the system. Certainly we are. So, people for English will typically use Festival, which is our system as their standard front-end. We haven't touched the text processing festival for a decade, and more than a decade, really. We built it. It was very good at this time, and it hasn't got any better. In industry, of course, things are very different, but the techniques haven't really advanced a whole lot. They were just kind of fixing the exceptions and writing ever larger dictionaries and rule sets and catching the edge cases, which is fine, and it works, but it doesn't scale. So, before we talk about all of this, let's talk about other places you can get information from. I would recommend two places. One, if you're not at all familiar with any sort of natural language processing at all, is just to start with this book here, and this is the excellent Juravsky and Martin. It's a thousand pages of excellent stuff, and for what it is, it's a very good value book, a big, thick book. An equally thick book, but a bit more expensive because it's a more specialised book, but even better is Paul Taylor's book with the excellent title, Text-to-Speech Synthesis, and it does what it says on the cover. It does everything in Text-to-Speech Synthesis, including Paul's attempt at signal processing as well, which is pretty good. And so these are the sources I would order to refer my students to to find out more about this. So we're going to look at in the next couple of hours, and we'll take a break halfway. What is this thing called a front end? And we're going to start a little bit philosophically by thinking, how does this problem of going from text-to-speech break itself down naturally into stages? If it does, and it may be useful eventually to think that it doesn't really break itself into discrete stages. But for the purpose of explanation, and it's certainly in the summer school, we're going to pretend that it breaks itself into these various discrete stages. And one of the stages, the first one, is to deal with this messy input text and get it into good shape in a way that we might then have some chance of knowing how to say it. So it's getting from how it's written to how it's said in the most general sense. And I think I'd like you to view text-to-speech throughout this and throughout all of the rest of the other presentations as this straightforward regression problem. A hard regression problem, but we can define it very straightforwardly as input text and output waveform. And how we get from one to the other is entirely up to us. We could imagine some incredibly sophisticated, powerful regression model that gets us all the way there in one giant step. And given enough data, we make it imagine that we might be able to do that and learn that relationship from arbitrary, raw text, unnormalised punctuation and abbreviations and everything else that's going on all the way to the waveform. Nobody can do that. That's just too hard. There's too many different things going on there to be able to pile all that into a model, into one model. And it's also very hard to get enough data to do that. So it does get broken down into stages and those stages are decided by experts where to draw the line along this long journey from text-to-speech. And just to be clear, what we mean when we say a regression problem, regression is the prediction of a continuously valued quantity in this case, the waveform. If we're predicting a discrete value of quantity, we call it classification. And later on, we're going to see in this front end, we're doing both classification and sometimes also regression. So let's just get a quick overview of what we need to do in this whole front end. Get a complete idea of what happens in this pipeline. So we've also input text and we'd like to start discovering more about the text, things that are useful to know how to say the text. And some of them will be directly useful to say it. And some will just be needed by another stage further down the pipeline. So one thing we'd like to know about the text is once we've got words, they're parts of speech. And most importantly for spoken language, are they closed-class words? Are they words from a fixed list that we can't invent new words in for any language? Like the determiners of the and the articles? Or is it an open-class word where we can invent new words, content words, names of objects and things in the world? And we can invent new ones of those and they are more likely to get prosodic emphasis, for example. It's not directly going to tell us how to say it, but it might tell us later on what to do with prosody. And we might go on and on. We might think that knowing something about the structure of the text might be useful. That might be helpful where to put phrasing. We certainly want to know what the pronunciation of this thing is, the sequence of sounds. In some languages that might be a trivial mapping from the spelling and in English it's certainly not. And we might want to start something sort of representation of prosody and so on, things like that. So that's what we're trying to do. We're trying to enrich the text with lots of information and we need to bring external knowledge sources to do that. And when we've got that we can then generate our waveform. And we're going to look at this in the other parts of the summer school later on, but we better just check we know what we're going to do so we know what the point of this front-end is. Let's just say that there are kind of two choices. One is we're going to concatenate bits of waveform and we'll find out this afternoon one way of doing that. That's called unit selection. Or we could build statistical models and they won't directly generate waveforms. We'll drive a vocoder of the sort we saw yesterday. And these look like discrete and very different ways of doing things. And it looks like there might be a very clear distinction between the stuff we do to text and the stuff we do here to generate the waveform. Hopefully in an hour and a half from now it'll be quite clear that that's not a clear distinction at all. And the responsibility of making some of the steps along this pipeline might rest with what we call the front-end or it might rest with what we call the regression model or the waveform generator. So this two-stage pipeline already is looking a bit dubious. This is how we might classically describe it. So let's call it a three-stage pipeline. We're going to go to this linguistic specification, all the classical text processing stuff, pronunciations, things like that, where phrases fall. We're then going to optionally or explicitly or implicitly depending on the type of system, we're going to get from that sort of what might call a linguistic structure to something that's a specification of the acoustics. It's not just how to say it but what it's going to sound like. That might be the input to a vocoder. It might be spectral envelope and F0, some acoustic specification. And then from there we'll actually generate the waveform. So what we found out yesterday was how to do this really, really well with very, very high quality, but what we didn't know yet was how to drive such a thing. And there's not a lot of point of having a very high quality model of a speech signal, whether it's sinusoidal or AQHNM or however many letters it's got in its name, if we can't drive it, if we can't predict its inputs well. And choosing a vocoder might be constrained by how well we can drive it. How convenient its inputs are, how convenient its representations are. So then we've got this journey from text to speech. So our input text goes in here and our output speech comes out here. And we've got this rather long and complicated journey to get there. And we're going to do that with a sequence of techniques and exactly how many there are in the pipeline and where their responsibilities fall actually varies a little bit with different systems. So let's just talk a little bit about different ways of getting through this journey. So it will become obvious what the front end is all about. And we're going to introduce some terminology here. It might or might not be what we'll hear in the unit selection session and I'll use it again later. In the classical unit selection system, we take the text and we predict a whole lot of symbolic things, the obvious one being the phonemes. We might predict symbolic representations of prosody, such as prominences and boundaries and so on. And then given only these symbolic things, we'll go and choose waveforms from a big inventory. And our choice will be based on matches or mismatches in these purely symbolic features. And in Paul Taylor's book he introduced some terminology to talk about this. It's some acronyms, we don't need to worry too much about them. But there's a way in which we only ever predict symbolic things in the front end and everything else happens when we select the waveform. Paul calls that the independent feature formulation because these symbolic features are treated as independently and we just have some weighted sum of them when we try and retrieve things in the acoustics. Some systems, in fact kind of all systems actually, even festivals that were claimed to only predict symbolic things, go one step further and predict some acoustic properties as well before going to choose the waveforms. Paul called that the acoustic space formulation. So you might imagine when we're going to do unit selection and going to retrieve bits of waveform from a database we want them to be of the right sort and that matching might be based on just their names, their symbols or it might be based on their acoustic properties. Are they the sort of duration we would like? Have they got the pitch that we would like? Or are we going to hope that the symbolic features will just get us that automatically? And then it's a very small step from there to go to what is now called a hybrid system and in the hybrid system we just go one step further and we just predict all the acoustic properties and in the simplest case we only use acoustic properties to retrieve waveforms in the database. So that's going even further down this pipeline. And going even further down the pipeline we don't just predict all the acoustic properties and then retrieve a waveform. We use the acoustic properties to then actually generate a waveform with another regression model, so another continuous prediction and that's statistical parametric speech synthesis using hidden Markov models or deep neural networks. And this is really, we could see this as something of a continuum. This is a slightly artificial categorisation and it's just a question of how we divide up this pipeline and how we assign the responsibility for getting us a little bit further down the journey to some module of the system. So in a classical unit selection system which is the one you're going to hear next we've got a front end just like I'm about to describe it in the next hour or so. We have a very implicit regression model. We may or may not explicitly ever predict any acoustic properties. The regression might be implicit in retrieving units from the database. By retrieving a unit we get a continuous valued thing, a waveform, and that's a prediction. It's a very implicit, funny sort of regression model but it is a regression model and the waveform generator is almost trivial signal processing. It's concatenation of pre-recorded waveforms. Like all trivial signal processing it's really hard to implement it well. The theory is very simple. The implementation is all about details. So that's what the picture you should have in mind as we're talking through the front end and you're thinking does this belong in the front end? Is this part of the regression model? Is this waveform generation? Don't worry too much what you call it. We just stepped along the journey. So we're just going to incrementally add information to the text. Some of the information might be symbolic. Eventually some of the information might be continuous. So durations, we might predict durations of things. And then how we eventually generate the waveform might be to retrieve things or it might be to get acoustic specification. So how we divide this journey into little steps is entirely up to us. It's a design choice. There's no right and wrong answers. Some just work better than others and that's an empirical question, not a theoretical question. So let's see in pictures then the sort of things that typically happen in the front end and we'll just dive into detail in a few of them. I'm not necessarily going to use all the slides if you've downloaded them. We'll pick and choose from those slides to see how the time goes. So inside the front end it should be unsurprising then that the typical architecture of a front end is what we call a pipeline. It's a very naive architecture and it's just a concatenation, a sequence of processes, many of which require the output of the previous process to do their job. So we need to predict part of speech before we can look at words in the lexicon and we'll see why that is. And we need to predict certain things before other things in general. Now that's kind of linguistically obvious that there's these hierarchies of linguistic information, some things depend on others. From an engineering point of view this is a terrible architecture because we get a cascade of errors. None of these things are perfect and any error we make early on in the modules. So this tokenisation, if we break the text into the wrong parts, if we cut something that should have been one word into two parts, it's very unlikely that further down the pipeline anything will think to glue them back together and recover from that error. So we'll make lots of unrecoverable errors. Part of speech tag errors. If we tag something as a now when it should have been a verb and we go and look it up in the dictionary and there happens to be a different in pronunciation, we're probably not going to recover from that. So we're making hard decisions all the way down. So that's a weakness of this kind of architecture. So you can immediately start thinking there might be better architectures than this. Or what festival does and I'm going to guess this is what a lot of commercial systems do as well. The other weakness of this architecture is that each of those steps is expensive to make. It's expensive in human effort and it's expensive in knowledge required and expensive in data required. We need to know a lot and we need to be expert to build all of those modules. We might even need to be very expert even to annotate the data to do the machine learning if we're using machine learning. Sometimes we can get naive people to label things but very often we need experts. Labeling prosody is something you need to learn how to do. You need to be trained how to do. So that makes it expensive and now you understand why even nuances language catalogue is our 40 languages and Google say they're so ambitious we've got a huge target, we're going to do 200 languages. That's 200 out of 6000. And the rest of those will be very, very expensive to do because we don't know much about the language, we don't have a lot of data, we might be able to find someone who's linguistically expert. So to summarise, the architecture is a chain of processes each one could be rather expensive to make in various ways in time, in money, in data, in expertise. So let's look at a few of those to see why that is, to see what sort of data they need and what sort of knowledge you need before you can make any of these modules. Imagine you'd like to make a system for new language, what would you have to go and do? So a typical process is a thing called a part of speech tagger who doesn't know what a part of speech is. Don't be afraid, we're all happy with this concept. So it's going to affect our look-up in the lexicon so we get a matching part of speech for homographs spelled the same, different part of speech, different pronunciation. This is well-established technology. It's a pretty much a solved problem in NLP if you've got the data. And the data are simply large amounts of text that have been tagged. So that's boring and you need a lot, millions of words to make a really good part of speech tagger. So we need to annotate actual text with part of speech. The good news is we don't need it's read out loud. We just need text. The bad news is that when we do that we tend to choose text that's completely inappropriate. We'll just pick newspaper text because it's convenient. We'll pick the Wall Street Journal or something really stupid and that'll be nothing like the speech we're ever going to speech without, say without synthesizer. So we'll always already have a mismatch. Nevertheless, these part of speech taggers are pretty accurate and are not really a limiting factor. Another process we might find further down the line is to go and find pronunciations of words. An obvious way to do that is to look up in a pronunciation dictionary. We'll always find that there are words not in the dictionary and we have to go and do letter to sound for those. So we'll see a bit later on that there are classical methods for converting letter to sound. We'll look at a really simple one and I'll explain why we're doing the simple one later on. How would you write a dictionary? A lexicographer needs to be a native speaker, needs to be linguistically sophisticated, needs to choose the set of symbols for pronunciations, needs to choose the phonemes, which might be non-obvious, and then just needs to sit down and write a dictionary. That's kind of tedious and we might write 20,000 entries and then we go and synthesize and find that there's a word that we forgot and there'll always be a word we forgot because it was a new word. So we need to extrapolate from that dictionary. We need to learn what's often called letter to sound rules. They're not really rules. Think of it as a statistical model. It's a classifier predicting symbols. The training data for that classifier would just be the pronunciation dictionary. Now languages where we don't need a dictionary, we do have to literally sit down and write rules and for languages like Spanish, we can probably do pretty well with rules. Fairly comprehensive. Until we get a loan word or an exception and then we'll have to go and write a dictionary anyway. So how do we train letter to sound? We take all the words in our dictionary. Some correspondence between the letters and the sounds. Some sort of alignment. That might be sort of slightly expert driven. So a somewhat ad hoc alignment. So reformat our data. So it looks like nice, neat machine learning data. And then we'll learn a sequence to sequence model that learns to go from the sequence of letters. So it might learn to map from this letter to this sound and in doing so it'll look at the context in which this letter occurred. And so we'll build a classifier whose inputs are that and output is that. The nuance's catalogue and Google's aspirations for a catalogue wouldn't include rules as such, just the annotations. For the letter to sound? Well you mentioned that nuance has this catalogue or other languages that it doesn't include letter to sound rules. So the question is do all the languages need letter to sound rules? No, I'm just wondering what the catalogue sounds like. So in a big catalogue of many languages that some company has constructed or some researcher has constructed each and every language would involve building all of these modules. Which is why this catalogue is rather limited in size. So we'd need to sit down and either write the rules by hand or create the dictionary and then use machine learning to extrapolate from the dictionary. I would say that every commercial system does something for letter to sound. You may occasionally be able to treat the letters as if they were sounds and just do a one to one mapping. So a trivial letter to sound model. When I say catalogue, I meant the list of languages they have available for sale. So actual working systems. Let's move on in this pipeline. It gets worse. Some of these modules in the pipeline as we get further down are not just doing these classifications where we can craft the data. They're doing some regression onto acoustic properties and therefore we're going to need speech to learn them. If we're doing classical supervised machine learning which is the one thing that works best we need some annotated speech to learn some steps. So an example of that might be to predict where to put phrase breaks. We should not just signal by pauses. They're signal by drops in F0 and resets and duration effects around the boundary. So here's how it's done in festival. You take some speech which fortunately you find someone else is annotated but isn't very large and that will be annotated with the thing we're trying to predict prosodic phrase breaks. I still don't have your playback. So imagine someone saying this sentence. A 1918 annotator will annotate there was prosodic phrase break there. So the output of such a process will be a big data set probably not that big. You'd be lucky if it's an hour or two of speech and it's annotated with where the breaks were and then we'll use another sequence to sequence model to predict breaks at run time at synthesis time. So this sequence and typically we won't predict from the words we'll predict from part of speech because words are too sparse and there'll be unseen items we'll collapse them to parts of speech to categories and learn a sequence model and you can see that for every sentence we might only have one positive example of a break. So we need even from an hour of speech let's imagine 500 or 1000 sentences and we're not going to have a whole lot of positive examples of phrase breaks to learn this model. So this is tough for machine learning. This is very sparse data. That was a little whistled stop tour of the sorts of things that happened. Let's go into a bit of detail in some of those. Not all of them. So the question is do we annotate more than phrase breaks? Yes, we'll see in a minute. In quite a few minutes we'll see that we could choose to annotate Toby Symbol's full prosodic annotation if we wish to do so. We would need to manually annotate the training data in order to train the classifier to do that and indeed that is what the festival does. Just one thing to say since we make the recordings it is nice also for the others who are not here to hear also not the answer but also the question. I'll try to repeat the question. No, no, it's not this to say you just have to press the speak and then here and then... Okay, let's try that. I can't repeat the question anyway. It's ingrained. I'm going to take the festival speech synthesis system as a canonical boring traditional conventional example of a text-to-speech system. It's the one you're going to use in the lab. I couldn't possibly say it's state-of-the-art. I've already said that the front end hasn't been touched for a long time but I think it's still pretty typical of what is done and if you wanted a commercial system that was like this you just add more rules more exceptions, longer dictionaries longer lists of acronyms and clean up all of those edge cases. Maybe use smarter machine learning models better fancy a part-of-speech tagger and so on but the concept is going to be very similar. So that's the list of things that festival does and let's start going down the pipeline and see how far we get before the first break. So the first thing that Paul Taylor calls in his book Text Decoding is actually to find the words because text isn't made of words text is made of characters and so we might have characters like this one here's a sentence it's not made of words it's made of some things that are very obviously words that's a word I'm assuming this is a word but it might not be in my dictionary so I'd have to check that that's not something else that's not a word and that's not a word What do I mean by a word? I mean something we might plausibly find in a pronunciation dictionary imagine a really big pronunciation dictionary one of those 20 volume ones on the shelves that someone's listed everywhere that we can think of if we think we might be able to just directly look it up in there then we can say it's a word this is not going to be even in the Oxford English dictionary it will be ridiculous to start listing all possible numbers so we're going to have to turn things into words so we need to find the words and before we can even do that we need to break this thing into chunks that we can process further and that's called tokenisation tokenisation is typically going to be done with some handcrafted things, something like a regular expression or some rules it's going to depend a lot on the language so in English white space is a pretty good clue it's not the only clue because we can see pronunciation next to a word without white space so we need to know what counts as pronunciation and what's not pronunciation that's language specific punctuation symbols in Spanish that don't occur in English upside down question marks if that's not in our list of punctuation it will get treated like a character and things will go wrong and this is pretty straightforward so we could do a decent job of that once we've chopped it into tokens we're then going to go through each token one at a time and decide if it's a word or not and if it's not a word do something about it turn it into words so we need to detect whether things are words or not the classical way of doing that will be to write long lists of all the things that we know about all the acronyms, all the abbreviations and so on and just have a big look up table that helps us expand all the ones we've already seen before for the ones we haven't seen before we might do something like regular expressions so to detect numbers we can easily write a pattern that is a sequence of digits the key problem though in doing that is that these things are ambiguous so although you might have detected that this needs some further processing you might also have detected that this needs some further processing what you then do to it, what it expands to depends on the context in which you find it so it's not, we can't just do everything with a look up table because we wouldn't know whether that's doctor or drive and likewise numbers we could sit down and write by hand lots of expressions that match currency amounts or year amounts, dates and so on and this is kind of what festival does so something appears in a format that we didn't think of when we wrote those rules it gets skipped by the rules and get handled by some default rule that maybe just read the digits out as individual words, as a fallback so it's hard to say anything theoretical about all of that it's a bit ad hoc that's because language is messy, there's no reason to think that there will be some beautiful theory of text processing text is an evolved way of communicating it doesn't necessarily obey rules people are creative and productive they invent new things so there's no reason to think we should be able to write one elegant model that does all of text processing it's perfectly reasonable that it will be a sequence of kind of messy things we can say something about a general principle though of the sort of way that we might go around processing the text and that we might break it down into some stages we might first detect that things are not standard words might look them up in the dictionary if we don't find them we'll say it's not a standard word that might be a very simple way of detecting it once we've done that we might say what sort of non-standard thing is it and we might have a set of categories for that years, currency amounts dates, times and then some catch-all general rule that just says read it out as a sequence of words the best you can and then for each of those separate classes some way of expanding that so there's very specific ways of expanding if this looks like it's going to be a year there's very specific rules about how to say that in each language we can just write rules if it was a case where rules will work fine we don't need machine learning for that so exactly how we detect how we classify, how we expand is up to us there's lots of classical and simple methods for doing that detection might be some simple rules classification might be a very simple machine learning classifiers such as a decision tree an expansion could be lookup tables or transducers or anything else you like typically these will be very simple forms of machine learning because more sophisticated forms of machine learning we might be tempted to throw a neural net at everything we might find that we just don't have enough data to get good performance out of these fancy machine learning methods they're not as efficient with the data we might find that a regular expression a transducer, a handcrafted decision tree actually is more efficient with the data that we've got it makes better use of the very small amount of data so we've spelled everything out as words it's going to be somewhat language specific it's going to be tedious to build and we're not going to get many papers out of doing that it's got to do it and then we've got to start our journey getting towards how to say those words some of the things now that we could either look up in a dictionary and if we don't find them we can convert the spelling into a pronunciation some other way and then we can start saying them start adding more acoustical information so let's talk about some of the things some of the steps that might happen along that journey and some of them are much more helpful in some languages and others I'm not going to say anything about morphology morphology will be to break these words down into the smaller constituent parts in the hope that the pronunciation of a big complicated thing is made by the pronunciations of it smaller things concatenated possibly with some context dependent rules it doesn't help very much in English on the analysis side so given an arbitrary word that's not in our dictionary to automatically morphologically decompose it is error prone and is not necessarily going to be super helpful in some other language like Finnish where we've got some impossibly large possible number of word forms it might be essential to try and break it down but morphology is useful in English in actually creating the dictionary in the first place so the lexicographer when he or she wrote those 20,000 words might not have wrote them all by hand might have written a lot of base forms and some simple rules and said let's just make lots of plaurals by adding s wherever it's possible or es when necessary so we could generate forms in the lexicon by rule that's where it's going to work decomposing unseen words is going to be much harder we've already talked a little bit about part of speech tagging I'm not going to say so much more about that let's go straight on to looking things up in the dictionary and then worrying about what to do and we don't find them in the dictionary so pronunciation of words what to do about turning this string of letters into a string of pronunciation symbols phonemes typically so a language like English it's important to have a dictionary here's some annoying misuse of terminology lexicon they say dictionary they use them interchangeably I don't know what the difference is anymore dictionary will be more correct because it says how to say things lexicon's more just a list of words nevertheless you'll find both used in the literature a dictionary for text of speech needs to be a bit more sophisticated than one for speech recognition in speech recognition all we would care about would be spellings and pronunciations and we just use this as part of the creative model of speech recognition to generate from words to subword units in synthesis we need a little bit more than that because we're not going to use it in that way we're going to have spellings and we're going to look up the phonemes and it's more important that we get the right phonemes here so if we've got two identical spellings we're going to have to make a hard choice between these two things we can't make some probabilistic choice and say let's try both let's try saying it as both and seeing what happens speeches and like that there are hard decisions we need to pick one and stick to it we can't say something halfway between these pronunciations and we can't say some distribution over them we're going to pick one and so we need to make a choice between these two homographs and in this case it's easy because one's a noun form and one's a verb form so we've got lives the plural of life and lives to live so now we know why part of speech tagging is important to go and look up the right thing in the dictionary but the case is that this doesn't work for there'll be words that have the same spelling and the same part of speech and still have different pronunciations there's examples of this in the textbooks lots and lots of examples of words like that the favourite one in all the textbooks so we have to use it as this word here it's a kind of silly one because it's quite a low frequency words it's either bass which is a musical term or bass which is a fish I've never had to synthesise bass the word fish except when testing festival but nevertheless they're both nouns and we couldn't discriminate between them in the part of speech tagger so we need to do something called word sense disambiguation and we're essentially just going to look at the words in the context and if in the rest of this sentence there were some music sort of words that were co-located we'll pick the musical sense if there were some fishing sort of words then we'd pick the fishy sense and this is a standard natural language processing problem word sense disambiguation and we can do it very straightforwardly and fairly accurately so looking things up in the dictionary is fine if we get the part of speech tag right we might need to do this clever word sense disambiguation first and the lexicon would have to have both senses and have some extra annotation saying this is the fishy sort of B-A-S-S and this is the musical sort an extra tag so when we look it up from our word sense disambiguator nevertheless that's a completely solvable straightforward natural language processing problem couldn't it be hard for unseen words someone invents a new word and then uses the same spelling for two new invented words we'll have a very hard time discriminating between those two but we need more of them this string of pronunciation symbols we need also some structure within the word structure of the sounds not structure of the spelling not the morphology but the syllable structure of the sound and so this is going to be absolutely essential when we come to start deciding how to pronounce this and that's going to be true whether we're doing unit selection waveform concatenation or a statistical model in all of those cases the structure of the syllables will be very useful and either choosing appropriate units and where to make joins is predictive features for our regression on to the acoustic promise of a vocoder there are various reasons for that the most obvious one is that syllable structure helps us decide where to put prominences where to put prosodic events so in the dictionary itself we might already start assigning some prosodic features so there's some prosodic features that belong to words themselves and they tell the difference between two words in tone languages and the other pitch features called tones we could also mark up in the dictionary we'll discriminate between words so we've got two words here and they've got the same spelling on a different part of speech but not only is the phoneme sequence different the pattern of emphasis is different so we'll call this lexical stress lexical because we can mark it in the lexicon and it's specific to the particular word and in festival everything's marked up in this horrible bracketed notation that's just the language it uses and we mark stress with ones and lack of stress with zeros so we might have present and present as these different word forms and getting that wrong is going to be a very salient area to the listener it's a different word so you've got a different meaning I'm a little bit confused between the one and zero syllable stress so the question is whether it's one and zero syllable stress I'm going to call this lexical stress and it's a feature that attaches to the syllables of a word and in every word there's going to be one syllable that's got the primary lexical stress and there might be another syllable that's got some secondary stress in a long polysyllabic word you might even imagine some sort of tertiary stress in a sufficiently long word there might be a third place that gets a little bit more prominent as well but the key is to get this primary lexical stress and we can mark it in the lexicon but it doesn't mean that we'll get some enormous pitch movement at Synthesis time it means it's the sort of place we might do something with prosody so what does lexical stress do for us it says when we're coming later to predict prosody this syllable is a very likely candidate to do something interesting with to make it longer, louder to move the pitch around and this syllable is a very unlikely place to do something you probably don't want to go when making this syllable sound longer and louder and have some pitch movement you might want to do exactly the opposite you might want to make it shorter, quieter and even make the acoustics somewhat reduced already this has got a reduced vowel in it phrasal stress how does this work with the phrasal stress so the syllable stress or the lexical stress is different from the phrasal stress so let's think of lexical stress as possible locations where something prosodycally interesting might happen but it's not guaranteed to happen, we don't know yet we're only talking about words in the dictionary in isolation in their canonical form things are going to happen to them when we start putting them in context for example if this is the focus of the sentence or not and what other words around it are going to have prominences or saliences so think of the candidate sites ready to receive some prosodyc event they may or may not get one we're going to have to compete with the other words to see who gets them because in any one phrase or any one sentence we can only have so many prominent or salient syllables if we put pitch accents on all possible lexical stress syllables the speech will sound hyper articulated and a bit over excited and unnatural so we need to some of them won't get these events or will get smaller events, we'll come to that so how do you actually like split a word into a bunch of syllables and does there actually like exist multiple ways of splitting like in the case of the first present it's like do it like p, r, e, h and z, e, x and t yeah, excellent question so question is how do we syllabify the words first question in the lexical we do it by hand because we have knowledge we are an expert we have linguistic knowledge of the language so we write down the correct salarification for unknown words words that we're predicting with letter to sound that we'll see shortly we also have to predict the salarification so we need some statistical predictive model of salarification can we use rules also this model of salarification could be a fancy statistical model it could be just some rules and indeed the words in the dictionary we also might have made some explicit or implicit choices when we manually syllabify those as to as the question implied there might be some consonants here it's not entirely obvious which syllable they might belong to is this present or present both of those seem plausible and reasonable to me as a native speaker so there are various principles and rule-based systems on which you might do that one of which is called maximal onset she says put as much on the beginning of syllables as possible but there are other schemes as well so this is for English and I guess it's maybe in part following linguistic tradition to think about stress accent as other being primary secondary or off when you have a Chinese lexicon I'm assuming that you need lexical tone in here as well and then you probably again follow the traditional grammarians by putting one of four possible tones and I'm wondering whether there are any kind of anecdotes of experiences in TTS where the linguistic background story about stress or tone or whatever it is that goes in this lexicon has proven to be simply insufficient and you need something else to support inference of the kind that all of these people are talking about in order to actually find the trajectory and know the duration and all of that or do people just generally follow the linguists so in terms of what you manually mark up in the lexicon when you construct it I suspect people stick to well established knowledge in conventions and I think you're implying that the knowledge that's present here in the dictionary may or may not be acoustically accurate and so that's a fair question and I think in statistical parametric systems we do have some chance of recovering from that if we systematically see disagreements between these annotations and the way our speaker spoke and there's enough data and we can somehow predict from context what the difference between those two is then we can still get the correct pattern of stress the same probably applies to tone that's problematic because there's simply probably not enough data in our speech database to find those sort of systematic discrepancies between what we marked up in the dictionary what we get in the speaker's rendition of the text I don't have a lot more to say about that other than it's one of these things people haven't looked at a lot recently and don't tend to evaluate in these terms to detect if this is a particular problem I'm going to hypothesise this is not a particular problem because the intelligibility of current systems is extremely good and errors here would affect intelligibility on some extent as we get towards the first break let's just start talking about going from letter to sound now for some languages we'll always use Spanish as the canonical example of this phonetically beautiful five vowels very neat and tidy you think it's not quite so neat and tidy the relationship between spelling and sound is very regular as my wife who actually teaches Spanish says you say every letter except you don't say H except in some cases except in some parts of Spain you go to Seville they just chop the ends of things off and so on so it looks neat and tidy in its canonical perfect form but it's not so with every speaker and in that relatively simple case we could actually sit and write rules by hand of this sort of form so let's do it for English because that's the language we have in common we start to sit down and we'd like to write some rules for the letter C C appears in lots of different contexts and has lots of different pronunciations it can be CACHA or it can just be silent it just depends and we could start thinking right if it's at the beginning of a word there's nothing before it and it has an I after it after what happens next this will be SA where it's like system so we can start writing these rules down and this works very well if things are regular and it looks like it's going to start working well for English but it won't because we'll need a very large set of rules and as soon as we have a large set of rules we'll get inconsistencies between the rules we get conflicts between rules so rule-based approaches are great if the set of rules is small the order in which you apply the rules is obvious if the order in which you apply them is not obvious if changing the order changes the outcome and there are a nasty interaction some rules need to know the decision of a previous role then before we know it we've got some horrifically big bowl of spaghetti and these rules become impossible to maintain and fixing an error in one place creates two errors in another place and so for that reason we'll tend not to sit down and write these rules for English and probably some other languages where that applies too and I was going to finish off saying one little thing about letter to sound what we do after we look things up and then we're going to just rewind and go and look at one very simple form of machine learning which you might be a little bit familiar with but it's so important that we're going to just go over it in a bit more detail and that's classification and regression trees first of all we use this classification regression trees for everything it's the neural net of the early 1990s if we don't know what to do we throw a classification or regression tree at it and it works great up to a point if there's data let's just say what happens after we've done that just to wrap up that part pronunciation can't be entirely determined on a word by word basis there are things that happen when we put words in sequence so we can only do certain things after we've looked up in the lexicon or the dictionary and so they're called post lexical effects in the vast majority of systems all systems I'm familiar with the number of post lexical effects that we attempt to deal with is very very small and we just pick the most important ones so in French it might be liaison in English some accents in English it might be this linking R we need to make an R sound to link two words together it might be devoising things at the ends of words in certain contexts and so on so a very very small number of rules and in festival this set is small and some of it is just hardwired into the code because they're so important and so predictive so useful that they're just always there and they're always on some might be specific to the speaker in which we might need to turn them on or off depending on the accent of the language that we're dealing with this will be completely language specific all languages do things to words as we string them together no language sounds like isolated words concatenated right we just we're going to watch 5 minute video and then take a break and then we're going to come back and we're going to talk about the video so you can give some time to think about it we do need audio we do yes we do I'm not getting myself then what will day we'll take a break and then after the break we'll watch a video with sound that's a good idea ok sure thank you very much