 I'm Sandy Ritchie. I'm a linguist at Google. I work in the speech department. My main responsibility is working on speech recognition for new languages which Google currently doesn't support. So a lot of my job is finding out basic things about ethnology, certain things about the writing system, how they say things like numbers in order to create the zero speech recognition systems for new languages. Today I thought it would be interesting to complement the talks about Persephone with another aspect of speech recognition, which is complementary and would work very well in tandem with Persephone in certain situations. It's got a technical title. I'm going to talk about graphene to phoneme conversion, which is the process of changing spellings into phonemic transcriptions, transcriptions or pronunciations, using finite state transducers, which are type of, let's say, grammar or machine, machine grammar, which you can use to convert from spellings to pronunciations. So I'll just give a very brief overview of how automatic speech recognition or ASR, I'll call it ASR a lot, so that's automatic speech recognition or text-to-speech TTS works. Then I'll talk a bit more about how graphene to phoneme or GTP conversion works. Then I'll talk a bit about what finite state transducers or FSTs are. I'm going to talk a bit more about Tibetan and why it's a super interesting case and it's been a real interesting challenge for us to work on Tibetan GTP. So how does ASR work? Basically, this is like abstracting a lot. At a very high level, it's audio in and well-formed formatted transcriptions out. In most state-of-the-art ASR systems, you have three sub-parts, three sub-components of the system. The text doesn't match the audio. You can read waveforms. I have no idea. I just found it on the internet. So there are three sub-components of an ASR system. You have the acoustic model that will get you from the sound waves to phonemic transcriptions. This is like this is Persephone within this system. That's what Persephone is doing. It's taking you from audio to phonemic transcriptions. Then you have the pronunciation model, which takes your phonemic transcriptions and turns them into possible spellings of that phonemic transcription. Then you have a language model, which disambigurates between possible hypotheses or possible variant transcriptions. If I say your book, the sound gets turned into a transcription, your and book. Then the pronunciation model will change your into your or your, but it doesn't know which one it will just give you both. The language model will choose between your and your, depending on basically its context within the sentence. It's called an n-gram model. It's looking at the words, proceeding and following to any n degree in order to find out which is the best hypothesis for the pronunciation. The previous talks were about the acoustic model. That's what Persephone is. I'm going to talk mostly about the pronunciation model today. We'll maybe talk a little bit about the language model at the end. Synthesis or TTS text-to-speech is basically the same thing kind of in reverse. So you have text in and audio out. So you take a well-formatted string like 12 minutes. The first thing to do in text-to-speech is to change things like one, two into the word 12. That's a process called text normalisation. Then, after normalisation, those are, let's say, written form of the spoken speech. It's turned into phonemic transcriptions. Then you have a synthesis model which will turn that phonemic transcription back into audio. So it will speak back to you. So if you say something to your Google Home or your Alexa, I'm not going to promote Google Home over Alexa. Then you can say, well, set my alarm. Then some N-O-U-N-O-P magic happens in the middle, which I don't really know about. Then it will say, I've set your alarm. Okay, so the pronunciation model in particular. It's generally, in most production systems, made up of two components, a lexicon and a graphine to phonemodel, or GTP. So the lexicon is very, very high quality, hand curated phonemic transcriptions of spelled words in the language. So these are made by linguistic experts who are training in transcription of phonemic transcription. And they will produce really, really good quality transcriptions for words like this, which is pronounced kernel. You wouldn't guess it from the spelling, and that's why we need a lexicon because it is almost impossible to guess that that word is spelled, is pronounced kernel from the spelling. The pros of a lexicon are that it's really high quality and it's very reliable. You can almost certainly rely on it to get the pronunciation right. The cons is that it's static. So new words, neologism, slang, non-standard words are not included usually in the lexicon. And it's extremely expensive because you have to pay linguistic, people have linguistic training to hand transcribe every single word. So to complement the lexicon, we also have another component called a GTP or a graphine to phonemic converter. And that is basically a rule-based or machine-learn model which will learn how to produce plantations or phonemic transcriptions for words that aren't in the lexicon. So I don't know if anyone knows this word. I found it on the internet. Apparently it means flirting in London slang. I didn't know either. It's unlikely to be in the lexicon. It's the kind of thing that someone would say and then where the lexicon falls down and doesn't have an entry for that word, the GTP will kick in, take over and try to reduce some kind of useful phonemic transcription for the spelling. So the pros of the GTP are that it's very dynamic. It can produce a pronunciation for any word. It's quite cheap, but it's unpredictable and the results are, let's say, variable. And depending on the script, it can be really, really bad. So really, in a production system, you need both. They kind of work in tandem together. So the lexicon and the GTP, they complement each other. Where the lexicon falls down, the GTP takes over and vice versa. So let's just talk a little bit more about spelling and pronunciation. This is an obvious point to make, but the relationship between spelling and pronunciation is not always predictable. So even in languages with very shallow orthographies and by shallow, I mean that the orthography matches quite closely to the phonemic realisation of the language. There are going to be context-dependent rules. Take this really simple example in Italian. So Italian N, the graphene, is pronounced N phoneme in this word, nasa, which means nose. And in this word, finchè, which means until. And in this word, manna, which means hand. So you're like initial, final, medial. It must be N everywhere, right? It is N. It just is N. So maybe we can just have a context-dependent system and just replace the graphene man with the phoneme man and we're done. And then you get to the cases like this, where it's part of a digraph, GN. And you try to just map N to N in a context-dependent way. Your system is going to produce agnello, which is incorrect, okay? That's pronounced agnello. It means the lamb. So given that simple example, with a very easy shallow orthography, think about English words like yacht, or kernel, night. I mean, you're lost. What does CH represent in the word yacht and the pronunciation of the word yacht? I don't know. There must be some L2 learners of English in the room. Any others that you found particularly annoying to remember? O-N-U. O-N-U? Yeah, S-N-S-N. Darn and gone. Ah, right, yeah. Darn and gone, yeah, exactly. The same spelling, two completely different pronunciations. Or like O-U-G-H, right? Thru, bow, coff. Perseffony. Perseffony, okay. Yeah, yeah, yeah, exactly. English is like etymological, right? You know about Greek orthography to be able to give pronunciations to English words. Yeah. Okay, so yeah, graffing to fulfilling conversion. There's various ways to do it. It's not just like one size fits all. So the most common is to use a machine learning model, and for that you need lots of data. So you basically already need a pre-existing pronunciation lexicon. So a lexicon is just like a huge list of mappings between spellings and phonemic transcriptions. And if you have that, then you can train a GTP model. It's just like a sequence-to-sequence problem. So you can see that this spelling corresponds to this phonemic transcription in all these different contexts, and it just learns to reduce the correct phonemic transcription for any word. But of course, in many situations, we don't have that yet. In fact, that's what we need to make in order to get a system going. So it's not viable for many, or even probably most languages in the world. I mean, there are 7,000 languages in the world. A Google, I think, for ASR, it's always brought something like 120. So that's not even 10% or something. It's very, very few. If you think about it in terms of the number of languages that were caused in terms of the number of speakers, we're already covering quite a large part of the population just with those 120 languages. So, yeah, as an alternative to machine learning models, you can also use a rule-based approach. So this is like super cheap. You just need one linguist to sit down, study the phonology, study the orthography, come up with a system which will allow you to generate phonemic transcriptions for spellings. And yeah, there are various ways to produce a rule-based GTP. You can use ICU rules, which are produced by Unicode, but I've tried and they're very hard to read, and it's hard to read other people's rules if you don't have a pre-existing knowledge of their mind and the language and many other things about the system. You could also use regular expressions, which are context-dependent replacements that you can use. That is even harder. It's almost impossible to read. As soon as it gets complicated, you're lost, and the code would be unmaintainable, trying to fix a bug in someone else's regular expression. Forget it. So, yeah, the best thing to do if you want to make a rule-based system is to use finite-state transducers written in two languages or two implementations of the same language, which are Thrax and Pynini. So Thrax is like a custom implementation of finite-state transducers, and Pynini is like another implementation of that which you can import into a Python library. It uses a Python library, so you can use it within Python. So, yeah, finite-state transducers written in Thrax are super easy, super fun, and it's a really interesting problem to use. So I'll just talk a bit about that. You can see I'm a bit biased. So, FSTs operate on strings. This is like the terminology, darling, so here it comes. Strings can be characters, single characters, words, sentences, or text, so any level of character-based input. And FSTs operate on a string. So they take a string and they either change it, let it pass unchanged, or fail, refuse the string. So there are various types of FSTs. So the main one, of course, is the actual transducer. So a transducer takes an input like fish, and it produces an output like the phonemic transcription of fish. So that's what a transducer does. It changes a string into a different string. An acceptor, which is also an FST, all of these are FSTs. This is a bit confusing, but it's worth getting your head around. Acceptor, it takes a string and it just returns the same string. So it is a transducer, but all it's doing is mapping from the thing to itself. That's called an acceptor. Then you have union, which is collections, or sets of FSTs. So you can say something like, it's like ore, right? So fish, ore, that's an FST. So it's like a union of two things together. And then there's concatenation, which is strings like appended to other strings. So it's like a bit like am, so you have fish, and then the suffix y. So those two things concatenated together will give you a new string, fishy, right? And then finally, a bit more complicated, there's composition. I'll talk about this in a bit more detail in a minute. That's sort of chaining together one FST after the other, to produce a larger FST. So this is like a very basic computer theory. I mean, I'm a linguist and I did my PhD about language documentation, so I'm not necessarily the world's leading expert in this, but so if you have A to B, and then you compose A to B with B to C, then your result will be A to C, so you get all the way from A to C by composing those two rules together. Finally, there's context-dependent rewrites, which are basically just transducers, so changing one string to another string, plus some context in which the change should occur. And context-dependent rewrites are really the ones which are crucial for the graffin deferring problem. Okay, so here's a context-dependent rule written in thrax. So, okay, forget kernel, forget your... We're never going to get that. It's just not going to work. You have to enter those in the lexicon. There's no GTP rule that you can write, which will allow you to produce the pronunciation for that spelling. But for a word like night, okay, let's think about what we can do. It is possible to get from the spelling K-N-I-G-H-T to the phoneme transcription N-A-I-T, okay, that's what we're after. Okay, so first of all, we're going to deal with the silent K, okay? So, in a CD rewrite, this is the definition of a CD rewrite. You have a transducer, so we're going to rewrite K-N, the string, to N on its own, and we're going to give it a preceding context, and a following context, and a field within its searching, okay? And in this case, our preceding context is the beginning of the string, so we know that it happens only at the beginning of the word, and you'll see why that's important in a minute. And then the following context is always going to be a vowel, okay? So, you can't have K-N in another consonant. Okay, these are really like, I mean, this is shorthand, but these are like acceptors of the beginning of the string and any English vowel graph, okay? So, A-E-I-O-U. And then this one, which is a bit like confusingly named, that's just a convention. It just means like any unicode character or a space, so it's like the field or like looking to do this, to do this transaction. Any character that you like can be accepted by this rule and will put. So any string will be allowed to pass through in any language, in any script, in any writing system. Okay, so with this rule, we can now get to N-I-G-H-D, but we're not quite there yet because we need another rule for the middle. So, we have a second rule, which is another theory we write, and it's basically going to rewrite I-G-H-A-I for the phoneme transcription of the ninth. And the left context in this case is a consonant, and the right context in this case is a union of a consonant or the end of the string, okay? So it can either be a consonant on the right or the end of the string, the end of the word. And again, we've got the same search field, any unicode character. And now, with just those two rules, you can take this string, because it's a bit confusing, but a string is an FST, which is an acceptor of itself. So this is an FST, and it takes it's spelling in, and it just gives you the spelling back out. Then you compose that FST with the rewrite KN rule. So that's an FST. This is an FST, this is an FST, and now this is an FST, so A to B. And then you take that FST and you compose it with the I-G-H rule. So now you've got A to B to C. So A to C, and this whole thing is an FST, and it will spit out, hopefully, the correct output, the phonemic transcription night. So we've got, with just two rules from the spelling, which is nothing like the pronunciation to the correct pronunciation, okay? And what's really crucial and important to remember is that you can apply these rules hopefully to any English word, and they will always produce the correct phonemic transcription if those sequences are in there. And it won't misproduce bad transcriptions, okay? So take an example like breakneck. I'll just give you a minute. If you want to write it down, okay? Look at breakneck and think about using this rule why breakneck won't be changed by this rule. You can shout it out. Exactly, okay? So it's string medial, okay? It's not at the beginning of the string. So if you pass breakneck to the rewrite KN rule, it's just going to pass it unchanged, okay? So we're not inappropriately rewriting this thing, which is actually a coder and an answer into it, or just an answer. And in exactly the same way, if you pass a string like pigheaded, which I spent a long time looking for, what's wrong with this? Or why won't the IGH rule change pigheaded? Okay, it's got a consonant on the left, so there's some correct context, but on the right? Okay, exactly, there's a vowel. It's not the consonant, it's not the end of the string. So again, it's just going to pass through unchanged. So this works, this little toy FST, it works beautifully, there's no problems, no exceptions. Can anyone think of any exceptions? I couldn't. Are there any exceptions to this in English photography? I was trying to think about it. So I added end of the string here, because you might have like a thigh, right? Thigh, IGH, and then the end of the string. So it's not necessarily a consonant following, it could be the end of the string, but I don't think there's any other problems with it. I mean, it's just a toy example. We don't actually use GTP for English. It would be terrible. Yeah, so FSTs for GTP, so finite stage transducers for graffing to phoneme conversion. The pros are that it's deterministic, so it's always, like I said, it's always going to get it right. There's never going to be any exceptions. It's lightweight, it's small, it's cheap. And it really works, it works really well, even for languages which have large discrepancies between spelling and pronunciation. As long as the correspondence is, as we're going to see, are fairly regular. The cons are that, you know, I said before, that it's easy and fun to read, but even with this nice context-dependent rewrite language, it can get extremely difficult to read, and people do all kinds of interesting things with them, which maybe you oughtn't do, and it is quite, it can become quite hard to read. It is sometimes quite inaccurate due to time constraints, like the complexity of the writing system or the phonology, and it doesn't really handle exceptions very well, so if you have some kind of lexical exception to your rule, then you just have to, you know, because that's where you'd need a lexicon. That's where the GTP won't really work. And yeah, it definitely doesn't really work very well for messy, orthographic systems like English or French, or there are many other cases. But let's try Tibetan. Okay, so before I start, who knows much about Tibetan in the room? Okay, a few. So this is going to be really boring and basic for you guys, and super complicated for everyone else. Tibetan orthography is, wow, I've never seen anything like it. Okay, first thing to say, it's a Brahmic abhigida or alpha celibri. If you don't know what that means, Brahmic means that it's derived from Sanskrit originally, and an alpha celibri means like in modern Devanagari, so you don't write the inherent vowel, so you mostly just write consonants, unless the vowel is not ah or ur, then you would write a diacritic to represent e or u. So it's mostly like consonants and an inherent vowel. The current standard for Tibetan emerged in the 9th century and has not changed since then, barely at all, correct me if I'm wrong. So yeah, the written and spoken forms of words now differ widely. So there are many silent letters like the k in knight, English example, and lots of sound changes, so vowel, consonant changes between the spelling and the pronunciation. Unlike English, what's very interesting about Tibetan is that the correspondences between the spelling and the pronunciation are mostly quite regular and unpredictable, so it is possible to write a GTP for Tibetan with a rule-based approach. I mean, I'm not recommending it, you know, as the optimal solution probably is better to just pay or find a lexicon or pay people to write pronunciations, but this is what we can do, you know, in a low-result situation. So it's good to think about this, even in a complicated case. Okay, so this is what a Tibetan syllable looks like. So Tibetanist can... What is this word? Sorry, I don't have the transcription. Drumton. Drumton. So we have... How many of these letters are actually pronounced in that? It's a philosophical question. You'll see what I mean in a minute, okay? It's just like as a high-level introduction. You usually have what's called a prefix. There's nothing to do with morphology, okay? It's something to do with the autography. It means like a silent matter at the beginning. This is like a vowel diacritic. Then you have like a root character, which is probably pronounced, but maybe not pronounced the way you think it might be. Then you have a subscript, so it's like another consonant, appended to the first one. Then you can have a suffix, okay? That's a syllable. Usually the prefix and the suffix and the secondary suffix are not pronounced, so they're all silent. Then, in the second one, you have a vowel character. In this case, the main vowel, sorry, the main consonant, is not pronounced in this case. This is a superscript, and actually this is the root of the word. So the appended consonant is the one which is pronounced in this case. So it's like another type. Then you have another suffix. Okay, so yeah. As I said, inherent, I'm just going to go through a few problems in Tibetan and GCP, just to illustrate the complexity of the problem. So yeah, the inherent vowel is not marked in the geography. So the first thing to do when you're going from graffing to phoneme, okay? First you need to convert every single letter in the Tibetan string to a Latin character, but then you need to do some insertion rules to add the vowels, right? So that's like the first problem. So what is actually written here? We're ridder, and we want to say da. Okay, forget about b off and on. We want to add that a after the d. And the same thing here. There's no vowel in this. So you want to add that a there. That's the easy bit. Okay, there are special rules for this initial db sequence. So if you have db at the beginning, it can be so the d is not pronounced, and then the b can be lamited to w, or just delete it all together. So this one is dbus and it's pronounced u. So no d, no b, no s, just u. And the vowel is not even the same as the vowel in the orthography. So there's nothing left from the original text in the pronunciation. Suffixes, so okay, there's four types. They can be silent or pronounced, and they can affect the preceding vowel or not. So in this case we've got b, which is a suffix, which is pronounced in this case, and it does affect the preceding vowel. And then here in this case you've got l, which is a suffix, and it is pronounced, but it does precede the preceding vowel. So there's special rules about suffixes and you can group them into those four categories. Prefixes are typically not pronounced, but they can have effects on other sounds. So a preceding glosol, I don't know how you call this character. Like a lot, no. Well, let's not work. So I think that's a voiced wheel of frigate of the original. And it's had written three articles about it. All right. So yeah, okay. You can see that I'm doing to the N101 and some people are just shaking their heads and going, what is this? So in the very simplest case, this prefix will de-aspirate the initial consonant. So it's like a pen, a pen. And then here this G, initial G, or sorry, G N is not pronounced. Sorry, I'll start again. This is G and then this is mye and er. And the G is not pronounced, but it raises the tone. So this was originally low tone and it raises the tone on the vowel in the following consonant. So not pronounced, but having an effect on something way at the end of the string. So imagine, remembering your mind, the CD rewrite, right? How are you going to do that? How do you jump over all the other letters and make sure that that tone is raised? But you can. It's perfectly possible to do this with FSTs. Another interesting thing about that is called consonant stacking. So you can stack up to three consonant graphs on top of the other. So this stack is per and then E. That's right, right? So the E is pronounced after the consonant stack even though it occurs on top of it. Don't ask me. Here's that stack in action. So in this word so the subscript in this case, which is help me out. It's the Y. The Y at the bottom. It will change the preceding sound from P to Ch. I don't even know what to say about this. There's no rule. You can't call this like africation or something. I mean, what does P have to do with Ch anyway? Here you go from like C, R, to T, R, because of the subscript. Right? Am I right? I hope you get this right. And then here, yeah, this is the most in my favour. So Z, L, with an L subscript it's pronounced D. That's what goes on in Tibetan. That is what's happening in an illiterate Tibetan's mind when they see the sequence Z, L, they go, ah yes, D. That's how it is. Super scripts, a bit like prefixes. So super scripts R and S they raise the tone of vowels following the nasals but they're generally not pronounced. So this stack that's what it's here. So here, the superscript S it raises the tone of the following nasal. It raises the tone of the vowel following a nasal. So if you have Sir and then a nasal the tone will go from low to high. And in this special case which is the only one I think uniquely in this case is pronounced J. So from Subwaying you get Jan. Finally, yeah, tone is not overtly marked in the orthography at all so there's no like, you know, overt marking of tone. There's nothing in the orthography we can use to predict like an acute accent or a grav accent in the Latin script system. But we can work out how to insert it in a rule-based manner. So basically the Tibetan constants are traditionally grouped into four columns based on their manner of articulation. So if you have a vowel following a column one or column two consonant it will generally have a low tone. Whereas three and four vowels following three and four consonants they generally have a high tone. If you remember those tones can also be changed by prefixes and superscripts. So yeah, I mean all that was just to say, you know this is like an extremely complex case of the difference between an orthography and a phonemic transcription. Like even in this case it's perfectly possible to get all the way from graffings to phonies just with a rule-based system. I'm already relying on like a lot of work which has been done for the Tibetan script by the Unicode people and various others. So for example this stack thing so this is ISPY like you know from top to bottom but it's rendered already in Unicode as SPI SPYI. So the vowel is following the consonants. So yeah the real key to GTP for Tibetan is to classify the graphemes. So the first path is just to say is it independent or dependent? So is it appended to another character or is it an independent character? And then it comes to linguistic knowledge so is it a consonant or a vowel? Is it a prefix or a root? Is it a suffix? Is it a subscript or a superscript? And then you have many many many context of random rewrites to deal with all these kind of sound changes, vowel changes, consonant changes. So here's just like a kind of this isn't like a very very simplified abstracted version but this is what's going on. So you can imagine that each of these is an FST a bit like a context dependent so you pass this string B S G R U B S to this rule and then it will tag all the letters for the stases. So from this string we get like this one's a prefix this one's a superscript, that's the root that's a subscript and we said that we know like what the class of each of each character is. Then you just change all the characters to the Latin script. That's like the first thing to do to get to the feeling transcription. Then you want to remove all the silent letters change the quality of the plosive from G to D and then remove all the tags add the finally voicing and then add the tone marker because we know that it's column 3. So each of these is an FST and if you just chain them all together then you get all the way from graphics to phonies. So what's very important to remember is that if you apply these rules to any Tibetan word it will hopefully this and many other rules will use something like a accurate pronunciation for the word. So to summarise I talked a bit about how GTPs fit within pronunciation models specifically and ASR and DTS systems more generally. I talked about how GTP can be achieved in a rule-based way for languages which have a systematic correspondence between the graphics and the phonies or the spelling pronunciation. I talked a bit about how you can write GTP rules and threats and then we looked at the issues in geography and gave a simple GTP example for Tibetan. So yeah, if you're interested you can do this in your own time main maybe it will be interesting to try and write a context bendable for Tibetan or any other language that you're interested in. And I guess like yeah I did want to say that in the context of this discussion I guess what this would be really useful for is that if you can create this FST for your language that you're working on for your presentation project then you don't necessarily need to produce phonemic transcriptions for your one hour recording. All you need to do is to get someone who knows the orthography to transcribe it orthographically and you probably need to do some kind of normalisation on that text in order to get it into a state where the GTP will pass all the words but then you just chuck that text into the GTP and it will produce all the phonemic transcriptions for you so this is good in some context maybe you have lots and lots of recordings and all you have is the orthographic transcription on my native speaker and there's no analysis then you can use that data or you can use the transcriptions produced by your GTP for that data and throw them into your model and that will probably make your acoustic model a lot better if you have lots of lots of data like that so that's a real important use case for this aspect of it. Just a couple of resources no more about FSTs, go to openfst.org that's the way you can find the downloads and resources for producing tracks and fine-earing rewrites and I found this website really useful when I was looking at the button so if you're interested you can go there I'll say talk to chat