 I'm Sandy Ritchie. I'm a linguist at Google. I work in the speech department. My main responsibility is working on speech recognition for new languages, which Google currently doesn't support. So a lot of my job is finding out, you know, there are very basic things about ethnology, certain things about the writing system, how they say things like numbers in order to create the zero speech recognition systems for new languages. And today I thought it would be interesting to kind of complement the talks about Persephone with another aspect of speech recognition, which is kind of complementary and would work very well in tandem with Persephone in certain situations. It's got a kind of a technical title. I'm going to talk about graphene to phoneme conversion, which is the process of changing spellings into phonemic transcriptions, so transcriptions of pronunciations, using finite state transducers, which are a type of, let's say, grammar or machine, machine grammar, which you can use to convert from spellings to pronunciations. So I'll just give a very brief overview of how automatic speech recognition or ASR, I'll call it ASR a lot, so that's automatic speech recognition or text-to-speech TTS works. Then I'll talk a bit more about how graphene to phoneme or GTP conversion works. Then I'll talk a bit about what finite state transducers or FSTs are. And then I'm going to talk a bit and a bit more about Tibetan and why it's like a super interesting case and it's been a real interesting challenge for us to work on Tibetan GTP. So how does ASR work? Basically, this is like abstracting a lot. At a very high level, it's audio-in and well-formed formatted transcriptions out. And in most state-of-the-art ASR systems, you have three sub-parts, three sub-components of the system. The text doesn't match the audio. You can read waveforms. I have no idea. I just found it on the internet. So, yeah, there are three sub-components of an ASR system. So you have the acoustic model that will get you from the sound waves to phonemic transcriptions. And that's basically like this is Persephone, right, within this system. That's what Persephone is doing. It's taking you from audio to phonemic transcriptions. Then you have the pronunciation model which takes your phonemic transcriptions and turns them into possible spellings of that phonemic transcription. And then you have a language model which disambiguates between possible hypotheses or possible, like, variant transcriptions. So if I say your book, the sound gets turned into a transcription, your and book. And then the pronunciation model will change your into your or your, but it doesn't know which one it will just give you both. And then the language model will choose between your and your, depending on basically its context within the sentence. So it's called an n-gram model. So it's looking at the words, proceeding and following to any n degree in order to find out which is the best hypothesis for the pronunciation. So, yeah, the previous talks were about the acoustic model. Really, that's what Persephone is. And I'm going to talk mostly about the pronunciation model today. And we'll maybe talk a little bit about the language model at the end. This is, or TTS, text-to-speech is basically the same thing kind of in reverse. So you have text in and audio out. So you take a well-formatted string like 12 minutes. And the first thing to do in text-to-speech is to change things like 1, 2 into the word 12. That's a process called text normalisation. And then after normalisation, those are, let's say, written form of the spoken speech. So it's done into phonemic transcriptions. And then you have a synthesis model, which will turn that phonemic transcription back into audio. So it will speak back to you. So if you say something to your Google Home or your Alexa, I'm not going to promote Google Home over Alexa. Then you can say, well, set my alarm. And then some NLU-NLP magic happens in the middle, which I don't really know about. And then it will say, I've set your alarm. OK, so the pronunciation model in particular. It's generally, in most production systems, made up of two components, a lexicon and a graphine to phonemodel, or GTP. So the lexicon is very, very high quality, hand curated phonemic transcriptions of spelled words in the language. So these are made by linguistic experts who are training in transcription of phonemic transcription. And they will produce really, really good quality transcriptions for words like this, which is pronounced kernel. You wouldn't guess it from a spelling. And that's why we need a lexicon, because it is almost impossible to guess that that word is spelled, is pronounced kernel from the spelling. The pros of a lexicon are that it's really high quality and it's very reliable. You can almost certainly rely on it to get a pronunciation right. The cons is that it's static. So new words, neologism, slang, non-standard words are not included usually in the lexicon. And it's extremely expensive because you have to pay linguistic, people have linguistic training to hand transcribe every single word. So to complement the lexicon, we also have another component called a GTP, or a graphine to phonemic converter. And that is basically a rule-based or machine-learn model, which will learn or learn how to produce plantations or phonemic transcriptions for words that aren't in the lexicon. I don't know if anyone knows this word. I found it on the internet. Anyone? Apparently it means flirting in London slang. I didn't know either. It's unlikely to be in the lexicon. It's the kind of thing that someone would say, and then where the lexicon falls down and doesn't have an entry for that word, the GTP will kick in, take over, and try to reduce some kind of useful phonemic transcription for the spelling. So the pros of the GTP are that it's very dynamic. It can produce a pronunciation for any word. It's quite cheap, but it's unpredictable, and the results are, let's say, variable. And depending on the script, it can be really, really bad. So really, in a production system, you need both. They kind of work in tandem together. So the lexicon and the GTP complement each other. Where the lexicon falls down, the GTP takes over and vice versa. So let's just talk a little bit more about spelling and pronunciation. I mean, this is like an obvious point to make, but the relationship between spelling and pronunciation is not always predictable. So even in languages with very shallow orthographies, and by shallow I mean that the orthography matches quite closely to the phonemic realisation of the language, there are going to be context-dependent rules. Take this really simple example in Italian. So Italian N, the graphene, is pronounced N phoneme in this word, nazo, which means notes. And in this word, finché, which means until. And in this word, mano, which means hand. So you're like initial, final, medial. It must be N everywhere, right? It is N. It just is N. So maybe we can just have a context-independent system and just replace the graphene with the phoneme and we're done. And then you get to cases like this, where it's part of a digraph, GN. And you try to just map N to N in a context-independent way. Your system is going to produce agnello, which is incorrect, okay, that's pronounced, agnello. It means the lamb. Yeah, so I mean, given that simple example with a very easy shallow orthography, think about English words like yacht, or canal, night. I mean, you're lost. What does CH represent in the word yacht? In the pronunciation of the word yacht. I don't know. There must be some L2 learners of English in the room. Any others that you found particularly annoying to remember? O-N-U. O-N-U? Yes, done and gone. Right, yeah, done and gone, yeah, exactly. The same spelling, two completely different pronunciations. Or like O-U-G-H, right? Thru, bow, coff. Persefyni. Persefyni, okay, here. Yeah, yeah, yeah, exactly. English is like etymological, right? You have to know about like Greek orthography to be able to give pronunciations to English words. Okay, so yeah, graffing to phone conversion. There's various ways to do it. It's not just like one size fits all. So the most common is to use a machine learning model. And for that you need lots of data. So you basically already need a pre-existing pronunciation lexicon. So a lexicon is just like a huge list of mappings between spellings and phonemic transcriptions. And if you have that, then you can like train a GTP model. It's just like a sequence-to-sequence problem. So you can see that this spelling corresponds to this phonemic transcription in all these different contexts. And it just learns to reduce the correct phonemic transcriptions for any word. Of course, in many situations we don't have that yet. In fact, that's what we need to make in order to get a system going. So it's not viable for many or even probably most languages in the world. I mean, there are 7000 languages in the world. Google I think for ASR we support something like 120. So that's not even like 10% or something. It's very, very few. If you think about it in terms of number of languages, of course in terms of the number of speakers we're already covering quite a large part of the population just with those 120 languages. So, yeah, as an alternative to machine learning models you can also use a rule-based approach. So this is like super cheap. You just need one linguist to sit down, study the phonology, study the orthography, come up with a system which will allow you to generate phonemic transcriptions for spellings. And yeah, there are various ways to produce a rule-based GTP. You can use ICU rules which are produced by Unicode but like I've tried and they're very hard to read and it's hard to read other people's rules if you don't have a pre-existing knowledge of their mind and the language and many other things about the system. You could also use regular expressions which are kind of context-dependent replacements that you can use. But that is like even harder. It's almost impossible to read. As soon as it gets complicated you're lost and the code would be unmaintainable, trying to fix a bug in someone else's regular expression forget it. So yeah, the best thing to do if you want to make a rule-based system is to use finite-state transducers written in two languages or two implementations of the same language which are Thrax and Paenini. So Thrax is like a custom of finite-state transducers and Paenini is like another implementation of that which you can import into a Python library users of Python libraries so you can use it within Python. So yeah, finite-state transducers written in Thrax are super easy, super fun and it's a really interesting problem to use. So I'll just talk a bit about that. You can see I'm a bit biased. So FSTs operate on strings. This is like the terminology. So here it comes. Strings can be characters, single characters, words, sentences or text. Any level of character-based input. And FSTs operate on a string. So they take a string and they either change it, let it pass unchanged or fail. Refuse the string. So there are various types of FSTs. So the main one, of course, is the actual transducer. So a transducer takes an input like fish and it produces an output like the Paenini transcription of fish. So that's what a transducer does. It changes a string into a different string. An acceptor, which is also an FST, all of these are FSTs. This is a bit confusing but it's worth getting your head around. Acceptor, it takes a string and it just returns the same string. It is a transducer, but all it's doing is mapping from the thing to itself. That's called an acceptor. Then you have union, which is collections or sets of FSTs. So you can say something like, it's like ore. So fish, orex. That's an FST. So it's like a union of two things together. And then there's concatenation, which is a string that's appended to other strings. So it's a bit like am. So you have fish and am. So those two things concatenated together will give you a new string, fishy, right? And then finally, a bit more complicated, there's composition. I'll talk about this in a bit more detail. In a minute. That's sort of chaining together one FST after the other to reduce a large FST. So this is like a very basic computer theory. I'm a linguist and I did my PhD about language documentation. I'm not necessarily the world's leading expert in this, but so if you have A to B and then you compose A to B with B to C then your result will be A to C so you get all the way from A to C by composing those two rules together. Finally, there's context-dependent rewrites, which are basically just transducers, so changing one string to another string, plus some context in which the change should occur. And context-dependent rewrites are really the ones which are crucial for the graphene phoneme problem. So here's a context-dependent rule written in threx. Forget kernel, forget your we're never going to get that. You have to enter those in the lexicon. There's no GTP rule that you can write which will allow you to produce the pronunciation for that spelling. But for a word like night let's think about what we can do. It is possible to get from the spelling K-N-I-G-H-T to the phoneme transcription N-A-I-T. That's what we're after. So first of all, we need to deal with the silent K. In a CD rewrite, this is the definition of a CD rewrite. You have a transducer so we're going to rewrite K-N, the string, to N on its own. And we're going to give it a preceding context and a following context and a field within it's searching. In this case, our preceding context is the beginning of the string that happens only at the beginning of the word and you'll see where that's important in a minute. And then the following context is always going to be a vowel. So you can't have K-N in another consonant. These are really like, this is short-hand, but these are like acceptors of the beginning of the string and any English vowel-graphy. So A-E-I-O-U. And then this one which is a bit confusingly named. That's just a convention. This means like any unicode character or a space. So it's like a field in which you're searching or like looking to do this transaction. Any character that you like can be accepted by this rule and will put. So any string will be allowed to pass through in any language, in any script, in any writing system. Okay, so with this rule we can now get to N-I-G-H-T. But we're not quite there because we need a rule. So we have a second rule which is another theory we write. And it's basically going to rewrite I-G-H-T-A-I for the phoneme transcription and the left context in this case is a consonant and the right context in this case is a union of a consonant or the end of the string. Okay? So we can either be a consonant on the right or the end of the string, the end of the word. And again we've got the same search field of any character. And now with just those two rules you can take this string because it's a bit confusing but a string is an FST which is an acceptor of itself. So this is an FST and it takes it's spelling in and it just gives you the spelling back up. Okay? Then you compose that FST with the rewrite KN rule. So that's an FST. This is an FST, this is an FST and now this is an FST so A to B and you compose it with the IGH rule. So now you've got A to B to C to A to C and this whole thing is an FST and it will spit out hopefully the correct output, the phoneme transcription night. So we've got with just two rules from the spelling which is nothing like the pronunciation to the correct pronunciation. Okay? And what's really crucial and important to remember is that you can apply these rules hopefully to any English word and they will always produce the correct phoneme transcription if those sequences are in there and it won't mis-produce bad transcriptions. Okay? So take an example like breakneck. I'll just give you a minute if you want to write it down. Okay? Look at breakneck and think about using this rule why breakneck won't be changed by this rule. You can shout it out. Exactly, okay? So it's string medial. Okay? It's not at the beginning of the string. So if you pass breakneck to the rewrite KN rule it's just going to pass it unchanged. Okay? So we're not inappropriately rewriting this thing which is actually a coder and an answer into it or just an answer. And it's not in the same way if you pass the string like pig edit which I spent a long time looking for. What's wrong with this? It's going the IGH rule change pig edit. Okay, it's got a constant on the left so there's some credit context but on the right okay, exactly. There's a vowel. It's not the constant and it's not the end of the string so again it's just going to pass through unchanged. So this works this little toy FST it works beautifully, there's no problems, no exceptions. Can anyone think of any exceptions? I couldn't. Are there any exceptions to this in English orthography? I was trying to think about it. I added end of the string here because you might have like a thigh thigh IGH and then the end of the string so it's not necessarily a constant and following it could be the end of the string but I don't think there's any other problems with it. I mean it's just a toy example. We don't actually use GTP for English it would be terrible. Yeah, so FSTs for GTP so finite stage transducers for graffing to phoneme conversion the pros are that it's deterministic so it's always like I said it's always going to get it right there's never going to be any exceptions it's lightweight, it's small it's cheap and it really works it works really well even for languages which have large discrepancies between spelling and pronunciation as long as the correspondence is as we're going to see are fairly regular the cons are that I said before that it's easy and fun to read but even with this nice context-dependent rewrite language it can get extremely difficult to read people do all kinds of interesting things which maybe you wouldn't do it can become quite hard to read it is sometimes quite inaccurate due to time constraints like the complexity of the writing system or the phonology and it doesn't really handle exceptions very well so if you have some kind of lexical exception to your rule then you just have to basically that's where you'd need a lexicon that's where the GTP won't really work and it definitely doesn't really work very well for messy orthographic systems like English or French or there are many other cases but let's try Tibetan before I start who knows much about Tibetan in the room ok a few so this is going to be really boring and basic for you guys and super complicated for everyone else Tibetan orthography is wow I've never seen anything like it ok first thing to say it's a Brahmic abagida or alpha celibri if you don't know what that means Brahmic means that it's derived from Sanskrit originally and an alpha celibri means like in modern Devanagari so you don't write the inherent vowel you mostly just write consonants unless the vowel is not ah or ur then you would write a diacritic to represent i or u so it's mostly like consonants and an inherent vowel the current standard for Tibetan emerged in the 9th century and has not changed since then barely at all, correct me if I'm wrong so yeah the written and spoken forms of words now differ widely so there are many silent letters like the K in night in our English example and lots of sound changes so vowel, consonant changes between the spelling and pronunciation unlike English what's really interesting about Tibetan is that the correspondences between the spelling and pronunciation are mostly quite regular and unpredictable so it is possible to write a GTP for Tibetan with a rule based approach I mean I'm not recommending it the optimal solution probably is better to just pay or find a lexicon or pay people to write pronunciations but this is what we can do in a low result situation so it's good to think about this even in a complicated case so this is what a Tibetan syllable looks like so Tibetanist can what is this word sorry I don't have the transcription drumton drumton so we have how many of these letters are actually pronounced in that it's a philosophical question you'll see what I mean in a minute okay it's just like as a high level introduction usually have what's called a prefix there's nothing to do with morphology it's something to do with the autography it means like a silent letter at the beginning this is like a vowel diacritic then you have like a root character which is probably pronounced but maybe not pronounced then you have a subscript so it's like another consonant appended to the first one then you can have a suffix that's a syllable usually the prefix and the suffix and the secondary suffix are not pronounced so they're all silent then in the second one you have a vowel character and in this case the main consonant is not pronounced in this case this is a superscript and actually this is the root of the word the appended consonant is the one which is pronounced in this case so it's like another suffix okay so yeah as I said I'm just going to go through a few problems in Tibetan and GCP just to illustrate the complexity of the problem so yeah, the inherent vowel is not marked in lithography so the first thing to do when you're going from graffing to phoneme first you need to convert every single letter in the Tibetan string to a Latin character but then you need to do some insertion rules to add the vowels so that's like the first problem so this is what is actually written here we're ridder and we want to say da we want to add that a after the D and the same thing here there's no vowel on this so you want to add that a that's the easy bit there are special rules for this initial db sequence so if you have db at the beginning so the D is not pronounced and then the B can be lamited to the W or just delete it all together so this one is db us and it's pronounced u so no D, no B, no S and the vowel is not even the same as the vowel in lithography so there's nothing left from the original text in the pronunciation suffixes so there's four types they can be silent or pronounced and they can affect the preceding vowel or not so in this case we've got B which is a suffix pronounced in this case and it doesn't affect the preceding vowel and then here in this case you've got L which is a suffix and it is pronounced but it does precede the preceding vowel so the special rules about suffixes are two categories prefixes are typically not pronounced but they can have effects on other sounds so a preceding glotl I don't know how you call this character like a lot, no well let's not work so I think that's a voiced wheel of frigidote originally and it's written three articles about it so yeah fine I see that I'm doing like to the N101 and some people who are just shaking their heads and going what is this so in the very simplest case this prefix will diasporate the initial consonant so instead of pen and then here this G initial G is also a G and it's not pronounced sorry I'll start again this is in year and the G is not pronounced but it raises the tone so this is originally low tone and it raises the tone on the vowel in the following consonant so not pronounced but having an effect on something way again in the stream so imagine remembering your mind the CD rewrite how are you going to do that how do you jump over all the other letters and make sure that that tone is raised but you can, it's perfectly possible to do this interesting thing about that is called consonant stacking so you can stack up to three consonant graphions one on top of the other so this stack is and then E that's right right so the E is pronounced after the consonant stack even though it occurs on top of it don't ask me here's that stack in action so in this word so the the subscript in this case which is help me out the Y at the bottom it will change the preceding sound from P to Ch I don't even know what to say about this there's no rule you can't call this like africation what does P have to do with Ch anyway here you go from like to because of the the subscript am I right I hope I'm getting this right and then here yeah this is the most in my favour to zola with another subscript it's pronounced do that's what goes on in Tibetan that is what's happening in a literate Tibetan's mind when they see the sequence zola it's done that's how it is superscripts a bit like prefixes so superscripts R and S and L they raise the tone of vowels following the nasals but they're generally not pronounced so this stack associated so here the superscript S it raises the tone of the following nasal it raises the tone of the vowel following a nasal so if you have Sir and then a nasal we'll go from low to high and in this special case which is the only one I think uniquely in this case this stack so from you get finally tone is not overt marking of tone there's nothing in the orthography we can use to predict an acute accent or a grav accent in the Latin script system but we can work out how to insert in a rule-based manner so basically Tibetan constants have traditionally grouped into four columns based on their manner of articulation so if you have a vowel following a column 1 or column 2 consonant it will generally have a low tone whereas 3 and 4 vowels following 3 and 4 consonants will generally have a high tone but if you remember those tones can also be changed by prefixes and superscripts so yeah all that was just to say this is like an extremely complex case of the difference between an orthography and a phonemic transcription but even in this case it's perfectly possible to get all the way from graffings to phonies just with a rule-based system I were already relying on a lot of work which has been done for the Tibetan scripts the Unicode people and various others so for example this stack thing so this is ISPY like from top to bottom but it's rendered already in Unicode as SPISPYI so the vowel is following the consonants so yeah the real key to GTP for Tibetan is to classify the graphemes so the first path is just to say is it independent or dependent so is it appended to another character or is it an independent character and then comes the linguistic knowledge so is it a consonant or a vowel is it a prefix or a root is it a suffix is it a subscript or superscript and then you have many many many context to deal with all these kind of sound changes, vowel changes consonant changes so here's just like a kind of this is like a very very simplified abstracted version but this is what's going on so each you can imagine each of these is an FST a bit like a context dependent rewrite so you pass this string B S G R U B S to this rule and then it will tag all the letters for the stases so from this string we get like say this one's a prefix, this one's a superscript that's the root, that's the subscript and when you say then we know like what the what the class of each character is then you just change all the characters to the Latin script that's like the first thing to do to get to get to the feeling of transcription then you want to remove all the silent letters change the quality of the plos here from G to D and then remove all the tags add the finally voicing and then add the tone marker because we know that it's column 3 so each of these is an FST and if you just chain them all together then you get all the way from graffing and staphoning so what's very important to remember is that if you apply these rules to any Tibetan word it will hopefully this and many other rules it will hopefully produce something like an accurate pronunciation for the word so to summarise I talked a bit about how GTPs fit within pronunciation models specifically and ASR and GTS systems more generally I talked about how GTP can be achieved in a rule based way for languages which have a systematic correspondence between the graffins and the phonemes and the pronunciation I talked a bit about how you can write GTP rules in thrax and then we looked at the issues in Tibetan orthography and gave a simple GTP example for Tibetan so yeah, if you're interested you can do this in your own time main maybe it will be interesting to try and write a context-spending rule for Tibetan or any other language that you're interested in and I guess like yeah I did want to say that in the context of this discussion I guess what this would be really useful for is that if you can create this FST for your language that you're working on in your documentation project then you don't necessarily need to produce phonemic transcriptions for your one hour recording all you need to do is to get someone who knows the orthography to transcribe it orthographically then you probably need to do some kind of normalisation on that text in order to get it into a state where the GTP will pass all the words but then you just chuck that text into the GTP and it will produce all the phonemic transcriptions for you so this is good in some context where maybe you have lots and lots of recordings and all you have is the orthographic transcription by my native speaker and there's no analysis then you can use that data or you can use the transcriptions produced by your GTP for that data and throw them into your model and that will probably make your acoustic model a lot better if you have lots of data like that so that's like a real important use case for this aspect of it just a couple of resources if you want to know more about FSTs go to openfst.org that's where you can find all the downloads and the resources for producing cracks and finding rewrites and I found this website really useful when I was looking at the button so if you're interested you can go there I'll say tup te chat