 National education is an example of, you know, I think most of us think it's kind of good for people to be able to read and whatnot, but it's also a force for cultural oppression and homogenization. So for example, for about 500 years in the country I currently live in, if a school child spoke Welsh in a school, you would hang a sign around their neck as a badge of shame, and that is called the Welsh Knot. So now I will just look very briefly at the relationship of language to nation building, comparing the 19th century Europe with modern Asia as a sort of context for fieldwork and language documentation. So in 19th century Europe, linguistic description was a cornerstone of nation building in at least three ways. Philology of ancient varieties of national vernaculars, think of like the Germanic sagas, for example, one way you were a nationalist was to look at what was the medieval poetry in my vernacular. Then also there was work done in dialectology, dialect grammars and dictionaries, also dialect maps, and that's the third one is systematic dialect geography, where there is this notion particularly in Germany, but then also in later France, of like to really appreciate who the German people are. We need to really study Germany as it's spoken in each of the German speaking villages and understand dialectology and folk traditions. So for whatever reason, these elements of nationalism do not seem to be repeating themselves in Asia. That's my observation. Philology of ancient varieties, everyone pays lip service to it, but only in China do you see major investment like in India, as much as the BJP loves Sanskrit, you don't see people pouring over Sanskrit texts and doing editions of Sanskrit texts as a kind of manifestation of Hindu nationalism. And as far as dialect grammars and dictionaries go, there's a little bit of work here and there, but again, it really doesn't seem like the modern states of Asia feel the need to express their national identity through linguistic fieldwork. And that's worrying, I think, because it means that those pressures on decreasing linguistic diversity that come with nation states like national education and increased travel are there, but the countervailing pressures of documenting and celebrating linguistic diversity are not there like they were in 19th century Europe. So now I just look very briefly at how things are going here in the UK. So Welsh is doing well, Scott's Gaelic could be doing better, Cornish died in 1777, Manx died in 1974, but there are revitalization efforts going on with Cornish and Manx. And then I think that something that can focus our minds is that the Cornish and the Manx have been able to do revitalization because the languages were well studied and well documented, whereas, for instance, when when Minyak disappears or or what have you, it's not clear that that even if a community later wanted to revive it, that they could because there wouldn't be enough done. So our job in as let's say as historical linguists by virtue of historical linguistics needing to look at a variety of languages from a family. And our job as descriptive linguists looking at the range of languages in the world is a very hard one because of the quick pace of cultural homogenization backed by national governments and corporations and and documentary linguistics has even fewer resources than it did in the 19th century is is my perception that in the 19th century, you could go to a a national government or an academy and say, give me loads of money to study dialect diversity, for example, and they would, whereas that's not so much the case anymore. So we need technology to help us. So let's look at some new technologies. Well, just to kind of focus on this this this theme that I'm elaborating of of technology brings you improved standards of living, if you like, increased productivity, but also new forms of oppression. I think this is very true with the very technologies in a way that I'll be talking about. So looking at the role of digital communication in controversies such as the surveillance of the NSA revealed by Edward Snowden and the Cambridge Analytica's role in the Trump election in 2016. But we can if we're sophisticated users of technology also resist these forms of oppression. So for instance, there's the Tor browser, which is a which is a browser that cannot be monitored by governments. So I use that myself a lot. It's not very good for making bookings at restaurants. I'll have you know, and you can also have encrypted email accounts like proton mail is the one that I recommend. And now moving from this general situation of kind of how has technology shaped the modern world in terms of increased productivity and increased oppression and the ability to resist that oppression. Let's look at how those same tendencies are manifesting in academia, I think broadly. So you know, so I'm going from sort of big to small sort of society at large academia and then linguistics. And I have been sort of surprised actually in the past that some of these things have not been known to to to PhD students, for example. So so I think it's worth talking them through. So I'm going to look at YouTube library Genesis since I have in particular. So in the context of my ERC grant that just ended in August, we put a lot of the talks on YouTube. And I think that's hugely advantageous to scholarly dissemination. And I think actually that one of the advantages of the coronavirus that we're seeing now is that academics who tend to be very conservative are are are figuring out how to use technologies. And you can do things like have people from all around the world attend a doctoral school. But also I think we should try to put as much of that as it makes sense to into the into public access. And just imagine if you know, we were able to watch talks by Brugman or Deis or sewer or the greats of linguistic history. Well, I think we if we find that idea attractive, we owe it to the future to try to document some of our presentations. So that that now that it's technologically possible, it can can be realized. So now I will talk about library Genesis. So this is a way to find books and books in particular, it works a lot like a library catalog, the you have to check the URLs, because because it depending on where you live, or what's going on in the world, the URL can change. So I put in to this search window Nathan Hill Tibetan. And then, like in a library catalog, various things come up. And one of them is my 2019 book. And then you click on it. You can get the PDF file and there's a book. Yeah, so you've just saved yourself 85 pounds and have a relatively new book. So especially again, in the coronavirus time, this is a way of you know, accessing scholarly resources that might not be as accessible as they were in the past. And then if we look at Sci Hub, that's better better for articles. So similarly, you may have to check what the URL is. And you can basically just Google something like where is Sci Hub now. In this case, first we go to the homepage of the article in question. So I found one here use of reflectance transformation imaging in recording and analyses of Burmese Pew Inscriptions. And we've looked for the DOI. You see the DOI at the bottom. The DOI is a unique identifier like an ISBN number for a book. And then you can copy that. Of course, if you have access to this journal, you can just click through to the journal, but I'm presuming maybe you don't. They want to charge you 30 pounds for looking at this article, something like that. So then you go to Sci Hub, you put in the DOI, and then it brings the article right up. So that is a way to get respectively books and articles using, you know, new technological resources in the 21st century. Although I should mention that whether or not it's legal is something you should consider for yourself in your jurisdiction. In Burma, it would be perfectly legal. Yeah. So that's it for very sort of generic research technological resources of scholarship in general. Now I will look at something specific to language documentation. So there are a bunch now of language archives, which so what is a language archive? Yeah, when you when you document a language or produce sound files, video files, it's a place that can host in perpetuity those files in some kind of curated fashion. And I'm going to look at them a little bit, you know, one at a time. So this is a one called Paradisac, which is based in Australia and is particularly strong in Australian languages. I'm not super familiar with it, but some colleagues who whose opinion I respect a lot like it a great deal. It doesn't charge money for deposits and is quite accessible. Next, I'll mention ELAR, the Endangered Languages Archive, which is based at SOAS and personally, I find it quite difficult to use. It's hard to figure out whether they have a resource on your language that or the language you're interested in. In many cases, the or I should explain that this is tied to a grant program, the Endangered Languages Documentation Project, where if you're given a grant by the Endangered Languages Documentation Project, you're expected to deposit your data in the Endangered Languages Archive. Now, a lot of people, I would say as many as half of the people who get these grants do not end up depositing their data because they're bad. But the system still records a deposit that's waiting for them to deposit their data. Just for example, inside of Tibetan linguistics, Mark Post is someone who got a grant to, I think, work on Gallo and his deposit is listed, but there's no files. Then it just makes you feel crazy because you look, you see there's a deposit, you can't find any files, you think you've done something wrong, whereas it's actually that the depositor hasn't put anything in there. You have to have an account to use it, so it's not my favorite, although I should say all the decisions that have been made in deference to my SOAS colleagues were made for reasons that do make sense for some circumstances. Then here is Dobez, which was associated with a grant program by the Volkswagen Foundation, which is now closed, but they still will host new data. Without going into too many details, my complaints are similar. I find it hard to find things. A lot of things aren't accessible. There's different levels of security. My personal favorite endangered language archive, if you like, is the Pangloss archive, based at the CNRS in Paris, specifically at the Lassito lab. Why do I like it? I like it because it has an extremely flat structure. There's no sort of, oh, look here, look there, log in, get permissions. It's just a bunch of home pages, very flat. What you see is what you get. In general, you like here, you get a sound file that's glossed, that's translated. I'll stick on this one for a little while. It's a small enterprise and very French, which makes sense, is based at the French lab. It's paid for by the French taxpayers. That does mean that it tends to be languages that were studied by people who work at Lassito that are covered. That's good for me because that lab has done a lot of work on sign-out languages. I think it's worth noting that they will take deposits from the public as long as you format your data and whatnot in a way that's easy for them to ingest. I would recommend that. I think that if you're going to go to the trouble of really formatting your data for an archive anyhow, you might as well go with pan gloss. At the technical level, every deposit gets a DOI, which is not true for example at ELAR, and that DOI is resolvable down to the sentence level. Then if you want to cite, for example, someone's data in a paper of yours, you can just give that DOI that then will take someone immediately to that exact sentence that you're citing. That is also the cutting edge of scholarly practice. Pause on that thought. I like the pan gloss archive and would encourage you to take a look at it. Also, if you are in the language documentation business or doing some fieldwork yourself, consider giving your data to the pan gloss archive. In terms of transition, let me just say another thing I want to tell you about is the no-do. It is a generic data research repository supported by the EU and specifically by SAM, where they do the atom smashing. Most of the data in it, like you see here, is from hard sciences. I think it's worth us asking ourselves, why do we need special language archives? Language data is just research data, so why can't we use a research data repository like Zenodo? Now I'll just tell you what's good about Zenodo is it's totally free, extremely easy to use. There's no upper limit on a number of deposits and a single deposit can be up to 50 gigabytes. It's just the most user-friendly thing out there, and each deposit generates a citation and a DOI, so it also gives a lot of advantages in terms of citation practices. What are the disadvantages? The only disadvantage really is that it doesn't provide a good interface or user experience for people to look at your data online. I would say that's been an emphasis in the design of other language archives. You can play a video and you can look at the transcription while you play the video, whereas in Zenodo it's just here are my files, you can download them and then you can read them on your own computer if you want to. But I think that's not so bad. I think that if we make it easier for researchers to share their data, then more researchers will share their data. The more people share their data, the more we'll be able to have a kind of a rigorous scientific enterprise where our findings are verifiable and maybe too much effort has been spent on worrying about making lovely interfaces for language archives. That's my pitch for Zenodo. Personally, when I get an article these days on a morpheme or something in a language that's based on fieldwork, I ask to see the data because I think that's good practice. We should all be sharing our data and making the relationship between our data and our research findings clear. Now I'm going to talk about auto-completion keyboards, which I've moved here because it's more a technology that benefits communities that use under-study languages than a technology per se that benefits researchers, but I think it's exciting and useful and wants to draw people's attention to it. One thing that I've even had trouble explaining sometimes is what is an auto-completion keyboard. I'll get to that in a second, but you all use them all the time. Using these sorts of keyboards, let's say, is participating in the spying on us by major corporations, but we all do it when we use our own phones. My own feeling is the benefit of participating in that spying regime is that you get better commercial products for your language, and that would be a good thing for minority languages to participate in. This is what I just mentioned. There are privacy issues around the collection of data, but let's look at Microsoft Swift Key, which is the auto-completion keyboard that I use, and that I've helped in developing. This is a screenshot from my phone, and you see there's Tibetan and there's Welsh. Our languages that I have used, and then it suggests some other languages based on your location. This is me just proving that it works for a language, Atom, where basically a friend of mine gave us some data for making an Atom model, and he couldn't figure out how to use it, so I was just showing him how to use it, but I don't actually know Atom. The point is just we type something and then it says maybe the next word you want to use is this one. I think everyone probably uses auto-completion keyboards on their phones, but I have found that when I've tried to get linguists to give us data, they've said, what's an auto-completion keyboard? Why should my language community want one? But when you type and you say, hello, and then your phone says, do you want to say, how are you? And then you say, yeah, okay, I'll say, how are you? That's an auto-completion keyboard. And there's a lot of linguistics research that goes into making them. So just let's look at some a mix of, this is mostly Sino-Tibetan languages and Filipino languages to give you a sense of what's in Microsoft Swift key. And I myself have been involved in Tibetan, Hmong, Crusoe, Atom, Honey, and Galo. And you see the number of users and the vocabulary size. And here I just, the point I want to make is kind of the usefulness of documentary linguistics to real people, right? There are 5,311 people in Mizoram who are using this software on their phone to write in Mizo. So a little bit of work by a linguist has really helped these 5,000 people. And in some cases, which is a point to keep in mind, the numbers look good, like 2,000 Tibetans, almost 3,000 Tibetans, but there are 6 million Tibetans. So it's not good sort of, let's say to use commercial terms, market penetration. But if you look down at Atom, 157 Atom users, there's only about 2,000 people who speak Atom. So I think 157 is not so bad. So that's what, now I'm done talking about auto completion keyboards, but I'll say part of the reason why I'm doing it is a pitch, which is that if you look at Microsoft Swift Key and there's a language there that you would like to be there, that is not currently there, that you have some data on, then you can talk to me, and I'll talk to my friend Julian, who I've worked with, making some of these models, and we can add it. So I think that's a good opportunity for helping communities that use less supported languages. So now I will get into the actual discussion of the workflow of documentary linguistics, and then at some point, switching over into comparative linguistics as we get to that point in the workflow. And there are basically three phases that I'm going to break it down into. The first one is automatic speech recognition. The second one is natural language processing tools writ large, but there's this notion that's been kicking around since I think the late 90s, basic language resource kit, which I think is a useful notion. So I will discuss that. And then computer assisted language comparison, which is to say that the contribution of technological development to historical linguistics per se, and just sort of to give you this sort of overview perspective, this second one, NLP is where most linguistic research in commercial enterprises happens. So Microsoft, Facebook, Google have lots of people doing NLP. And for them, automatic speech recognition has been one NLP task. But for us, I think it's good to separate that because the way you approach it, if you're working on a language with a lot of resources, and the way you approach it, if you're working on language with few resources, are very different. So the extensive commercial research in NLP doesn't really advantage us that much when it comes to automatic speech recognition, but it does for other tasks. And then computer assisted language comparison. Well, of course, you know, commercial enterprises have not been super interested in reconstructing proto languages. That's no surprise. There's not a lot of money in it. So that's not something that comes under the rubric of NLP really either. Okay, so now I will look at automatic speech recognition. And just to remind ourselves that our sort of goal, if you like, is to have interlinear glossing so that you might have a little bit of a sound file corresponding to its transcription into the IPA. And then you would have a that broken up into words or morphemes. Then under each word or morpheme, you will have a analysis of it. And that's the glossing. And then you might have a sentence by sentence translation or some notes. So this process of changing your raw recordings into in your linear glossed text is extremely laborious. And to some extent necessary for using linguistic field work in further research. But it's a real bottleneck in terms of people's time. So I think basically everyone out there has far more recordings than they've been able to gloss. And let's just say that this standard, this practice of interlinear gloss text, it's not, for example, at all common in Indo-European. But it has been used since the 1900s. And here, for instance, is an example of Franz Boas, who did a lot of work on native languages of North America. And you see down at the bottom, he gives the forms and the glosses underneath them. So let's look at recent developments and basically, to some extent, my whole discussion here is an extended advertisement for the work of Oliver Adams and Alexi Michaud. And I have had a little bit of involvement, but a very tangential one. I'm more of a sort of fan. But I think this is a bandwagon to get on. So I commend it to you. So Oliver Adams did his PhD at the University of Melbourne. Now actually works in a commercial NLP company. And probably everyone has heard about neural networks. I noticed them about five years ago, but they've really taken over artificial intelligence research agendas. And that's what he is using. The input is audio and transcription aligned at the sentence level. And you need to train a model for the language and, in fact, the speaker that you work with. And this is where it's different than the big languages. So for English, you just buy whatever, buy yourself an iPhone, turn it on, start talking to it. It writes down what you say. Well, that's because a huge amount of existing resources were already invested in that. Whereas we're starting from nothing and trying to make progress in a way that benefits us as researchers as quickly as possible. So there's no, how can I say, there's no one size fits all answer. Each language is different. Each speaker is different. But generally speaking, one hour of existing very accurate glossed texts is necessary to train a model. And if you have more than that, that's better. And here is the input training, for example, where we have time aligned sound files with a transcription into the IPA. And one reason this worked well with Oliver and Alexi is that Alexi is interested in phonetics. So he makes very accurate transcriptions where, you know, if someone pauses, if someone says um or something, he writes it in there. Because many linguists kind of, even without noticing it, edit the text as they go. And you can't do that because a computer is listening to the sound file. So it needs to compare it to what the sound file actually says, not to some idealized version of what it should say. So in terms of, let's say, even at this stage, a recommendation to you is if you're doing fieldwork on a language that and you might want to take advantage of this set of tools, when you do transcriptions, please do them as precisely as possible, where where you write down what's actually said by the speaker, not any kind of cleaning up. So now just looking at the, well, this yeah, this is just a slide showing the the preparation of a new transcription. And here it is in practice, where you have a line by line. And I know, I know you're not going to read this just because it's a bunch of IPA for a language you don't know. But the the for each line, there's the first one is the reference, which is the correct transcription. And then the hype is the hypothesized transcription by the machine. And just based on the on listening to the sound file. So the point for our purposes is that it looks pretty good, right? Like, just I mean, just look at any one spot and you say, Oh, yeah, the the IPA symbols on the top line look a lot like the IPA symbols on the second line. So this can radically increase the speed at which linguists prepare transcriptions. But it's worth noticing that that that this slide may kind of overstate experientially the accuracy of the system, because common words, it will learn well. Yeah, that's just a machine is like a person in that way. Whereas less common words, it will learn less well. So here is an example of a four syllable name that comes up in the in the transcription. And sorry, in the texts. And what you see here is a bunch of different sentences that the same name comes up in. So the on the left is the IDs for the sentences. And then at the very bottom, you get the correct transcription. Yeah, the, the, you know, Alexi as a linguist has analyzed the phonological form of this name as having this this form. And you notice that basically every time the system sees the name, it does something different. It's because a four syllable name is not something it saw enough examples of in the training to to feel confident about. But still, I think correcting these incorrect transcriptions will be a lot easier than typing in those the the correct transcription in the first place. So so even when it's making mistakes, it's saving you time. So this is just an example of the wave file and the the alignment of the sound file with the transcription. And you what's happened here is that the person has repeated the the word and that in the second occurrence, it has it has it has not identified it. And the point that the slide is making is that the reason it hasn't identified it correctly is because of this globalization, which is to say even when the computer is making mistakes, they're not random mistakes. They're mistakes that are related to the phonetics that it finds in the wave file. So even its mistakes can teach us things and provide linguistic insight. So here, I'm just listing a few publications about this endeavor. It's quite new about two years old, making quick progress. So hopefully in in the future in the near future, this set of of tools, these technologies can be incorporated just into the workflow of linguists. And then I hope this this works. I just want to show you a clip of a YouTube video where you can see this tool. The tool is called Persephone in action. So I'm going to allow. But now I think can you see the YouTube or do you see the slide still? The slide. Okay, the slide here. Okay, then I think I need to stop sharing. And Christoph, you have to be sure that you can share the sound. So now can you see the the YouTube? Yes. Yes, okay. Well, I mean, you can always let's say even if the sound doesn't work, then you can try yourself in the future. But let's give it a try. See what happens. Yeah, sure. So that's all we need to do to install Persephone Elan. In that way, it's just like any other plugin, any other recognizer that's been developed for Elan so far. Once it's installed, we can open Elan and actually apply an existing Persephone phoning recognition model. I should have said so. This is a kind of a half hour talk about an extension for Elan that uses this Persephone tool. And I'm skipping the whole first part where he's like, okay, this is what Persephone is. This is how you fit into Elan to get close to where you get to see it in action. And because that's the sort of dramatic climax that I want is to see it in action. But but but I'm starting it a little bit earlier so that I can let Christopher Cox kind of set up its use himself. So I'll continue. Now, to do that, we do need to tell Persephone a little bit, pardon me, Persephone Elan, a little bit about our Persephone model. So specifically, we need to let it know the folder where a Persephone model or experiment is, how that model was configured. So specifically, which feature types we use, what venetic features, and what labels we use to provide the text, and where the original training data for that model are. Persephone Elan feeds that information back into Persephone behind the scenes to reboot that model essentially, and then apply it to these new unseen snippets of audio that we're getting from Elan transcript. Now, I'm hoping over time that we can make changes to the Persephone source code to actually save these settings inside the model themselves. So the only thing we need to provide the Persephone Elan is the path to our pre-trained model. But for now, this is information that we have to enter manually into the Persephone Elan interface. For myself, once I've trained up a model in Persephone, I usually just keep a small text file that has all of this information there. In the example we'll see in a second, I've entered this information in the appropriate fields already. But again, these are the things that you should be able to recover from your model training process fairly easily. So again, here's a short video showing what this looks like. So here we have the transcript. And this is, again, the Tutama language with Alder Bruce Starlights. So you can see we have a number of empty annotations or textless annotations on a main tier. What we want to do is provide that tier to Persephone's phoning recognizer. So in the Recognizers tab, we select Persephone phoning recognizer. And then we provide the settings I was just describing. In this case, we trained our model using fbank for phonetic features. The text that was provided this model, all those text snippets had this file extension. Again, this will look a little bit different for your particular model. We've built in support for Tutama's orthography here. But again, in your case, you'd most likely choose none. And what you'll get then are the actual phoneme strings that come out of Persephone with no conversions happening behind the scenes. We want to provide this BRS tier. That's the tier that contains all of the empty annotations that Persephone is going to try to recognize. Lastly, we want to provide a reference to the directory where the original training data is. So in this case for the Tutama model that we're being never using here. And lastly, excuse me, to the model itself, to the source experiment directory. There's a final field here as well for output recognized text. This is essentially just a junk file we can't get away with producing. Elan makes us do this. Don't worry too much about it. Once we hit start, Persephone Elan will start picking out all the individual clips from that tier, reload the Persephone model that we provided, and then actually ask Persephone to start transcribing each of the clips on that tier. So each annotation can be fed to it. When we're ready, it'll load the corresponding tiers and we can listen to the results. You can see here it's recognized in only segments, but also the tones that are marked with diacritics. So as we mentioned before, there are a number of settings that you need to have. Okay, that's the part I wanted to show you. So I stop sharing and then go back to sharing my presentation. Okay, and I don't know whether you thought that was cool or not, but I think it's pretty cool that the computer just filled in the IPA with the tone marks, correct? And although it took him some training data, I think in this case about three hours of training data and struggling with a computer to train a model and whatnot, once he's done that, then it works forever and he can do 10 hours, 20 hours, 50 hours and it's much faster to make progress. And also, as you saw in Linda's presentation yesterday, Ilan is a very standard piece of software to use in language documentation. So the point of this video and this tool that Chris Cox has made is that it shows you can incorporate this kind of quite clunky computer-experty thing, Persephone, into an environment that is, I mean, I don't know whether it's more user-friendly or not, but is already being used by a lot of documentary lists. So that's the end of the section of my presentation on automatic speech recognition. And I would now move on to the next part about NLP, but we're close to, I don't know, we're 80 minutes away from the half-hour mark, so I think it's maybe better to pause here than to press ahead, but I'll defer to what Christoph tells me. Well, let's the majority decide. It's half an hour too much, maybe, and or rather have 10 minutes and then close a bit earlier. What is your feeling? Why don't we, or let's say probably, let's all do one more section, I'll do the section on NLP, and then we'll maybe be a little bit past the half-hour mark, but we won't be so late, and then we'll see where we are. Okay, so now in terms of this workflow, we've already gotten from having sound files to having some textual representation of our language, and that might be IPA, or it might be some orthography. So now I will turn to NLP, which is the kind of, so natural language processing, the kind of thing that's in your iPhone if you have one. And I like to talk about, in a somewhat facetious way, what I call Maslov's hierarchy of NLP needs, referring to this idea of Maslov's hierarchy of needs, where first you worry about eating, and then you worry about whether or not you have a home, maybe some of you heard about it. So first you need to worry about whether or not your script is in Unicode. So you're using IPA, no problem if you're using Roman script, no problem, but like Tongut only very recently got in Unicode, and Kitan is not yet in Unicode, so it is an issue for some languages. Then after your script is possible to encode in computers, you need to have some text that are actually encoded. So you can't do anything fancy if all of your texts are still on paper or in museums, you have to have some e-texts. And then the next question is, can you divide words apart from each other? Where in languages like English or most European languages, we put spaces between the words, so they usually come pre-word divided, but for Vietnamese or Thai or Tibetan or Chinese, words are not divided. So it becomes a NLP task to split the words apart from each other. Once you have words, then you can analyze those words, and in particular, you can identify the part of speech category of that word. Yeah, so especially if you think of of of of of disambiguating homophones, part of speech analysis can be part of that. But now we're just sort of adding, you know, more and more rich layers of analysis, but the idea is automatically. So after a part of speech tagging comes lemmatizers, where, for instance, if we have a word like sang at the part of at the part of speech level, it will say sang is a past tense verb. And then at the lemmatizer level, it will say, oh, it's somehow the same word as sing, even though, you know, sing would be present tense. And then once you have that level, you can start doing syntax, whether it's noun phrase, chunking or dependency relationships between verbs and nouns. And after that is fancy stuff that we don't need to worry about. So I just want to to sort of take stock of some things that are available for English and that have become available for to ban. So there's this company that called Lexical Computing that makes some very useful software that until 2022 is free for everyone in EU higher education establishments to use. So you might want to play around with it. And they have corpora of of a bunch of different languages tend to be European ones with more or less detailed annotation. So I'm going to stick with English where we can count on all the bells and whistles. And this is a word sketch that is done totally automatically, where it's looking at the British National Corpus. And the verb is chair. I use this verb because because it's good for demonstrating the need for part of speech tagging because there's also a noun chair that's a little bit more famous than the verb chair. So once we've if we're interested in the verb chair, we've already part of speech tagged so we can we can find the verb chair. And then we look at this sketch where it's looking at the whole corpus and then automatically telling us typical objects, typical subjects, and typical, you know, other other grammatical interactions with this verb chair. And I think it's pretty impressive already you get people basically people's names and and job titles, or what chair things, and the things that get chaired are subcommittees, meetings, committees, seminars. So there's a lot of semantics of a lexicographical kind captured by this sketch that's all been done totally automatically, right? The grammatical relations come from the corpus, the statistics come from the corpus. Even a little bit more impressive I think is this, which is the thesaurus function, where looking at those statistical relationships of grammatical use, like what are what are other things that people do to subcommittees is kind of the way to think about it. It ranks the statistical similarity between chair as a verb and other verbs and you get convene, attend, host, conclude, adjourn, organize. And I think this is really impressive because it really is approximating semantic similarity, which is which is a notoriously hard thing to define and model, and has done it by this series of grammatical and syntactic analyses then scaled over a large corpus. So these screenshots actually I think I probably took back in 2013. So you know this stuff in language like English, totally old hat. It was done a million years ago. But a language like Tibetan, we had a project at SOAS from like 2012 to 2015. And for Tibetan these these things didn't exist. So we were mostly working on part of speech tagging and word breaking, and we did a pretty good job of that. And here is the Tibetan word galpo, which means king. And from a word sketch, these are these are verbs that king tends to be the subject of or let's we say the agent because it's an errative language. And it's a little small and probably most of you can't read to Ben anyhow, but they turn out to be speech verbs largely. So they're their request, say, another verb for request, another verb for for request or say, so lots of speech words. But then there's also some other ones, think, give. And then there's a verb that is used in a as a light verb construction in in in, you know, in a construction that means invite. So these are, you know, not shocking things to see kings involved in. But I think it's, I don't know, I think it's cool and very useful if you were, for example, running a Tibetan dictionary. Now, when we turn to the thesaurus function, it's a little less impressive. And I think the reason why is because of the corpus size. So this corpus is 80 million words, which may sound like a lot, but is nothing compared to what they have on English. And we have a more recent Tibetan corpus that's about 150 million words. So I'm hopeful that once that's loaded in, we'll get more impressive results here. But this is what the system says are the nouns that behave most similar to king. And I'll just tell you what they are. The top one means God. The second one means Lama or Guru. Yeah. So not so bad. You have sort of high ranking, socially high ranking, animate, you know, entities. But then comes son or boy. And then person, which I think, well, you know, kind of a king is a type of person, but not stunningly impressive. And then the last one that's on the screen, it means a victor. And that is is quite good, I think compared to, you know, a victor and a king are sort of semantically similar. So, you know, for the purposes of this presentation, it's just to say, we have made substantial progress down the road of Tibetan NLP. And that work continues. But it needs to be done for each language on earth, right? And if the language has fewer resources than Tibetan, then you have to start from wherever that language is. And that might be automatic speech recognition. So after this project ending in 2015, I was busy with my ERC grant. And I wasn't involved in the follow-up project. But there was a follow-up project on verb syntax, looking at classifying Tibetan verbs according to their governance relationships. And this is a screenshot from the corpus that that project has delivered, showing, and I think, personally, it is a very beautiful interface. Yeah, so you see, I'll just talk to you a little bit about it. Like the D is a demonstrative, the C is a case marker, the A is an adjective. Then we have another case marker, then a noun, then a determiner, and then the verb. And there's an arrow that goes from the verb to the head of the noun phrases that it's governing. And in this case, we have a sentence that says, there they erected each a statue of the great black one. So, dare is the first syllable at that place, and then knock-bow, black, chimp-bow, great, genitive of ku means body, actually, but here it means statue, re-each, and then zheng means to set up. So, I mean, and I don't want to sort of, let's say, lead you to the impression that any of this is particularly easy. It takes a lot of work, a lot of blood, sweat, and tears, and also a lot of analytical work. But once it's up and running, it's scalable is the point, right? Because like now we have a machine that will analyze Tibetan syntax, and it doesn't care how much Tibetan you throw at it. So, we can go, we can get big data sets that we can use in our research.