 National education is an example of, you know, I think most of us think it's kind of good for people to be able to read and whatnot, but it's also a force for cultural oppression and homogenization. So for example, for about 500 years in the country I currently live in, if a school child spoke Welsh in a school you would hang a sign around their neck as a badge of shame and that is called the Welsh knot. So now I will just look very briefly at the relationship of language to nation building, comparing the 19th century Europe with modern Asia as a sort of context for for fieldwork and language documentation. So in 19th century Europe linguistic description was a cornerstone of nation building in at least three ways. Philology of ancient varieties of national vernaculars, think of like the, you know, the Germanic sagas for example, you know, the one way you were a nationalist was to look at what was the medieval poetry in my vernacular. Then also there was work done in dialectology, dialect grammars and dictionaries, also dialect maps, and that's the third one is systematic dialect geography, where there is this notion particularly in Germany, but then also in later France of like to really appreciate, you know, who the German people are, we need to really study Germany as it's spoken in each of the German speaking villages and understand dialectology and folk traditions. So for whatever reason, these elements of nationalism and do not seem to be repeating themselves in Asia, that's my observation, philology of ancient varieties, everyone pays lip service to it, but only in China do you see major investment, like in India, you know, as much as the BJP loves Sanskrit, you don't see people pouring over Sanskrit texts and doing editions of Sanskrit texts as a kind of manifestation of Hindu nationalism. And as far as dialect grammars and dictionaries go, there's a little bit of work here and there, but again, it really doesn't seem like the modern states of Asia feel the need to express their national identity through linguistic field work, and that's worrying, I think, because it means that those pressures on decreasing linguistic diversity that come with nation states like national education and increased travel are there, but the countervailing pressures of documenting and celebrating linguistic diversity are not there like they were in 19th century Europe. So now I just look very briefly at how things are going here in the UK, so Welsh is doing well, Scots Gaelic could be doing better, Cornish died in 1777, Manx died in 1974, but there are revitalization efforts going on with Cornish and Manx, and then I think that something that can focus our minds is that the Cornish and the Manx have been able to do revitalization because the languages were well studied and well documented, whereas for instance, when when Minyak disappears or or what have you, it's not clear that that even if a community later wanted to revive it that they could because there wouldn't be enough done. So our job in as let's say as historical linguists by virtue of historical linguistics needing to look at a variety of languages from a family and our job as descriptive linguists looking at the range of of languages in the world is a very hard one because of the quick pace of cultural homogenization backed by national governments and corporations and and documentary linguistics has even fewer resources than it did in the 19th century is my perception that in the 19th century you could go to a national government or an academy and say give me loads of money to study dialect diversity for example and they would whereas that's not so much the case anymore. So we need technology to help us so let's look at some new technologies well just to kind of focus on this this this theme that I'm elaborating of technology brings you improved standards of living if you like increased productivity but also new forms of oppression I think this is very true with the very technologies in a way that I'll be talking about so looking at the role of digital communication in controversies such as the surveillance of the NSA revealed by Edward Snowden and the the Cambridge Analytica's role in the Trump election in 2016 but we can if we're sophisticated users of technology also resist these forms of oppression so for instance there's the Tor browser which is a which is a browser that cannot be monitored by governments so I use that myself a lot it's not very good for making bookings at restaurants I'll have you know but and you can also have encrypted email accounts like proton mail is the one that I recommend and now moving from this general situation of kind of how has technology shaped the modern world in terms of increased productivity and increased oppression and the ability to resist that oppression let's look at how those same tendencies are manifesting in academia I think broadly so you know I'm going from sort of big to small sort of society at large academia and then linguistics and I have been sort of surprised actually in the past that some of these things have not been known to to to PhD students for example so so I think it's worth talking them through so I'm going to look at YouTube library Genesis since I have in particular so in the context of my ERC grant that just ended in August we put a lot of the talks on YouTube and I think that's hugely advantageous to to scholarly dissemination and I think actually that one of the advantages of the coronavirus that we're seeing now is that academics who tend to be very conservative are are are figuring out how to use technologies and you can do things like have people from all around the world attend a doctoral school but also I think we should try to put as much of that as it makes sense to into the into public access and just imagine if you know we were able to watch talks by Brugman or Dacer sewer or I don't know the greats of of linguistic history well I think we if we find that idea attractive we owe it to the future to try to document some of our presentations so that that now that it's technologically possible it can can be realized so now I will talk about library Genesis so this is a way to find books and and books in particular it works a lot like a library catalog the you have to check the URLs because because it depending on where you live or what's going on in the world the URL can change so I put in to this search window Nathan Hill Tibetan and then like in a library catalog various things come up and one of them is my 2019 book and then you click on it you can get the PDF file and there's the book yeah so you've just saved yourself 85 pounds and have a relatively new book so especially again in the coronavirus time this is a way of you know accessing scholarly resources that might not be as accessible as they were in the past and then if we look at Psyhub that's better better for articles so similarly you may have to check what the URL is and you can basically just google something like where is Psyhub now in this case first we go to the home page of the article in question so I found one here use of reflectance transformation imaging in recording and analyses of Burmese Pew Inscriptions and and we've looked for the DOI you see the DOI at the bottom the DOI is a you know unique identifier like an ISPN number for a book and then you can copy that of course if you have access to this journal you can just click through to the journal but I'm presuming maybe you don't they want to charge you 30 pounds first for looking at this article something like that so then you go to Psyhub you put in the the DOI and then it brings the article right up so that is a way to get respectively books and articles using you know new technological resources in in the 21st century although I should mention that whether or not it's legal is something you should consider for yourself in your jurisdiction in Burma it would be perfectly legal yeah so so that's it for very sort of generic research technological resources of scholarship in in in general now I will look at something specific to to language documentation so there are a bunch now of language archives which so what is a language archive yeah when you when you document a language you produce sound files video files it's a place that can host in perpetuity those files in some kind of curated fashion and I'm going to look at them a little bit you know one at a time so this is a one called Paradisic which is based in Australia and is particularly strong in Australian languages I'm not super familiar with it but some colleagues who whose opinion I respect a lot like it a great deal it doesn't charge money for deposits and it's quite accessible next I'll mention ELAR the endangered languages archive uh which is based at SOAS and personally I find it quite difficult to use it's hard to figure it out whether they have a resource on your language that or the language you're interested in in many cases the or I should explain that this is tied to a grant program the endangered languages documentation project where if you're given a grant by the endangered languages documentation project you're expected to deposit your data in the endangered languages archive now a lot of people I would say as many as half of the people who get these grants do not end up depositing their their data because they're bad but the system still records a deposit that's sort of waiting for them to deposit their data and and I mean just for example inside of Tibetan linguistics Mark Post is someone who got a grant to I think work on Gallo and his deposit is is listed but there's no files and then it just makes you feel crazy because you look you see there's a deposit you can't find any files you think you've done something wrong whereas it's actually that the depositor hasn't put anything in there and you have to have an account to use it so it's it's it's it's not my favorite uh although uh you know I should say the all the decisions that have been made uh you know uh in deference to my SOAS colleagues were made for for reasons that that do make sense for some circumstances then here is uh Dobez which is uh was associated with a grant program by the Volkswagen Foundation which is now closed but they still will host new data and without kind of going into too many details my complaints are similar I find it sort of hard to find things a lot of things aren't accessible there's different sort of levels of security uh my personal favorite uh endangered language archive if you like is the Pangloss archive uh based at the uh CNRS in Paris and specifically at the Lassito lab and uh why do I like it I like it because it has an extremely flat structure there's no sort of oh look here look there log in get permissions it's just a bunch of home pages very flat what you see is what you get and then uh in general you you like here you get a sound file that's that's glossed that's translated uh they uh yeah okay so I'll stick on this one for a little while it's it's a small enterprise and very French uh you know which makes sense is based at the French lab it's paid for by the French taxpayers uh but that does mean that it tends to be languages that were studied by people who work at Lassito that are covered now that's good for me because uh that lab has done a lot of work on sign-out languages I think it's worth noting that they will take deposits from you know the public as long as you format your data and whatnot in a way that's easy for them to ingest and um and I would recommend that yeah I think that like if you're if you're going to go to the trouble of really formatting your data for an archive anyhow you might as well go with pan gloss and at the technical level uh every deposit gets a DOI which is not true for example at ELAR and that DOI is resolvable down to the sentence level so then if you want to cite for example someone's data in a paper of yours you can you know just give that DOI that that then will take someone immediately to that exact sentence that you're citing so I think that is is also the sort of cutting edge of scholarly practice so um you know just to sort of pause on that thought I like the pan gloss archive and would encourage you to take a look at it and also to uh to if you are in the language documentation business or doing some fieldwork yourself consider giving your data to the pan gloss archive but then in the let's say if if or in terms of transition let me just say another thing I want to tell you about is the NODO and it is a generic data research repository supported by the EU and specifically by SAM where they do the atom smashing so most of the data in it like you see here is kind of from let's say hard sciences uh but I think it's worth us asking ourselves why do we need special language archives yeah language data is just research data so why can't we use a research data repository like the NODO and now I'll just tell you what's good about the NODO is it's totally free extremely easy to use there's no upper limit on number of deposits and a single deposit can be up to 50 gigabytes so um yeah so it's it's it's just the most user friendly thing out there and each deposit generates a citation and a doi so it also gives a lot of advantages in terms of citation practices what are the disadvantages the only disadvantage really is that it doesn't provide a good interface or sort of user experience for people to look at your data online and I would say that's been an emphasis in the design of other language archives is that you can you know you can play a video and you can look at the transcription while you play the video whereas in the NODO it's just here are my files you can download them and then you can read them on your own computer if you if you want to but I think that's not so bad I think that if if we make it easier for researchers to share their data then more researchers will share their data and the more people share their data the more we'll be able to you know have a kind of a rigorous scientific enterprise where our findings are verifiable and and maybe too much effort has been spent on worrying about making kind of lovely interfaces for language archives so that's my pitch for the NODO and uh and personally when when I get an article these days on like a morpheme or something in a language that's based on fieldwork I ask to see the the data because I think that's good practice and and we should all be sharing our data and and well and and and making the relationship between our data and our research findings clear now I'm going to talk about auto completion keyboards which I've moved here because it's it's more a technology that benefits uh communities that use uh under study languages than a technology per se that benefits researchers but I think it's exciting and and useful and want to draw people's attention to it so uh one thing that that I've even had trouble explaining sometimes is what is an auto completion keyboard I'll get to that in a second uh but you all use them all the time and uh using these sorts of keyboards let's say uh is participating in the spying on us by major corporations but we all do it when we use our own phones so my own feeling is uh you know um the the benefit of participating in that kind of spying regime is that you get better commercial products for your language and that would be a good thing for minority languages to participate in uh yeah this is what I just mentioned there are there are privacy issues around uh the collection of data but let's look at uh Microsoft Swift Swift key which is the the it's the auto completion keyboard that I use uh and that I've helped in developing so this is from my this is screenshot from my phone and you see there's uh there's Tibetan and there's Welsh I our languages that I have used uh and then it suggests some other languages based on your uh location and uh this is uh me just uh proving that it works for a language Atom where uh basically a friend of mine gave us some data for making an Atom model and he couldn't figure out how to use it so I was just sort of showing him how to use it but I don't actually know Atom so um the point is just uh we we type something and then it says maybe the next word you want to use is is this one I mean I I think everyone probably uses auto completion keyboards on on their phones but I I have found that when I've tried to get linguists to give us data they've said what's an auto completion keyboard why should my language community want one um but uh when you you know when you type and you say you know uh hello and then your phone says do you want to say how are you and then you say yeah okay I'll say how are you that's an auto completion keyboard and there's a lot of linguistics research that goes into making them so just let's look at uh some some a mix of this is mostly sign or Tibetan languages and uh and um Filipino languages uh to give you a sense of what's in Microsoft SwiftKey and I myself have been involved in Tibetan Hmong, Crusoe, Atom, Honey and Galo and you see the number of users and the vocabulary size and and and here I I just the the point I want to make is kind of the the usefulness of of documentary linguistics to to to real people right there are five thousand three hundred and eleven people in Mizoram who are using this software on their phone to write in MISO so so you know a little bit of work by a linguist has really helped these five thousand people and um in some cases uh which is a point to keep in mind the numbers look good like two thousand Tibetans almost three thousand Tibetans but there are six million Tibetans so it's not good sort of let's say to use commercial terms market penetration but if you look down at Atom 157 Atom users there's only about two thousand people who speak Atom so I think 157 is not so bad so that's uh that's what now I'm done talking about all completion keyboards but I'll I'll say part of the reason why I'm doing it is a a pitch which which is that if you look at Microsoft SwiftKey and there's a language there that you would like to be there that is not currently there that you have some data on then you can talk to me and I'll talk to my friend Julian who I've worked with making some of these models and we can add it so so I think that's a good good opportunity for helping communities that use less supported languages so now I will get into the actual discussion of the workflow of documentary linguistics and then at some point switching over into comparative linguistics as we as we get to that point in the workflow and there are basically three phases that I'm going to break it down into the first one is automatic speech recognition the second one is a natural language processing tools writ large but there's this notion that's been kicking around since I think the late 90s a basic language resource kit which I think is a useful notion so I will discuss that and then computer assisted language comparison which is to say that the you know the the contribution of technological development to historical linguistics per se and just sort of to give you this sort of overview perspective this second one NLP is where most linguistic research in commercial enterprises happens so Microsoft Facebook Google have lots of people doing NLP and for them automatic speech recognition has been one NLP task but for us I think it's good to separate that out because the way you approach it if you're working on a language with a lot of resources and the way you approach it if you're working on language with few resources are very different so the extensive commercial research in NLP doesn't really advantage us that much when it comes to automatic speech recognition but it does for other tasks and then computer assisted language comparison well of course you know commercial enterprises have not been super interested in reconstructing proto languages that's no surprise there's not a lot of money in it so that's not something that that comes under the rubric of NLP really either okay so now I will look at automatic speech recognition and just to remind ourselves that our sort of goal if you like is to have interlinear glossing so that you might have a little bit of a sound file corresponding to its transcription into the IPA and then you would have that broken up into words or morphemes then under each word or morpheme you will have a analysis of it and that's the glossing and then you might have a sentence by sentence translation or some notes so this process of changing your raw recordings into in your linear glossed text is extremely laborious and to some extent necessary for using linguistic fieldwork in further research but it's a real bottleneck in terms of people's time you know so I think basically everyone out there has far more recordings than they've been able to gloss and let's just say that this this standard this practice of of interlinear gloss text it's not for example at all common in inter-European but it has been used since the 1900s and here for instance is an example of Franz Boas who did a lot of work on native languages of North America and you see down at the bottom he gives the the forms and the glosses underneath him so let's look at recent developments and and basically to some extent my whole discussion here is a extended advertisement for the work of Oliver Adams and Alexi Michaud and I have had a little bit of involvement but a very tangential one I'm more of a sort of fan but I think this is a bandwagon to get on so so I commend it to you so Oliver Adams did his PhD in at University of Melbourne now actually works in a commercial NLP company and the the probably everyone has heard about neural networks they've they've they they I noticed them about five years ago but they've really taken over artificial intelligence research agendas and that's what he is using the input is audio and transcription aligned at the sentence level and you need to train a model for the language and in fact the speaker that you work with and this is where it's different than the the big languages yeah so for English you just buy whatever buy yourself an iphone turn it on start talking to it it writes down what you say well that's because a huge amount of existing resources were already invested in that whereas we're starting from nothing and trying to make progress in a way that benefits us as researchers as quickly as possible so there's no how can I say there's no one size fits all answer each language is different each speaker is different but generally speaking one hour of existing very accurate glossed texts is necessary to train a model and if you have more than that that's that's better and here is the input training for example where we have time aligned sound files with a transcription into the IPA and one reason this this worked well with Oliver and Alexi is that Alexi is interested in in phonetics so he makes very accurate transcriptions where you know if someone pauses if someone says um or something he writes it in there because uh many linguists kind of even without noticing it edit the text as they go and you can't do that because a computer is listening to the sound file so it needs to compare it to what the sound file actually says not to some idealized version of what it should say uh so in terms of let's say even at this stage a recommendation to you is if you're doing fieldwork on a language that and you might want to take advantage of this set of tools when you do transcriptions please do them as uh precisely as possible where where you write down what's actually said by the speaker not any kind of cleaning up so uh now just looking at the well this yeah this is just a slide showing the the preparation of a new transcription and here it is in practice where you have a line by line and I know I know you're not going to read this just because it's a bunch of IPA for a language you don't know but uh the the for for each line there's the first one is the reference which is the correct transcription and then the the hype is the hypothesized transcription by the machine and just based on the on listening to the sound file so the point for our purposes is that it looks pretty good right like just I mean just look at any one spot and you say oh yeah the the IPA symbols on the top line look a lot like the IPA symbols on the second line so this can radically increase the speed at which linguists prepare transcriptions but it's worth noticing that um that that that this slide may kind of overstate experientially the accuracy of the system because common words it will learn well yeah that's just a machine is like a person in that way whereas less common words it will learn less well so here is an example of a four syllable name uh that comes up in the uh in in the the transcription and uh sorry in in the texts and what you see here is a bunch of different sentences that the same name comes up in so the on the left is the is the IDs for the sentences and then at the very bottom you get the correct transcription yeah the the you know alexi as a linguist has analyzed the phonological form of this uh name as having this um this form and you notice that basically every time the system sees the name it does something different it's because a four syllable name is not something it saw enough examples of in the training to to feel confident about but still I think correcting these incorrect transcriptions will be a lot easier than typing in those the the correct transcription in the first place so so even when it's making mistakes it's saving you time so this is just uh an example uh front of uh of of the the wave file and the uh the alignment of the sound file with the transcription and you the what's happened here is the the person has repeated the uh the word and the in the second occurrence it has uh it has it has not identified it and the point that the slide is making is that the reason it hasn't identified it correctly is because of this glottalization which is to say even when the computer is making mistakes they're not random mistakes they're mistakes that are related to the phonetics that it finds in the wave file so even its mistakes can teach us things and and provide uh linguistic insight so um here I'm uh just listing a few uh publications about this uh endeavor it's quite new about uh two years old making quick progress so hopefully in uh in the future in the near future this set of of tools these technologies can be incorporated just into the workflow of linguists and then I hope this this works I just want to uh show you a clip of a YouTube video where you can see this tool the tool is called Persephone uh in action so I'm going to allow but now I think can you do you see the YouTube or do you see the slide still the slide okay the slide yeah okay then I think I need to stop sharing yeah and uh Christoph you have to be sure that uh you can share the sound so now can you see the the YouTube yes yes okay well I mean uh you can always let's say even if the sound doesn't work uh then you can uh try yourself in the future but let's give it a try see what happens yeah sure so that's all we need to do to install Persephone along in that way it's just like any other plugin any other recognizer that's been developed for Elan so far once it's installed we can open Elan and actually apply an existing Persephone phoning recognition model I should have said so this is a kind of uh half hour talk about an extension for Elan that uses this Persephone tool and I'm skipping the whole first part where he's like okay this is what Persephone is this is how you fit into Elan to get close to where you get to see it in action and because that's the sort of dramatic climax that I want is to see it in action but but but I'm starting it a little bit earlier so that I can let uh Christopher Cox kind of set up its use himself so I'll continue yeah to a tier of our lighting and one of our transfers now to do that we do need to tell Persephone a little bit of pardon me Persephone Elan a little bit about our Persephone model so specifically we need to let it know the folder where a Persephone model or experiment is how that model was configured so specifically which feature types we use for phonetic features and what labels we use to provide the text and where the original training data for that model are uh Persephone Elan feeds that information back into Persephone behind the scenes to reboot that model essentially and then apply it to these new unseen snippets of audio that we're getting from our Elan transcript now I'm hoping over time that we can make changes to the Persephone source code to actually save these settings inside the model themselves so the only thing we need to provide the Persephone Elan is the path to our pre-trained model but for now this is information that we have to enter manually into the Persephone Elan interface for myself once I've trained up a model in Persephone I usually just keep a small text file that has all of this information there in the example we'll see in a second I've entered this information the appropriate fields already but again these are the things that you should be able to recover from your model training process fairly easily so again here's a short video showing what this looks like so here we have the transcript and this is again it's written in the language with Alder Bruce Starlights so you can see we have a number of empty annotations or textless annotations on a main tier what we want to do is provide that tier to Persephone's phoning recognizer so in the recognizer's tab we select Persephone phoning recognizer and then we provide the settings I was just describing in this case we trained our model using fbank for phonetic features the text that was provided this model all those text snippets had this file extension again this will look a little bit different for your particular model we built in support for Tsuchino's orthography here but again in your case you'd most likely choose none and what you'll get then are the actual phoneme strings that come out of Persephone with no conversions happening behind the scenes we want to provide this BRS tier that's excuse me that's the tier that contains all of the empty annotations that Persephone is going to try to recognize lastly we want to provide a reference to the directory where the original training data is so in this case for the Tsuchino model that we're being never using here and lastly excuse me to the model itself to the source experiment directory there's a final field here as well for output recognized text this is essentially just a junk file we can't get away with producing um Elan makes us do this don't worry too much about it once we hit start Persephone Elan will start picking out all the individual clips from that tier reload the Persephone model that we provided and then actually ask Persephone to start transcribing each of the clips on that tier so each annotation can be fed to it when already it'll load the corresponding tiers and we can listen to the results you can see here it's recognized in only segments but also also the tones that are marked with diacritics so as we mentioned before there are a number of settings that you need to have okay that's the the part I wanted to show you so I stop sharing and then and then go back to sharing my presentation not this one right okay uh and I don't know whether you thought that was cool or not but I think it's pretty cool that like you know the the computer just filled in the IPA with the tone marks correct yeah and um and although you know uh it it took him some some some training data I think in this case about three hours of training data and struggling with a computer to to train a model and whatnot uh you know once he's done that then it works forever and he can do 10 hours 20 hours 50 hours and and it's much faster to to to to to make progress yeah so so and also um as you saw in in in Linda's presentation yesterday Ilan is is a very standard piece of software uh to use in language documentation so the the point of of this video and this tool that that um that Chris Cox has made is that it shows you can you can incorporate this kind of quite clunky um you know a computer expertly thing uh Persephone into an environment that is uh I mean I don't know whether it's more user-friendly or not but is is already being used by a lot of documentary lists so that's the end of the section of my presentation on uh automatic speech recognition and I would now move on to the next part about NLP but uh we're close to I don't I don't know we're eight minutes away from the half hour mark so I think it's maybe better to pause here than to press ahead but I'll defer to what Christoph tells me um well uh let's uh let us let's let's the majority decide okay is half an hour too much maybe and or rather have 10 minutes and then uh close a bit earlier what is your feeling um why don't we or or let's let's uh uh say probably let's let's all do one more section I'll do the section on NLP and then we'll maybe be a little bit past the half hour mark but we won't be so late yeah and then we'll see where we are okay yeah great okay so um so so now in terms of this this workflow right we've already gotten from having sound files to having some textual representation of our uh of our language and that might be IPA or it might be some orthography uh so now I will I will turn to NLP which is the the kind of so natural language processing the kind of thing that that uh you know that's in your in your iPhone if you have one and I like to talk about in a somewhat facetious way what I call Maslow's hierarchy of NLP needs we're referring to this this idea of Maslow's hierarchy of needs where you know first you worry about uh eating and then you worry about whether or not you have a home maybe some of you heard about it uh so first you need to worry about whether or not your script is in Unicode so you're using IPA no problem if you're using roman script no problem but like Tongut only very recently got in Unicode and uh Keaton is not yet in Unicode so it is an issue for some languages then after your script is possible to encode in computers you need to have some text that are actually encoded so you can't do anything fancy if all of your texts are still on paper or in museums you have to have some e-texts and then the next question is can you divide words apart from each other where in languages like English or most European languages we put spaces between the words so they usually sort of come pre-word divided but for uh for for Vietnamese or Thai or Tibetan or Chinese uh words are not divided so so it becomes a NLP task to split the words apart from each other once you have words uh then you can analyze those words and and in particular you can identify the part of speech category of that word yeah so especially if you think of um of uh of of disambiguating uh homophones uh part of speech analysis can be part of that but now we're just sort of adding you know more and more rich layers of analysis but the idea is automatically so after a part of speech tagging comes lemmatizers where for instance if we have a word like uh sang uh at the at the part of at the part of speech level it will say a sang is a past tense verb and then at the lemmatizer level it will say oh it's somehow the same word as sing even though you know sing would be a present tense and then once uh you have that level you can start doing syntax uh whether it's noun phrase chunking or uh dependency relationships between verbs and nouns uh and uh after that is fancy stuff that we don't need to worry about so uh I just want to to sort of take stock of some things that are available for English and that have become available for Tibetan so uh there's this company that called lexical computing that makes some very useful software that until 2022 is free for everyone in uh EU higher education establishments to use so you might want to play around with it and uh and they have corpora of a bunch of different languages tend to be European ones uh with more or less detailed annotation so I'm going to stick with English where we can count on all the bells and whistles and this is a word sketch that is done totally automatically where it's looking at the British national corpus and the verb is chair I use this verb because uh because it's good for demonstrating the need for part of speech tagging because there's also a noun chair that's a little bit more famous than the verb chair so uh once we've uh if we're interested in the verb chair we've already part of speech tagged so we can we can find the verb chair and then uh we look at this sketch where it's looking at the whole corpus and then automatically telling us typical objects typical subjects and typical you know other other grammatical interactions with this verb chair and I think it's pretty impressive already you get uh people basically people's names and and uh job titles or or what chair things and the things that get chaired are subcommittees meetings committees seminars so there's a lot of semantics uh of of a lexicographical kind captured by this sketch uh that's all been done totally automatically right uh the the grammatical relations come from the corpus the statistics come from the corpus even a little bit more impressive I think is this which is the thesaurus function where looking at those uh statistical relationships of grammatical use like what what are what are other things that people do to subcommittees is kind of the way to think about it uh it ranks the the uh statistical uh similarity between chair as a verb and other verbs and you get convene attend host conclude adjourn organize and I think this is really impressive because it it it really is approximating uh semantic similarity which is which is an notoriously hard thing to define and model uh and has done it by this series of of grammatical uh and syntactic analysis then uh scaled over a large corpus so uh these screenshots actually I think I probably took back in 2013 so you know this stuff in in in a language like English totally old hat it's was done a million years ago uh but a language like Tibetan uh we had a project at so as uh from like 2012 to 2015 uh and for Tibetan these these these things didn't exist so we were mostly working on part of speech tagging uh and word breaking and and we did a pretty good job of that and here is the Tibetan word galpo which means king and from uh a word sketch these are uh these are verbs that king tends to be the subject of or let's we say the agent because it's an erratic language and uh it's a little small and probably most of you can't read to ben anyhow but they turn out to be speech verbs largely so they're they're uh uh request say another verb for request uh another verb for for request or say so lots of speech words but then there's also some other ones uh think give uh and then there's a verb that is used in a you know as a light verb construction in in in you know in a construction that means invite so these are you know not shocking things to see kings involved in uh but i think it's i don't know i think it's cool and very useful if you were for example running a Tibetan dictionary now when we turn to the thesaurus function it's a little less impressive uh and i think the reason why is because of the corpus size so this corpus is 80 million words which may sound like a lot but is nothing compared to what they have on english and we have a more recent Tibetan corpus that's about 150 million words so i'm hopeful that once that's loaded in we'll get more impressive results uh here but uh this is what the system says are the nouns that behave most similar to king and i'll just tell you what they are the top one means god uh the second one means uh llama or guru yeah so not so bad you have sort of high ranking socially high ranking animate you know entities but then comes son or boy and then person which i think well you know a kind of a king is a type of person but not stunningly impressive and then the last one that's on the screen it means a victor and that is is quite good i think compared you know a victor and a king are sort of semantically similar so uh you know the for the purposes of this presentation it's just uh to say we we have made substantial progress down the road of Tibetan NLP and that work continues but it needs to be done for for each language on earth right and uh and uh if the language has fewer resources than Tibetan then you have to start from wherever that language is and that might be on max feature recognition uh so after this project ending in 2015 i was busy with my erc grant and i i i wasn't involved in the follow-up project but there was a follow-up project on um on verb syntax looking at um uh classifying Tibetan verbs according to their their governance relationships and this is a screenshot from uh the corpus uh that that project has uh delivered uh showing and i i think personally that is a very beautiful um interface yeah so you so you see uh i'll just talk to you a little bit about it like the d is a demonstrative the c is a case marker the a is an adjective uh then we have another case marker than a noun then a determiner and then the verb and there's an arrow that goes from the verb to the head of the noun phrases that it's governing and in this case we have a sentence that says uh there they erected each a statue of the great black one so dare is the first syllable at that place and then knocko black tempo great uh genitive of coup means body actually but here it means statue re each and then and then zheng means to set up uh so um i mean and and i don't want to sort of let's say uh lead you to the impression that any of this is particularly easy it takes a lot of work a lot of uh a lot of blood sweat and tears and also a lot of analytical work but once it's up and running it's scalable is the point right because like now we have a machine that will analyze Tibetan syntax and it doesn't care how much Tibetan you throw at it so we can go uh we can we can get big data sets that we can use in our uh research