 Thank you very much Nathan. Many thanks to you as I understand the main organiser but also everybody else who contributed to the organisation. This first presentation can serve as a form of introduction and so I will try to place the Persephone tool in context as exemplifying the benefits of interdisciplinary dialogue between computer scientists and field workers. Linguistic fieldwork is probably very familiar to all of you but I'll still recapitulate briefly what it's about. This is the core mission of the research centre where I work, Langues et civilisations à tradition orale. So this lab was founded over 40 years ago to document and analyse under-documented languages and cultures. And the languages on which fieldwork has been done so far are indicated by pins on the map. So the pins in red are where PhD students are currently active. The pins in white are where people have been but have not archived any data yet. And in orange is where some data are available from our open archive. For instance my own fieldwork took place here. It has its little pin on the map. For people familiar with South-West China, here is a close-up map showing the position of Nasi. So the Nasi language you may have heard about it because it's famous for its pictographic script. And the pictographic script area is shown on the map with little red dragons for the corresponding places. Now, which is the language I'm doing most fieldwork on, is spoken at the border between the Chinese provinces of Yunnan on the south and Sichuan on the north. And you can see that Myanmar, you can't see it, but Myanmar is just across the black border and Tibet to the north, northwest. So since 2006 I have conducted nine field trips in this area, specifically in this village, transcribing and analysing materials and learning to speak the language. So this is what it looks like, the plain of Yongning. It's inhabitants in festivity costume. Here are some water buffalos which were introduced as a 21st century innovation for plowing. But now, less than two decades later, they have already been replaced by tractors. So this is at the same time a 21st century site and already a thing of the past. So things are changing quickly. Here are some mountain streams, villages, and here is my teacher. And this is what her voice sounds like. So this is a Sino-Tibetan language and when working on an unwritten language, one needs to work out the sound system in order to be able to transcribe. The knowledge of neighboring languages helps, although there are generally some discoveries to be made. This is where linguists come into play and we know the distributional analysis to work out what the vowels and consonants are, what the restrictions on the combinations are and so on. And as soon as possible we begin recording texts. We transcribe them with the help of the consultant and we annotate them. So this is a passage from a story where the multilevel annotation allows one to spot the mismatch between tones at the sentence level and the underlying tones of individual words. So this is near the beginning of a story. It's said that there were two siblings and you can see from looking up the word by word, that for instance the sister, younger sister, doesn't have the same tones in the sentence where it's nimi, halo, and in the word by word, glosses, where it starts with a mid-tone. And this is what I study as a linguist. We can also have any amount of lavish notes, translations into any languages we want. And now when we get to analyzing the combinatorial properties of tones, we can look for instance at, he went hunting the montiac, so we know the tone of the noun, the animal, the kind of animal, the verb to drive away, to chase, and how they combine into a tone. So this is the first step where the object and verb have a certain rule that tells you what the tones are going to be like when you combine the two. A bit like declensions in Latin or conjugations in languages that have it. In this case, when an object and a verb get combined, well, if you know the tones of both, you can know the tones of the output, but you have to know the paradigm. Then there is verb serialization, he went hunting, and some post-lexical processes. So this is what I did, doing the big tables, saying if you have an input verb that has high, mid, low, or mid-high tone, including the sum categories, and monosyllabic objects or disyllabic objects with this or that tonal pattern. This is what you get at what people say, how they combine, tonally. Now it's been a few years' work, more than 10 years, so all the resources are available online. It's open access, it's Creative Commons license, so everybody is welcome to use it for their own research or for some other purposes. And the excerpt that I showed you a moment ago is now shown in its online setting that anybody can look up without having to download any tool, it's all in the browser. Files can be downloaded, including the original audio and the full annotations in XML format. We have the metadata that you can also download in full. You have a reasonable sample laptop saying who's the speaker, who did the recording, the annotating, everything. And we have DOIs, that was a specific request by Nathan, who said the archive is good and it shall be usable when you have DOIs. So not really as a suggestion or a question about what you think about DOIs, and we didn't think very highly of them, but a request to implement them, and so we did it, an overlay of DOIs. And they do facilitate citation for people who want to refer to a specific sentence within the text. They can have a one-click link to that. And my book was proudly published in 2017, so that's dropped on, in a sense. But the problem here is now that we do have some of the files that are transcribed, those that have a little scroll as an icon to the left. And that's okay, most of them are, but as you scroll down the list, you get this whole list of stories that sound interesting about adopting pets, about dogs and cats, and how they're seen locally, and how long in living memory they have been around, and the fact that dogs are special friends, kind of totem animal, and lots of information, but you only have the audio and no transcription. And it's unlikely that I will ever do the hand transcription for all these recordings, because it takes so much time, maybe one hour for each minute of recording. So if the story about pets is nine minutes, that's nine hours, I can't do it completely on my own. I'd need to be with a consultant. And now, you know, like, aging people, I have to. I don't have to, but I chose to supervise PhD students and have other administrative responsibilities. And if I'm happily in the Himalayas transcribing all this and annotating it, somehow I'm not doing all the tasks I should be doing. So I could say, you know, posterity can manage. They've got the dictionary. They've got my book and my dictionaries. They should read my publications, linguists say that. And they can just do what I did, look at the gloss text and continue for themselves. But I see Mandana nodding her disapproval. And having limited access to these documents for now for my own research is a problem, because if I want to track the use of construction, that's not so common, but I do feel one can hear some instances. I would really like to navigate the whole corpus. Each time I recorded a story with kind of negotiation with the consultants, you know, if you've had all the stories I could tell and please. So then I couldn't have some stories again, a new version. And sometimes she would talk about an interesting topic and I said, ah, could you record this? And you say, okay, think about it, prepare it. Every time when it's in the box, it's something really very precious to me. And having access to those beyond just the way file is important. So for me and for other interested users. So when we think about this situation, maybe it's all because of digital technology. So if there were no digital technology, we wouldn't record so much materials. So this is text by Franz Boas, 1902. And this is the same method that we apply today. People will tell a story that they're confident narrating, be it about their own family or skills that they have, how they learn to make a crossbow, for instance, or traditional folk tales like this story. And the reason why he didn't record reams of text is because there were no audio recorders at the time. And so now we have these digital technologies and we can record basically as much as we want, even video because we've got big hard drives. So it's a digital problem. So maybe digital tools could also help us solve the problem that they've created by using natural language processing, you know, automatic glossing, automatic transcription, automatic translation. And somehow they owe it to us, you know, clean up the mess that they're creating. A difficulty is that our method, we're happy with it. It's tried and tested, it's more than one century old. And so as linguists, we don't want to give up what we already know, what we already have, for some fanciful alternative that may work, you know, 67%, but what does it mean if it works? 67% that's, we can't get students to graduate and we can't finish our grammars if it's just 67% accurate, right? And spending a lot of time wading through the errors and doing error analysis sounds like maybe we're not doing our job and we have lost track. So this is where the dialogue with natural language processing begins. And it is a pretty slow process. My own first attempt was in 2014. I was in Hanoi in an engineering centre. So that was easy. There are lots of toolkits for doing NLP, including speech recognition, automatic transcription, and so a colleague in the same room said, you know, let's try the automatic processing of Yongminna. We'll develop a light acoustic model, light because, you know, you have so little, so few data for this target language. And then we will also try to complement it with heavyweight models from five national languages. They had Khmer, English, you know, Chinese, German, French, something like this. And at the time it was interesting, but very far from practical usefulness. There were lots of errors, so you could think about, oh, isn't it curious that when there's a... in the microphone it gets transcribed as a uvula, ko, as if it were the syllable ko, but that was not practically very interesting. Now, when it did become interesting was when Oliver Adams, who is fortunately here with us today, did a PhD at the University of Melbourne and tried various things, including transcription by means of a neural network with connection is temporal classification, which is, you can date it from different periods, but mostly it's something from this decade, right? It's something that allowed for much progress in classification. So typically you have a sound and which phoneme is that. How do you want to classify it, given that you have a finite set of phonemes, and in language it has to be one of those. So the input to the tool was audio plus transcription aligned at the sentence level, and importantly the training was done from extant fieldwork data, so what I had collected that far. And that was data from just one speaker, about three hours of notice. And that was appealing, not having to work for the tool, saying you have to record different materials, red materials, I can't do that, the people don't read, or having a whole lexicon, I didn't have the lexicon ready at that stage. So it was just saying, given the output of your fieldwork, this can be fed as input to a tool for transcription. So this is the training set. What really matters is the audio and a sequence of target labels. So in this case just retaining the initials, the rhymes, so the consonants and vowels, because it's mostly a CV language, and the tones. And this was in 2017, late April, and Oliver had said I couldn't try something, but it won't have tone, and I thought that's a pity because my book is about tone in Yongning now, but anyway, you go ahead, you show me and I'll tell you if that's useful or not. And in the last few days he said I found a hack, and now there's going to be tone, and what is it like? And so this was the audio. And in the first line there was just one error, which was the tone on the copula nie, which should be a low tone, but when you come to think of it, the last three syllables, they are mid low low, but that's realized as gradually going down. So the low tone in the middle is phonetically more like a mid tone than a low tone. So because in that language, when you get a sequence like mid low low, you don't rush to say mm, mm, mm. You get mm, mm, mm. It goes down gradually, and the way people process it, which is language specific, they get the tones right without having to, you know, really stick to a three level scale with high mid and low. So that sounded very, very promising and really useful practically for the new transcription. So of course this is just phonemic transcription. It doesn't understand the words or chunk into words really. So you have to do that at the spelling, sorry, at the punctuation, at comments saying that B here is not to go, but it's the adversative morpheme. And then I would check some things, like is it this tone or that tone and make notes on those, I put those into comments, and then the translation I still do by hand. And then you can compare for a test set that was already transcribed and is not part of the training set, the output of the transcription with the linguists' transcription and see, you know, do the error analysis. And some colleagues said, you know, that's too good to be true. There must have been a mistake that if you have 85% accuracy or something which is on the order of, you know, 13, 14, 15% error rate, especially for an off the shelf model that hadn't been worked on very long, but just getting the thing to work and obtaining this result, it has to be something wrong, because when people speak, they don't give, there's not 100% of the phonemes as need segments in a string when they talk. Some words get gambled and for instance, you know, Mrs is from Mistress, but now it's went its own sweet way because in that context it was used as a title and not full noun. And some people today wouldn't know that Mrs, Mrs Smith is from Mistress Smith, and that's what languages are like. So the first part of our answer to, you know, it's too good to be true, it can't work is that we're doing what would today be called open science. It's documented. The code is available online. The data sets are available. And the results. So what we get is available online too. So there's GitHub repository, where this is available and it's open for anybody to look at those materials relating to Persephone. And that's the whole list of parallel text for the test set. It's called Storyfold Cross Validation. I learned some technical words, that was fun. It means you take out a story, retrain the acoustic model on all the rest of the materials except that, then apply it to that story. So it counts as, you know, the tool doesn't have the answer. And then you can check, compare it with the linguists transcription. So we have those sets here. It's too small to read, but this is the parallel text sentence by sentence of the reference Sometimes it's a bit generous because there are errors in my transcriptions that a word gets repeated. And I only wrote it once. And then the algorithm gets about mark for not being up to the gold standard, which is sometimes not so golden. But anyway, the materials are online and people can check. And the second part of the answer to, you know, is it too good to be true? Is it really working? Is that other people very soon worked with Persephone? Hilaria worked at the same time as there were tests with Na, there were tests on Chatino, so Hilaria also is present today. Chris Cox will present tomorrow, worked on Dine, which used to be called Athabascan. So in Na it's a language of Canada. And only yesterday evening I got a note from Emily Prudomo, who said that they were working on Cayuga, which is an Iroquoian language, and they were using the same tool. So probably it is replicable. It's more about exhaustive transcription. So as Nathan said, you need to have everything transcribed. If you correct the transcription because, you know, there's mistakes in it, then you should do it in such a way that the computer can still recover the spoken string. So if people want insist on additions, you can place them between square brackets, for instance, and then that can be disregarded for speech recognition purposes to train the corpus because these words are not spoken in the audio. And conversely, if they said things that they really want you to leave out because it was a stray sentence that they began and never finished, then you can use angled brackets or some other explicit encoding. And then, you know, for purposes of learning what the phoneme sounds like, they're perfectly fine. Even a word that is mispronounced would be OK. If you transcribe it phonemically, put it into the machine, probably it won't do any harm. OK, so there are some habits to adopt and that can make a difference for some people, but that's feasible, I think. Another thing is that multispeaker will be very hard to achieve, and that's things I've learned all of this from Oliver, so I'm doing the text-to-speech but here's the brain behind that. But that's common knowledge that if you want a tool that will work for men and women and young and old and different speaking styles, then you need to have very, very solid statistical basis, and sometimes it won't work because you can't expect somebody talking into the microphone at SOAS to have text that conforms to your expectations based on whatever happens to be in the English Wikipedia or something. So sometimes it can be good to have overfitting. So overfitting means that the computer doesn't generalise a lot, the statistics don't generalise a lot, they just recognise things that patterns they've already seen, and departures from that would really confuse them. But it's OK if you're transcribing most of my data is from one speaker because it's a very complicated language only, maybe a little like Chatino and other Ottoman languages. I don't have a replacement speaker, it sounds very awkward to say it because I'm describing an ideolect. But all the community members I worked with, apart from my main consultant, they differ subtly in their own patterns. So I mostly recorded data from one speaker. So if I can have a tool that helps me work the 75% of her recordings that I didn't get around to transcribing, that's good, that's an achievement in itself. So overfitting to that speaker is not necessarily a terrible evil. If I could just add in one thing, there's a distinction worth making between a multi-speaker system that can capture the speech of multiple speakers in one model and a model that can generalise to speakers that haven't seen it training. It's perfectly possible to have a system that recognises multiple different speakers well if it's seen those multiple speakers in training. To get a multi-speaker system that generalises to speakers not seen training can also be done, but that just requires a lot more data. So there's two different multi-speaker cases worth distinguishing between. But when you say a lot more data, what does that mean? Maybe an order of magnitude more to get the same accuracy. So what you could, and this is all just very rough rules and they don't apply generally, but say what you could do with one hour in a single speaker case, you'd maybe need a few hours to do a multi-speak system and then maybe ten times that amount to do a multi-speak system generalise. So would you call that speaker independent by the time you don't need to bother whether they have training data for that person? It's a speaker independent ASR and multi-speaker ASR would be multiple people that you know and for whom you have data, right? Yes. It's a weird question, but why would you refer to your, why would you refer your event or the organisation, why is the organisation named after the Greek goddess of vegetation and the other world? Is there any reason for that? Yes, you can look it up on GitHub. It's open science. We have no dirty secrets or terrible hacks there. So the reason for choosing a Greek mythology is that this is where prestige is and the reason why we're happy to have this event at SOAS, which is a very prestigious institution, is that Nathan carries with him the prestige of ERC and many other wonderful achievements and we want prestige. So we wanted a Greek goddess. In Persephone and it sounds like, you know, phonetics and phonology and... We'd love to say it in French. Persephone. Which is probably much less truthful to the Greek. In fact, there's no, it has no relationship to that Greek root. It seems to be a goddess that was there before the Greeks moved in and they kind of preempted her into the local pantheon. So she would be goddess of nature and things like that. And so the phone is phony in there. It's not a real phone, but it did sound nice. And the real reason is that, you know, she got abducted into a very unpleasant place and so you have to think when you're doing error analysis, for instance, you're a linguist, you think you've been taken into a really dark place and you're committed to that because you somehow committed yourself to that at one point, but you're looking forward to the other half of the year when you will see the light of day and maybe computer scientists feel the same when they have to find out that materials which are theoretically conforming to a list of phonemes that the linguists provided. In fact, they get tons more and the linguists say, well, of course, there are Chinese can't you see? And, of course, they've got different phoneme inventories and you feel like, well, okay, maybe there should be a little more explicit about their methodology and if linguists claim to be scientists, somehow they still have a way to go on the front of, you know, explicitness and rigor, basically. So you want to be back to your paradise and out of the other person's hell. So, but it's good and somehow you can survive spending half the year and you can do it until time eternal. So we also hope that dialogue will continue for a long time and that there shall be plenty as befits the goddess of, you know, growth and so that's the story. Okay. So let me see if I can be back on track. I won't dwell on this. This is the kind of benefit for researchers to say, why would we want to do that apart from practical benefits? We get other benefits, such as having a kind of signal shaped view on our own data, which is different from our brain shaped view, because as we know when we learn the language, when we do fieldwork we have to learn the language. We should learn the language and, of course, we adjust to many things that alphonic variation we learn to overlook it because we want to get the phonemes and more generally we adjust and we don't hear differences like, for instance, words of four syllables in this language. They are nightmare for Persephone for the tool. They are wrong every time. For instance, this is the main protagonist in the story, Achiduma. Achiduma. And not one time out of, you know, more than ten does the name come out correctly in the transcription, automatic transcription. And if you look at the spectrogram, it looks like Achiduma, Achiduma. So calling out to somebody and when there's four syllables, but you're saying them at one go, because it's one person's name, then, of course, they get shortened, some of them get lengthened, the R and R that follow get together, and there's a lotlization at the same time. So it's difficult to retrieve them from the signal. So this is something I wouldn't have paid attention to as a linguist that in this language every verb is monosyllabic. Most nouns are monosyllabic or dysyllabic. And the family names which come from Tibetan and the given names, they're all four syllables. So it can be four or six. And of course, when people say them quickly, they sound pretty different. And here we have a statistical view on that that's saying, yes, at some point, they're a subsystem. It's almost as if they were encoded differently, pronounced differently, and called for separate decoding. So this is something we presented among other publications we did. One was at the latest Congress of Phonetic Sciences. So with Oliver, also Chris Cox and Céphrine Guillaume, who's an engineer at La Citeau. Phonetic lessons from automatic phonemic transcription. Preliminary reflections on Na and Surinna. And that's interesting. So to summarize very briefly the finding that Chris made on Dine, he found that the orthography used to have how many is that? Three or four vowels. And there's one less they think in the current speech. So people disagree. And now we're still distinguish the R and R, and others will say, no, no, it's gone. Listen to the Amis speakers. So he has orthographic transcription for Surinna. And the orthography makes the difference. And then he sent this in with the audio to train an acoustic model. And then applied it to new data. He has an Elan plug-in, which has been online for a few weeks now. It's wonderful that you can look at a recording in Elan and press a button and out comes the transcription. And he found that the patterns for the two vowels that people said probably had merged, he thought they had merged. They were still coherent in his main consultant's speech. And they matched the lexicon. So it seems like probably the consultant is making a difference. It doesn't prove things, like artificial intelligence shows that. But it's one hint among others. And it's an encouraging finding for people who do orthography. So let's not just dump this vowel distinction, which goes back to Lee Fang Wei who did the first fieldwork and make it simpler for the younger people to learn. No, maybe they should, since the people who still speak the language probably make the distinction and just keep it going and see to what extent they can deal with that. So there's a kind of dialogue between the linguistic fieldwork and the work with the consultants about which Hilaria will say much more because she has a much stronger tie to the local community than I do despite my best efforts. Between that and the NLP tasks. So basically this is most of what I want to say, most of what we have got to do for now. But I wanted to emphasize that it's an interesting process. And I think there's a side benefit for us as citizens that we can get awareness of contemporary developments. So this is also things that I owe to Oliver, though it's on a personal note like sending me a link to a fun video. I think all of you should watch this wonderful video. It's a keynote speech by James Mickens from Harvard about you know the whole conference was about security, computer security. And he took some very creepy examples and said you know, if you are currently maybe overselling a little a new artificial intelligent device that will do this or that, think before deploying. And if you want even shorter advice, it's think first. And in one word it's don't. Whatever it was that you were intending to deploy, don't do it. Don't do it because we're kind of running headlong into a society where people get a yes or no on you know, mortgage loan or whether their kids can go to university or whether they should be allowed out of jail after some time in there. And it will be labeled as you know, a decision from expert system, artificial intelligence that has no human biases because you know it has nothing against you, it just says that you should stay in prison. And you can't easily call that into question because it's a black box essentially that's behind it. And people who know like several of his colleagues who know how things work, they know that of course, that they'll be biased. If I declare the wrong set of phonemes I'm not going to find the missing phoneme thanks to Persephone because what it will do is categorize all the sounds into what I say is the finite inventory. But maybe we could change the tool in ways that would allow it to point out the outliers, may find there's a pattern in the outliers and we could do lots of things and that's a way to think of those systems that we should know about those. I put the book by Edward Stodden that's kind of personal interest so I found it was a very impressive book as a piece of writing, not just you know the man did his thing and let's buy the book and support him or something, but because it's such an informative book for citizens in this day and age but unless we take classes in computer science I try to listen to online classes by Graeme Novig and they're really good and there's lots of things online but doing some work with those algorithms is extremely enlightening because then you think okay if it works for now then it knows something about the phonemes if it can recognize them so how can we open the black box and see where the boundaries fall, learn about the acoustics to do some explicit modelling of the languages acoustics based on these statistics and there's no easy way to do it now but because there has been progress then we can look forward to a future when other stages such as translating from and into newly documented languages could be a possibility and try to make those possible because technology will not just happen, well things like that as Edward Snowden revealed, things like that do get deployed but we don't get to learn lots about it but we can spend some time working with computer science people and get I think multiple benefits from those so yes for field linguists there is the usefulness we have some good prospects just from a couple of days ago to report that project on computational language documentation funded by the FG in Germany and INEI in France will be funded so we can now have a 15 month contract of a programmer who can complete a good front end for Persephone and hopefully many other back ends, multiple back ends we can continue the improvements to the back end and think of the links to other treatments, if you have a dictionary then if you have the phonemes look if you could match and make hypotheses about the words that are in that string of phonemes and et cetera so I guess I should yeah this is a small community we're trying to build of people interested in computational approaches for documenting and analyzing oral languages so the good news is that there are some computer scientists who have a long term commitment to this strand of research and some linguists and with a sufficient community growing at the intersection many things can get done and this is just final photographs and waiting for your questions