 Okay, good evening. So welcome to the last talk of the deaf room today. So I will talk about Poryu, some software for predictive text, but I also want to address this topic from in the beginning from a bit different angle. So I wanted to talk a little bit also about this idea of coding for language communities. So I'm one of the co-organizers of this deaf rooms and I wanted to present a little bit also about the idea because now I give a talk in an event that is like addressed to software developers, but my background is actually linguistics and I wanted to talk a little bit about language communities. So now we are in this communities, open source communities, free software communities. I wanted to talk a little bit also about language communities and what needs they have in terms of software and in terms of technology. So to like maybe find a link how those different communities could work together. So this is actually a map of the endangered languages in Europe and this is like the area where I come from to study endangered languages and to document endangered languages. So this is only Europe. So the white ones are more or less okay and then the darker gets the more serious endangered they are and some are already extinct and if you like zoom out of this map to the whole world you will see that there's really a vast amount of languages that have only few speakers. So some of those languages might have only like 50 speakers others have maybe a few hundred, a few thousand but this is a completely different topic and different languages that we talked about in the talks before like English, German and so on, right? So people like speak these languages still but normally those people don't have the opportunity to like have machine translation or even like keyboards and stuff but we will get to that. So here's some images, how we work. This is a language documentation project. You can see my friend Felix who could not come today because he got sick yesterday. So this is how you collect data from about languages that are endangered. So you go there, you record videos and audios and then you work together with informants to find out how the language works and then you get an idea how the language like works and you have some data. Normally you have some hours of video and audio maybe 50 to 100 hours and then you have a transcription of those. So this is the first step into gathering some data. Here are some language documentation projects that happened like in the last 10 years or even a bit longer from a project called Dubis in Germany. So those languages got documented and for those languages we got data collected and archived. There are other projects that do similar things that have similar amounts of languages documented but still I mean we can only document so much as there are researchers and funding also. So who knows how many languages there are in the world currently. Somebody has an idea. So roughly I would say it's around 7,000 languages. Probably there were a lot more like 200, 300 years ago. We can only guess that because we don't have a lot of data about that but one guess is that there were maybe around 12,000 languages and now more and more languages die out and the ideas that we still collect the data while they're there, right? So we can still study the language in 200 years and 500 years even if the language dies now so the last speaker dies then we can still study it so we have the data but we cannot collect like hundreds of languages we can collect like, yeah, as you can see here. This was a bigger project that had a lot of funding. Here's a quote from like one speaker that was the last speaker of her language and this is a quote how much she feels hurt and how much, how difficult it is. Here you can also see one of the reasons why this happens. So I mean, she talks about clear cutting on our land and so on. So often this has economical reasons and economical pressures. Sometimes it's really direct pressure and other circumstances it's more like the speakers like stop to use the language because it makes economic sense for them to speak just another major language in this context where they live. So they give up the language and there are others that then also don't care, right? So it's not always like this story where they actually want to save the language and they want to keep it. There are other cases where there was economic pressure. They used another language but okay, that's okay for them because well, for what do they need their own language? I mean, that's more like a quote from a speaker also. I always like to connect this idea also to this idea of permaculture. So there's this idea in permaculture that you have diversity and that diversity like keep stuff healthy and diversity is good actually for living together and building your surrounding and your environment. So I also think that at least, yeah, even if you don't see any economic reason why you would save languages, you could say, okay, it helps us in a scientific sense it would help us to understand how our mind works, for example. If you can only study like 50 languages, you have a very restrictive set of features in the languages. So you cannot understand the full mind of what could happen there in a human mind, right? If you can study 7,000 languages, you get the full bandwidth of things, how to express things and you might find out more about how cognitive reasoning works and so on. Here, now we come to the technology part. So this, I did some years ago, probably changed a bit since then. So this is the numbers of speakers of the languages that are supported on iPhone. So you can see a large part if you support Mandarin, you already have like a lot of speakers that speak Mandarin that can then get text input support or have a keyboard on their device. So I think it was around 90 languages that were supported back then and that covers roughly 46% of the world population native speakers, yeah. So now the situation got a bit better on the iPhone but also Google built their Gboard on Android which now supports really a lot of languages. So the numbers are probably higher now but still, I mean, there's a lot of people that cannot use their native language on phones and probably never will, right? So I see this now also here in this community. Now there are new kinds of social networks coming up or how you communicate with each other. So the question is there again, for example, on mustard on how easy is it to use any language, right? What can you do there to promote actually usage of other languages than English or how you can make this possible? I mean, I'm not really in, I don't know a lot about how people communicate there but yeah, there are software projects where it makes sense to think about to support languages, yeah. And in the case of Android, for example, it took like really a lot of years until they realized, okay, people have this need and they started to build something. Here's another approach. You can build keyboards like hardware keyboards. A lot of people try to define like how they could input their language. So even now, every day, people are working on inventing new scripts for their language then getting it maybe like on the screen somehow or maybe even developing some hardware for this input. So there are a lot of languages that are not supported yet. They have no script that is defined. Probably they have no, then they have, of course, not support for software keyboards but then also not for hardware keyboards. Excuse me? It's a language without a script and you give them a script. Yeah, destroying, I would not say. You're changing it. Not really. I mean, it's also like normally this is not something that somebody gives them a script so normally they develop something. So even if you go somewhere really remotely, they already started to think about how could we write this down, right? So this is not something that you give them but you take what is there normally. And if there's no script, well, for your research, you have to like write it somehow but it's not like anybody has to use it outside of your research thing. But often there's already something. So in a lot of cases, you already have some, some proposals from the language community how they would write their language. That's normally the case. So here is this, where it comes from, this idea coding for language communities. We started this in 2014. We had a summer school in Portugal where we developed on several open source projects together with mentors from those projects and groups of students. So we spent one week developing in five different groups on different projects and was really interesting because the students could learn something about programming, about natural language processing. There were linguists there. There were software developers there. There were the guys from the open source projects there. So it was very interesting week. And for example, in one project, we actually developed an Icelandic speech recognition so that we could also see something happening. It was a, of course it didn't work like perfectly, but in one week we could learn how you would set up the system, how you would build the pipeline to build like Icelandic speech recognition. And since then we continue this. So we had another deaf room in 2016. Now another one to like find the links between software developers and linguists and the language communities to see if there's maybe something we could do together to support more languages and to bring those languages also in electronic communication. Here's another project by another co-organizer of this deaf room, Kevin Scannell. It's called Ankrubadan, I'm not really sure if I spelled this correctly, it's Irish. So we can have a look maybe what this is. So this is a big corpus for a vast amount of languages. So here built a crawler over time that crawls the web and then output some data for the different languages. So here you can see this is now 2,228 entries. So this is the number of languages that he supports. And then each of those languages has a data file that you can download. And it contains the links that were crawled in the zip file. And then ngrams. So it's not the data, we cannot, for copyright reasons you cannot just publish all the data that you crawled but you can publish at least the links where the language was detected and then the ngrams. So the words, word lists and the context of words like two words together and three words together. And I will show them that this is also like, this ngrams is also part of this text prediction that I wanted to show. So you can already build like technologies just based on those ngrams. Then this is a project, another project of one of the co-organizers of Eddie, Rising Voices. So this tries to bring together activists, language activists in an online media. So they do translations for example but also try to promote projects for language support, education, stuff like that. So Rising Voices is also one of the interesting projects in this context. And of course there are various others. So here also there's Mozilla for example, we heard about it already in some of the talks. Mozilla is doing a great job like building stuff for a lot of languages and building technology for natural language processing. So there are a lot of projects that do interesting stuff in this area. I wanted to show some, one project now quickly at the end of this talk, it's this Poyo. I built this in the beginning for educational purposes also. Like to show students how to develop a simple text prediction. So based on some data collection where the data, where you have to clean the data, how you have to prepare the data and then how you build your language model to support something like predictive text. And the idea is also that this is kind of language independent, right? So as I said often we have only small data for the languages, we don't have like the gigabytes or terabytes of language data available. So we have to find a way to build this with small data also. Maybe we can have a quick look. So it's a text prediction. So you can choose the language here, for example. I choose Bavarian because I'm from Munich. And then you can start to write and it would propose like the text completion in the lower line so you can also select here. So this is a sheet. So nothing really sophisticated but you can see already that the system works somehow. And what I wanted to show, just to show a little bit of how this works and what we implemented here. We can go to the GitHub page. So the basis of this, the core of this is a library called presageo. This is a port of a presage. Maybe some of you know it. So presages was for example, used in one of the on-board, on-screen keywords for Qt. So this was powered by presage. So you get the text completion based on Android models in this case. And I reported this to Python to make it a bit more accessible to students also. And just to show you a little bit of the API, what you can do is you can just parse text files here. This is all parsing of the arguments. But then what you get, what you do is you input the files into a tokenizer. You just tokenize each file. This gives you an ngram map and that you can store to a database. At the moment, we support SQLite to have it all on device. Or you can use a Postgres database if you have like large amounts of data. So you choose this database. And then on the other side to predict then the text, you just load the model from the, from some configuration. I would show the configuration then. So this config file tells okay, which SQLite database to use, database to use or which Postgres to use. And then you have a callback that gets called on each prediction that you generate. And then you start the prediction. So it's really easy to use. It's just a few line of code that you have to implement in Python to build the language model, but then also to use the language model. And you have several options how to configure this. So you can configure the database here. And then there are several predictors available in Presagio that you can use, how the different ngrams are combined and so on. So again, this is, yeah, for me it was a good way to show like how you could implement this to students and how now they can try out the different options that they have there and learn about how this library works. And what we also built is a pipeline because now we can build this for our own text files and our own stuff, but we wanted to also make it easy to like now collect data, for example, from this KubaDAN database or from other data sources. For example, we also used Wikipedia as a data source as a lot of projects. So we built a pipeline that is called the Poryo corpus that does all the parsing of the data and like manages the input of the data, then calculates the models, the language models that we want to use in Poryo and in the end outputs this to a database or SQLite so that we can use it in the front end. I mean, it's still this web-based front end where you enter text and this connects to a back end that has those presageo language model as its core. And the pipeline is quite, again, it's just, it's not a lot of lines of codes. We built this with Luigi. Luigi is a nice data science pipeline developed by Spotify. So we have the different parts. Those are the different parts here. For Wikipedia, we can use text files and other sources. And then we have this task to build the Ngram models and in the end, yeah, we use presageo to store it. And this whole pipeline is then, you can like add your language codes that you want to support. And if it finds a Wikipedia, it would just download the Wikipedia and build the Ngram models based on this. If you have other data files, it would include those. It would also include then what we are doing at the moment is inclusion of this KubaDAN database that would download the data from there. So get a data model from the different sources. Yes, and I think that's it also. So, are there any questions? The website? Mm-hmm. Is there also mentioned smartphone input at first? Yes. Yeah, for the smartphone input, for this presageo, there's at the moment only what we also have is a web-based solution, but it's not integrated into the smartphone, right? So now the databases that we build in the end are compatible with the ones that are built by Presage, which is a C++ library. So probably to run it on a smartphone to really like include it into the system keyboard, I would probably then use the Presage library to actually query the data, right? I can use the Python version to build a language model, but then to use it on the phone, probably I would use a Presage as a library because Python is still more difficult to run on the smartphone. But again, I also, at the moment, I see this also more like as an educational tool because now for smartphones, you already have like good support for a lot of languages for text prediction. So at the moment, I'm also looking for other use cases of this pipeline idea, what other things we could build. Maybe we could build also a little chatbot or something based on the data that we have. So, because this is not something that happens for Bavarian, like Bavarian chatbots or for other languages. Yes, I mean, Gboard is also more, parts of it is open source. Probably the data not, I didn't have a close look at it. The problem is always the data. So there's a lot of solutions. There's also open source software for keyboards. One is actually, I can show this maybe also because that might be an interesting project that's called Keeman, as far as I remember. Let's see. Yes, so they built something very similar. At the moment, it's also mainly based on angry models, but they also explore other more advanced models. This is all open source and they already have integration with iOS, with Android and with also a web-based thing. So this is probably at the moment, the most advanced approach for open source keyboards, I would say. More questions? I was just gonna comment on that. I think the predictive text from open source is actually at least the language, minority language communities I've been working with is actually in quite high demand. There is this kind of sentiment with the younger and language users that why would we use our language because I can't use it in Twitter app or Facebook. It would be a good thing to have like, because I mean, of course languages like Karelia and whatever I've been working with, you don't have input or no more time for ideas. Yeah, it depends, it's still like a reputation thing. I mean, if you have your language, I mean, it gives the language reputation and you can say, okay, you can use it and I'm proud that now my language is there. Then people use it some more, some less because then there is also social issues, right? So you discuss online and then somebody is in there that doesn't really understand the language well, so everybody again switches to the major language and so on. So it's not only about keyboards, but keyboards is one of the things that are needed. More questions? Yes, in this case, it's really the most simple thing that you get the close ones, really like the neighbors. And to calculate, you just count how often they appear. And this, but this already, with this you can already go quite far. So the quality is actually quite good. And then to recently, that was also the main model in most keyboards. Now people switch to more sophisticated models that also like take the semantics into account like this word embeddings that we also saw in other talks. So they are most sophisticated. Other questions? Well, then thanks for coming. Was a good deaf room, I think. And yeah, then hope we have the same next year. And if there's still questions or something, I'm still here. Yes. Thanks a lot.