 So, hi everybody. Today what I am going to talk about is a system we implemented in Firefox OS which implements a keyboard prediction algorithm into the present system on how Firefox works. So, before I actually go into the details of how we did it and what we actually did I just wanted to ask how many of you know about Firefox OS? Wow, that is a revelation. Now then again Firefox OS is completely all the apps and everything is based on JavaScript and it runs in a gecko engine which sits in behind which is called boot to gecko. Now the problem becomes or it starts with every app you create for it apart from the third party apps the native apps are also JavaScript and they have to run on the device. Now as some of you may have known there was a device launched in India a very low cost device which is 35 dollars I think 2000 rupees and it is called Intex cloud effects any of you have it. So, how are you happy with the performance of it especially the keyboard part? Okay, right. So first let us see what is the problem. So, when Firefox OS first hit the market and the market audience was low cost devices and those devices which run on very limited hardware. So, the problem was optimizing the apps or native apps whatever for example for us this was keyboard for these systems. Now one of the plus point of having Firefox OS was that we tried ourselves that we have we support a lot of languages lot of localist languages. For example, it has Bengali, Hindi and all these keyboards inside it. So, the problem was prediction part. Now every one of you who has an Android or iOS device once you start writing something in the keyboard and if it is not in English but you are using the English letters to write it anyhow a localist word. So, it won't come up as your prediction but once you keep writing those same words again and again for chat or something maybe it learns what you are writing and it slowly develops and gives you the prediction which was absent and still absent in Firefox OS. So, that was our work that okay how actually can we keep on predicting them and keep actually learn. So, why does a multilingual framework actually matter? So, when we actually started this, this was not the case. We wanted to implement for a very specific looper that was initially in Bengali but when we actually started doing that present how it was implemented in prediction algorithm. So, we could predict words in Bengali or in another language but they were very limited there was no learning part involved. So, we wanted to do that once we started doing that we realized if we can make it modular it makes a lot of other languages and we can actually make it a framework. So, that's how it began. So, what do we have today which are in open source from where we can we can take inspirations. We have Swift, Swifty, Swipe, the Android keyboard, the AOSP and a lot of other in other platforms. Among all these the only source code we have available is of the AOSP keyboard and we kind of utilize some of it. How? So, how does Firefox fit in? So, the problems we actually faced was that this is like I told this is in JavaScript and it runs in Gaia. So, Gaia is a keep like UI interface which runs on top of Gecko which shows you everything in Firefox OS. So, we will go into how actually we did it and one of the goals of doing it was that the learning part has to be there and it has to be language agnostic. It is especially true for Indian or Indic languages where we have lots of more characters than the Roman alphabet. Now, starting that how does yeah how does the suggestion localize builds have worked. So, we started with first Bengali and then Hindi to actually see that how it actually performs. First not in a device just in a simulator then in a real device and then in low cost devices. So, it has its own problem we will go into it. So, what was the plan? So, first the keyboard has to learn new words then it has to show it in a prediction and it has to work with the previous prediction algorithms which are were already implemented. Now, how does it work right now? Like whoever you have the Firefox OS device right now this is already there. So, how does it work is that they have a word frequency list and from word frequency list it actually predicts your world. So, what is a word frequency list again? So, we have a list of words built from a corpus which are already available and it actually goes through a whole list sees which of the words are repeated most frequently in those documents in the whole corpus and assigns a weightage to that. So, that is the word frequency list which is already in a device whenever you type something if you are typing something like for example, you are typing welcome you type W ELC and then it realizes ok of the all the words I have this is the word which has the most high weightage which starts with W ELC. So, actually it does not work right now like this for welcome it will work like W ELC O M CO till this word it will actually get all the things and only the last two word it will suggest you. So, that is how it is implemented now the word frequency lists are mostly taken from AOSP. So, Android kind of works like this and they have their word frequency list ready. So, we actually utilized most of them some of them we had to build. So, this was there, but there was no way to actually make it learn these lists are pre-made with pre-processed it and they are into into the put into the device. So, there is no way I can actually make it learn the new world. So, plan so, since it has to integrate and we cannot just introduce a new whole new paradigm of how we can do it because that is just not possible it has to plug into this existing system. So, the first target was ok for Bengali create a word frequency list and then make that list learn. So, we created a word frequency list and from a purpose which was provided to us by fuel team in India and we started experimenting with that. So, the performance was an issue which we tackled later. So, that was first how we actually started from creating the word frequency list. So, this is how it actually is implemented right now in most of the devices you have till Firefox version 2.5 which is going to come or probably is already there you can flash your devices this is implemented in version 3 the one I am going to talk about is going to come. Right now what it has is a bloom filter. So, just previously this talk you have seen that when you type something in Google it predicts word right that is a very complex mechanism which is done in the back end. But if you want a very simple representation of what goes in this is how the prediction works. So, we have a word frequency list stored in a node we have a tree. So, it actually goes through the whole weighted tree and it goes to the whole word in the last two words checks which node it belongs to and which weightage it has and it just gives you the prediction that is how it works right now. Now how do we implement the change? So, we already have that implemented and we have made a Bengali word frequency list. What we do now is that that list is pre-processed and put into the device. We create an alternate ternary tree structure into the device which when you buy the device will be blank there will not be anything. When you keep typing new words the words which are not present into the previous dictionary it will keep added to it. Keep note that we don't check the words. So, whatever may be the word is a real word not word you are doing a typo doesn't matter whatever you do it gets added to the ternary tree structure and a weightage is added. So, we have a very simple way of doing the weightage for this section. Whenever the how many times you are using the same word again and again will automatically increase the weightage and the word you are using will be compared with the one we have already have in a word frequency list. For example, okay, so maybe someone doing a typo. So, when somebody types welcome that is already present in the dictionary. When somebody types will it is present in a dictionary, but when somebody wants to write wills fargo w e l s that is not part of the dictionary. So, first it learns okay this is a new word just put into the dictionary and it assigns a weightage to it. Next time somebody again tries to write wills it will go through the list and see that will has a more weightage to the wills. So, it still won't predict it unless there are no other words for prediction available, but if your user keeps writing the wills word again again that weightage will bump and it will trump the one which already is in the word frequency list previously. So, now it will come to the main prediction for you. So, that is how we started tackling the problem. Now, this kind of works and this works pretty well for new devices and everything and this also solves a new like new segment of problem we have which is this is specified to the user. So, every user will have a different result with it. So, if somebody and is a personalized because we are not taking the data back to any server and we are not training on it. So, what whenever you are using it your dictionary will be radically different from somebody else who is using it the one we are actually building here. So, let's see a demo of what we have till now. But before that let me show you how it works in a device right now. Actually, I will show it later. Yeah, I did that. This is exactly why we have all this. So, as you see I will try with a few words and see this word is not available. So, here is a one of the like contributors in a Mozilla Bangladesh I was trying with his name. So, as you can see the predictions are coming up here and you see the his name is not part of the prediction yet. The current was part of the prediction not the remaining part. Now, I will remove it in case some of you missed what is going on there will be more like a few more words still not there I tried with my name. This was simulated in computer. So, you see the typing is pretty slow. The catastrophic is not present. I try again till not present. Now, so this is how it is implemented right now. But there are very big varying problems in it. So, as you can see the alarm is right now coming. This is interesting Mozilla was not there Mozilla was there. But anyway, okay, you get the drift what is going on. Let me show how it is right now. That is more or less you can see it. So, that is a flame that is a Firefox OS device. Something you can see. I will try to write something the keyboard. I will try the same example may be interesting. Okay, Modi is now there. Anyway, so I, okay, let us try JSU that is probably not going to be there. But yeah, the JSU is not there as you can see it is nothing with JSU. But yeah, so it does not matter how many time I try here. It will actually not learn it. So, as you can see and this is running a fairly advanced version which I will show you. So, yeah, this is running a 2.5 pre-release, which is going to come, but it still has not been implemented there. Okay, so what we wanted to implement just as a implementation part, okay, this is how we learn and it goes on. Now the problems we actually come when it becomes a multi-lingual prediction system. These are the problems we faced, phonetic ambiguities, loopy phonetic nature of our languages and multiple input variants, so on. We will tackle a few of them and I will show you how do we actually tackle them. So, first phonetic mapping ambiguities, which starts with how many number of Roman letters we have and we can accommodate in a keyboard and how do we actually predict when it comes to like predicting new words which are not part of that. So, there are two ways actually we can write in any other language like Bengali or Hindi, which is either a fixed layout keyboard or a transliteration system like overall keyboard or something like that, where you just write using Roman letters, but it will produce you Hindi or Bengali word, which is actually applicable for most of the Indian languages. So, that actually creates a little problem. So, that's one of the example is that in Hindi actually, the later initiated D, we can write it, so with that can represent as B is all these, the, the, all these right. Now, we have to handle that other problem we face is loopy phonetic nature. What we write or what we actually lot of times try to write is not strictly phonetic. For example, Bershwin that is what we write as B-A-C-H-P-A-N not as B-A-C-H-A-P-A-N which would be the strictly phonetic nature. So, that is what we do and we cannot force users to do something specific. So, we have to compensate for that. So, our system also has to learn how to actually people are writing it and don't produce false predictions, which will completely throw off the system. The other one is that multiple input variant for the same word, we can use like different inputs and it should still produce the same output. For example, Rashtrapati, we can write it in all different ways like this. Rashtrapati, Rashtrapati-A-T-H-I and all these. These should produce a single output instead of multiple wrong outputs, because every wrong output throws off our system a bit. So, how do we deal with it? This is the work I am, this is still going on and some of them has been dealt with in implementation part, not all of them are dealt with in implementation part because then comes engineering problem, I'll just touch a little. First, how do we actually deal with like multiple input variants and those things? So, we start with alphabet mapping definition. For this, we use a very basic solution or you can say way. So, we start with regular expressions. So, we create rules for every Roman alphabet in set of rules that for every alphabet, we create a rule that this alphabet if you type it, then these are the output natures you can have. Then we actually go back and create with multiple alphabet like A, B on those things and we create from a corpus that, okay, these two letters can produce these kind of outputs. We create a mapping definition for all of them. Next, we create a training data. So, we already had a corpus which has all like we have like 11 GB corpus for Bengali and that has all the letters of words, whatever you use in literature. Now, we say it specifically on that the alphabet rules we used to create, we use those regular expression based rules on the corpus we have to come up with specific words snippets which generally are used, which we validate if they're real words or they're really used or something just generated by your rule. Since these are regular expression based rules, they can be altered or adapted for other languages as well. But you have to do the creation part. Once you have it, then we actually create different classifiers for every training data. For example, when you have different alphabet mappings, they each will produce a different set of outputs. You have a different classifier for each of them. And we also try to implement a portion where we actually check what are the nearby words so that we know that these are the words most likely to appear with these other words. Otherwise, these are not likely to appear there. So their weightage will be less. This is how we actually calculate the weightage and just to keep the actual training data which has like a 5 lakh unique Bengali words about took about like 30 to 40 minutes to train after the first two parts and that also will in a power PC cluster we have. So how do we actually that decision? I mean, how do we actually predict it after all these? So we use a very basic decision tree model to go through all the data and the new weightage. This part is not in the mobile device because it will actually kill the device. So the index model which was launched in India has like 128 MB of RAM. And if you implement this in JavaScript and if you run it there, you can guess what might happen. So that is an engineering problem that is still going on. So once we have it, this is this was a very interesting problem. When you type in Bengali or Hindi, sometimes you use English words in between. And those are the words. So for example, I'm going somewhere translates to Hindi, like I'm going to say the name of the place, I'm going to the J S home or my convention center page around. So this convention center word is not Hindi. But if it thinks that, okay, that is Hindi, it will try to come up with like convention C O N. So if it can try to come up with, okay, what what are the rules I have for C O N if it matches with any Hindi word like that. But we cannot just do that for English words to segregate the English words. What we do is that we just have a lookup table. Once the word frequency list we already have for English dictionary, we just look it up there. If the words are present there, it won't try to translate it. If it's not there, then only it will translate it. So that is how we are working with this. And so this was specifically for Firefox OS and in JavaScript mostly code is available in GitHub. It is not merged into modular central yet. So you can find it in my GitHub and a few others. So one of the generalization idea after we came up with that is that since this is a general framework, you can see how it goes on. You can try to improve on it or you can implement it in other things. For example, Wikimedia has recently come up, not recently, one year I think now, come up with an editor called Visual Editor which is based on Etherpad and Ether Editor actually and which is right now used for most of the Wiki edits and you can use your own language. It will automatically translated and those things. One of the ideas we had was like, yeah, if they are interested, you can just take this and implement it so that each writer gets their own personalized things available there which might or might not be a good idea. So those are the normal general ideas we had surrounding this. So this is mostly what I had. So this part I really need to do. So I am grateful for these people. So Mandar Mitra, Majumdar and Amandavadhai, they actually gave me access to their corpus data which was invaluable for actually creating the word frequency list and whatever we did. And in general that's from L2, C2, Shankar Shanmukhapadhai from Red Hat, Alonita Sharma who was in Wikimedia once I was doing it and right now in Twitter engineering. So they actually helped with a lot of ideas on how we can do that and obviously the Mozilla team for all the value inputs for this one. That's mostly what I have today and I'm open for questions and if you have any suggestion on how we can improve it or how it can be done better and if you can contribute, that's awesome. I think he has a question. Hello. You said that you collected data from 50,000 Bengali peoples or I'm not sure peoples or something. Words. Of 50,000. How much time it will take? Yeah how much time it like I have in India there are 18 languages, national languages you can see. So how much time it will take to manipulate all this data? So yeah first we actually did was in 5,000,000 words and it was part of a 11GB corpus which was provided to me. So it really depends. So this is a pre-processing part. We are not going to do it in mobile of course and so when I was doing it I had access to a pretty good system which was a cluster of like 32 CPUs and everything and it took me like an hour every time I wanted to pre-process it with new rules or something like that. So just pre-processing or for creating the word frequency list is not that hard if you want to just create a word frequency list and there are even codes available in python codes and they are like if you want to do use WECA or something you have codes available which you can do it. But if you want to implement the other parts for example creating the alphabetical rules they actually you have to manually do it and then implement in your system to actually do it. The processing part is actually not that much. So it takes time but with the like mostly the fairly powerful systems we have I had a cluster of PCs and it was quite okay. It was not that hard. But yeah for every language you have to actually build for example Kannada or any Tamil, all the languages you have to first build the alphabet mapping definition first. That is the hard part actually doing it and not every so for example I only knew about Bengali mostly Hindi I had heard from people otherwise you cannot actually create it unless you know about the language mostly. Yeah another question like you said you have defined the rule manually. So like there are 18 languages we have to hire 18 people from different languages who can code right. So that is one of the plus points we have. We really have a very good and very vibrant Mozilla community around India. I know in Bangalore in Hyderabad there are a lot of places and all the localization part of all the Mozilla products are done by volunteers. So you go to MDN, you go to any Firefox, all the localization part is available there is done by volunteers right. And so they already are doing that. You just need to sync up with somebody who can suggest okay this is how the words are gonna do and going on with somebody pair up with somebody who actually does the coding okay so this is how you can do it. Let's try that something like that. So that is possible. Thank you. Hi. You said you are using a word frequency list and a bloom filter. So are you assuming statistical independence between those words or are you considering a pentagram model language model or something like that? Oh no so that part actually would have been better if we wanted to do something else. But I was told so this we cannot touch what is already there that will actually break the old system. So what frequency list what is inside that is also a ternary search tree but implemented by google and so they have a specific way of how they did it right. And so what frequency list they just have created the list and created in a dict format item. Nothing in there that's like bunny data block we have. So we just take it we use it. But other part I mentioned where we actually create the ternary search tree that is in device and that is in the JavaScript. There are plus and minus points of it. I didn't actually touch the limitations. Limitations are when I create a ternary search tree in device or in JavaScript in memory performances are not impressive when I actually do the operations on it like look up creating new entries. So one of the problems we might have and we actually really need to see it so if a user keeps using a device for maybe one or two years with lot of new words we right now have no way to actually know how it will impact the device especially with a device we want to integrate in the RAM or something like that. Those are the things we need to take care of eventually. Yeah I had one more question. You were discussing about shwar deletion taking an example bachpan as an example. So do you have a JavaScript implementation of shwar deletion which even takes care of the example of dharkan? Dharkane? Oh, you said bachpan bachapana and then difference between bachapana and bachpan. So that is not in JavaScript. That was my pre-processing part. Oh is it? Yeah. So we actually have something and that was not implemented by us. Have you heard of omicron lab? Yes. So Avro keyboard, once they came up with the iBus and their JavaScript implementation they tried to do that and so I actually went back to and talked with them that okay so easier something we did and they have something which doesn't quite work. Okay. So we took a step back that we instead of trying to handle it in device I'm actually tuning them out through the fwb mapping use. Since I have contest-aware predictions which are going to be there which is not in the demo I showed which are going to be there so in that case we actually tune out a lot of things like that and so for example the bachpana yes a bachpana is taken care there but the rashapati part is not. Okay. Thanks. Hey. Hi. So you said you're happy to have contributors for this project that could you please elaborate on how people could get into this and that is a very good question. We have mailing list available I should have put it here. So we have mailing list available there are IFC channels in Mozilla so I worked this with BFH Isla whom I'm working with in the other projects so I will not work directly I'll call about this on other things so you can just sync up with me and there's somebody called Tim Midian who kind of use the whole keyboard part so yeah we can hook you up so there are that's not a problem just and about the GitHub repo you didn't would like to see that as well the GitHub repo where the code is oh the whole so what I showed today the demo every part that is already there if you go to my GitHub and just pass your via gai you have the whole code base and how to like implement in your device just so that is already there so the other processing parts they are also part but they are kind of fragmented so you actually need to know that okay so this is how I'm going to implement everything is in GitHub but they are not properly documented that okay somebody can go in okay this is how I'm going to do it so separate portions are there you can use them right now yeah thank you I have a question yeah yeah while demoing on your Firefox device when you started typing js-foo so it started suggesting usf something so what is the logic behind that ideally it should not suggest anything from j right yeah yeah so that's what I thought initially but remember one of the things I talked about how this is implemented right now so this portion is how it is right now implemented in the device if you see in the three even with the weightage it weights for the whole world in the last two characters for js-foo it weighted for js-foo then only the last double o it wanted to see that if there is a word so if apparently you couldn't find anything in the dictionary which starts with js-foo yeah ideally in that case it shouldn't have suggested anything but it started suggesting yes usf text something like that not starting with you right yeah yeah so that is currently a problem I think I haven't actually seen why it suggests to you this thing I will see but I just changed the whole portion of how it suggests to implement my part so yeah probably I'll see why it starts with you or not something with you thank you hi so are the weights solely decided on the frequency of the words or do you have some other algorithms like most recently used something of that sort um not most recently used no uh so again implementation what there is is based on word frequency list and the turn SS3 on the new one coming up again what I talked about after that which is we are like still processing the part and then putting in the right even there we are not considering the recent or how decent you have typed something which is we are obviously considering the frequency of how how frequently you type something obviously that is already considered but along with that how those words actually appear with other words in a different literature that is what we are taking from the purpose that if you are typing some word like her so what are the immediate words which come with her so uh like the context kind of thing which sometimes works sometimes doesn't give a good result but yeah that's what something we are and does it make sense if we have these words defined on different contacts for example if it's a sms amp so if we store it for different contacts because uh some words we specifically use for some contact and uh so is it even feasible to do that and helpful it all depends on uh once we generate the data on those how good or bad uh like tradition it gives for example the context one i told that kind of gives us a fit or miss like it's still 50 50 percent like not exactly it's like 50 40 false positives are 40 in that case that's not quite like good thing to actually implement that even we have electric so yeah so these are all we are trying that see how we can actually improve it without putting the processing in the mobile device itself and the other problem is we cannot for the privacy person of policy we cannot actually take whatever the user has actually typed in the device put it back which actually everyone other like every other thing does like swipe uh or um swift key they all have a like portion in the cloud if you enable it or by default enable what whatever you train it gets the data and it trains yes itself on it which we should not do actually thank you yeah hello I just saw uh it's really like to answer the question uh coming from there like uh why the specific word has existed when start with j and suggest it with go so it's not only the part of especially prediction it's in a part of auto solution as well so in keyboard layout j and u are quite especially close so when you start with any word with j it it especially give you a suggestion that they suppose not to be j they're supposed to be specifically you yeah so that was part of the fixed layout keyboard which was already implemented yeah hi uh hi great talk thanks so how do you test something like this very badly so I have a few people in for Bengali part at least I kind of reached out to the community ask them that how many of you are ready to test it like specific build for me may not be in your daily driver device but you at least the type everything every day from type something there and a real thing you type not just gibberish you type there and for that case I actually had it enabled for people to log it and send me the logs where I can train it so this uh I work last year till august and uh from august it was done I had like three months data for 30 people from Bangladesh community and that's why uh so there is another problem which I haven't talked about the corpus I used has both words from Bengali B N I N and Bengali B N B D and there are certain words which are not available in B N I N available in B N B D and vice versa so you have to be really careful to see how you train that part so it's kind of I think biased towards that you know that was my case thanks thank you thank you Ravima folks we break for