 Yeah, so guys It's afternoon, right? It's the last day Seriously, I was very excited to be here, you know Back in PyCon everyone So much people, right? Everyone is working on Python. Yeah, I'm working on Python So learn something new every time when I come to PyCon We can go back and experiment come back next year meet new people that's happening for some time. Yeah Yeah, and So today we are here to present in the KNLP. So we have been working on this for an year now I'll start with the story. Everyone has been asking me how we started this Why did you go with the Indian LP? What was the motivation and Everything but it's a funny story actually So first I'll introduce my teammates. I'm Adam Shamsuddin This is Selva Kumar. This is Kamal Raj. And we also have Soham Currently he's not with us because he's in Singapore doing his masters So Soham worked on Bangla and I and Kamal worked on Malayalam and Selva worked on Tamil So, yeah, let's go forward. Yeah Yeah, everyone has this dilemma, right? Should we do it or not? Right, that's the whole how this began I work in Sama Technologies all all of us actually when we started we work I was working on a project called Daliya. That's a chatboard platform So we were using all the, you know Latest NLP techniques, you know transfer learning ULM fit DPC and in different architectures, right? So I was thinking why why why can't we do the same thing in our own languages because there are a lot of things Out there a lot of research papers a lot of data sets But why isn't anything in our mother tongues maybe Tamil or Malayalam or Hindi? There isn't much, right? So what I did is I went through a couple of the papers like we can we get a data set. That's when I saw paper from Salesforce It's from Stephen Merity And I don't know the god there So I thought maybe I'll tweet at him and how he got the data set. How can I do the same thing? So back in August last year. I was just thinking I'll just tweet at him Then I thought should I do this when everyone thinks that we're right? Should I send him and he's like a big guy Should I send him on to it? Will he respond? What will everyone think? I thought, okay? Let's do this. So I sent him a tweet and he responds back. He said yeah, we have script for that Right, it just send me an email or a message. I'll send you this thing. So yeah, and So we wanted to do you love it. So that's when you love it came out. So in you love it what We were excited because it was transfer learning. There was never transfer learning in text So we wanted to do it for Malayalam because we don't have enough data, right? so we also I also tweeted at Sebastian Rooder and Jeremy Holtz. So Sebastian Rooder if you're into NLP like he's like Really really good in NLP So I tweeted at him and he also responded. So that's like a boost. These guys are really big into NLP And they're ready to help anywhere anyone anywhere in the world if you have just a Twitter, right? Just we did them. They'll be happy to respond So what we did is I We got the data Because there was some help for These things so we went on we downloaded it. We we got the data, right? So what next that's what the story of a banana cake comes in. Okay, so it's a funny story again. I Got the data set. I don't know what to do because I know NLP But our team has a very good NLP people like Selva and Kamal So I asked Kamal, man, can we do ULF it on this? You know, are you interested? He's like, yeah, we'll do it next week, man We'll do it next week Maybe if you buy me a banana cake, I'll do it today, right? He was like, okay I'll do it today if you buy me a banana cake. So I got really okay Then I go I went out. I just bought a banana cake He had like two slices. He started coding and I told him like if you are done With when you finish it, you can have the whole cake, right? So He got excited and got excited because of the cake And he started doing it and we were really excited that we got a very good Perplexity on the language model that we trained And we were like thinking what should we do next we have a language model that language model predicts the next word What should we do next so that when he that's when we realized there is in any data sets for Any task in Malayalam? So we went on to a website a popular Malayalam website and we downloaded some newspaper articles say and we categorized in them to five like sports business entertainment and we did a Text classification on top of that and we got a very good accuracy of 92 percent That's when we realized we can actually apply a lot of deep learning techniques In our languages So when we were done with that I told Selva that we got a good thing. So he's like, oh you guys Did it? I also wanted it To do it for Tamil so he got excited and he did something for Tamil and then Soham joined He did something for Bangla. So that's how it all started, right? so when we went in deep we realized that There isn't much right no data sets in Malayalam Tamil or English that is publicly available and even if there is these are with Academy like Universities in India and they are not willing to give any data sets to outsiders. You might be seeing a lot of research publications in IEEE saying that they got a good accuracy on this Datta set using this model They have a good classifier, right? But None of these things are publicly available. So people like us can't actually work on it improve it Improve it and bring maybe a new models, right new models so and even One day we found a India data set publicly available and you are so excited, right? I don't know the name. It's first fire I guess and we downloaded and we realized that it's a zip file, but it is password protected And they're not willing to share the password with us So this are the issues that we went through next I will call Selva and he'll explain how and How why that language is hard and why we need an open data platform? Okay, hi, I'm Selva Why open data and why language I Have a 14 year old friend who develops Android applications and I have a 40 year old friend who does data science. I Used to Arduino to create robots and people use Raspberry Pi beagle bone to create drones And there is entire hardware movements to create machine network like an alternative Internet So these are all possible You do many factors, but one of the primary thing is free software movement Having access to code and how the softwares works So the softwares we use Wim E max Linux Firefox and of course Python It's they are very famous and stood the test of time And they all have something in common with the bicycle They are easier to modify you can Disassemble I mean you can disassemble and reassemble a bicycle in a matter of half a day or a day And why does it matter so? One of the things with free softwares you see it's easily customized it can be tailored to Run from smartwatches to supercomputers and how did this happen? Where there are tech savvy people like Linus Norwells and Stalman who have worked on? Harder code bases, but also community effort from common population so if communities have access to the inner workings of the things they use Civilization smooth forward and But there is a problem coming up because softwares are changing in a sense that The inner workings of the software is not just code anymore The code doesn't give you a full picture of what the software does because the ML and AI train has They are learning they're just hand coded they are learning from data and even if you have access to the entire code base Without access to the data that they learned from it's very hard to modify To suit our needs that's Why open data is very essential for Another revolution that happened in free software to happen in machine learning or AI So another thing why language I'll come to the later so People thought solving chess Checkers go will lead to AI Because intelligence was thought to be rational and calculating But we have solved the chess go and checkers almost completely, but we still don't What do you meant by AI? So the definition of intelligence is a little different and it's evolving one of the definitions that As a guy who works in MLP I have is if a system understands language, I can safely say that it is pretty smart at least smarter than a monkey So why language? Everybody speaks language How hard could it be? Let's see in French medieval times. They had a game There's a part you pull your money in and if you hit a chicken with a stone you get all the money This is where the term pool comes from because chicken in French is pull poultry pole That's where the memory pool thread pool we all use right so another thing Why cat CAT refers to an animal and why does it refers to an animal that meows and not the dog that barks What is in that C A and T? In that order makes up a cat Or the meaning of words entirely composed from individual letters or the relationship is arbitrary This is at word level right When you go to sentence So what do you think this is let's talk about rights and lefts You're right. So I left But if you read it, there are too many words stock outright sand our right and soil chunking sentences Into word pieces is a very hard thing to do For us, it's easier But then you just feed him to a tokenizer just was crazy another thing is how words follow order and create meaning out of Individual words How are you is a meaningful sentence? But our how you is not in English And that is the order of words the syntax This thing the green marble went to sleep furiously last year It's a valid English sentence in terms of noun and verb order, but it doesn't make any sense, right? This is a semantic problem because we need knowledge of the world we live in to Understand the meaning of sentence and the sentence to be valid. It has to follow some worldly laws like physical laws another thing So when your wife yells look at the mosquitoes She does not admire the mosquitoes. She's Telling the windows are open and asking you to shut it So that's why language is a very hard problem And if your system understand language we can't impact knowledge on to it just by talking and we Hope to do We all hope to do so in at least next 50 years so I'll Hand over to Kamal who will talk about the technical aspects of as an indie can help you platform I will explain what I feed them so far So we collected our data from the initial data collected from our Wikipedia them unstructured text That's where we created our language model with the around perplexity of 36 So then again like after developing a language model We didn't know how to evaluate the language model. We need further downstream task to evaluate the Performance of the language model. So we collected data from news articles under like different category Entertainment sports business etc. So that's where we collect data for our supervised classification model in Tamil and English Malayalam So as a first-hour deep learning approach we created a word-to-vek model It's we actually use the open source cool tool called gensham So we have created word-to-vek model for Malayalam Bangalayan Tamil. See we all Already opens up these codes and model and we all also have a UI on top of it Then after that we also try transfer learning in NLP. So we use an ULM feed. So currently we are working on bird So first task of most NLP problem is how to represent text in a machine understand So initially there were like statistical models like bag of words where we just represent a count of words We just take the count of words. So the words doesn't have any meaning so like we here we are representing the words using the continuous vector space where Semantically similar words are mapped to proximity points in the geometrical space So what we can do is we can click on a word then we can get all the similar words associated to it So that word level meaning is very important when we are coming to the NLP problems And also when we so we have two models and is continuous bag of model and as keep your model So what is continuous bag of model does it so it takes an input as a one current word So it tries to predicts the context of its words That's the first second model. So the skip gram model is exactly opposite to the continuous bag of words model So it takes the current word and like it's output the all other words context words so we train the Skip gram model for the Malayalam and Tamil language as we don't have a lot of data skip skip gram model works for with less data So and like for evaluating the unsupervised model for evaluating Also, we don't have any data set. So we have what we actually did is we have a tensor flow embedding projector So we projected our word representation using that then we manually analyze the word clusters So so where we found like how we tuned the hyper parameters and like And like we manually created our the size of our vocabulary. So most of our like word to a vocabulary is around 10,000 So then we after creating the word to a model we try the normal LSTM classifiers Okay, I will show a demo of how word to work works. So this is for Bangla. So we are clicking on the word called Okay, Tamil there is a word called rat. So when we click on the rabbit We are get the all the animals like rat and Cheetah and everything in Malayalam. We are clicking on the word called January when we are clicking on the word called January we are getting all the months related to the January, right? This is November December and everything Okay, so so after creating the word to a model what we did is we created a normal LSTM based classifier So like as a first layer we use the pretend embedding that we already created using the Gensom model on the second layer We have created the LSTM bidirectional LSTM So bidirectional LSTM is used to capture the word meanings from the both direction has like in its innocence The word can have dependencies in both direction Then the final output from the LSTM layer is taken into a fully connected dense layer followed by a softmax Accuration so in this configuration we got around 89 percentage accuracy on our news classification dataset without any like fine-tuning For the hyperparameters or anything. It's took around one hour for training So after doing that we are excited to do transfer learning in NLP. So transfer learning is an approach where like you apply We train in large model on a huge amount of data Then we use that domain knowledge to transfer the knowledge to a similar simple task where we don't need a huge amount of data so The transfer learning is very common in computer vision. So this is a first paper in transfer learning on a text This is a model is called ULM fit. This is the implementation of fast AI So, what are we doing in like for ULM fit is first we train a language model as a first figure So what happens in language model training is we will input it to two or three words Then we will try to predict the next word. That's how the language model Learned so it actually learns a semantic and like relationship between the birds and all other things So, what are we doing the second step is? We will fine-tune the language model on the our preferred dataset here The language model is trained on the Wikipedia and second language model is fine-tuned on the our news classification dataset Finally on the third class for fine-tuning. We remove the final Language model layer and add an X software likes a softmax layer as in the LSTM classifier to class via on the news us Here we actually own with around 10k samples. We got around 95 percentage accuracy Then currently we are working on like bird for Indian languages So bird is a like little different architecture compared to all the transform all the other transfer learning models bird is on like Works on a Transform a model. This is a transform model is from attention. He's all you need paper fine Google So there's a little different from all other language model. This is a mass language model where we predict Instead of the next word we will predict the mass and there's a one more task called next sentence prediction So we train with jointly with two sentences and Like bird has improved I like in English itself. It says improved like around five percentage improvement on all the NLP Task so currently we are working on it. Even though the code is open source by Google The talk nicer is not compatible with our Indian languages currently We are working on like in the talk nicer part as far as we finish. We will also open so this so all the models and codes are open So straight so you can check out our Indian repid autology So Adam will talk about more about our platform So whatever we have done We put it on a github and we found a lot of people interested in the project And we wanted to bring everyone together. That's when we launched the website called in the canal P dot o RZ So we put all the information that about the project we have what we have done so far Whatever tools that we have all the github projects in this website So we there is no data tagging platform for Indian languages Maybe not for any language that is Publicly available. So we wanted to do a translation task. That's when we I build a small Django app and for translation But I know it didn't work that well, but that's when we found a project called do cano That is recently released that project right now. We have added data set for named entity recognition So we can select. So what I'm selecting is Bangladesh. It's a location Tajuddin Muhammad. It's a name of the person So this is a news article and we are tagging it for named entity recognition Similarly in Tamil we have the data set uploaded and we have just started tagging So we are coming to the big question. How can you guys help us? I mean, how can we do this as a you know? As a project together, right? So if you know how to tag that you can just come to our platform and help tag that us So if you know Tamil or Malayalam or any other language in the language, you can we can help you add You know work with us and improve this Indian LP into all Indian languages So and if you know English and Indian languages, we can use it for translation We don't have any translation data set for Indian languages publicly available Similarly, if you are linguist, you can help us, you know frame Good tasks, which is different from English. There will be different tasks in Malayalam or Tamil, which we don't know yet You know how the language NLP models improves is that we actually create a data set And people try to create the state of art models on top of that So if we can actually create a good data set in Indian languages Then all of different people will be competing to create the best state-of-the-art models So we'll actually have a good code and you know data sets publicly available. That's what we are aiming for right now So these are some of the other contributors that has helped us in building the UI in a building the You know the Websites and everything and even the crawlers. So we have written some of the Crawlers that goes to Indian website is and crawl and get us news data's everything. You can also contribute to that Meet again. So can you talk about your volunteer community? Especially the non-technical people So currently we so after coming here I spoke to a lot of people and we have started a telegram group So you can join the telegram group? I'll be providing to the link to the telegram group in the website can come there and there is a Google developers group You can sign up in that. So you will be getting all the emails about all the different languages There's also a Facebook group can be a part of that So yeah, I would suggest join the Google group because you will be sending you'll be getting emails And you can if you're working on anything you can also share it with everyone Yeah, our gang will will also join sure Hello, so actually I'm asking what kind of tokenization method did you use? Or Malayalam or Bangla? Yeah, so currently we are tokenizing only using the space and punctuation So that's actually not correct. So but the model used to learn it anyway So currently we are only also working on like by pair based encoding as a wordpiece vocabulary based tokenization So actually you can use byte pair with English, but I was like I'm also working on same thing So how can we use byte pair for Hindi or any other Indian languages if you can share something? Yeah, there's already a library by Google sentence piece if you know, I don't know I know it so you could train her like the large corpus. We already created the Wordpiece vocabulary for Malayalam for like bird training Okay, so I'll share the boards with you. Okay. Okay. Thank you So I want so most of the approaches that you mentioned the process come when it comes to crawling or training a word-to-web model Which is unsupervised all of this can be applied to any language. So What is what so what did you have to do specific to Indian languages or how is training more NLP models for Indian languages? different from training it from any other language or say English because most of the stuff that I found in your Talk was generic The technically we didn't do anything different. One thing you're trying is to create a tokenizer That's the one problem. We face right now but this is like the very few of the atoms to create language modeling and release the model data also in public Are there any challenges that you faced with Indian languages that weren't with English? One primary thing is tokenization because even with the byte bar encoding the unicode representation of Our indic languages is a little different other was single glyph I'm not sure whether you if you're from Tamil It means car plus. There is a dot in the top. Those are different Unicode points this creates a problem when you use be a byte bar encoding. We are working on solving that We are seeking help from people who have knowledge in linguistics who know understands language and the grammar grammatical structure because we are not linguistic savvy people That's one challenge we face Other thing is a lack of data sets Because we just crawl the newspapers that's easy to do but creating name-to-name recognition or past data sets For example name-to-name recognition anyone can tag the data Okay, you you understand what's name and what's place but when it comes to post tagging Part of speech whether it is a noun or verb or something I mean complex and non verb, which I don't know That is not easy for common people to do that, right? So we need Linguistically savvy people to help with annotation Do you have anything on transliteration English people like Tamil in English or Hindi in English? That's what people do when they don't have a keyboard or whatever is it There is one project for Tamil open Tamil. I think they have a transliteration package I'm not sure for Malayalam. Even for Malayalam there is a publicly available transliteration tool so we are actually all thinking about using Like applying deep learning on top of transliterated text because applying birth and all these libraries on top of Transluted tests will be easier so we haven't tried that approach will be we are actually looking forward to that Data set is actually available like if you do any name-to-name recognition a lot of data is available in English If you can translate into language, it becomes easier some way Okay, another but lot of early content even in government. They don't use Unicode much It's in like in Tamil. It's an angel or if you use railways. They are not using Unicode for now. That's what I understand from so Older content if you can have a morasa angel a lot of other from a Tamil land here and that we had a lot of standards earlier Because I did something about 20 years back. So I'm looking at what you can come with that part one thing I want to mention is you mentioned about translating datasets, right? So there is a Squad data set is a question answering data set. You are given a passage and some questions They're supposed to answer those questions within this context of the package Let me try translating most of the sentences were Gibberish for example to give an example Nila will party what is what to do in all this is a Tamil phrase Which means there is a lady in the moon who is cooking up what I But the translation we got was there is a old lady burning in the moon so It is a little hard to just trust the translation Maybe even with that was from Google Translation, I think cockroach in Malayalam was a little different. I don't think we can say that in public Hello, yeah, so I had a slightly different question Like you said about tokenizers and all can we apply the techniques that you're using first things like applying ctc Laws on text that you're trying to do for example if I'm trying to handwriting recognition But I want to run an NAP task on what I get from it like a ctc laws to fix it or word grams to try and predict missed or Sentences can you talk a little bit about those kind of applications in Indic languages? I if I understand correctly you are also applying OCR, right? Yeah from OCR I do not have personal experience with OCR, but we trained speech-to-text model That also uses ctc, but I still do not understand completely to answer your question how ctc works What about skip grams or n-grams? So if I if I'm trying to classify if I'm missing a word in the middle Trying to predict that word based on the previous words and the words after that. Yeah You could actually use kibra model any language model. So have they worked properly with any with Indic languages in your experience? Or have you faced any problems? That's my question So you could actually try our ULM fit model with that's our state of art now. This is around 36% per proxy T We had actually we tried to generate samples using that language model. It was actually pretty good grammatically. It's correct So it was like we generated from it's scratch. So we didn't actually try just just filling out some words So actually I can assure you. Okay. Thank you