 This is a lot more people than I expected on a Sunday morning. I wouldn't be here. So yeah, thanks for coming. And I'm Aadash. I'm from India. And it sure sounds like a clickbait title, but yeah, I'll try to deliver. So we're going to look at different applications of NLP and how you could develop that to maybe propel your career and to impact the strategy. So basically, everyone here would already know what NLP is. So I'm not going to spend a lot of time on this. Natural language processing is AI part in which you teach the machines to recognize, give input and output some form of text or speech data. So this is obvious. I'm not going to spend a lot of time on this. So a best example would be the most advanced NLP tools that are Google now or Siri. It takes a lot of research and in a lot of languages to develop such tools. So we're going to look at how you go from nothing to this level. So we'll start. So there's a problem. This is data from eight years back. So here you can see that English leads the internet. And then there's all these languages. Then there's the rest. But things already got internet. So this bar, I think it overtook English. And in the coming years, it's going to go exponentially this way. But the thing is, all these languages, they're heavily supported. They have a lot of research, tools, and every kind of a lot of people working on this. But this part right here, it doesn't. It doesn't have a lot of people. It doesn't have a lot of research. It doesn't have tools. It doesn't have many things. So this part, it's a lot of people. It's billions of people who use these languages, who prefer these languages over English. So here we'll discuss how to get into these languages and take it on par with languages like English. So we'll discuss what kind of tools and what kind of research can be done in these languages to take it on par with this. Before going into that, we had to discuss what are resource-rich languages. And we just saw English and a few more languages. And you could call them as resource-rich, because there's a lot of tools, research, corpus, everything for these resource-rich languages. But there are a lot of 95% of the languages, that resource-poor languages. And just in India, this is the amount of people who speak this so-called resource-poor languages. And the problem is that there's no solid foundation for developing tools or all this stuff. So if you're a developer or a machine learning engineer who wants to build these tools for these languages, there's a lot of potential for these tools. But you don't have a solid foundation to build these tools upon. So I'm just talking about India. And I'm pretty sure you can apply this to any other languages. And currently, there are 234 million Indian language users who prefer Indian languages over English. And it's going to increase. That's obvious. Like if you look at the last point, 9 out of the 10 internet users between 2016 and 2021 will prefer their own native languages over English. So if there's not a lot of people working on this, they'll have to switch to English. And if you think about it, that's an inconvenience, isn't it? Like if I speak English and I'm good in English, then I would definitely like my tools to be in English or Chinese or whatever language I'm used to. So what do those people do? They can't do anything unless they're a developer. So that's why I wanted to do this talk. So how to take these poor languages and give them a solid foundation of resources, research, corpus, and tools. So all the languages can be classified based on their resources. We have English, everything, all the tools first come for English. Like if you think about Google now, it was released for English first. And then you can currently use Google now in, I think, Spanish, French, Danish, and a few languages like that. And we have next 10 languages. And then we have all the Indian languages and some popular languages in Africa, I think Southeast Asia, and a few languages. And also Japanese falls into the top 10 languages after English. So our goal should be to take the level four languages to level three. That itself is a huge, humongous task. And it takes a lot of research. And like I said, it takes a lot of time developing time, everything. So this is not something a person alone can do. But it's something a person alone can start. You can start a revolution to propel these languages from level four to level three. Let's see how. And how many of you here have used Grammarly? OK, that's good enough reason. So I'm pretty sure if it wasn't for Grammarly, I wouldn't have been here. It's such an amazing tool, it corrects your grammar. So if a commercial company wanted to develop an application like Grammarly for some other language, suppose they wanted to develop it for Hindi, which as we saw, for 22 million users. So I'm pretty sure even if 0.1% of the people decided on using it, it would be like about 400,000 people. So that enough is huge potential for a startup to develop that. But they can't. Why? Because there's not enough resource. There's not enough tools. There's nothing basically here. There's not a lot of people working on it. So they can't. They have to start from the scratch, which is a lot of work. So now we'll discuss about how we can give such commercial companies or startups proper foundation. So it starts with the corpus. So what is a corpus? So here we see a sentence. Corpus is basically a fancy word for data, textual data. So here we can see that. I hope this show the server working. He opened the window. It's two random sentences. So this is a morphological corpus. What it means is that I is a Pranam. Hope is, I think, yeah, you can see that. It tagged the parts of speech. Window here is a noun. So this gives, you can understand, here window is an object. But how do you make the meshes understand here window is an object? I'll give you an example. So I fish a fish. So here we know that this fish is a noun. This is an action. This is a verb. So I'm going into a pond. I'm fishing for the fish. Here fishing is an action. And this fish, it's a thing. It's a noun. So suppose I wanted to convert this into French. So what do I do? So the French, I don't remember the French translation. So I'll write it down. I'm sorry for, you know, offending any French because yeah. So it's completely different. These two words are the same. But je peshe en poisson. So how do you say that this is a verb and this is a noun? So that's why a tagged corpus is important. So it tells the machine that this is an action word. This is a thing. So it makes everything easier, even from developing text to speech, translation, everything. It needs a corpus. So the start of developing any commercial tool is having a solid data, corpus, textual tagged data. So for doing this, so not a few lines won't do anything. It needs thousands, maybe 100,000, maybe a million lines. So that's where a POS tagger comes in. The POS tagger is like the foundation of doing any natural language processing task. We'll see how. These are a few applications of a POS tagger. So right now we'll look at this. Sorry. This is a tool that I made for Malayalam language. So here I'm giving. This is a random sentence. I don't know what it means. I can't read it. So I developed this tool. I don't need to know the language to develop this tool. So that's the best thing. You don't need to know the language. You just need to have the proper corpus. Corpus means data. So right now I ended the text. So this tagger was trained on data, already pre-trained data. So as we saw on the previous slides, it's tagged. We trained it on a tagged corpus. So right now we'll see what happens, what a POS tagger does. Right now we got a tag sentence. We can see this is a noun, verb, verb. So this is what a POS tagger does. So these are the few applications of having a proper POS tagger. So the foundation, this is how we build this foundation for a language, having a good corpus. So it helps in classifying, clustering data, extracting topics, and every other next level of tools you have to build. So why is having all these bad? So why is not having research corpus tools bad? So this is one of the reasons. Commercial organization, they do not have the motivation to build these tools. So having this helps boost the economy. So suppose I mentioned that at some of Grammarly, they were able to create a tool because it already had a lot of research and resources in that field. So if languages doesn't have this, no commercial organization, developers can't think of ideas because they start from scratch. So that's why we need all these tools for even this boost the economy. Language support by commercial organization basically boosts the economy. So LSNET was an organization based on, I think, Norway. So they proposed a concept called BLOCK. So what BLOCK stands for is Basic Language Resource Kit. So what their motivation or what its core idea was that if you have a set of ideals for processing a language, that is you have a roadmap, you have whatever language you're doing stuff in. First you need to start with the corpus, then you move on to a pure stagger. And then so basically what this revolved around was having an organization for taking in a language. If you manage to make one language to make it move on to the next level of resources and support, you can do the same for the other languages. So you can apply the same set of ideologies for multiple languages. So this revolves around having an organization for doing this. So basically it was abandoned in 2003 unfortunately. This was proposed for European Union languages. But it can be applied for the whole world. And they left it. I don't know why. And I tried pinging them, but I didn't get in response. So basically they abandoned it. So what will happen in the future if we don't do this? What will? There's a lot of languages out there. So what will happen if we don't provide support, resources, tools to these languages? So there are three outcomes. So it depends upon what your philosophy is. If you are happy with having a few languages which are fully developed, or in 50 years up from now, there may be only three languages, like English, Chinese, or And Russian. If you're OK with it, then it's OK. And all these two, out of all these three, the last one might be the most promising one. Because language and speech will be used to ensure participation from all around the world. So if I could choose between these three, I would choose the third one. Not because my language is one of the resource-poor languages, but it will be helpful for the whole world. So this was what we talked about, a basic language resource kit. Minimal set of language resources, that is necessary to do any pre-computer research and education at all. So the commercial organization are built on these resources, research, and tools. So this was some of the applications of BLOCK. It starts with the corpora, which is basically data, textual data, tag data. All these are data and collections. Then we have taggers, tools, text to speech. These two applications, they have to be built on proper written and spoken language corpora. So suppose if you wanted to develop Google now for Hindi, you would have to have a wonderful spoken and written language. So here Google is in front, because they already have a lot of data. But other languages, they're not so lucky. So if you wanted to develop some tools, you could start with this. The first, we have a POS tagger, lemma extractor. We'll look at some of these applications after this. And named entity recognition. What it does is everybody listens to, everyone uses Netflix, yeah? Yeah, probably. And everyone listens to the news and reads the news. So when you go into the news side, so it's grouped according to world news, regional news, sports news. So when you have thousands and thousands of articles per day, how do you group this? You can't. It's manually not possible. It needs a huge amount of human resource. So that's why we need a named entity recognition software. So what it does is if it detects a person, it knows what's the name, what's the location, what's the context about so that we can get help in grouping these into similar topics. And it's hugely useful tool for, no matter what the language, but it's a very important and code tool for resources in any language. So that's what a named entity recognition does. It basically identifies the things in text, location, organization, person. So it groups, it helps websites and what do you call it? Yeah, websites group it according to the context. So if we take the example of Netflix, recommendation systems can work wonders for a media company. So if you develop the named entity recognition software for any of the language, maybe private organization would approach you, try to buy it from you, or even support you because it's an amazing tool. So these are a few end user tools you could build. So as I said, you need to have a copy of a lot of research. So that's what we covered in the first part for building these tools. These are the end user tools. So we look at the core current solution. So this is a popular meme about natural language processing. So what a core conference resolution does is that, here we are talking about natural language processing. The next line, the machine doesn't know what we are talking about, what do we want? I mean, when do we want it? When do we want what? Because it doesn't know we were talking about the natural language processing a few sentences back. So that's what core conference resolution does. It remembers topics. So where are we now? So we're on level one. We don't have anything. And it takes a lot of research and a lot of manpower. But if one person was going in front to start all this, they have a lot of potential. It's very clean playing space because there are no first movers at all. So if you're interested, you could try to start. Because a lot of organization would be happy to provide you funding for all the doing such things. But there's nobody going into this. So I think I'm going to wrap up my talk. And this is a project that I showed you earlier. This is my Twitter. This is my email. So if you have any questions or if you want to take it forward or something, you could always ping me up. Thank you. So does anyone have any questions? Yeah, sure. So I'm wondering, I don't know so much about NLP. I advise in AI for 70 years now. But my question is, a lot of your methodology in NLP is very inductive. Ready? Inductive. Yeah. Inductionally. Meaning you have a data. You train on a data. You classify what we just said. Classify data. This is like AI, 20 centuries old. It's not that modern. So question is, is there any other way of building NLP systems, whether that's grammar check or something that's not just induction based? Because induction always implies you have to have a lot of data. I mean languages, like French or whatever, languages are very different. I mean, I'm sure you know some of these through a language evolution. They are very, very different. Inductive methods are linear and incremental. It has the reason Google Translate even till now. Try to translate something to Mandarin and it's a disaster. Despite the fact that it's Google, because they use, I'm not actually sure what they use, but if they use this, that can easily explain why they don't do it anywhere. We make sure, like not very similar languages, but same English or German. So question is, is there anything that's not just doing induction based, this kind of data, classifier, separator, whatever, approach, but something more deduction, transaction? There's like a lot of AI that can help. But the thing is, there's not a lot of people working on this field. So like, unless developers who know this modern methodologies work on this, then only then we'll know about what there is, what there is not, what we can use, what we cannot use. So I try to think no one has any idea because no one has ever tried to apply anything to this field. So one thing that I could do was, I could, as you mentioned, I did the 20th century method using induction because one person already did it for a Tamil language. So I took that and I applied it on Malayalam language because I didn't have any other resources or tools and it worked, kind of worked, but. I give an example, a few weeks ago, there was a, so there's something called generative in the server. Yeah, again. So there was a project that forgot the name, but you give 10 words to that project, that platform generates for you a video, and all the text was reached, even a video. That looks credible enough for you to think that humans should generate. So you give it 10 words, CAD, platform, post-Asia, girl, fish, and it generates a video, that's like that. So to me, so GIS, a lot of self-placed, a lot of deep, deep reinforcement learning type of methodologies could be potentially used for this because all of them, what they have in common, they don't require any data. They are very context specific, a lot of them are deducted, a lot of them play. Instead of a network or AI-mescability training on data, you have networks training on networks. So that spares you the time and everything, you don't need the data because if you train on one data, you try to extrapolate on another data, usually languages are dramatically different. I mean, there are similar languages obviously, but when you try to do some dramatic type of a processing, it's very, very difficult. But the context part is the biggest missing part right now is because they don't understand. Yeah. So you said fish, fish under the special blossom, right? So how does it know which one is the subject, which one? So now, nonetheless, it knows exactly the context of English, right? So yeah, for me, you have to probably look at the more advanced AI and not induction based AI. So that's actually my next part in this. I just started like six months ago, this project, and I did a few or ancient methods. So I'll probably move on to in the next six months, move on to GANs and trying new things with this. So maybe check because there's already progress related to languages, it's not NLP probably. I mean, I don't know. But the fact that they generate video from words, random words, that should probably be a good direction. Yeah, thank you. That's it. So I'll see you around. Have a nice day.