 Our project is a smart learning module for automated tagging of documents. So our problem is to automate the process of tagging of documents. What exactly is tagging? Given a document, we need to generate tags. And what are tags? Tags are certain concrete classes which describe a document. For example, if you have a document about Lionel Messi, it should generate tags such as sports and football. And similarly for others. So tags can also be looked at in this way. If you search using tags, then the relevant document should come as a result. So those who have a basic overview about machine learning, this might seem like a simple classification problem. And we have many ways to solve classification problems such as logistic classifiers, neural networks. All you need is a data set and you need to train it. But this exactly is the problem. We do not have ample amount of labeled data set for this problem. So we need to look further. Our solution is LDA. It does not require an explicitly labeled data set. Now Aditya will explain how LDA works. Alright, so before jumping right into LDA, I would first like to give a basic overview of what we are trying to do for people who are not well versed with machine learning and natural language processing. So basically we humans are really good at interpreting things from a text or a document. We can easily say that this text is related to this topic. Things come really easy for us. And this is because from our childhood, we have been given a lot of information from different fields and different topics. So as a result when we encounter a new piece of document, we are easily able to figure out which topic it belongs to. Now we are trying to implement the same functionality using a machine. So but in order to do this, similar to a human being, we need to feed it a lot of data and that data needs to be from a variety of different fields. So for that purpose, the data set that we have used is of Wikipedia because Wikipedia almost contains articles about almost everything. And the other thing that is really important is the words that are there in the Wikipedia articles essentially the main features of our model are the words. So like any machine learning problem, there are basically four steps involved in our project. The first one is, I would be talking about each of them in details. So the first step is pre-processing the words so that we get a quality output. So for that like pre-processing step in our case on our machines took around 15 hours. Then this was followed by training. Now training was the hardest part of our project because like we had two options. Either we could go for online LDA or we could go for batch LDA. We finally went with online LDA because it converges faster. This is because in case of batch LDA, we take the entire, we had 5 million documents that we trained. So in case of batch LDA, we take the entire 5 million documents in each iteration to update the parameters while in case of online LDA, we just need a chunk of let's say 10,000 documents. And we keep on updating the parameters and we obtain the final parameters in a single pass. So the hard part was that we didn't know the other parameter that training required was the number of topics that had to be specified in advance. So the hardest part of the project was like getting the right attributes because we could use different number of topics and we could also like give different parameters like exactly what was the chunk size, the number of documents we were giving for each iteration. So that was the hardest part and because like training only once required 10 to 12 hours on our laptop. So each time when we had to change the attributes, it took 10 to 12 hours to train again. So it took a very long time to finally get the right set of attributes. Finally we took 10,000 as the chunk size and 100 topics that was giving the best output. Now this is followed by prediction and after that we evaluate the performance. Now the first three parts we have accomplished using Gensim library which is already built in Python. So firstly starting off with pre-processing. Now Wikipedia as I said we had around 5 million articles. So we discarded articles with less than 200 characters. And apart from that we also used top lists to like prevent the useless words from being there. So this could like we firstly ignored the words with the maximum frequencies since they were more likely to be useless words like n of the and we also discarded the words with very less frequency because they were not likely to contribute to a majority of documents. Now after this we could also apply techniques like stemming and lemmatization so that like we can get a single word for different forms of a word. For example if there are three words AM, R is we could simply use B for all three of them. Now the next step after pre-processing was straining. So the input that we gave to our model which is LDA in our case the list of the final words minus the stop list and that is the final list of words and the TFIDF matrix which basically contain information about the frequency of words in each document. Now coming to LDA. LDA basically the main features involved in LDA model are the words of the model. So what it basically does is it assigns it's a bag of words model. So it assigns a set of words to each topic. So like the number of topics we are specifying beforehand and now we are assigning a bag of words to each topic with each word having some probability belonging to that topic. And when we have the list of topics with the bag of words with each of them we can then use it to like get a new document and easily tell like which exactly what's the composition of the document in terms of the topics. So I will explain two more parameters of LDA which are alpha and beta. These are basically the hyper parameters of the model. So alpha basically contain is used to vary the per document topic distribution. So if there's a high alpha value it would train the model in such a way that each document would correspond to more number of topics. And similarly beta is used to vary the word distribution per topic. So a higher beta would mean that topic would be trained as a collection of more number of words. I would also like to give a basic intuitive idea of how exactly LDA is working. So yeah so like what LDA assumes is that a document is created by the following process. Now it assumes that let's say we have three topics. So it would assume that like each topic has certain bag of words associated with it. So there's sort of a recipe for each document. So let's say the document that we're trying to generate involves 50% of topic 1, 30% topic 2 and 20% of topic 3. So it would essentially take 50% of words from topic 1, 30% from topic 2 and 20% from topic 3. Of course it would also take into account the probabilities associated with each word. Now this is like this is what LDA assumes in order to generate a document. But in reality what it does is actually the opposite. It is given a document with all the syntax rules and all. It is given a document and it does the reverse process. It tries to learn a model that is a bag of words associated with each topic such that for some recipe this document can be generated. So that's how the training of LDA takes place. Kanika would explain the results of a model. I'm starting off with the results. So for testing purpose we input a particular set of documents and in our case when we like input a particular set of documents we got the same results again and again no matter how many times we like put the input. So next like after extracting the files using MakeWiki and then you know training the model we got the result. It is basically printing the topic number and the bag of words associated with that topic along with the probabilities basically the weights. So like in the first line we can see it is showing words like galaxy, planet, earth. So we can see that this topic might be related to astronomy. Although we haven't labelled it just like that we can see other topics also. Like in the second line we can see apps, football, championship team. So we can see that this topic is related to football or sports. So just like that we can see the results. Then comes like after we have trained the model then comes the final thing which is like when we pass a document to that trained model it should give us the results in which the bag of words which are related to that document are printed. So basically only the bag of words of those topics are printed whose like the topics whose weight are greater than the threshold value and the threshold value depends upon the user only. So in this case when we pass a document let's say about Shah Rukh Khan so it prints words like actor, actress, director, episode, story. So yeah these are quite related to it. And also the same can be applied to FIFA World Cup 2018 article also. It prints words like football, goals, club, league etc. Then comes the conclusion. So in this project basically we have investigated the use of LDA and on passing same articles again and again we should be getting the same results which we got in our case. And also this is important because the data set cannot be like manually tagged because this will be quite tedious and also like this will be quite tedious and time consuming. So that's why this kind of a model is quite necessary. And also future work is like we can use like audio tagging and video tagging also because like if a video file is converted into an audio file and when an audio file is converted into a text document the thing which we have used in our project like automatic tagging for text documents, once an audio file is converted into a text document the same procedure can be used for tagging. So if we know how to convert an audio file into a text document and a video file into a text so this thing can be used to tag the audio and video files also. Thank you so much. Use of these libraries per se how difficult or easy it was. No the reason I am asking this is and this should be important for everyone independent of whether you are actually going to become experts in data analytics or not because some amount of analytics some amount of machine learning will have to be done by each one of you no matter what activity you are doing. Going forward the idea would be to use as effectively as possible a whole number of ML libraries which are becoming available and therefore to develop that skill to quickly use it should be there. This team got a chance to do that but others should also try this same thing out and that is why I would suggest that I hope where all the uploaded reports are available for everyone to see right. The problem is you might not have documented the technical things but you may not have documented more specifically process to use these libraries and such thing but your python code is there. How much code did you have to write to use those libraries for example? Not a lot. No but as a number of lines. 150. So you see just 150 lines of code can actually enable and empower you to use these things. That is I believe that is quite important. So I would like all of you to actually look at these 150 lines of code and doing this experiment at your places I mean no brainer. I think you can do it on your own machines right? That's okay. You can fire something and let it run. The machine unlike us does not sleep unless you make it go to sleep. The second point I would like to make is something which they mentioned but they have apparently not been able to include it because it's a hard problem. You mentioned context. How do you know the context of a document? Because the entire context or information about context may be available completely outside of the document. The document itself may have nothing about context. For example suppose there is a document describing a football match. Now when that match occurred, where it occurred, etc. may not be there in that document at all. So how do you capture the context? And this brings me to another part that you will notice that many times a text file that you create or a document that you create the name of the document is sometimes indicative of the context if you use a proper name. But more important than that, with every document you must actually have metadata. Unfortunately the only metadata that we think of are the keywords. But you don't have to specify them. There are tools which will extract those keywords from the document anyway. So the creation of metadata which is something other than the document describing the context is a habit that all of us will have to form. This unfortunately is not part of our training yet. But believe me much of the useful information about the context is lost. Once a document is written and published somewhere. That context was only in the mind of the creator of the document. So this is something. How are you catering to the context in your work? We are not exactly doing anything about the context. The model is doing all its. The last question. The last question I wanted to ask you was how would you vary the values of alpha and beta? And how would it impact the classification that you do? It takes all the parameters by calling it. It has some default values but you can also give them before you create it. For example, there is alpha, beta. There is a total number of topics. Yeah, correct. So what parameters you feed in would affect finally what kind of resolution? Correct. Now would it not be interesting to test this with different values of alphas and beta and to try to find out what do you get in terms of the tags? Sir, we did that with three parameters. Backside, chunk size and number of topics. Okay. Every time you change the parameter, you now have to wait for 10 minutes. That's a question of time. Now you understand why high power computing framework is essential for doing this. Thank you so much.