 Okay, so to close out Lightning Cloud Crowns, ESB, who will be talking about a very interesting way of applying machine learning to community insights. So take it away. Hi, so I'm B. I work with Fedora and I'm right now, Outreach intern with Mozilla. So I mostly work with community related stuff in Fedora and I am very passionate about machine learning and NLP because I studied computer science and I want to use it to understand the community. So this is sort of a pet project for me because I saw that whenever I wanted to do any wiki gardening task, it took me a lot of time. I don't know. Fedora has a very big wiki and it has too much redundant information and it was hard for me to find the correct pages to identify like how to do wiki gardening or what I could do and if I wanted some information like which page was the correct one, which one had the updated information and so on. So I have basically identified three main categories where I could use big data and NLP to aid such wiki gardening tasks. So the slides are just very concise but you can find some code I have written here. I'm still developing it because Fedora wiki is too large and I'm thinking of using some parallel processing to access the pages simultaneously but I have not worked on that yet. So the first problem I think which is very common is that like if you have a large wiki for your project then I don't know it just grows on and on and there is a lot of redundant information there. So I mean there may be two pages which will have the same information and I don't know it is one of the wiki gardening tasks for contributors so that to identify such pages so that they can be merged. So for that what I suggest is that you can use, I mean we can use text similarity from NLP. I hope you all know NLP, I don't have to explain what NLP is. Can you just raise your hands if you don't know what NLP is? Okay so NLP is basically natural language processing so what it does is you try to process the textual information so it can do it in two ways. One is it just takes the words and then it has some like it doesn't understand the information actually just processes the test so you can just take the words and then you can have a bag of words model like you can mix the words and just match the words or the characters or the alphabets for it. And the second is semantic similarity which is a bit tough where you try to understand what the words actually mean and then you try to get information out of that. So in the first one like I haven't done anything related to semantic similarity. I mean I do not try to grasp what it actually means but when there is redundant information I have seen a pattern that mostly the common words are used everywhere so if you just compare if you use a text similarity function from say Jensim in Python and you compare to pages you will for like for highly redundant information between two pages you will get a high text similarity score. In Fedoraviki I found that for some pages there was a high similarity score as high as 85% and I was like literally shocked. So you could do this and you can start doing this like you can first implement it at a category wise level so it is a bit easier to run. The program doesn't keep on running for too long and you don't need to write the text similarity functions on your own you can use inbuilt functions in Python libraries. Jensim is one, NLTK is another or if you prefer any other language then you can like I mostly work with Python so I'm suggesting those. So the second one is identifying old pages which need to be like updated or which are out of date. So every wiki has a sort of edit history so you can it is pretty easy to write a script just to check when the page has been last updated and say if the pages have not been updated in say the past X months or so then maybe we need to give a closer look to those pages like in case of Fedora I put the barrier of X at 12 months so that would be a year and I checked like if the pages have not been updated so what do I do? Obviously this would also depend on the category of the pages you are looking for so say it's an event and you are just using it wiki as an archive for those events then maybe you do not need to check for that category because it's just an archive but if it is a wiki about say some meeting or some project information or some team then maybe yes you need to check it and see if it has been really old and does it have any out of date information so like if you just put a filter on the edit history and try to script for that then you can easily get a list of those pages and so the third thing is yeah so you need to so third is like how can you categorize these pages like automatically so this is a very big NLP and machine learning problem about categorization so I mean I cannot say that this is one algorithm you can do because everybody tries it but what I am trying to say is that maybe you could find a balance between humans and say big data thing so maybe you can use some classification algorithm based on machine learning by giving say inputs from previous like say giving a training set based on what types of pages are in a particular category and then it can suggest a new category automatically for the new page and then maybe some person can verify it I mean it would be easier I mean it's better than having no category at all for the page so these are some common problems and how you could do it and if you want to see the detailed code it's on the link there and you can develop it for your own wiki it's not really that tough to write I mean it just requires some knowledge of python and I don't know NLP is not really tough and it already has inbuilt functions so you can use them and Jensen has some pretty good documentation and NLTK has some great documentation too so if you have any questions on how to develop this but I am ready to answer them I mean you wanted this to be a Q&A session so that I could answer your queries more so Thank you so much for any questions We actually have about 5 minutes left so any questions? What sort of results have you seen for putting this on the real wiki? Yes, so I ran this thing on different categories I ran it first, the first thing category wise I saw that most of the information in some particular categories was two pages had very high similarity score for them so I could identify those pages easily I put a filter say if you have a similarity score about 0.75 then I wanted a list of all those pairs of pages so I could get those URLs and then I could see yes, maybe these have same common information and then they merge them manually So how much compression of the wiki or how much savings were you able to achieve with that? Did you find it was very common or not common at all? No, it was very common at least in Fedora wiki because Fedora wiki is really really large it has millions of pages indexes it's really hard to find any new information or whatever I'm looking for in it even if I just search for something it gives me a long list of pages that's not necessarily the page I'm looking for I have to go through everything always so it really compressed it a lot but still there's a lot of work to be done I cannot give you an exact score because I did not go on merging everything Thanks Anything else? Is there a lot of work to get rid of the wiki markup to just get the text? For Fedora wiki, no so you are just saying the wiki markup no, so you can use beautiful soup and other libraries to get the text directly out of it and they have a function so it's really easy there wasn't really much work for me to get text So beautiful soup is really good I would recommend it just to get the text Alright and I guess we need to move people in and out so thank you so much once again and if anybody wants to ask any other questions you can try it I'll be at Fedora booth so if you want to ask any questions I'll be there and I can answer your questions then