 Good afternoon everyone. We are going to present our project data antics pipeline for AU system Our team members are Aiyush Agarwal, Devya Patel, Mayank Agarwal, Sakshi Sharma and Myself sanskriti Agarwal What is about our project? A project aim of segregation of operational and non-operational data that are asked in AVO videos and chat messages operational data corresponds to the problems in communication between student and the teacher For example, your video is not visible, your audio is not audible etc and non-operational Are the data that student asks for example doubts We are working on AVU. AVU is a e-learning platform Which held the instructor to interact or teach the student We are segregating operational data. How we are doing this? Here we see the three input data YouTube link, chat file and video Selenium is used to extract YouTube chats from the YouTube link. We have worked on many softwares to extract a space to text. There are a list of sentences Which are then passed to a question classifier and we extract the questions Finally, a name-based classifier is used to predict whether it is operational or non-operational So now we see that there are three inputs in this pipeline. Since the AVU records videos as well So firstly, we'll convert the video to their transcripts in order to get the transcripts and then find out what are operational and non-operational questions in that So what's their approach in that? So firstly, we have the video We'll extract the audio from that video file and then the audio will then be converted to its transcript So there are specific configurations for audio file which are needed to be there in order to successfully convert it into transcripts. Any speech recognition system accepts a file in WAF format. Why WAF format? Because it's lossless and it is uncompressed and it offers a wide range of frequency Much more than mp3 files. Also the sampling rate must be 16,000 Hertz and it must be mono channel The technologies which are used in order to achieve that So the in order to modify the audio files are pied up ffmpec and YouTube DL So what YouTube DL does is it directly downloads a YouTube video The AVU videos which are available on IITB studio So those videos are downloaded under these configurations by YouTube DL itself Now we have the audio now we want to convert the audio to its transcript So there are many services which are used to do that So some of the proprietary softwares which are very good in order to convert speech to text are Google API for speech recognition and Gnani API. So these are proprietary softwares. They have a free limit to certain extent, but after that they charge Charges a certain amount for conversion. There are also open-source softwares that can be used So some of them are CMU Sphinx and deep speech So what's the problem with conversion from speech to text? So all the good speech to text conversion softwares are proprietary Also the free services the open-source services, they are not trained for Indian accent data They are trained for foreign accents data and most of the videos on AVU are generally by Indian professors So we need a software that converts Indian accent to transcript also the long length of videos since AVU has faculty development programs So the length the length of those videos are exceeding one hour to two hours So converting one hour into our video to their transcript is a huge bottleneck in this task So what's the solution to this problem? There's a deep speech model. This is an open-source model What is deep speech? It's an open-source model Also a pre-trained model is available which is trained on foreign accent So we can use that model in order to achieve our goal It's also written in Python, which is a go-to language these days for machine learning tasks Also, it's based on tensorflow and a huge community support since Mozilla supports this deep speech model So if there's a huge community support offered by Mozilla So now how do we adapt this model to support Indian accent? So we'll apply transfer learning. What is transfer learning? So if a certain model is used to Perform a task A we can tweak certain settings in order to Make that adaptable to support the task B So for example, I would like to show that suppose a model is developed to identify a letter E So supposing for intuition we have four edges in E. We train the model to identify all these edges So now if a task B is there we have to identify F We don't have to train All the edges in order to identify F. We just have to forget the bottom edge in order to identify F So this is what transfer learning is And now how transfer learning is Adaptable to deep speech. So deep speech has certain layer of neurons It's basically a deep learning a neural network based deep speech model So neural networks as we know have many layers. So we don't have to train every layer We just have to fix certain layers and train some layers in order to achieve our task So the last layers will be trained on Indian accent data in order to identify that the data We are using for that is common voice data set. It has many accent data So we extract the Indian accent data from that and we train our model on that So now we are having a transcript generated from that So there's a huge problem with this transcript as we are finding operational sentences and non-operational sentences We must firstly know What the sentences are in this we can't see any punctuation So we need to perform text segmentation on that in order to firstly identify the sentences and then we can Use our further pipeline in order to get to know whether these sentences are operational or non operational So for that we are using deep segment So deep segment is a natural language processing based technique in order to perform text segmentation It is based on word embedding So it is using glove vectors which are supported by Stanford and sequence to sequence also Bidirectional LSTM are also used to segment these texts the data source for our pipeline was the week at the transcript We get the we get from the video the another Source can be the YouTube live charts. So how did we extract the YouTube live charts? Simple Go to the website copy the messages we want to we want for the analysis then Paste it in a file But this is easy for a simple like one video to video, but you cannot do that for hundreds of thousands of video So you just write a script for it. So this is where selenium comes to rescue So selenium is a tool which is usually used to automate web browsing primarily. It was used to Automate web applications testing, but nowadays people use it for web scrappings The selenium is quite popular among programming language. It can be like web browser automation can be Done using selenium in Python PHP Java C Ruby, etc So let's see different type of web scrapping first of all static web scrapping Static's website are the Websites we do not interact with clients or the servers So it's pretty easy to scrap the data or get all the data from the website and There are others library also like beautiful soap and scrappy which can be used to scrap the data Same as selenium, but what makes it different from the other beautiful soap and scrappy is that it can also scrap dynamic content It is like only the one it is only the Unique tool which can be used to scrap dynamic content So you can like go to a website automate form filling and apply it then you can Scrap the YouTube live chats comments. You can even get the all the photos of a user on Instagram Which you will have to like give some time to skip the data So how we use selenium in our project First of all the whole the whole process from starting a video to Scaping down the relevant messages is all fully automated So if you run a selenium script for around 10 videos, it will take around 20 minutes because of the ads in the YouTube So when the video starts at the beginning the live chats are not loaded So you have to go to the end of the video. So so the like live chess game can be Loaded once you go to the end of the video, then you have to change from main YouTube Iframe to the sub iframe the live chats, then only you will able to extract the data from the live chats And once you do that then you can just use ID and class reference to get the messages and author's name So this is the result we obtained from scrapping the data Okay, so now you saw in the flowchart there were three input channels. The third one was Just chat a view chat files, which are in CSV format, which can be read using Python. That's no problem Okay, so now we have these three channels and we have a now we have a list of sentences, right now What are we going to do with it? We have the dough now we have to you know process it So what we do is we pass it through something while a question classifier So what question classifier does this so it takes a sentence and it tells the probability of that sentence being a question So like if you imagine it as a black box, this is what it does So the first intuitive approach you can think of is if the sentence has a question mark at the end It's a question. That's pretty Yeah, you know the first approach would think of but the problem is those sentences Which don't have a question mark at the end. We are doing deep segment We are doing scraping now all of those doesn't guarantee our sentences with only questions We have to do something to overcome that. So what we do is we take help of deep learning So if you expand the question class for it will look something like this inside There is some pre-processing and there are some layers. We'll go through them one by one. So Pre-processing basically what you do if you if you hear a sentence, you'll you'll not think of the words Which you don't care about you'll take the keywords and you'll okay. That's what he meant. That's what did happening here It's a processing done before processing. That's why it's called pre-processing and then okay. So now we have a sentence But we cannot feed it to a computer right a sentence It cannot do anything but if you give it numbers it can do wonders So we have to find some way to convert this sentence into numbers So this we do it in a very ingenious way called glove vectors. So Glove vectors are something it's actually Mapping a word to a vector in high-dimensional space Now by high-dimensional I'm telling hundred dimension 200 dimensions actual this picture can be a quite misleading because it's only in 2d Imagining hundred is hundred dimensions is beyond our capacity. So but the glove vectors are so beautiful that you can see now you have a Vector say king you can subtract man from that and add woman and you'll get queen You see the power it has to convert to Encase semantic meaning similar words will occupy nearby distance in the vector space So that's great work done by Sacher and his team at Stanford. This is the layer that was the first layer now This layer is where actual learning happens now. I don't have time that has been well, but What it basically does is it it learns from its mistakes something like that this analogy has been used many times So I'm just gonna state that it's how humans learn if you're shown an unknown picture First time you'll guess something random and I'll say, okay This is not that that is something else and then you'll correct yourself. That's what is happening here But that are normal neural networks. This has an addition of a feature Where it also learns from its context vectors. What's near it? So these last two neurons are just for converting these numbers into probability The first one will say what is the probability of it being a question? The second will say the probability of it not being a question Now as we have extracted the questions We'll use nape base to segregate operational and non-operational nape base is a uses probabilistic algorithm and is a classification technique It is nape because we assume that each word is independent of other ones in a sentence There are three types of nape base first one is Gaussian nape base Second one is multinomial and the third one is Bernoulli in our project We have used multinomial nape base because the features which we use for extraction or prediction Assumes discrete value. It works very well on text classification. Now coming to the implementation of nape base As an input we give a CSV file which contains of eight to nine columns out of which only two are important first one is sentences This column consists of chat between the instructor and the student the second column is of label Where n stands for non-operational and O stands for operational for pre-processing of data We have used count vectorizer and TFIDF transformer count vectorizer tokenizes the word for a to convert it into numerical form and TFIDF transformer reduces the impact of words which occur more frequently in a data Now the results are here like for first one how to determine symptoms of fever Nape base of classified is it as non-operational and the second one. Am I visible is operational? The accuracy of this model is 86% for the future work Deep speech if we have more data on Indian accent We can get better results using deep speech as we segregated the non-operational and operational queries by knowing how many operational Problems a center is facing. We can give more equipment maintenance to that particular center and The last one like we have worked on the AVU charts and there the log and charts chart files were overwritten They were not saved. So we have very less data So by saving them we can get ample of data and get better analysis I think you should have put a lot of examples So all those examples that you have do not convey anything Of the project which part do you want all I can just give no no you don't have to give you have to your See when you are when you are explaining see You go to the glow vector. Yeah, this is your live example By live example what the chat that you analyzed and you get this Okay, so the problem is So we cannot map hundred dimensions you are you are actually giving same examples Which are being covered in many many websites and books and the problem is we cannot Covering the same techniques which people have heard for thousands of times Yeah, yes, sir. So the problem is We don't have the resources to how much how much time you will take for the demo One minute one minute. Okay. One minute. This is just opening up the server Okay, and then We go to local host Yeah, we have multiple as we said we have three channels. So the first channel is for YouTube live scraping so And the second is for the AVU chat list and this is for a converting video to audio Audio transcript Play a pie chart and the same thing for others also wherein we are given a ratio of non-operation and operational. What is the actual sentence? This is done over a whole Data set a whole chat file one whole recording its crepes for that example What is what are those sentences which are so you want to say that? New equipment is required a remote center and all those things. Yeah, so how will you say that? We can see the chat transcript. So you want me to go through the chat? Why you are required? This is the result from the chat transcript. So those are the sentence for example, sir Will you please explain what should be money? This will be classified as non-operational because it's a legit doubt So this is what is displayed for the user or no Sir user here has multiple terms which you see when you can scrape when you can read some data You know exactly which are the sentences, right? So this is for like a view currently has remote centers all over India. I'm not talking about you. I'm talking about you are able to read sentences Yeah, right. Yes, you are able to put feed it to your model Yes, that means you know which is that 10% is operational and which is 90% is non-operational Whatever point eight point five can you display these operational sentences? Yeah, of course, so this is what he's asking So that's just an implementation issue if that is what the client requires But so usually what the client requires is he doesn't want to see okay This sentence is classified as that what it's a pie graph wherein okay these many are operational That means okay, we need to put more resources here because like 90% of this data is operational. Okay, okay So do you have a database? That that was the problem is that the log files were not being stored till recently so yeah We took a small data from lock files and we have done you process this text you identify which one is operational Which one is not operational and then you what what you have done with this data so you simply made the pie chart Yeah, so that's the future scope. So this is till now till now is that then but if you have identified You can simply put operational in front of the text and make because it's it's not to something that need Extra workbook you have already done that but like is there is no No, that's okay So the thing is if if on the text file if operational is written Let us say yes, then one who is managing these things rather than simply knowing 18 percent operational is required search for operational if he has to manually do it search Manager here who is looking over all remote centers Would he want to rather at one remote center see each text as operational or each remote center as one pie graph Which says him that this percentage is operational. This I don't think that's much more knowing some some percentage Operational matters for me. What matters for me who needs what what to do now? Yes Overall only okay, so then let us say if I come to know 18 percent Then what to do after that you deploy this I think a person needs to know what are the problems So then what you have done is not useful So now we have an operation Yeah, okay, so there will assume we have 100 you are you are talking you have taken that that's some purpose What is the motivation for this project? For example, there is one remote center at Kolkata Okay, and there are many over there here a manager is sitting who is supposed to say them that your audio devices are not working go and check it So this data will help them 90% operational. That's a huge amount You're having much more audio video problems than others the service should be deployed there Repeating the same thing again and again. Yes, sir, but the question is fine. Thank you one question I have in the initial part of your slides. You said you convert some audio to video transcript using Sphinx or Google API So and after that you you you do something to punctuate it. Yes. So what was that and what is this? How is it connected? Yeah, so audio is like complete channel, right? Yeah, we convert the whole thing to transcript Yeah, now there is no way to directly convert with pauses like if someone is speaking No, no, so then after that, how does this chat reply come into picture? There are multiple channels of input. So one of them is video. Okay, so from videos We also get sentences. So those sentences are also passed through this and we also get you showed a sample of one of the live chat Replies, I mean that is easy to take because you can scrape it Yeah, but you did not show any sample about the one that was you know converted with Sphinx and then punctuated with that other Machine learning tool. So the format in which data will be present after passing through every Every channel will be the same. So I was really interested in seeing you showed a You know blockhole block of text without punctuation. I was really interested in seeing how you punctuated it and what was the result? So anyway, that's okay. That's okay The model which was used was deep segment. So that segments the text We are not interested in the model. We are interested in your work