 Good morning to one and all present here. The topic of our project is arrest enable question video analysis service using machine learning technique. Let me introduce to our team member, myself, Ashwara, see his dimple and he's Neil. Let me give you a brief introduction of our project. Basically, an individual will send a question explaining his query in video format, either through RPI device or through mobile application. Our API would generate information tags from the content of the video. The tags that we decided to extract from the video content were gender, age, emotions, keywords, and topic. Gender because sometimes it would be appropriate to ask the question to the same gender based on questioners comfort. Age would give an apparent idea of the knowledge of the individual, seeking the answer. Emotion would help us in identifying the real intent of the questioner. Keywords will give us the condensed summary of the question and topic would help us in mapping the question to the appropriate respondent. The tags that will be generated by our API would be sent in JSON format to the requesting application and it will also be stored in the persistent database. This is the, as Tushank rightly explained in the beginning, there are three parts to the Drupal project and we are dealing with the video question analysis part. Our project is further subdivided into three parts. First will be the facial feature extraction. Next will be textual extraction, getting the topics from the audio of the video. And third is we built a REST API to make our services available to anyone who wishes to use it. So I'll be talking about the facial feature extraction. We've heavily used convolutional neural networks so I will just give a walkthrough of what CNNs are and why they're used. So what is a convolutional neural network? It's a deep learning algorithm which assigns weights and thus importance to different pixels of an image and it is mainly used to extract different kinds of patterns that exist in an image. In the earlier days, people use handcrafted image processing techniques but with the advent of CNNs, we no longer need to do this. They are automatically done by defining a loss function and updating the weights which are then used as a filter. So that was the definition but this is a more mathematical explanation but this is a very basic mathematical explanation just to give an intuition. So the green part, that is the image, that is the image. Each pixel of the image is represented by a number. That's how a computer sees the image and the yellow sliding window is called the filter. Now the filter carries out an element-wise product with the corresponding part of the image and each product will result in a single number which is represented in the pink part. So multiple element-wise products then produce another matrix. So this is known as the green part is the image, the yellow part is called the filter and the pink part is known as the feature map. To understand what CNNs actually learn, the filters when stacked multiple times produce a convolutional architecture. Different architectures are used for different models. So what do CNNs actually learn? So here I have an image of a snake. Low level filters are filters that are used in the beginning of the architecture and high level filters are used which are at the top of the stack. So the low level filters generally learn very generic details like edge detection, shape detection, et cetera. The high level filters learn very specific details. As you can see in this case, the high level filters are learning the scales of an image. So it recognizes a snake based on the scales. And after the dot product, whatever is left is a feature map and as you can see, it has rightly identified the shape as well as the scales of a snake. So this was a very broad overview of how CNNs actually work. Our task was to extract relevant information from the video part and the tags that we could think of were age, gender and emotion. Our general algorithm for extracting these tags were, first we pre-processed the image. After that, we detect the face. We applied a face detection algorithm. We cropped the face and we add a bounding box and last part would be to feed it to a CNN network. So actually CNN models have, it's very difficult to train a CNN model. You require a lot of computational power and apart from that, you need a very clean data set. So generally there's been a lot of work that has already been done in this field and we did a literature survey to see what existing models are available and evaluate them. So this was the first model that we had seen, age and gender classification using convolutional neural networks. Architecture, it had five layers, three CNNs and two fully connected layers. But the issue with this paper was it treated age estimation as a classification problem rather than a regression problem. What I mean by that is it, instead of detecting a single digit, it produced a range of ages for which the image belongs to. So that was the issue with this paper and that's why we had to drop it. The second paper that we tried was easy real-time gender age production from webcam video using Keras. It was trained on the IMDB Wiki data set. It has around 400,000 images of the top 100,000 Hollywood actors. So the issue with this paper was it had a very bias towards Western faces. So it didn't work really well for Indian faces. The accuracy that has been mentioned was based on that data set. So it didn't work that well on Indian faces. Also it produced a mean absolute error of 4.63 years, which is not very accurate because let's say if my age is 21, then it will produce an output between around 17 to 26, which is a very wide range and not very accurate. This is a demonstration of the models that we had tried. So here the gender prediction is wrong. The age is off by five years. In the second one, it's not even detecting my face. So we had to consider that also because at certain oblique angles and because of some occlusions, the face was not detected. The third model that we tried was the real-time convolutional neural networks for emotion and gender classification. This was also trained on the IMDB Wiki data set. However, for the genders, they used the FER 2013 data set and it produces the following tags as emotions, angry, disgust, fear, happy, sad, surprise and neutral. This was indeed a good model and that's why we incorporated this model into our project. Now why did we discard the idea of estimating age? So in a world of constant learning where everyone's knowledge is augmented by every day, if we estimate the age to be off by four to five years, then it would prove to be counterproductive to our model because as Ashwarya mentioned, the main aim of detecting age was to get an approximate idea of the in-depth knowledge that the person might be having. So if it's detecting my age as 29, that means I'll finish my BTEC masters and whatever like. So it's not a very accurate measure and if you think about it, humans themselves are not able to detect age based on facial feature expressions. So it's very difficult for a model also to detect the age and that's why we had to discard this idea of estimating age. The challenges and implementation first was correction of video orientation which was a very basic one. The video can be taken in both landscape or portrait mode. So we had to get the metadata of the video and then align the, orient the video accordingly. Second was evaluation metric for obtaining tags. Video is nothing but a collection of frames. So sometimes a model would fluctuate like for example, for the 42nd model, it would predict a male and let's say 57th frame was occluded and it gave a wrong tag, female. But we had to produce only one output. So we had to come up with a metric which produced only one output for the entire video. Third was improvement of face detection algorithm. It used hard cascade, but it was only able to detect as we saw frontal faces, any obliqueness or tiltedness in the video was not detected. So we had to take care of that. And face alignment, as you can see the face alignment algorithm was pretty simple. We detect the eyes, we draw, we join the center of the two eyes and based on the slope of the line that we get, we orient the face so that the line is at zero degrees. And this is a demo video of how our algorithm, our model finally worked. So I'll cover this part rather quickly. So another major part was topic classification which dealt with three phases. First part was video to text. The next part was extracting keywords from the text that we had obtained and the last part being topic classification to classify the topic of the questioner. The extracting text from video again was done in three steps, audio extraction from video, making the necessary configuration changes to audio files and audio to text conversion. The required file format was waveform audio file, mono channel and the sampling rate of 16,000 Hertz was required. The conversion and audio extraction and the configuration changes were carried out using these three Python libraries, MoviePy, PyDub and FFmpeg. So the obvious question is that what is the need for these changes? Obviously the user won't take the pain to record the video in the format that we require. So we had to do these changes so that the video is suited for our next part. Converting audio to text was a great uphill task. The options that we explored were Nani API, PocketSphinx, DeepSpeech and Google WebSpeech API. The issues that we faced were that most of the speech to text conversion solutions that gave us best results were proprietary. So we had to go for the open source options which were not well suited for Indian accent English. They gave good results for the foreign English accents but not for Indian English accents because they had not been trained well. We tried proceeding with the DeepSpeech speech recognition model provided by Mozilla researchers. So it, again, works well for foreign accents. It can be trained for other languages but it requires a lot of data, training time and computational power. So the solution that we came up with was the collaboration of the Project DeepSpeech and the Project Common Voice. We tried applying transfer learning on the pre-trained DeepSpeech model. Transfer learning can be explained in short as suppose there's a professor who is an expert in some topic and if he wants to teach that to someone else so he'll give a concise or brief overview of that topic to that person so that that person doesn't need to start learning from the scratch. So that's the same case with models. Instead of training the models from scratch we used the pre-trained model and applied this technique. The data set that we used was Common Voice data set which had 4% of Indian accent data. Again, the same constraints so we got better accuracy but not the desired accuracy so this still could not be used. As a temporary solution we used the Google WebSpeech API which gave very good results. So till we get an open source solution for speech-to-text we can use the free services which have been offered for 60 minutes per day conversion. So that can serve well. For the keyword extraction we use the rake algorithm which determines the key phrases by analyzing the frequency of word appearance co-occurrence with other words in the text and ratio of the degree that is the co-occurrence with other words and the frequency. On the basis it determines the keywords and the results were pretty good for our purpose. For topic classification this is the general water machine learning algorithm does it learns the features and then predicts the tags for any other unknown text. For our project we used multinomial NAV base which is based on the base theorem. The data set that we used for training the model was the Yahoo question answers data set. The data pre-processing was performed using count vectorizer and TFIDF as the text could not be directly given to machine learning algorithms. So first it had been worded to some numerical form. The question classification classes that we support are these 10 classes. The classifier worked pretty well as we can see in the performance of our classifier demonstrated through the confusion matrix and the precision recall and F1 score. The overall accuracy of our classifier was 67%. So now let me explain you about REST API. REST stands for representational state standard texture of architectural constraint of RESTful API. First is the new form interface, second stateless, third catchable, fourth is client server, fifth layered system and sixth is code on demand. New form interface is a key constraint that differentiate REST API and non-REST API. The technologies that we have used for REST API, the first one is Spring Boot and the second one is MongoDB. We have used MongoDB along with Spring Boot because Spring Boot has powerful MongoDB connector and the data is stored in MongoDB in the form of JSON. Now the video analysis API, the API that we have made, the main functionality of the API is to accept a request containing a video with its ID, run the analyzing Python script, return the JSON tags to the requesting application and it also provide an add-on functionality that is to track the status of the video. Our video analysis service is accessible at following end point. The first one is post request to Drupal slash upload. We all need to give the key value that is video and ID. The second one is get request to Drupal slash video slash status slash ID, that is we can track the status of the video using ID. The status can be neither of the three states that is queued, processing, finished. The third one is get request to Drupal slash video slash ID, that is to get all the details of the video. This is a directory structure of our project in which the test video analysis API is the root of our project. This is the overview of our REST API. The video service is a class which contain upload function and it download the video received from the application and it download that video, analyze the, run the analyzing Python script using analyzer class. Analyzer class run the Python script of both the face recognition and keyword extraction and the path manager is used to maintain the directory structure and to avoid hard coding. The challenges that we have faced, the first one is to accept multiple concurrent requests. For this, we first tried Spring Boot trade, the future and completable trades, but this lead to race condition, sometimes returning an empty response. That's why we tried Tomcat trade. As we know that Spring Boot has embedded Tomcat servlet, so we can easily implement this Tomcat trade. The maximum number of threads supported by Tomcat is 200. The second challenge is to pipeline Python and Java code. The other analyzing script was written in Python and the rest API was made in Java. So the main task is to integrate this code. So what we have done, we have made Python as a child process to the parent Java process. Now the deployment, we have maintained separate directory structure for all the Drupal project team. As you can see here, this is a directory structure and these are the steps of implement which is very simple. We have to connect to the remote server using secure cell connection and install required software and later we need to transfer the files to the remote server. At last, I would like to conclude by saying future work of our project. The first one is to, it would be interesting to incorporate the age tag if an accurate model is surveyed or developed. The second one is to add the existing classes of emotion by adding classes like curious, interested, frivolous, nervous, et cetera. The third one is to, as our project lacks an open source option for speech-to-text and the other option for this, the last one is to classify the question on the basis of its complexity that would serve as a mean to send the question to the correct professor. That's all, thank you. So do you think that you have done the analysis here? I don't think there is analysis because your focus is only for the emotions, feelings, and the edge. That is not related to the educational institute of questions and answers. So the analysis we dealt with was analyzing the question videos basically. As everyone mentioned in their hypothesis that the question may be answered differently when you are seeing it in a video format. So we wanted to present that analysis like the topic classification that we did was for again the labeling part that was mentioned by the previous team. The emotions and age and gender are being recognized. Like if some video comes from that Raspberry Pi device, then we don't know who's the sender, we don't have a profile. If it's from the app, we can use that but if it's come from the device, we don't know that. So in order to provide the answerer a better view of who is asking the question, we are providing that details without, as everyone said, without seeing the video, without downloading the video. With that details, he can choose to answer the question or not. I don't think this is the analysis which you just justify with the questions. So we are analyzing the videos basically to get the data out of that. Like analyzing it to get the gender and all that. Why you need that? Because you already, when you recognize the face, the model is wrong here. When you recognize the face, you can fill up the entries. There is no analysis kind of things for the age and feelings. Why do you need the feelings for the questions? So like, till now it is a basic model but in the future work. Suppose you are going to answer some questions that has been raised in the video format. Before answering, you need to find out that which age group has asked this much questions, what are their feelings? So is it not important, I think? No, so the whole point of adding emotions was, let's say if someone is joking and then they're putting out very frivolous behavior, so we can discard those videos. Like, you know, just to... Yeah, avoid wasting of time. I mean, like there are certain questions which are very lame and there's no need to ask them. So to, so getting the emotion of the student would help us to... Here I can only see that you are converting videos into text format through the audio, that is good. But unnecessary is just using that age group and feelings. Then I need to register as a user. You just fill the details. So but we wanted to... It's better if everything is automated, right? Like why I necessarily make the user go through the... Somewhere you save the profile of the user. If I am just saving my profile, I will give, accept what I am, my age. That will not give, why you need that through the software. So we may get gender and age, but we won't be able to get topic or keywords. So earlier the first presentation, I mentioned that the model is question-specific, not the user-specific. So we don't only have that app, we have also planned to integrate this analysis with that Raspberry Pi device from which we are not getting the profiles in which the user can manually fill up. So for that, we need that analysis. Like from Raspberry Pi, the first video demonstration, there was no profile created. So the user cannot fill his gender or age or something. And as everyone suggested that, we can only show the thumbnail without downloading the video with the metadata that you said in the last presentation. So for that metadata, we need all this. So I understand that you have done according to that part. Yes, sir. So we have made it a general service to integrate with all the devices, like for app, for that Raspberry Pi devices, for everything. Not just keeping in mind one device or one app that automatically we will get the data and everything. So based on that first presentation, you have done, but what are the results of the analysis? There is nothing, no results. So we are returning the analysis results to them. So the analysis results are being returned to that app. Like answer it. Analysis means when you have some data, you need to analyze the things. We are returning like gender, emotions, topic, keyword. We are having some, it's not like large scale analysis, but it's just a small analysis of a video which we are getting and we are returning these tags. I understand. But what is the use of this gender and the age here? As I said, if you are asking questions. Yes, sir. My concern is, I need to answer that. I don't think that I would see first that what is the age of the student? What is the feeling of the student? So actually the use of ages, let's say for six year old ask a basic question. Doesn't matter if it's 60 or 60 years old. Doesn't matter. The answer will be more intricate if someone who's old enough and already has knowledge about this topic asks a question compared to a six year old. Okay, but the percentage is very small if you analyze this. So that's what our hypothesis is that we are assuming that the answer will vary according to these things. So for example, if a student is asking for a six year old child is asking how to make an Android app, the respondent will give- I don't think six year old. Just for the sake of getting the answer, he's asking that how to make an Android app, then the professor would give the different matters of that question and the V students would ask that how to make an Android app. He would explain the better way, how to improve our app and all the details. That's why we are using this age tag. Okay, anyway, thank you. Maybe that six year old boy knows some programming and that is how nowadays kids are like that. So it is quite possible that it could be the same question, same answer could be given to six year as well as six years. Possible, I mean, we don't know. You are assuming. See, you are assuming it. That's the thing. So that is why you are gone in this direction. And one of the things, was that your goal, the neutral thing, emotions, right in the beginning? These are the six classes under the emotions, like neutral, happy. No, was that your initial goal or on the way you found something? We found something on the way. I saw that in one of the papers. Something else had come. I think you could have included that also. Yes, whatever emotions our detector will come as a list. It's not only one thing that will come, whatever has been detected will come as a list. Sir, actually I feel like, they are doing, you answered in the previous session that they may be answering Indian languages, right? That's why you want, you don't want English. But I have not seen them trying out any other languages, different languages as an output. Have you tried different languages? All these are. Indian languages? Yeah, all these have been asked by answering the questions. Yeah, no, I'm asking sir. He's asking out Indian accent. So he's asking out Indian language. So sir told it's very difficult to type English because according to me, if you give me a video or a textual format, I will prefer a textual format. Say, if I can choose which answers I want, right? Correct, correct. Like say, in different people can have in different views about something. And I will be biased to a left or right view. Correct. Or I cannot be neutral all the time. Correct, correct. Okay, so how will I know, like in different number of videos are there, if you give me a textual format or that thing, I can choose what I want to see. Correct. And I wouldn't want to read a lot of like, listen to 20 minute video, which is a reply. Yes. I want a small video, like depending upon what I want. So where is it written? Like, Yes, yes. Indian languages were not there. So let me answer your question. Yeah, please. So as you might have seen that they are saying that none of the models, open source things are returning accurate text. Okay, so this is the problem. Why they are not returning accurate text because there is no data available to train the models. See, models are available. But we do not have data. So the data was taken from where IMD-V repository, where there were so many faces, textual information was there. So face came from there and then there are several websites which allow you to donate your voice. Correct. So there are so many things, but unfortunately none of means we are not participating. Correct. So let us say I want to see a Hindi text written here or Marathi text written here, rather than the English one. Correct. So in order to do that, what do we need? We need voices of people in Hindi, asking questions in Hindi and asking questions in Marathi, which we do not have. So we have models, we have technology, but we can't train. Correct. So one purpose of by making this system is also to collect data, collect samples so that we can train our models. So because right now there is no way. Everybody is allowing you to put means ask answers in textual format. So there will be, even in 10 years, we will not be able to train a model so that they can recognize Indian languages. So with this system, if it is implemented properly, I believe within a year we will have enough data so that we can train our model in Hindi, Marathi, Bengali, Tamil, Telugu, Kannada and other Hindi languages, Indian languages. The other thing is, because we do not have that data, we can't train about the text information. So yes, obviously I also like text. Even I make videos for a student, but I read from books only. I don't watch videos because I don't prefer that. It's my personal choice, but according to learning science, there are six learning behaviors. So behavior changes, multimedia theories are there and there are various factors, but here it's not just the matter of asking question in text or video. For some people it is difficult. For if you go to rural side, and even if you insist them to ask question, forget about writing, if you insist them to ask question you can answer, they'll not answer. So if there is a device which is there and there's nobody around that device, they will not feel sorry, they might ask some questions which they know that anonymously they are asking without entering any detail or something like this. So I believe after some time it will be useful. As of now definitely it's in the research process. So I can't say anything with 100% proof that yes this will work. It might not work, it might fail or it may work, it can be a successful thing, but right now I do not have answer. So once this will be ready, we will start getting data and then we will conduct research and then we will come to some conclusions. Right now it's not there.