 We are going to start with the Akash Lab project. The first group is EDX Analytics. The mentors are Mr. Praveen Pal, Ms. Mitali Nayak, and Mr. Pushpak Burange. The main purpose of this project is to analyze the EDX student data to gain meaningful insights. Data analytics also involves a thorough study of database containing all demographic and activity information of a user. For this data, one can infer as to which category of students are most likely interested in learning the course, the drop-out rate, et cetera. The team is here to present. Good afternoon, everyone present over here. The EDX Analytics of Akash Lab are here to give a presentation of what we have done for this whole month of two months. I'm very thankful to Parak, Mr. Pushpak, Ms. Mitali Nayak, and Mr. Praveen for guiding us through it. And let me introduce my team members. I am Pallavi. She's Akansha, Raunak, Shubham, Sachin, and Oshin. Now, let's start the production. Now, what is EDX? As Aayushi had already introduced in the morning, it is a MOOC platform that is a massive open online course site where students from all over the world can register free courses and learn many new things. Now, coming to the second part, that is data analytics. It is a discovery of meaningful patterns in data. And the main aim of a project is to find out such features or relevant information through which we can figure out that a student is gaming the system or not. And as I have come up with the word gaming the system, let me tell what gaming the system means. Gaming the system is, on behalf of a student, when he is attending an ITS, that is an intelligent tutoring system, like the EDX course site, he uses certain features present in the EDX itself and takes the advantages of those rules and somehow finally reaches the final answer. For example, let me say, suppose, hint abuse button. If someone has visited the Khan Academy website, there is a question and a hint option. If the student is not interested in giving the answer, then he will click on that hint option and directly come to the answer and he will give that answer. And it is recorded that, yes, the student has completed, so he'll get the certificate in the end. So what we want to do is that we want to filter the students out so that in future, we can introduce some interventions for slowing their gaming process so that these kind of students also learn equally good as the regular students. Parallel we have also worked on data visualization in which we are plotting statistics like demographic data, gender, based on education level, et cetera, in high charts, pie charts, line graphs, et cetera, through which it can be a good feedback to all the people who are designed and also the tutors to see how popular their course is or how well their course is running so that they can get their feedback and based on that, if they want some improvements, they can do that further. So this is the flow diagram of the log parsing. So first, on the edX course site, whenever a student clicks on a button or does any activity like logged in or whatever he does, an event is generated, which has been recorded in the tracking.log file on the edX server. And our job is mainly to parse this tracking.log file, extract those events and we have classified them into certain tables which we are storing in our own database. And then we are writing some queries for extracting features to depict that this student is gaming, looking at such instances. And we are storing it in a CSV file. And as we are using VECA, which is a machine learning tool, which is going to classify whether a student is gaming or not, we have to convert it into a .IRFF file, which is an input for VECA. And this will train the data set and give us the output whether that student is gaming or not. Now I would like to call Sachin to continue with log parsing. Thank you for loving. Now I would like to continue to parse log files. Log, basically log files are the files in which we record the series of events which are occurred in the system or in the software. When an event is generated, server creates a log entry and that log entries are added into the log files. When that log file reaches to the certain maximum limit of the size, then those log files are archived and that archived file is stored and new log file is created. edx.itus also uses log files for recording history and keeping the track, track of whatever the activities student has done. For example, if student clicks on play video button, log entry with the event type play video is generated and that is stored into the tracking.log file in JSON format. This is the simple schema to store log data from tracking.log file into the hive database. The first one is log database, a log table that will store the general attributes which are present in each and every log entry. For example, IP address of the host page, those attributes are present in each and every log entry. Those will be stored in that log table and for particular event type such as load video, there are some special attributes which are associated with it. For example, in speed change, there are attributes like old speed and new speed. Those attributes will be stored in those particular table. We need to keep our database synchronized with tracking.log file. So, for that purpose, we are using status table. Status table contains different entries like lines read and size. Lines read represent the number of lines which we have already passed from the current tracking.log file and size represents the size of the file which we have passed. When our program runs, there are three possible cases. First case is when current size is greater than old size. This means that in the tracking.log file, new log entries are added. So, our program will read the number of lines which have already been passed from status table and in the current tracking.log file, it will skip those entries and parse only the newly added entries. When current size is equal to the old size, that means the file is not modified. So, program will do nothing. And when the current size is less than old size, that means the file has been archived. So, Ocean will continue. Thank you, Sajjan. So, as we know that we are working on large amount of data. So, to handle such a large amount of data, we will be using big data tools. So, I will give you a brief introduction of all the tools which we are using. The first one is Hadoop. Hadoop is a framework or a platform which is used to store large amount of data in a distributed computing system, computing environment. In our project, we have developed, we have set up a cluster of 5, 6 nodes and stored the data generated in it. The second one is Hive. Hive is a data warehousing infrastructure which is developed on top of Hadoop. It works in a similar manner as that of SQL queries. And the last one is Scoop. Scoop is basically used to transfer data from normal database to Hive. And next part will be explained by Akshay. Thank you, Ocean. Now I will be continuing with the connection of Hive with Java. So, since we are using Java program, in order to write our Hive queries, the first task of us is to connect our Java program through the Hive server. For that thing, Hive provides a specific service which is called the Hive server. We need to run Hive server on a particular port and need to mention that port number in our Java program so that our JDBC driver is able to connect to the Hive server and it is able to run our Hive queries over that server and return the output query to our Java program. So, this is the sequence that from our Java program, JDBC driver. What is stored in Hadoop? What data is stored in Hadoop? Yes, sir. Different types of queries with different types of data. Yes, sir. Those tables are sitting in Hadoop. Yes, those tables are sitting in Hadoop. So, we have actually extracted all the useful information in our log table and all other tables which was explained previously by Sachin. So, now our next work is to extract some really very useful information from that so that it is of some use to either the course creator or to the student or in our machine learning feature. So, we will be using this. So, we have basically derived four types of features. First feature is how much active a user is on a per day basis. Before that, I always insist on requirements. Yes, sir. You started with the project objective. I want to catch what was even a gaming. It's a gaming. So, before you proceed further. Yes, sir. So, two things I understood. Your objective is to catch gaming. Second thing I understood is, you are going to catch gaming based on eight events. Yes, sir. You have got exactly eight events. Yes, sir. Where is the requirement document which says gaming is equal to? Where is this? So, we are using machine learning to classify whether a student is gaming or not. You can use anything. Yes, sir. Now that machine learning tool will have an interface to define gaming. What is gaming according to you? That should come here. So, basically, first we have... What is gaming based on the eight events? So, we have derived all the features. Based on that feature, we have mapped it into the machine. All right, features. Let's see what you mean by feature. So, first feature is how much active a user is. So, it is basically based upon two types of activities. First is the videos played by a particular user on each day of the course. And second is the number of questions attempted by that user on each day. Joining both these tables, we can get an approximate overview of how much active a user is on each day of the course. So, a combination of these two will give how much active a user is. Second, coming to the difficulty level of a question. Since we cannot judge a student only on the basis of how many questions he has solved and he has taken how many attempts in solving that question, we need a difficulty level to associate with each and every question. So, for that, we have actually associated a difficulty level for each and every question. And this is based on the mapping of this formula. We are actually recording that for a particular question, how many users have answered that question and he has answered that question in how many number of attempts. So, it will be, we can actually... I always get very confused when somebody gives mathematics. Actually... Okay, now, I don't understand. First, where are you getting the difficulty level of a question? None of the events added. No, sir, none of the events. So, he has generated data? No, sir. We are having a track of that how many students have solved this question from log table. We are getting that, yeah, this user has solved this question and in these number of attempts, we are getting from the log files. I understand the same difficulty everybody has. Anyway, I am not sure about your formula, okay? I don't know why you are generating it. So, let us go next, all right? Don't give me mathematical formulas without justification. They all junk according to me if they are not tested, okay? Next. This is the fourth one where we are calculating the seek time of the video. First time, suppose the duration of the video is 100 seconds and you have seek from 20 to 50. Then, in the seek video table which we had passed out from the log entries, we have two columns known as new time and old time. So, if we take the difference of that and calculate the total sum. Now, machine learning he will tell the activity level from the features we have just derived. These are, with the help of this, we are drawing the values from which we can derive. Now, the seek time. Now, I will explain you this one. So, if I divide duration of the video, divide duration plus seek. Sir, now, if you say it is not seek 0, it is, entry will be 0. So, D by D will give you one. Now, I am trying to keep this value from 1 to 10. So, if I get, when a user has not at all seek the video, this entry for seek will be either 0 or just 1, 2 or 3 seconds in minimum. So, I get a higher fraction towards 0.9 and 1. When I multiply it with 10, I will get a value say 9 to 10 something. Then, this 10 value will show that he is a regular student. Okay, next goal. Let me see your Vekka, Vekka, Vekka. Next, next. So, this is an input for not ARF file for each user for feature 1, 2, 3. This is the input value which we have calculated from all the three features. Depending on the dependency of all those, we already are training, first we have to train the machine. So, we have given on that, whether he is gaming or not, 0 or 1. This will be taken as the input into Vekka. So, this is with logistic regression from Vekka itself. For that, our dot RF file, we have got this correctly classified instances as send with 71.4 percent and incorrectly as 28 percent. So, this is what we did just yesterday. So, we couldn't get time to even more go for that. And we didn't have any idea about that. The purpose of the internship is for you to learn. And you have done a very useful work where you have classified the events. But good Vekka, I think end-to-end is what is most important. Whether you have missed the way or not, you know to start from somewhere and end somewhere. That is quite good.