 So, coming up next is the IIT Bombay X data analytics group and under the mentorship of Shukla Man. This project is about developing a data analytics system for IIT Bombay X based on big data inspired by Open edX insights. Good afternoon. Myself Jay Bothra from data analytics group. We are basically designing a data analytics system for IIT Bombay X that is inspired from the edX insight system we are trying to follow the same architecture. So, first I will explain how the edX insights they are working and then the changes that we have made in our pipeline which are appropriate and were required. The Open edX insights is basically a platform where the data that is collected from the Open edX platform is stored to visualize so that the course coordinators and the course creators can go through the data and see how the course is running, how the people are learning and the various statistics about the information. There are basically three analytical models in edX insights that is course enrolment, course engagement and course performance. The course enrolment basically tells us about the enrolment statistics over time based on different parameters such as age, gender, education level etcetera. The course engagement tells us how a particular user is engaging with the various resources of the course like videos, quizzes, problems etcetera. And the course performance tells the performance of the students whether a particular question is tough, how that particular question is answered about in the different quizzes, what is the answer distribution across the quizzes, what is the grading distribution. So, these are the basic models that they have implemented. Now let us talk about the pipeline, how the system is actually working. They are basically three parts of the system, the LMS, the pipeline and the application. The LMS is basically the actual edX platform which acts as the source of the data. Here there are three different sources of data. The first source of data is a summary data which tells us about the state of the edX state of the course. For example, what are the number of users in that, what number of students in that particular course, what are the different quizzes in that course, how the particular different quizzes are answered and the authentication details etcetera. The summary data also contains the details about the course forums and course discussions that is stored in the MongoDB database called module store, while the summary data is stored in MySQL database that is called the edX, app and different tables. The important part of the entire system is the log data. Basically the log data records all the user events as the as each and every user interact with the system on a millisecond basis. That is every footprint of the user is recorded in these logs as JSON logs. edX analytics, what the people at edX do is they store these event logs in an Amazon S3 cloud which has an access to an elastic map reduce cluster and every time they want to process the data for a particular model or want to update new data that is coming. They search this data in the Amazon S3 cloud and take it to their pipeline where they process it to the through the map Hadoop map reduce programs. After processing in the Hadoop map reduce programs they have generated a set of tables that are directly used for visualization by the data API. This edX analytics API can be used by other systems also to show certain other visualizations using the same results store and this data API is itself used by edX and insights dashboard to represent the entire data of the different of the different models that that we are discussed. The pipeline basically acts as different Python luigi tasks. Now luigi is a task scheduler that is used so that the task can be scheduled parallelly with each other and on different systems. So if we see for a life cycle for a particular model like for a course engagement model or for course performance model all the workflow was like this first it fetch the data from the logs and the summary data that is from the S3 cloud then it processes the data using Hadoop map reduce. It puts the data into the Hive tables apart from putting the data here into Hive tables some of the data that needs no processing is directly put in here through scoop through scoop it transfers the data into the Hive table that needs no processing. After that combining both of the data through simple SQL queries it puts the data into the analytics database from where the data API can fetch the data. Now the major difference between our pipeline and and their pipeline is they all always for a particular model when a batch process is run they scan the data and clean the data. But what we are doing is we are cleaning the data only once once the data will come new data will come into the system it will be clean extracted and stored in the system so that it can be easily accessed. So no cleaning is required again and again due to this we need we do not need Amazon S3 cloud storage so we we have we have saved from using the Amazon S3 cloud. This is the basically the results to a structure of the database the different models that are used have different corresponding tables else like the authentication tables, the course enrollment analysis tables depending on the different parameters, the course activity tables, the course performance analysis and the video analysis presently ADX Insights has implemented no model on video analysis just they have constructed the schema for video analysis. This is a this is a result stored that is being accessed by the data API. Now this is this is basically our pipeline what we are doing instead of standing the data from the Hadoop sorry S3 Amazon cloud we are accessing the data from the local servers and running a three python task for the three different sources of data that is a MongoDB one runs for log data one task is running and for the APP one task is running. Now instead of routing the data through the SQL table and then to the analytics database the database needs no processing directly from the SQL database we are putting it to into the analytics database and instead of using Hadoop map reduce we are using Spark map reduce and Spark SQL course which has made our processing ten times faster as compared to Hadoop. Now the Python cleaning code basically the code was first written in Java and it did not have all the events classified there are many events in the logs. So we we first classified all the events around 150 to 200 events that we finally got and then my colleagues converted that particular Java code into the Python code. So that it can be written into Python logit task and directly started as a batch process. This entire code has been converted into the Python and it is directly writing the data into the Hive tables. So there is no no intermediate storage thing the the database that is being created in the Hive contains these three subsets we can say the first table there is a single table for the log data there are these tables for the summary data that is city, state, course, category, course, course, chapter, session these contains the relevant information that is stored in these tables and the course forums and the course discussions basically these have the details about the threads the discussions that are going on and the vote up vote count all that particular information is properly extracted and stored into these tables. This is an example log conversion previously when we were working in in testing stage we first converted that into an MySQL database to check whether all the particular fields are properly populated or not and then we followed the same use the same schema for Hive tables. Now now in the log table a particular tuple pertains to 58 fields but it is not necessary that for a particular event all the 58 fields are populated pertaining to a particular event the appropriate fields will be populated for example for the video event we will have the current seek time the current video time the video navic type if and the current video speed and old video time for a problem event we might have the max grade the number of attempts the hint available hint use fields like that. So that according to the appropriate event appropriate fields are there which can be used for further processing these are the events that we have successfully classified now the advantages over insight as I have said we have eliminated the use of Amazon S3 cloud the second advantage is we are using Spark instead of Hadoop which has which has speed of our process several times and yes we have eliminated the use of MySQL we are directly transferring the data into Hive the saving that time the basic the basic theme of the entire edx insights are its models the analysis models else take the data from the Hive tables and process them according to the different parameters and then store it to the analytics API. So what we intended to do was to replicate the models that are available on edx insights and add some of our own indigenious models so we were successful in implementing the course environment model the answer distribution model and the course activity model apart from it we have also added the user navigation model and the video model which will be explained here you can see the entire process for the course environment model that is the data from the log log tables to to the different tables in the analytics database that can be directly used for populating for the visualizations. So the process begins from the log data we select only those events that pertains to enrollment activated deactivated and mode change then we submit that data to the spark thing we map it based on the user ID and the course ID and the date so that we get a result that is across the date line that is for each date we get the enrollment how it's increasing or what is the pattern of the enrollment now this pattern of the enrollment is based on the birth here daily enrollment the education level and the gender this is the this is the way the different models are implemented in the same way. Now Anurag will explain how the different technologies used and the problems we face in the installation phase. Hello everyone I am Anurag and I will be telling about certain problems that we face during installation of big data tools such as spark hive and Haroop. First of all I would like to start with the Haroop cluster ID mismatch when we found that there was mismatch between this cluster ID in the name node directory in Haroop home so what we did is we what we did is we paste the we copy the cluster ID from name node directory in Haroop home located under Haroop infrastructure and we pasted that thing in data node current version file and then we restart the service I think what we can do is that's just format the name node thing and then we start the server. Next problem that we faced was sometimes there was unhealthy state of a node so the solution was just remove that app temp Haroop files and then also we can do is remove the NM local directory in Haroop directory and then restart the node. In Hive I would like to say about that we did we configured MySQL database as meta store instead of Derby. We also configured the 3rd meta store URI so that Hive and its meta store can be accessed by spark. In spark we found a way to configure the executor memory by default actually default it is 256 MB and what we do is we increase that to 5 GBs. We also integrate spark with Hive and use MySQL database as meta store and the next thing we did was add the path of MySQL Java connected to class path of spark. I would like Kashwini to forward this thing. Hello everyone now I will talk of analytics pipeline. So what does analytics pipeline do? It's basically there to fetch the log files and then go on to store them in MySQL tables so it was earlier configured for Amazon S3 buckets. Now we have configured it for HDFS so the files are now fetched from HDFS and with this we did this thing by configuring a file which has been named override.cfg. We have well documented it for future work also. So this eliminated our Amazon S3 buckets and secondly this pipeline works on Luigi task. So these Luigi tasks run on these logs and they provide very good visualizations to automate tasks that we want to. Another feature of this Luigi task is that it creates tables if it doesn't exist. The other model is the next model is data API of which Sager will talk. Good evening everyone. The data API is responsible for the fetching the data from the Hive tables and give it to the dashboard for analytic purpose. It takes the data that is stored by the analytics pipeline into the Hive tables and then it returns a JSON format spring which is accessed by a dashboard and the dashboard displays it as an analytics is used for analytics. So we have configured this data API because while installing the this data API open edx client we have found that data API was not properly configured. So we have to configure it so that it could fetch data from the Hive tables. Next, Devan will talk. As we can see over here this was one of the major achievements like getting the data from data analytics pipeline into the data API. This was a task which was not yet successful because right now there is very very less documentation available about edx insight. One of the most trusted sources is by Stanford who have made it work locally on their back-end pipeline but then one of our minor contributions in this is that we found an error in their own documentation like the way they have presented it in. The thing is that the schema database schema which is created by the luji task and the one which is created by the data API they have a mismatch because of that the tables were not being populated and this was the error in which we are stuck for a very long time but finally we got it sorted out. And then again there is no such currently there is no documentation available at all for running this analytics pipeline locally behind a proxy server. The settings which need to be changed and the other extra things which need to be done. So we have documented in a proper way and successfully integrated the analytics pipeline and brought data into the data API. I am talking about the problem with the dashboard. Actually the dashboard we were having is that for authorizing that dashboard we have to log in in that and it goes to the LMS server and then after that it returns to the dashboard itself. The problem was that previously we were not having the authorization model in that dashboard so we have to configure it and we have to set it to prove so that authorization is enabled in dashboard. After that when we enable the dashboard we found that dashboard does not work like this. It has to get the HTTP server but we were having HTTP servers so we have to configure that so that HTTP server work on ADX. Now the problem we are facing is that certificate error. After we are getting back from the LMS to the dashboard it is required the certificate error. It requires some certificate but we are not currently providing it so we are getting error. So we have prepared our own dashboard and once that dashboard that the problem is solved we can move our R scripts from that R dashboard to that dashboard. That is provided by Open ADX. For talking about R scripts Ankit will come. Hello everyone. Myself Ankit Kumar from IIT Patna. So I would talk about our dashboard. What does our dashboard does is that data representation and visualization of the analytics pipeline. The graphs and tables we are plotting for each analysis. The graphs are customizable and we have done some statistical and quantitative data analysis. The technologies used are R programming language. It is a statistical data analysis tool. We have used Python Django framework at the back end. Google is for visualization and R MySQL for the query. YR. R is basically a very popular statistical modeling language and it is very easy to use fast. Now Zirana will explain you about the Google V's. Google V's is an interface between R and Google's API. We have used it for data visualization and Flash Player is required for displaying animated charts. We collected the data. We collected the data in the summary table and using R we run queries on it and try to produce a few charts for different analysis parameters and we have produced different types of charts such as donor chart, line chart, etc. We'll show you a demo about this. This is our finally we have created a dashboard. As the user logs in he gets a list of the courses. He'll get the list of the courses for which he has got access to. For example, we select the course CS 101. List of various courses offered by the faculty. For which he has got. Faculty has access. So currently the authentication system is not working. So it is without the authentication system. So the modules which are ready are the enrollment the enrollment module. The enrollment module contains activity, demographics and geography. This is the activity module. On the x-axis you can see the date. On the y-axis we are having the number of the students. So we have plotted two things here. The verified count and the owner count. So we are what we are calculating for each day. How many students are enrolled in the verified mode and in the owner mode. As you hover over the graph, for example here, 2k15, February 11, you can see the number of students in the owner mode is 12,601 and we have no student in the verified mode. Next module which we have here is we are also having a table. Basically on which state of which person is enrolled. Number of users are enrolled in the log file. The log files, actual data. Actual log files are taken from CS101. Yes. So after a particular date there were the enrollment, there was a drop. The number of enrollment stopped just because there were no changes in the enrollment. So it is a cumulative count of 2 on that. Then it cannot drop. So it dropped because what it says on every date the number of students that will get unenrolled or unactivated. For that the count will drop and the number of students that will get activated it will increase. So after certain time there was no change in the activation or deactivation. Means the number of students that were enrolled remained constant over time after that particular date. That is why it is like that. The graph is stable after a particular point of time. The graph shows that on that day these many students have unenrolled themselves and then there was no change. This we got from the log data itself. So it is a video so we cannot see the date. No, no, it is a video. That is approximately 7th of March around. Sir, we were also shocked to see that such a drop was there but since it was real data so we came up with this one. But this was the real data. So we plotted this one. So one bottleneck that we had that the log data was not complete like for certain quizzes it was present in the summary table but the pertinent log were not there. June 25th, June. That is why it dropped. That is why it is constant after that. Locked according to the log table. Next we are having the demographics modules. In this we are having the age-wise distribution of the enrolled students. It does not have the information regarding when a particular person is activated and deactivated. It has only enrollment. Enrolled on a date only that record is available. Activation and deactivation is only present in the log data. Then enrolled for the course then enrollment activated. The administration activates the particular user for that course. That is what the activated event is. When the user unsubscribes from the course so it gets deactivated. That is what is happening. We are getting activated and unactivated events for a particular enrollment. Generally log data is never supposed to be reactivated. Log data posing some log is missed and something is missed. Correctly data is available from a regular table. The reason of not using the summary data was we means today in the morning as Dr. Fartek said that he wants the access over the time. Then we will not get the access over the time. We will get only the constant. But we have only single enrollment data. The middle activities that happen. But we need to see the load that is there on the system. Without seeing the load on the system then there is no meaning of it. Next we are having the age-wide distribution of the number of students. On the x-axis we can see the age and on the y-axis we have the number of students. We can also customize chart based on the age, number of students on the basis of age. It is taken from the EDX-APP table. That is the summary table, not from the log table. So we don't have data in it. So for all these in the enrollment except the activity we have taken from the EDX-APP table. So we won't be processing the log files. So age distribution. We can even customize this chart. For example I want to convert this into a donut chart. So I would click on the donut. So you can see the age-wide distribution. The next model which we are having is the education-wise analysis of the students. It is once again taken from the EDX-APP table. On the x-axis you can see the education qualification of the students. And on the y-axis we have the number of students. On the right-hand side even you can see the table which is denoting the number of students and their percentage of different qualifications. The next model which we are having is gender. This is again gender-wise distribution of the students. Next we are having the map. We are having the geography-wise distribution of the number of the students. The intensity of this graph shows the number of students who are enrolled for the CS course 101. If you hover over this, just play the video. If you hover over any state you would get the number of students who are enrolled for that course. Chhattisgarh also there is nobody from Chhattisgarh. The white includes that we have no data for them. We can also edit this chart. We can get a very smooth visualisation. Next we are having the engagement module. On the x-axis we are having time in seconds which is calculated for a day. On the y-axis we have taken into account six events. The six events are courses, navigation, problem, video, discussion and enrolment. For a particular user, what we are analysing is that the radius of the circle is depicting the time spent by the user for that event. For example, the user Aditi Gandhi on the event type problem, that is event number 120, he or she has spent about 180 seconds on this event. Throughout the day we can see how the user navigates through the course. On the x-axis... What user I have got on Chhattisgarh? No, we have taken the top three users into account. Per course? No, we would give a list of... We have taken into account all the users. We have segregated them and the top three users we are taking into account. Because they are the people who are maximum navigating through the course. So we are showing their analysis. That option can be added accordingly. That can be added, we have the data. After that we have processed and got the top three data. After that we have done... Yes, sir. Yes, sir. The chart would look clumsy. But if we plot 10,000 users, anything we will use it will become complex. Sir, we have... That can be added. The editable option we have provided in the previous search. Student performance. Student performance. Next we have the student performance module. No, sir. The student performance module. What a faculty can do? He can select the assignment type. We have to wrap it shortly. We are having an answer distribution model. What a faculty can do? He can select the assignment type. For CS 101 we found that there was quiz. Final examination. On the basis of assignment type, he would get a list of all the assignments in it. If he selects a quiz, he will get a list of all the quizzes for that course. Quiz one, quiz two, quiz three. As he selects that quiz, he will get a graph which will populate all the problems in that quiz and the number of incorrect and incorrect submissions for that quiz and for all the problems. So that graph we were showing. Not this. We were showing the answer distribution. This is basically the... We will show it a bit later. This chart is... This chart is... This chart is basically... This chart is basically... This chart is basically... This chart is basically... This chart is basically the user activity chart. But the data that it is affecting is wrong. We have corrected the thing, but the video was already recorded. We will put it in the slide. We will be showing you the graph. The same chart. What actually it wants to do? Actually, when you run your program, you will see... My video, camera capture, which can't lie. Okay. Now you are showing me a printed chart. No, it's not... We have run it through the R script. We have run it through this R script. We have changed the code. We have changed the... Yes. This is the chart. What this chart is showing... We have taken into account of four things. The number of active... Because this video is recorded... We are connecting the wrong... In this, we are getting the number of active users. The number of users who actually watched the video. The number of students who actually attempted a problem. And the number of students who are active on the forum. This data has been calculated... All these four line curves. For each... No, we are getting this for each course. That I said it was a performance for which I will show you later. Any video, any problem? One course for a teacher... It is one course for the teacher. No sir, it's not like that. For a particular course, for a particular user... X axis is courses... No, X axis is... It is the weekly analysis of the number of students of a particular course who are doing these activities. The total number of students. Week, date... Date... And Y axis, we are having the number of students. So for CS 101 course, on this week we are having so and so students who are active, who have watched the video, who have attempted a problem and who are active on the forum. This we are getting for a course. I am just saying, why is there so much activity in the middle of the course? What is the maximum date? Maximum date is 17th March. We have taken it until 17th March. Maximum point is on... Maximum point... Maximum point is around... February 17th. 17th. So that is the exam time. So our graph is correct. The exams people are most active. We have some of the modules. But due to lack of time, I would invite Arundh to speak about his modules. Thank you. This is his module. He will be equating the module. The graph is pertaining to this module. Good afternoon everyone. My name is Arundh Jaabashak. And I have been working on... Good afternoon everyone. It is the last question. So I have been working on this data module as part of EDX Analytics for determining difficulty regions in the various videos. The problem that we considered was this. The students as part of the course did not understand the different videos. But they would be coming back to the different regions of the videos and watching even several regions repeatedly if they do not understand it. This may be due to the... This would be due to the fact that they do not understand this part clearly or because the video is not clear enough or it is not lucid enough as such. This is the kind of information that we would like to provide to the instructors so that they can appropriately change the content and perform it. So what we came up with is that we would develop the same... The two features that we had taken were the number of... we had divided the video into a number of time frames of size 4 seconds each. The X axis actually determines the number of the time frame like 0 to 4 seconds, 4 to 8 seconds. And the Y axis, there are two curves. The red line is actually a curve and the red line is a column. The columns are actually the number of times a particular video frame was accessed like the number of times all the users have covered it. Actually sir, what we had done is that we had gone through the log... Yes sir, the time spent in each of the time frames. Yes sir, this can be... Actually sir, this is a summary data. What we had done is that we had found out for every single video for every single user who has watched that video. So I am interested. Hello. So what we had done is that we had processed for each single video, each single user. This is actually stored in a hive table. But for data analytics purposes we can generate multiple different visualizations for this. This is an example visualization. On... We can also do like the... For example, for a single video what a single user's performance is. I don't... No sir. This is for a single... Separate out the... Which is for which. And we will be doing a cluster analysis on this data that will actually automate this process. Sir, it's not wrong. Sir, it's not wrong. Sir, it's not wrong. There is a problem here. Your explanation is not correct. You just have to clear this. You have to give an additional PDF. What I am saying is... Okay sir. Okay, thank you sir. Sir, this is just for now. What we are going to do this is make this as a prediction or an alert. A notification to the professor. That this is the part. No sir. This for the... If he wants to, this will be available as an additional. But mainly what he will be getting is a notification. That's what we... That is a future work section. Like this is what I had basically worked on sir. We developed a prototype on this one. What? Health, performance, the extra clean... Based on the way... Yes sir. This is what we had done. Like even the log data, we had to actually process the log data and determine the sequence of the events because this one has to be actually accessed sequentially to determine the different regions of the video that I have made. For example, sir... No sir. Process on the log data? Yes sir. This is kind of an idealized sequence of how a user would watch a video and on the basis of this we had process which part he is watching, when he is coming back, whether he is fast forwarding a particular part. This is what we have done. There is an example which shown this curve. In the prototype stage, this is the curve we have generated. This is for a real video and this is a real video. That's a... The final implementation was done using a spark implementation for speeding up the entire process. So for making the entire process faster and keeping it in line with the rest of the EDX inside project. Ultimately this is what we want to do. Like extending the EDX inside to actually help the instructors in, you know, making the core of the project. So this is what we want to do. This is what we want to do. And, you know, making the courses better for the students and also enrich the experience for the students. Thank you for this. Authentication problem of the EDX inside dashboard. Our dashboard is ready. If you give us your authentication system, we would implement that. Our dashboard is ready. Because of their authentication error. We are getting the SSL certification error which I told about to you. The documentation is there. Whatever we showed, it is on our own made dashboard. It is our own R scripts. It is our own Google implementation. It is nothing. We haven't taken anything from EDX. Like, because we could not authenticate, it was forced on us that we have to do it on our own. It is not ready. There is a link. Sir... The dashboard at IIT Mumbai directly gets linked to it. Yes sir, we have created a link there. From the... Instructor dashboard. We have a link over there. But the thing is that there is no server certification. On our own machines. The local implementation of the EDX. At first it has to work on our local machines then only we can put it. Because it was still not ready and we are getting authorization. Actually we were not ready. Did you get Mr. Abhilahar to give your development server and test server? Otherwise you are not ready. We have talked. Actually in the back end, in the Python Django framework, we are calling the R script to sub-process. R script to sub-process. Import sub-process in Django. That sub-process is basically calling the R script on the terminal as you usually call R script. Actually in Django, under those view sections, we have included our own R scripts calling. We have our own R scripts that are inconsistent with the modules which are available. Yes sir. We have written the R, we have written the back end complete. It is a pure R script. Pure R script. Thank you.