 Good morning to all the interesting and curious minds present over here. We are Open edX analytics team intended to enhance the insights provided to course teams to better their courses, resources and quizzes. We are the team members Myself Abdul Satar Mapparam, Omkar, Chaitya and Ananthalakshmi. So the content overview of today's presentation involves introduction to the project then we will dig deeper into our implementation and our ideas. Coming to aim objective of the project, the main aim was to enhance the insights to course content developers based on the performance of their candidates. This involved us to change data analytics pipeline, data analytics API and dashboard. These are some of the technologies that we have used which involve Apache Hadoop ecosystem, MySQL, Luigi, Python, Webpack, Django internationalization. Coming to the overview of the project, the project involved two parts video quiz and resource usage. In video quiz part, basically the video contains one or no reflection spot. A reflection spot is a point where the instructor asked the candidate to pause the video and he asks a question. The candidate is expected to stop. Based on the video, there are a few questions at the bottom which he has to answer. The answer to the question can be either correct or incorrect. Based on the responses received, we need to analyze the performance of the candidate. Coming to the resource usage part, there are some external resources associated with a course. Number of candidates who access the resource, who viewed the resource can make insights more meaningful and beautiful. Over to Ananta Lakshmi for Open edX architecture. Open edX architecture, Open edX is a web-based platform for creating, delivering and analyzing the online courses. Some key points in Open edX are learning management system, studio and analysis. Learning management system is the most visible part of the Open edX platform. The learner takes the courses from LMS itself. LMS also provides instructor dashboard who have admin or staff role can access this dashboard by selecting the instructor. Studio is mainly used by course content team for creating and updating the courses in dashboard which is used by LMS. Next analytics. The events which are captured the learner's behavior are captured by analytics pipeline and stored in JSN format in log files and processed using Hadoop. And these aggregated results are stored in MySQL database. Then these results are made available using REST APIs. Next Omkar will going to explain about analytics pipeline. Okay, so I will be talking about the work done in Open edX analytics pipeline. The first one is installation. So Open edX provided a single script for installing the entire pipeline locally. So what we use that script and we also face some errors. So we noted down the errors and fixed the script accordingly and it has been put on GitHub. So for anyone wanted to install it can refer it. In the other three tasks I will explain for the slides. So first I will give a brief overview about the analytics pipeline architecture. So the data generated by the Open edX site is stored in these data stores towards the left. Okay, so this data is copied over to another S3 instance using a scoop or rsync depending on which data store it is taking from. Then the data in S3 is transferred over to the Hadoop file system and processed using Hadoop. So the logic for all the tasks in the Hadoop or the Open edX analytics pipeline is written using Luigi. It's a workflow management system which helps in building complex pipelines of batch jobs. It was developed at Spotify and after the processing happens on the file stored in HDFS or the Hadoop file system, the final result is stored in the MySQL database which is used where the API uses this for calling. Okay, so for weekly resource users which was the first task, first was changing the pipeline for user activity, second was creating data API which Chaitanya will explain later and then displaying the front end will explain. So each course, an instructor for the course can give additional resources on a topic. So these links are provided as an external material. So what we wanted to do was analyze this usage of resources for a week. So the result looks something like that, the tab added at the end and one more graph was displayed like this. So the first four things already provided by Open edX and we added the last visited resources tag. There was no separate event type in the log for resource capturing events. We processed the log file for event corresponding to clicks on a resource. Then moving on to resource users. The last one was provided a complete count of resources. In this task we had to count the access count for each resource for a course. So the task involved writing a separate file, separate Luigi task which had MapReduce logic and the code was compliant with those written with Open edX. So it can be easily merged with their code. Then we created a data API which Chaitanya will explain later. So the next task was generation of answer distribution. So answer distribution table stores the data for a question and its response. So it might have a display name of the question, the responses to a question and how many users have attended and things like that. So Open edX had implemented the pipeline kind of thing mentioned above. There were group of tasks A grouped into A which processed a line from the raw log file and passed the line to give the intermediate output year which contained the course ID and the data of the problem that was stored in HDFS. The group of tasks B took this output from the earlier task as its input and stored it into a MySQL database table called answer distribution. So the problem we were facing were the answer value field at the intermediate output when passed by a task in B was giving an error due to some hex string coming in it. So what we did is we created an entire task corresponding to B which took the earlier completed HDFS pass line that is this and we manually processed this string if an error was encountered and we also added some exception handling cases to finally stored a process line in the MySQL table. Moving on to the data API, Chaitya will explain the details. So once the all the data is stored into MySQL after processing it via the Hadoop files by the Luigi workflow system now it is the time to display it to the dashboard but the only way the dashboard will communicate with the server is via the data API. A simple HTTP GET request in our case will be sent to the server and it will respond with a JSON response which is eventually used to make graphs on the dashboard. So the first part that we did was to display the reflection spot. The newly added concept to the videos is the reflection spot. Since it is a very basic idea and not widely used these days it is just in an experimental stage so we did not have any data regarding the reflection spot in the log files or there is no correlation between the video and the reflection spot available today. So we had to actually manually look at the videos and make a new table with the reflection spot. So probably we can have something in future as some task which would actually extract that reflection spot for us and this reflection spot basically is to make the video more interactive and engaging with the viewer and apart from this there is a few quizzes provided below the video which is actually to check whether the user has understood the video or not. So this was the situation earlier like the videos part was completely separate and the quizzes was completely separate. There was no relation between the two like in open edx till now but IID Bombay X started some course which had this reflection spot concept and even the quizzes which were related with the videos. So we had to come up with a whole new table which was manually created with the reflection spot segment, course ID, pipeline ID and the quiz IDs as shown. So now the API response was basically had to take parts from three separate tables. The video timeline table which actually gave information regarding the videos and the answer distribution table which gave information regarding the quiz and its response given by the users and video and quiz table which was created by us which had the reflection spot information and this is the user interface which shows different URLs that open edx supports and the one highlighted in the red was created by us and this is the response of our API which shows the reflection spot segment. And the whole video actually is cut down into five seconds segment and which actually shows like how many viewers were there in that particular segment, how many users have viewed that segment and the segment number mentioned next to that reflection spot is segment number two which actually indicates that the reflection spot occurred between five to ten seconds and apart from that there were two quizzes provided below the video whose response has been recorded and the total number of answers and total correct answers has been given out by the API and the third part which the API provides is the number of users and the number of views per segment which you can see the last three parts in the response. Now the second task that we did was to actually count the number of target URLs. As the instructor provides separate URLs external links which can give more deeper knowledge or some extra reading for the student. So how many people actually go and click those URLs that is something of more important to the instructor. So there were a few activities which were already there on the open edx as you can see the first four tabs and what we added was the fifth tab which was the total number of users who actually visit that external URL. Now the third part was to give them a proper course wise clicks of every target URL. So this resource usage new URL that you can see which is highlighted there was the third thing created by us this was completely new and hasn't yet been implemented due to shortage of time and this API response has been showed which actually gives you the course ID the URL and the count of people who actually visited that URL. This right now is a dummy data cause the task couldn't be run because of shortage of time right now and which could obviously be done in near future. Now once the data API has been set now all this data has to be transferred to the dashboard to actually display it as graphs and charts or for which the instructor can derive certain insights. So my friend Abdul will tell you more about the dashboard having created the data API the question remain who uses this data API obviously it is the edx analytics dashboard. But it does not directly extract the information through the data API it is the data API client which acts as a interface between the data API and the dashboard. The purpose of data API client as you can understand is to transfer data from backend to front end it supports calling of APIs and brings the data from data API where data warehouse open edx data warehouse to the dashboard for display purpose. It was modified for fetching quiz responses and resource usage from the data API coming to edx dashboard the purpose of dashboard remains very crucial because this is the interface between all the back end processing and the course instructor the implementation involved us to install the dashboard locally on our system and then do some setup we did the setup with and without authentication. Also it involves display of video timeline quiz responses and resource usage the installation was quite a hectic process it was very long process we did it with LMS authentication. But due to some errors for creating different virtual machines we were not able to complete it with LMS authentication. But we were successful of doing it without LMS authentication for developer purposes. But still there were certain tasks that were not available to run hence the dashboard was never up and run next part was our implementation. We modified the data received from data API client to be able to display it on the dashboard the front end part was also changed to enable it to display additional things which included the quiz responses and resource usage. There were quite quite a lot tools and technologies involved in this project. But all common things apart there were a few things that stood out and that attracted us one of them was web pack it minimizes the HTTP request to a server on huge responses there are issues of server overloading. So this attracted me a lot coming to internationalization. I knew Nelson Mandela's lines that if you speak to a person in a language he understands it goes to his head. But if you speak to a person in his own language it goes to his heart the beauty behind this lines made me understand the meaning and importance of internationalization. So this video is basically captured of the dashboard and how the instructor can view different different aspects of it of the course the insights in the course. This video is captured for this ED701X course. This graph shows you the number of resource visited and that extra tab and the graph which we created this time. And that table shows you the weekly usage of every activity. In this process if you select a particular week and the video which you want to view you get this kind of a graph which shows you the reflection spot and the number of users and the number of users along with the number of people who attempted the quiz correctly. So that's all of any questions. Given that you were actually handling a completely different kind of things than most others how troublesome was this experience of dealing with the edX analytics pipeline and the learning that you had in the process. The first thing that we came across was the installation which actually took us like around two weeks to come across to get over the. This halt here this is something that I would like you to all of you to understand and appreciate increasingly as professionals you'll be spending a significant part of your life in what is emerging as a very critical activity in any team called DevOps. I think I mentioned that development and operations are going to be combined. You won't have the luxury of saying I'm a programmer I'll only write code. So that means these implementation aspects installation aspects have to be integrated in your mindset and your skillset. It is it is no more feasible to say there's some sysadmin who will do these things. You only will have to do this. Now they had the first hand experience of all kinds of troubles that one can face I guess. I hope you have learned something useful from that. Yeah like we never face so many problems installing anything than we experienced. And for all of you life is only going to become more complicated because each one of you in your future career would be handling systems which will have something like thousand components. You'll always need a team which individually somebody will be familiar with some of these things but you'll have to work together to resolve those issues. And it is better that you learn to do these things by dirtying your hands while you are in college. So in your own colleges you would have something called software lab and such labs where you have such experiences but you may not have experiences of this kind. So better go back and suggest that some lab exercise or some exercise which is done over let's say two weekends by a group of friends you just pick out some large system of this kind and try to see what kind of installation issues that you face and how do you resolve them. I think it's a very important and useful learning experience. And what else did you learn in the process in terms of new technologies or any such thing. It was completely new as we have never done anything like this before. The first thing was this Hadoop system we learned like writing map reduce tasks. Okay. Obviously like we never get to a chance. First things we never come across working with lab data and doing something like that but it was a good learning experience. What an opportunity actually to develop the analytics of a course which happened a week before we joined. From the installation point of view I learned that software should be developed in a backward compatible way because that was a great problem during the installation. Yes that always happens. Compatibility always remains an issue. For changing any part of the course very much care is needed. Okay. Good.