 Let's enter into the open world edX offers online courses and classes pertaining to a variety of domains from best universities of the world free of cost open edX is The open source platform that powers edX all edX code is freely available to the developer community IIT Bombay X is an initiative of IIT Bombay to put all the educational materials from its undergraduate and graduate Level courses online freely openly available to anyone and anywhere in India Through their project Dr. Fartuck serves group have tried to make enhancement to the open edX group edX platform Under the guidance of Shukla Nagmam. A very good evening to one and all I'm Neha Gadgil from VNIT Nagpur Me and this group of 13 interns working under Dr. Fartuck will be presenting the work on the following project Enhancement of IIT Bombay X open edX. We all use various web applications every day However, little do we know that we are leaving back a long trail of our activities on the platform Various applications like edX analyze all our activities and draw meaningful insights from them From our project. We have done the same thing We have taken those event logs pass them and try to get some meaningful analysis out of those and also try to answer Some questions that haven't been answered before now Big data the event logs that are captured are done in order of millions every day. Hence, they culminate into big data Big data. What is this big data as the name suggests? It is a huge collection of data. It compresses of three Vs That is the extreme volume wide variety and velocity at which it must be processed Big data analytics is often associated with cloud computing since the analysis of large data sets in the real-time requires a platform like Hadoop to store the large data and Distributed cluster and map reduce to coordinate combine and process data from multiple sources Now what falls under this big data? It is broadly classified into three types that is semi structure structure and unstructured data So this is just a list of technologies that we have used and then I would like to hand it over to the parser group that have passed the event logs Good evening to one and all. I'm Aditya and we have done event log parsing. Event is captured when someone clicks on IT Bombay site and this When the event is generated it would be In form of JSON objects and the corresponding log record is stored in a file. What we need to do is we need to read the log file and extract the necessary information What open edx does let us say there is a task which requires a processing of a certain set of log data This is it gets the data from a centralized parser Which will separate particular lockwells and send it to the task if they are multiple tasks operating on same set of data Then each data must be passed multiple times and this will reduce It will increase processing so we need commit a new model in our model We will parse once store the result in template in a database But there's a comma the code we are having is completely hard-coded for each event What we have done is we have generalized the parser As these limit as it is already hard-coded we need to know all event types before Whenever there is any a new event occurs. It is difficult to add as we need to change the code again What we have we have generalized the parser that is it is suitable for all kind of events we have stored all the events event types and the attributes in a database and We will search for the event type field in a JSON object with the database data we're having Let us hear the event type name is discussion We will check the JSON object with the check string if it is matches and the corresponding event type and name are noted And it is further processed These are some of the event attributes This is what the flow we have described Whenever the new data event occurs, we will output the particular Line number and the kind of attribute or an event in a particular file Here it indicates that this is a log file then log file name and at particular line 159 and corresponding to edx code Enrollment a user ID attribute is found and at 9 lemma 96 we could found that change email event is found This is how we will identify new events and we will enter into database open edx Doesn't include a Indian parcel related to discussions. We haven't found in insights But our parser can identify all discussion events and parts accordingly These are some of the discussion events generated whenever Someone opens a form or sees a particular thread in a discussion Problem existing for log records some of the events doesn't have a necessary information to accept it like upload doesn't Correspond to to which discussion they are uploading Okay, I'll hand over to the next group Good evening everyone. My name is Nikhil and I'm going to explain how the existing analytics of the edx platform works This is a flow diagram of the whole edx analytics pipeline whenever a student takes a course he interacts with the LMS and when and every student activity is recorded is recorded as a log as a JSON object in the log files and these logs files are Stored in the Amazon s3 service for our own purpose We have used the HDFS to store all the logs and then these all the logs files are fetched to the edx analytics Pipeline in the annual edx analytics pipeline most of the computation is done using the Hadoop MapReduce job and After the analytics has been done the results are stored in the MySQL database and to show the instructor the analysis of the course We bring it on the edx inside server using the edx analytics API what analytics API does it? Just you do the simple queries to the result store database and then show it to the edx insights edx insight is the place where an instructor can see what It is a place where an instructor can see the course analytics It has been divided into three parts the enrollment engagement and the performance so the enrollment shows how many students have been enrolled and Every activity related to the enrollment and then they show related to the engagement and the performance What edx insight can help is an instructor can modify its course according to the analytics So there are few tasks that has been defined in the edx analytics pipeline that has to be run Periodically so that the data on the insights Update every time so these are the various tasks the answer distribution workflow what it does it? it has to run every night and It it updates the performance content of the analytics course and like import enrollments into MySQL It it is for the enrollment and the course activity weekly task. It is for the engagement activity So to do the analytics on our own local server what we have to do first we install we build the edx analytics pipeline on our remote server and We note down all the errors that we face while building the system and while running the system and This was a command that was used for running the pipeline for the import enrollment into MySQL all the activities related to the import Enrollments and This is a table that was generated after that From one of the tables as you can see the course enrollment daily it shows how many students have been enrolled in a particular course On a daily basis for example here We have taken the CS 101 and on the second February 2015 according to the locks 20 students have been enrolled and on the daily Basis these are the students that have been enrolled Now I will hand over to the Vedant The task assigned to us was to convert Hadoop map reduce tasks to luji spark This was because the current edx pipeline uses Hadoop map reduce which is lower in time efficiency as compared to Apache spark This is because luji spark uses RDD which stores data transparently in in RAM rather than writing it in disk So we save cost Due to the read write operations So these are the advantages Speed is speed was a major factor and the benchmark set for that was that it runs at least hundred times faster than Hadoop map reduce The only disadvantage was that it uses more RAM and because being in the early stages of development It is still recovering from bugs What we did was we successfully converted to of the edx pipeline tasks into The luji spark task the first one was course enrolment which calculates the net change in the enrolled users Per date per course. So for this the all the event logs corresponding to user Can I not get it on from without analytics? Number of users were logged into the course So it calculates the change per day So that would be a huge amount of tasks to do so that's where they require big data for computing it fast for every day We have to learn that So it tells more information about the enrollment just like the enrollment according to the demographics In this city, the more students are enrolled and Enrollment according to the Enrollment according to the age group. It shows all that data you you are you are finding out an enrollment from a log event Okay, the enrolment event generates a my SQL record Okay, so I can do it from my SQL very easily why I have to do all this thing because the data is very large that way to use But why should I get this event from long This is for part day enrollment That is an SQL query no because I'm sure SQL as when he was in What day Yeah, but per day enrollment if the enrollment time is there in SQL which it should be okay if it is there then all So there are there are many courses that run and our user can be enrol enrol on a given date many times So the data would be huge Event okay, which has got two impacts. Okay, one is entering into the SQL my SQL data base Okay, which I presume Should have a timestamp. Okay. I don't know whether it has okay Right, the other is an event log Okay, which can which is one of millions. Why should I process one of millions when a Single SQL database query will give me what time that event occurred if it has got a timestamp Okay If it has got a timestamp, I can write a select query, right? so whenever you Whenever the user enrolls or de-enrolls we get a log event corresponding to that event with the event type being whether He had enrolled the enrolled and we load all that into an RDD We get a key value pair corresponding to the course ID user ID and the timestamp at which it occurred and the action value Which was one or minus one corresponding to activation or deactivation Then this value are part these values are passed to a mapper where we sort all the time stamps according time stamps and get the last value of The get the status of the user at the end of today so that we can calculate the net change So summing up those values would will give us the net change in enrollment per day per user per course