 A very warm welcome and good morning to one and all present here. Today, I stand here in front of you with all sorts of mixed emotions streaming through my mind. As now, we have come to the end of the Iklavya summer internship 2014. I still can't stand the fact that this internship has finally come to its conclusion. Being habitual of getting nostalgic every now and then flashback to the time when I came here from the very first day my learnings and lessons started and now in the end of it I can proudly say that this internship has given me some pivotal characteristics that I'll carry on for the whole of my life and these characteristics will help me have intense desires to attempt, achieve, conquer, maybe fail sometimes but still stand back and fight till the end. I'm sure you all must be agreeing with me. IIT Bombay has given us all more than what we expected. We all have come, we all have got our own achievements, guides and paths for life. I'm Aayushigarh. In turn under Aik Shiksha team is going to be a host for the following two days. Now, no wasting much of the time, let us start with what we are here for. Let me introduce you to the first group, the fundamental research group. The fundamental research group is managed by Mr. Nagesh Karmali. Let me call the first team, Data Analytics Workbench for Educational Data mentored by Mrs. Shugla Nag, Ms. Firuza Aibra and Mr. Nagesh Karmali. EDX is a massive online course, MOOC destination site and online learning platform that is open source. EDX platform generates tremendous amounts of data related to students, instructors, staff and courses. Analysis of EDX data enables us to answer the fundamental question about learning nature of students. Now I'd like to request the team to come up and proceed with their presentation. Good morning everyone. I'm really happy to be starting off the presentation today with the fundamental research group presentation on Data Analytics Workbench for Educational Data. Before we go ahead, let's start with the introduction. As we all know, EDX is an open source MOOC platform. Unlike Coursera and Udacity, other such MOOC platforms, it's open source. We all know that it generates tremendous amounts of data. Since there are lots of logs being generated, there's a lot of backend stuff going on. However, what we were wondering the first time when we came for this project was that how tremendous is this data? Is this data actually enough to call it a big data project? So let's go ahead. So what we did was some simple number crunching when we actually started off our project. This simple number crunching was actually based on the test which all of us interns had at around 14th May. We actually received this data, so we were actually able to do some computations on it. There were around 80 students, as you can see, 2 to 3 hours of a test, 120 questions, one and a half GB. All this overall amounted to around 0.02 GB of data per student. So then what we did was we checked that actually for an EDX course, on average, how many students generally participate. And what we found, what we were surprised to find was that it's around 40,000 students. So when we actually started calculating now, what we saw was that even if we take this number down to 2 to 3 hours per week, and we found that they also interact with many more things. Since in the test we only interacted with problems, we would end up with a total of around 750 GB per week, per course. And if you take this for a 3 month course, we can all see how, we can clearly see that it is at the level of big data. So let's get to the main objective of our project, basically a data analytics workbench for educational data. When we got this project, the main aim of our project was actually to create a workbench such that we receive a lot of EDX logs, a lot of data. We get a complete process by which this workbench will automatically load this data and it'll keep it ready so that it can be analyzed by not only staff or other people, but also by researchers who can use it to get important results. And apart from that also, we went a step ahead and actually visualized the data so that it can actually provide a first useful user interface so that people can actually see what is going on in the system. Next comes the tools for big data. I'm not going to spend so much time here. Basically we've listed the tools which we've used both for the big data technologies as well as the front end technologies which we actually use to visualize and show how our platform is working. So the EDX platform delivers the two kinds of the data. One is the log data and another is the database data. And the EDX use the MySQL platform to store its relational tables and no MongoDB for the NoSQL database. And all the data times are stored in the UTC format. The event tracking data contains the all amount of the log data that the student generates when it interacts with the LMS platform. And these are the kinds of events in which we can categorize the log events. The database data can be divided into these eight types in which around seven are in the SQL table formats and the forum data is stored into the MongoDB database. And this is basically the SQL mind map of the table. We categorize all the useful tables into the three ways like certificate data contains the generated certificate information and the user data contains the student demographics data and the courseware progress data contains the progress of the student in a particular course. This is the ETL part of the data analytics system. Like first we extract the system then we transform the data and then we load it into the SGFS. So the first type of the ETL platform was actually to provide a GUI so that not only some programmer or actually a person who knows the system internals can use it but also any end user who wants to actually start working on his research can directly start using it. That's why our first step was actually to provide ETL GUI where you have to just provide simple options such as what type of data are we using and the basic path to your data. The GUI automatically then starts working according to a Django backend which we'll come to later. The next step of the ETL can be shown by this log diagram which is our data organizer. What happens once we receive the input to our data as you can see there are different types of inputs over here it's sent into our data organizer which sends it which first segregates the data according to its different types and then according like here we are talking about EDX data packages. There are two formats of data one is EDX data packages and the second is locally generated data. So the EDX data package data is already provided in a pretty easy to use format. What we have to do is just segregate the data into its different types, prepare it into TSP formats so that it can be directly uploaded onto HDFS so that it's ready for the next loading and transforming phases. The next part which is the locally generated data was a little more difficult to handle since it has not been properly initially organized what we had to do was not only do a recursive search and segregate the data but also handle source files which was SQL source files which were handled using a big data tool known as scoop which directly uploads it onto HDFS. On the other hand Mongo data had to be queried into first of Mongo database table converted into TSP format so that it is again ready for loading and transforming and finally the local log data was again segregated and arranged into HDFS. Finally comes the step three of the data which is loading the data for loading the data for the EDX data packages for the SQL files at least we were directly able to upload it onto the Hive tables and same for the locally generated SQL tables since they followed the same format exactly. However the biggest problem we faced were for logs the reason being that there are around 5 to 10 different types of events and each of these events have a number of sub events because of which if you would have actually concatenated all the possible logs into one table it would have come up to more than 50 or 60 columns. For this reason what we did was that we arranged it according to the different events which are given here as you can see there are common events, student interaction events, instructor events and it all had to be organized according to the type of event it was. All this was done using simple was using serializers such as JSON tuple apart from that it also had we also required a get JSON object and a number of such tools were required for the MongoDB part as I had explained earlier what we had to do this flowchart simply explains how it works what happens is first is the as I had explained the data organizer organizes the data after that it is uploaded onto your Mongo database once it is uploaded onto the Mongo database some PY Mongo connector was used in order to actually query this data obtain the useful part because there are a lot of fields which are actually not required right now we obtained the useful part and provided into CSV or TSV format which can be directly uploaded into high tables and finally this is all the blocks which we discussed until now all of them have been connected together as you can see it starts with the GUI it goes into the data organizer the data organizer uploads it into HDFS it is then in after initial loading it has to be pre-processed a lot of cleaning has to be done internally and finally it is sent into one common platform of the of Hive which can be queried using Hive or Shark which again we will discuss now starts the visualization part which flowchart will handle. For the data visualization part we use the D3 and the dimple.js JavaScript library and these are basically the two JavaScript libraries and the dimple is based on the D3 itself and with the first what we did that the the Hive output of the Hive queries is transferred into the TSV and the CSV files then these files are used for plotting the graphs these are some of the queries that we have implemented on the summer intern test data the data was around 1.4 GB and the 37 queries are formed and the implemented some of the important ones are mentioned below we categorize all the queries into four parts the student analytics enrollment analytics, problem analytics and the questionnaire analytics I will tell you some of the queries we implemented one of the graphs this graph shows the age distribution that is the age distribution of the users who have registered for a particular course now the second category which we categorized are analytics was the course analytics in course analytics one of the graphs was active users per day that is the number of users active in a per day in a particular course the next one was video analytics video analytics has these queries and one of the examples is number of users who use transcript for a video this is basically the number of users who use subtitles to understand the video in a particular course then we have this problem analytics in problem analytics we have this graph this is the response time of students solving quiz questions by course and this is a stacked bar chart and each stack shows a question and the response time according to the question for a particular user next is the enrollment analytics and this is a world map this world map shows the number of users registered for a particular course from the different countries then we have basic and a statistics in basic statistics this is one of the most important graphs because it will be used further for predictions and all and this graph shows the sequence followed by different users for completing a course or learning a course and this graph shows different nodes like enrollment, navigation and problem these are different types of events and according to the sequence which are different users follow the intensity of the arrow shows how many users have gone through that path and I will tell you some more charts this is the correct responses for problems this is basically the data which the quiz data which we gave each question shows the correct and incorrect responses for a problem for a particular quiz since we were we have done the visualization using the D3 part so we tried it doing using the Google charts basically it provides you with a large number of dynamic charts and more complex one where you can have the flexibilities of choosing what you want on the X6 Y axis and the labels for that so Google charts uses the three JavaScript libraries one is the JS API visualization library and the library for chart itself so then you to proceed with the charts you have to prepare the data in the form of data tables and views and then you can customize it instantiate it and then you can draw the charts which you have the two functions to draw the charts and then you have the three ways to populate the data either you can populate it manually or the CSV or the CSV file that we got from the as output of the high queries and the other one is from the Google spreadsheets and then we can query it also this is the interface that we developed to visualize using the Google charts so this is the home page and then next here you can as you can see that you can select the data set and the chart type you can select the data set in the chart type and then you can select whether you want to query or not and then once you submit it will be directed to the other page as you can see that we tried two kinds of data sets one was the air passengers which was the test data set and the response time data set which was the data set that means the data that we got from the quiz that was conducted in the beginning of this intern without querying the bubble chart for the response time data set as you can see above and the other one with querying you are getting this way So we have done the querying part using Google charts and we have loaded the data now for querying the data to provide the user with the interface where he can query by himself and see the output we have done it using Django Django is used to integrate both data loading and data visualization in a web application it is a python based web framework it can easily execute python queries, R and Hive queries as well it is also capable of queuing the processes that is if a process takes a large amount of time we can put it in a queue and when the previous process has done its execution completely next process can start this shows the overall data loading part in this first of all we are going to get the input from the user and then the workers of the queue that is we have used a radius R queue for this purpose workers will put it in a queue and it will check that whether the queue is free or not depending on whether the queue is free further processing are made if queue is free it will start working like as previously it has been explained it will load the data but if the queue is not free after some time gap it will again check whether the queue is free or not we have data visualization part on getting the input arguments first of all Django will check whether the arguments are correct or not and after that it will see that which query is to be run for now when it decides that the input we have got is correct and the function that we have to call to generate the graph two processes simultaneously run first one is querying like it queries from the database it generates the CSV or the TSE file and the other part which generates a template and it combines both of these parts like CSV file and the template part to give the final output next one is the overall processing that is first of all it decides whether it is data visualization or data loading and then completes it all this is just a screenshot that we have generated a graph depending on student mark statistics this is the right-hand side you can see the input form where user can choose for which course you want to enter and the course has started when and what are the queries like in the last drop down you can choose for which query you want to check for this is the message that gets shown when there is no data for the particular query optimization part what we have done is that we use the shark and spark it reduces the query time and it is very much as compared to the hive and the shark part like hive takes the 48.9 seconds around for this particular query and the shark takes only 9 seconds and for the simple query the shark takes the 3 seconds and the hive takes the 21 seconds so it considerably reduces the time these are the results that we have got the technology is used that we have shown that we have used the big data tools and the technologies and for the front-end part we have used the Django and the d3 dimple and the Google charts and he has already explained the architecture of our system we have implemented it on a single load cluster as well as the multi-node cluster and the ETL automator script what the script does is that it automates all the data loading process and the data visualization process and all in one user interface as well as the admin purpose and as well as for the end user like teachers and professors these are some of the future works that we have decided sequential data mining to find the sequence in which a student learns so that you can predict the student behavior and the detecting undesirable student behaviors like if someone is giving answers in a quiz or in a very short duration if he gave all the answers then there must be some sort of cheating or we can decide after predicting about the data latent knowledge estimation is about estimating the knowledge pattern of the students and detecting possibility of student dropouts like if some students is watching a video for a very short duration and is not interested in the course then after data analysis part we can drive such situations and using the MongoDB data of the ADX part MongoDB is basically a no SQL database it contains the forum data and the course were related data so we can use that data and use the natural language processing to find out that which students are active on the platform on forum data as well as in the course and integrating the Django front end with the multi node cluster and we have already set up a multi node cluster and we want the Django front end work for the whole cluster team we are the five members so you have used standalone data generated during the quizzes that were taken any attempt has been made to integrate it with the actual IIT Bombay X EDX so actually what we found out right the first thing which we did was found out which types of data are we going to get as input IIT Bombay X and EDX data we spoke to Sukhla ma'am she told us that she gave us in fact that data is going to be generated internally in IIT Bombay X and a second source of data would have also been Amazon data packages which are being EDX data packages downloaded from the Amazon EC2 server there are two forms of data input as I had discussed earlier so for the Amazon data packages I had shown you that block diagram earlier and for the locally generated data which I had spoken about IIT Bombay X specifically in that case we have created that system which is a little more complex because initially the Amazon provides the data in a much more parsed format but IIT Bombay X data is in a form of SQL dumps MongoDB dumps which have to actually be initially transformed and parsed by us I don't understand there are thousands of courses that are running on EDX lot of big data is being generated what analysis does EDX itself provide on its data? EDX itself provides very simple analysis right now it provides you your basic statistics like I had shown you which we are also providing it provides you basic demographic data such as the ratio of females and males it provides you basic data about degree distributions basically all the data it provides only comes from the SQL tables there is no data which they provide from their logs they don't generate any kind of visualization or analytics from their logs almost all their at least until now all the data analytics which is being performed is only performed on their SQL data like they are improving their framework right now but it's still going to take a long time okay so the EDX data is available where? basically as I explained there are two types of data one is your locally generated data which is your blended MOOCs data regular EDX provides most of their data on the form of Amazon EC2 from Amazon EC2 servers which they have online you can download their data but that they only provide to data czars which are representatives to other colleges otherwise any college which is themselves hosting EDX such as IT Bombay will now be hosting MIT has been hosting they have their own locally generated data which is actually a more important source of data analysis sorry locally generated data talks about quiz these are my SQL tables this is not large data no in fact locally generated data is much larger because locally generated data also has your logs locally generated data also has your my SQL tables then why then if that data is large what is the additional data that is there in your Amazon whatever you talked about Amazon data is basically like the international EDX platform which is there in which all the people participate means from international from different countries we are only concerned of EDX okay what does Amazon got to do with EDX like for your own data which you generate for your own courses for everything which is of your own for that it directly connects with your own locally generated data Amazon only provides that data which is generated in other courses in their own courses they are providing data why analyze that it's optional that's why we provided an option for both if there's any research why because there are researchers who may be interested in analyzing that data as well forget that you may a fundamental research group I am a very practical person I don't do research so I should only integrate the portion of your software which what you call as local data for that I don't have to beg anybody no you don't have to you get the data directly from the table it's got everything it's got logs also it's got MongoDB dumps also it's got everything which even the EDX data logs are in my SQL right no then JSON format they have to be cleaned they have to be transformed there's a lot of processing which has to be done which you have done already which we've done it it's all integrated into an end system and it's there in the lab we've integrated everything and ready so my other question was there was a testing done in on IIT Bombay two phases why did you not attempt to integrate your stuff during the testing phase or at least my problem would have been because sir the testing phase was when it just came in here that time the test had happened no no sorry sorry testing just happened two weeks back we were not made aware we have a product to release in July 27 that product release needs some testing testing is probably happening right now as we speak we were not aware of that the second thing I need to know is how do I know you have covered all types of data as I said if you had seen enough future goals we missed out just one type of data that too only for the Amazon server data which is again as I said Amazon server as far as I know IIT Bombay has got I don't know whether if I release the product will some data go into Amazon no it won't it will be generated locally what are the challenges the challenges which we faced in the process the first challenge was that there are two types of data again I won't go into that specifics but more importantly the most difficult challenge we faced as I had explained earlier the logs which we had because there were around 5 to 8 different types of categories of logs and each of those categories have different types of events as such you can say subcategories and if I would have actually taken generally what the notion would be that for all your logs you put everything into one table and then you do querying on that but the problem we faced was that if we would have put everything into one table we would have come up with a table which has more than 50 or 60 columns and for each different log around 20 to 25 of those columns would have remained null so instead of doing that what we did was that we segregated each of these tables into different categories so that we have maximum amount of data being stored instead of null values being stored in most of the database that was one of the main challenges secondly was the MongoDB data the MongoDB data does not come in any kind of TSV or CSV format it is a no SQL format as we all know so what we had to do was we had to actually apply a Python Mongo connector and query out the important part of the data because it has all logs and all dumps into it which we only require certain things like which was the quiz number which week was that quiz held in for what was the due date what was the start date of that quiz we had to actually remove that information and put it into useful tables so we had to actually what we did our end product was actually in case of MongoDB was that we put it into a TSV format which can be directly loaded onto Hive thanks