 Thank you for joining us here this afternoon. I am SVJ Lakshmi, standing here today to talk to you about a project that is comparison of behavioral patterns based on gender, education, age and location. As the saying goes, talent wins games but teamwork and intelligence wins championships. So let me introduce my team members. She is Tanusha, Reshma and Sumanj. Let's walk into the content overview. Okay, we'll see the objective. The famous philanthropist Warren Buffett has his great investment rule. That is, if he understands business, he'll invest on that. If he doesn't understand, he doesn't touch it. So one should understand when, where and how to modulate his business apart from starting or setting it up. So considering a massive application such as MOOC where millions of learners undertake a course, there isn't essentially a case that all the users or learners take the course in the same pace. So this is the reason why the MOOC instructors or the educational data mining technologies are interested to compare the behavioral patterns of different categories of participants. So our objective in our project is to enhance the insights of IIT Bombay X by analyzing the course activity based on gender, education, age and location. Moving on to the introduction to Open edX and IIT Bombay X. For time being as you've all heard of what Open edX and IIT Bombay X, we are skipping the slide. The technology is used, the demonstration of the project. Apache, Hadoop, Hive, MySQL, Lugi. So we are showing this demo in the development version. We couldn't completely deploy this in the production version because of some issues with the VM. We'll see the architecture of the IIT Bombay X, which is told by Somansh. Right now, we don't have the image up, but basically, IIT Bombay Analyze Architecture is composed of three main components. The first is LMS and then it's pipeline and then applications. So LMS stands for learning management system. So basically a platform where the learners can view courses. It can also be used by the course team members. So LMS is a learning management system and that's where the learners can view the course and the course team members can use student enrollment and they can look at maybe discussion forums. And basically, every time a learner interacts with LMS by let's say watching a video, enrolling for a course or attempting a quiz, each of these actions generate an event and this is stored in a log file. Now, this log file tracks events for every user every single day. So obviously the size is massive and you cannot store them in a normal file system. So you store them in something called a Hadoop distributed file system and normally it's stored on a cloud, but for our purpose we store in a local HDFS and all the data in this is now processed using Hadoop's map reduce program and the results of this map reduce program is stored in a relational database such as MySQL. Now that the data is in the database, we now have to get it to the dashboard and dashboard is the part is the UI part. So that's where the instructors can see all the insights. Now to get the data from the database to the dashboard, we use a data API which is based on Django REST framework and we'll explain each of these components later. So let's start with the pipeline. Let's have a demo first so then you'll understand what's happening and then look at each component. Coming to the IIT Bombay X analytics pipeline, let us first tell me about installation. Open edX has provided as a single script for installing analytics locally, but we do face some problems while installing. We have fixed the script and completed the installation accordingly. We have implemented four tasks as mentioned now and I'll be going into detail in the further slides. Let us see the architecture of IIT Bombay X analytics pipeline. The entire data generated by the IIT Bombay X platform is stored in these data stores, which is shown on the left side. This data present in the data stores is copied into the Hadoop cluster along with the tracking log files. These tracking log files contain the events that are generated by the learners when they interact with the LMS. So basically every click of the learner on the LMS is considered as an event. Now when once the data is available in the Hadoop cluster, we process the data by using MapReduce program and the resultant required data is stored into the hype. When once we get the data into the hype, then we import the data into the MySQL database, which will be further used by data API in order to send the data to the dashboard to display the results. Logic behind the tasks which we have written uses Luigi, which is a workflow management system that is used to run complex pipelines for batch processes. Now coming to the tasks we have implemented, initially IIT Bombay X has a course activity task which gives us a total number of learners for each category of activity for each week. Now what is this activity? Basically depending upon the events that the learner generates while interacting with the LMS, we have categorized the activity into five types. They are active, attempted problem, played videos, posted forum and resource usage. Now the course activity which is initially present gives us the total number of learners based on this category of activities. Now we have broken down this course activity into four different tasks. The first task is gender which gives us a total number of learners based on the gender and we have categorized gender into male, female, others and unknown. When once this task is run, we get a MySQL database as described. The next task is course activity education level which gives us the learner's count for each category of the activity for each week based on the education level. Now we have took some education levels such as masters, bachelors, doctorates, etc. in order to categorize the learners. Similarly, the third activity is birth year where we categorize the learners based on the birth year. When we run this task, the MySQL table as shown will be created. The fourth task is location where we get the total number of learners based on the location where they belong, the city. So in order to achieve this, we need to initially run the task which gains the country, country code and city by using a Python module known as PyGIP to which we give the IP address of the learners which we get from the tracking log files and once when we give the IP address, it gives us the corresponding country, country code and location. So when the data we get from the initial task into the HTFS, now we run this course activity location to get the final results. Now the further explanation will be done by so much. After these Luigi tasks are completed as you have described, it's all stored in the MySQL database and now from MySQL database to the dashboard, we do it to a data API which is based on Django REST framework. So somebody already mentioned, I'll skip that part. I'll show you the work that we have done. So these are six APIs that we have written. The four main ones are gender, education, age and country. We also have two helper APIs to list all the countries and all the locations for each country. So let me show an example. So this is an example API. This is our age API. Basically it gives a age of a user in four buckets, which is above 40, between 25 to 40 and less than 25. And in case we don't know the age, the classifier is unknown. Now by default, this API returns the usage for the last week, the most recent week, but we can provide specific parameters that start and end date if you want a custom API response. Now the result of this API is sent to a dashboard, which will be explained in the description. Dashboard. Open edX insights makes the information about the courses, what the dashboard consists of. The dashboard consists of enrollment, engagement and performance. Enrollment consists of activity, demographics and geography. Engagement consists of content and videos. Our task was to segregate the content into gender, age, education and location to view the behavioral patterns based on the activities like playing a video, being active, attempting a problem, participated in the discussion forums and resource usage. The changes we incorporated in the dashboard. Originally the engagement content had only the summary view of the activities and now the content is broken down into gender, age, education and location along with the summary. These are called the tertiary navigation items under the secondary navigation item content. For each of the tertiary navigation item, there are five quaternary navigation items are there. Those are active students, played a video, attempted a problem, posted forum and resource usage. Behind the scene. The whole idea behind the scene is to render the data into a template, the form which is visible to the user viewing the page. But where do we get the data from? As mentioned by the so much about the data API, a call to data API is made along with the required parameters. But the dashboard cannot directly extract the data from the data API. The data API client acts as the interface between the data API and the dashboard. The purpose of the data API client is to transfer the data from back end to front end and it supports calling APIs. The analytics client fetches the data from the data API and returns the data in the form of JSON. And this data is further processed for displaying the weekly student metrics which shows the required strengths in a graphical representation. This is the graphical representation and the second metric is the student activity metrics which shows the recent week information. And the third metric is the content engagement breakdown which shows the required strengths in a table of form. By these, the MOOC instructor and education data mining technologies can compare the behavioral patterns of different categories of participants future work in pipeline. The current geo location supports only country and location. We can obtain the state of the learner by using a IP address, by using a databases available on the internet. Then a new field is added to the course activity location table in the dashboard. The extension of the pipeline can be updated in the dashboard graph. Now we are using a graphical representation to view the behavioral patterns. Instead of that, a geographical heat map can be used. And also machine learning techniques also can be used to achieve more meaningful conclusions. Thank you. We'll quickly see the demo of a project by passing the authentication since this is the developer version. This is one of the courses available. Our task is to segregate the content. This is the education in which you can see the labels like the bachelor's, master's and all. For all of these activities like active students played video posted forum, the values are shown. As you can see the number is huge for masters. It is the same for age and gender. The location is a little bit different since we have to select three locations and the comparison between the three locations can be seen from the graph. Selecting three locations from a country Tanzania. You can see the number of active students in a particular location, the week wise trends, the recent week summary and the content engagement breakdown. The rest are the same. The spoilers for age and gender. Thank you. Okay, that is some actually good work but you are not you're not displaying what is important. So that's a problem. So there's some good work which is there but there is one question here. So your demo has got everything but your slides have nothing. That's what I'm trying to tell you. Basically, we thought of starting with the demo. So once you have an idea of what's happening then you can look at each component one by one. So why you have the same tertiary level nomenclature for each of those secondary level So talk about the labels, right? The activity. Yeah, labels. Yeah, so that's a form of filter. So you want to look at a particular activity. So if I'm getting in this, if I'm getting by just visiting education, the same thing, do I get the same thing? No, no, no. So suppose you visit education and then what is the tertiary level? Usage, resource usage. And then I visit gender and then I go to resource usage. Is it the same? Okay, so as an example, let's say 500 students played a video. So if you open gender, it may say 200 males, 150 females and unknown. But under education, the same 500 will now be segregated into maybe 10 categories according to education level. So that is true across. Yeah, so basically the same 500 total is the same. But based on your filter that you want, you want to classify according to education, gender, location. So it gives you more insight of the type of students that you have. Yeah, but I think it is too complicated. So maybe you could have figured out displaying by some other method. Okay, we can take this off. So these these labels are same for each category, that is location, age, gender, whatever you're taking, that is the upper level category, then lower levels are you're segregating depending on what are the activities they're doing. So for like if you take video activity, then for different locations, how the video is accessed, that's what the graph is showing on the location, I can understand because you're filtering it out. Other things that are no filters. No, they're also filtered, male is filtered, gender is filtered, I mean male female, all are filtered. As location is filtered, like that male is filtered. Education is also filtered. Yeah, everything is filtered. No, can I see you get the total amount, what do you say? Can I see your not there? Maybe on the laptop or something. Okay, thanks.