 Hello everybody. I am Shukla Nang from data analytics group. So today I am here to explain the projects which we have shortlisted for you. So this project will be about the data analytics. So we have got two projects basically. One is a chat and log analysis of Avue system. I will come later to tell you what is Avue system. And second is a comparison of behavioural pattern of different categories of IIT Bombay X or MOOC students. And the categories would be based on location, gender, education and performance. I think you know what is MOOC. That is massive. Open source. And actually we have open edX MOOC system which was developed by MIT. And our IIT Bombay X is based on that open edX MOOC system. So what we will be doing is that and we have got many courses on IIT Bombay X. So we will be taking the data from there and we will do this analysis. So now I will explain each and every this project separately. First is a chat and log analysis of Avue system. So Avue system is an advanced multi-modal, multi-platform collaborative e-learning solution. This is actually used for distant learning. So when the teacher teaches in a class, then the live classroom will be streamed by audio and video signal to the remote centres. And the remote centres could be any number. The number of remote centres are normally huge. So this is used by the IIT Bombay X to conduct their T10KT that is training 10,000 teachers. And it is a HMHRD project. So this Avue system will have a video conferencing system, a chat and we will be having the data from all those systems. So what are the data which we get from the Avue system? One is the SQL tables. There is all the chat messages which are done on the chat window come to the SQL tables. Second is the log files that is the application log files. That is whenever there is any problem or anything then application log files will log those data. And third is the videos that is which will come in the video formats. So we will not go into the video format, we will be mainly dealing with the SQL tables and the log files. What is our aims of analysis? First is to categorize the remote centres on the basis of the number of queries they have made to solve the problems. So when they set up their Avue system in the remote centres they will be having a lot of operational problems. Then they talk on the chat system to fix up the problems. So we will do the analysis on how many queries have come for the operational problems. And we will categorize all the remote centres depending on their number of queries. So it will give you their operational performance of the remote centres. And second is to dig into the chats not related to the operational problem. That is the chats are basically can be made by either the operators that is who sets up the system or by the students who can ask the teacher questions. So we will have to segregate these two kinds of chats and then we will take the second chat and we will find out from those chats those belong to which remote centres and to which subjects. And then we will do a sentimental analysis on those chats. So we will find out whether they are having difficulties with the subject or they are appreciating something. So we will categorize depending on that. So that will be our second project and aim. And the technologies which will be used for these are Python, Pandas, Numpy, TextBlock, then NLTK, Skykit Learn and Matplotlib. Have you used this? How many have heard of this and used it? Quite a number. So have you used all these or a few of them? Which are the things which you have used? Matplotlib and Pandas you have used. So we will have some background then. This is good then. So I hope you know for the benefit of the others I will tell Pandas is the data analysis package. And Numpy is a scientific tool package which is mostly, I mean Pandas use the N array, NDO array of Numpy. And TextBlock is text mining. So it has got a lot of utilities which will be useful for the beginners to do some data analysis. Then NLTK is a natural language processing. This also has got many utilities which will help you in developing the models. And Skykit Learn also is little bit better than NLTK. And it helps in developing lot of models for making the analysis. So now next we will come to the comparison study of behavioral patterns of different categories of IIT Bombayx and MOOC students. So here normally we are doing data analysis. So we will be most concerned about where the data are coming from, which data are you going to use. So here we get the data from, one is the user data that is all data related to user like gender, education, age, all those kind of information will be there in the user data. So it is available in the MySQL database. And then next is the course data. Course data is the course content data which is available in the MongoDB. So we will not be using the course content. So we will be using only the, I mean the user interaction with the course. So user and course enrollment data that also comes in the MySQL that is which user has enrolled in which course. So that will be useful for us. Then user interaction summary data that is in the open edX generates some summary data of the user interaction. Sometimes we will be using that. So second is the user, I mean, next is the user performance data. That is what grade he has got after he has completed the course. So if he appears in the exam then he will be graded and that grade will be available in the MySQL database. And lastly the user interaction detail data which are the log files. That is the application log files which are generated on the server. So these are the main data sources. Next we will come to the components of the IIT Bombay. Because we have to understand exactly how this data are getting generated. So you should know the basic of that analytics system of open edX. It has got four modules. First is the LMS CMS. LMS is the London management system and CMS is the content management system. Those are normally deployed in one server only. And this last three is mainly used by the data analytics system only. The first one London management system CMS stands by its own. So that is used by the user who is studying the courses. And these are used by the data analytics people. So second is the open edX data analytics pipeline which is actually a Hadoop cluster. It consists of HDFS file system. So all the log files, since those are huge in numbers, we cannot analyze them from the MySQL tables. Because MySQL table does not have that much capacity. So all those data analysis, I mean log files come to the Hadoop file system. And then the analysis is done on there. So basically the map reduce task will be run on this Hadoop file system. And the summary data of the analytics will be generated. And those summary will be generated on the MySQL tables. Okay, then second, a third is the open edX data analytics API. So this is another layer which is actually used for security purpose. So the last layer which is the edX data analytics dashboard. So that is the visualization. That is when you do the analysis, then you have to visualize the whole whatever you are getting the result. So this last one is a dashboard. It shows all the graphs and tables for all the data which you have got. Now this data come from the open edX data analytics pipeline result. But it cannot access them directly. So there is a layer which uses REST framework. And they have to call that layer. That is the data analytics API. So which gets the data from the pipeline and deliver to the dashboard. Okay, so this is the basic flow. And this is the, you can see the diagram, then you will understand it better. This is the learner management system. So it has got even tracking which are the log files. And this is a state is a SQL table that is which gives information about the user course list, then user course enrollment, all those things. So all those data come to the edX pipeline. And then it is kept in the intermediate storage is a HDFS file system. And then the scheduler runs the tasks. And then those tasks will create the summary results in a MySQL table. Sometimes it keeps in the elastic search also. Depending on the kind of data it is generating. Okay, so then after that this data will be used by the edX analytics API. This only, this server only can read those data. Using the edX insights will get the data through edX analytics API using the REST framework. Okay, so this is the diagram. Now this is whatever what I told you. That is it is reading data for the analytic purpose from the, it will be reading the user and user enrollment data from the server. And user interaction data from the log files. It will run the map reduce tasks which will create the analytics data. So this is what I told you right now. Now the present analytics dashboard has got this four modules. Okay, that is the visualization. It has got four modules. One is the enrollment, engagement, performance and learners activity. Enrollment it shows that day-wise enrollment. That is which day how many students have enrolled. Okay, then demographic shows education-wise, age-wise, then gender-wise enrollment. Okay, so the second third is the geography. This is shows country-wise. This is which country, how many people have enrolled. Then video engagement. It shows two kinds of engagement content and video. Okay, video and these are all drilled round facility which starts from the course. Then it goes to the course topics, from the topics to the particular article. Then for that article, which video they have used. Then in that video, how many have completed the views? How many has not completed the views? All those analysis will be shown on the dashboard. Okay, then performance with, that is also with drilled round facility. It shows the grading policy because every course has a grading policy. So that is grading policy means for a, I mean some weekly or in between the course, whatever the tests are taken, there will be some marks on that and on the final exam there will be some marks. So that is the grading policy. So it shows the grading policy. Then after that it shows course-wise, question-wise, total number of correct and incorrect answers. So this also it starts from the course, goes down to the chapters. Then from each chapter, how many problems are there? Then for each of this problem, how many correct answers have come and how many incorrect answers have come? So those are shown on the dashboard. Last is the learners activity. This is actually for each individual user. So far whatever is shown is a course-wise. So those are all summary data. The last one is the learners activity. That is each user is spending how much time on which kind of resource. Okay, so that is the last one and they use elastic search for this particular, I mean module. Then what are the things which we have to do in the project? One first is that you have to change or create the pipeline map reduce program. And there will be related utilities also which you have to use. And you have to get the location, gender, education, performance rate of the user from the MySQL and use it in the map reduce tasks. And then final results you have to put in the MySQL table. Then you have to write the REST framework-based API for the data analytics API module. So that analytics dashboard can call those APIs to get the corresponding data. And then finally you have to change the dashboard also. Because now we have got some new data, new things which you have to show. So you have to change the dashboard. So these are the tasks which you have to do for this project. Then what are the technologies required? Which is common for all modules is the Python and Django framework. Then for pipeline you have to use the map reduce program. For data analytics API REST framework. And for dashboard there is actually lot of technologies they have used. So you have to dig into the dashboard to see where they have used which technology and you have to use accordingly. You have to follow the open edX framework. So these are all. So if you have any questions you can ask me now. You want to know something in little bit more detail or there is no question. So we will see you in the project then.