 Welcome to learning analytics tools course. In this lecture we will talk about data collection in a different learning environments. You might have seen in this course we are offering in a learner centric MOOCs model. So, what is learner centric MOOCs model? So, the video lectures which you are watching now it is called learning dialogues that is the only place where we interact with you so that you can understand the concept and each learning dialogue will have a reflection spot. So, we will pass the video in between ask you to think about it so that you can think about the answers and write down in your paper so that then we discuss the generic answers the users can come up with. So, when we ask you to pass the video please pass the video think about it that will help you to learn better. And then learn by doing activities that is we will give some problem or some questions which you have to solve by applying the knowledge you learned in learning dialogues. These are however usually not graded for your assignment or something but we highly recommend you to do that because it will help you to understand and apply the knowledge you learned in a different environment or different data set. The most important thing is we will learn not only by attending lectures instead we learn by interacting with your peers with our friends in the class. To make that same environment possible we want you to interact with your peers in the forum. To help you to hide you to start discussion in the forum we will have focused questions so that please respond to others comments and answer the your peers questions like them comment them also you answer the questions. So, we recommend you to go and use discussion forums highly so that you will interact and learn from the other learners that is called LXA it is called learner-extension interaction. Also we have learning extension trajectories this is very important part. For example, we are in the 21st century here we do not need to be like a teachers they did in 20th century teachers for example in 20th century teachers the teacher has the source of knowledge teacher has the access to some books in the library so that teacher has all the knowledge the students who sit in the class learn only from the teachers whatever teacher says is true they cannot go and verify because they do not have access to all the knowledge or book or anything. However, in 21st century the students have more access to knowledge more access to more materials compared to teachers or usually teachers also have a similar access but so they can go and check internet they can watch videos they can do a lot of things. So, there is no need for teacher to teach everything in the course and in fact teacher no need to teach teacher has to just guide the students motivate them to learn particular topic they just have to motivate them so that the students can go and watch videos from other lectures. Also, there are very good lectures very good videos in YouTube or in a MOOC which explains the same concept in a very beautiful way which students appreciate and students understand which means why we are doing this course there are a lot of course in data analytics. This course learning analytics is to think you about how to collect data in the learning environments and use the tools for the data you collected in learning environment. So, the other courses exist in the online may not talk about how to collect data or the domain is not education. So, in order to apply data analytics in education domain we are teaching this course. So, I said that there are a lot of other videos which means we will give the basic motivation and the videos which need to understand what is learning analytics. However, for the advanced users or for the novice users to understand more on the topic we recommend you to go and watch other videos that is called LXT. We will provide the resources links to those videos and also we will have an assessment question based on that which is not going to be graded. But I recommend the interested users to go and watch the other LXTs understand the concept better and learn more about the topic we discussed in this lecture. So, this is about LCMs, so learner-centric MOOCs. So, this course will be offered in LCMO. So, when you see LXI, LXT, LBD please participate actively not just only watch videos and go for exam that will not help you to learn the video, that will not help you to learn the course better. So, let us start with the activity. In last week we assumed that you had access to data from a classroom environment like this performance and attendance and also we use that data to describe the types of learning analytics. Let us consider you are the researcher or you are the teacher you might be already. So, you are going to collect data from the classroom environment. What data you will you collect and how do you collect? Think about it, pass this video, write down your answers and the resume to continue. Please consider what data and how you collect data. Suppose if you want to collect students marks in the mid-sem, you will collect by conducting an exam. Similarly, list down all the data which you can collect in the classroom environment and write down how do you collect it. So, you might have answered performance because that is the most useful data or most data you want to predict or most meaningful data to understand the students knowledge that is what you consider. So, let us see performance is the most common data, everybody thinks about it. How do you collect the performance? So, we can collect performance in the students mid-sem exam or mid-term exam where we have questions and responses to each questions. If you can classify the question to a particular topic that is you are mapping question number 1, 2, a concept 1, question number 3 for a concept 2, you can have more richer data and understand which concept the student understood. Similarly, you will have a semester exam that is Ensum exam course or the course in the subtopics. For example, you break down your whole course or whole chapter into multiple subtopics and you can conduct test on each subtopic to understand students knowledge. Based on that, you can consider redoing or retouching something like that. Or also, you might have given the projects to the students. From the projects, the performance can be collected. Or you might ask them to present some video, present some topic in the class and you might create a rubrics to assess the students communication skills, whether they did a proper literature review, they did some research skill, they are able to identify the gap in the literature. You can score on these dimensions and you can use this as a performance data. You can do open book assessment like ask them to use books, solve a complex problem and you can assess their knowledge and how to apply the learned in the class in a timed manner. Or you can conduct a quiz, a surprise quiz or something like that. These all ways you can collect performance data. But that is not only data we can collect from classrooms. We can collect the students attendance. That is simple. You can collect the students attendance by marking their attendance. And you might have the students profile information and background information such as students which year they are in, which department or are they from which kind of school or the family background. All this information you might collect it from the admin department or whatever data you can collect, you can collect those data from the students. Also, students might have a corresponding lab activity. That data also can be useful to predict the students performance. So, you can collect data from the students lab activities. Also, you can collect students engagement in the class by observing the students engagement, coding it in a sheet or manually coding it. If you want to use some web camera and record the students engage in the class and post class, you can code them manually or use some software to code it. Also, it is like the students activities in the model or in other online environments like LMS or library systems. This you can collect from the log data of the system. From the log data, you can see what are the activities, how many times frequently they log into model, how many times they download or access the particular course material, something like that. You can also collect students motivation and affect using human observation that is affecting the sense, learner centric emotions such as boredom, confusion, frustration by using the human observers in the class which you can do in a real class or in a live environment or you can record the students facial expressions using web camera and you can sit down and code them after the class. The camera will be tricky because if you are large classroom, you may not able to capture all the students facial expressions. So, it is better to use human observation in a live classroom. Also, there are co-curricular activities, not extra curricula but co-curricular activities which my students might be participating in some events related to the course. They might be taking extra course in MOOCs or something like that. You can use those data also to understand the student's learning process also to improve your teaching processes. So, we said that we can collect a lot of data. The question is why we have to collect this data from classroom? So, it is good that we have access to a lot of data. We can create a nice data waste of all this information of all the students in the class, but why we have to do? So, the main purpose is that in a first week of this course, we talked about descriptive analytics, diagnostics and credit analytics. So, you have to understand why we have to collect this data and how this data can be used to predict something so that you can improve your teaching learning process. So, do you want to predict the students performance in the final exam using their behaviors in the class, the behaviors which we discussed in the last slide? Or do you want to predict which student will do better in the mid-sem exam? Or do you want to predict which student will do well in the quizzes? Or you want to understand which student is struggling in the class and which topic so that you can teach him better or teach her better? So, those kind of research goal is upon you. So, you have to set your research questions then you collect data in order to find that. So, let us move on to the other type of learning environment MOOCs. So, MOOC is massive open online courses, but this Swyam or NPTEL which we are learning the course is actually a MOOC. So, here students can access the course content from anywhere. So, MOOC is Swyam or NPTEL kind of platform. So, now you know what is MOOC. So, consider you are a course administrator in MOOC or you have access to the MOOC software and you know how to collect data, you have a team of people to collect data whatever you want. If you are a MOOC administrator, a course administrator, what data you want to collect from MOOC, from the students participating in your course? That is what kind of data you want to collect from the students interaction with the MOOC in your course. So, please pass this video, write down the answers, also write down how do you collect this data, not just what data, like how do you collect this data from the log file of the MOOC and after writing it down, resume the video to continue. The basic and very important data in a learning environment be it a classroom or MOOC or any other type, you have to understand the important data is timestamp. In a classroom, it is not possible to record timestamp in a very accurate level, at least a day, time, the class section is good. But in an online environment like a MOOC or tele, we should record the timestamp of each action or each activity student do in the MOOC. Other than that, you also collect students learner ID, session ID, IP address. Learner ID is each students will have a unique ID and session ID is that in the same MOOC, a student might be logging it multiple times because the course has for 8 weeks or 12 weeks. So, student has to log in multiple times in every week so that we have to know the session ID. And also IP address is useful to know the location, where the student is accessing data, that might be useful to do some adaptiveness or provide some feedback to the students. Let us consider you want to understand the student's page view behavior, like what are the pages the students viewed in this course. For suppose you have a MOOC which has a lot of content in the PDF and also we have a lot of videos and you have a discussion forum. What are the PDF content the student is reading or which pages or which menus he is spending more time in it. So, we need to understand what are the pages viewed. So, how much time we spend on each page will be obtained from the timestamp data you collected. Suppose you collected a data saying that the student with the page 1 from time 10 am to 10 2 am, then you know that student spend 2 minutes on page 1, something like that. So, we need a timestamp data and what page they viewed in order to generate that data. In discussion forum very useful data is there. Like for example, the student commented, he deleted a comment, reply to some comment or he supported the comment or he created a comment. He started a thread, delete, unfollow, reply, update, lot of activities within the thread. Also in forum search, the student is following some user or playing the same user multiple times. So, this kind of information can be obtained from the forum data. Also the navigation information. For example, the student is navigating from one page to other page most frequently or the student will be watching most time the videos immediately going to answer the assignment questions or after assignment questions is going back to watch a video in particular space which minutes he is watching. So, is he watching videos to answer the questions. All this information is possible to capture in the MOOC that is called navigation. Also in a video behavior like are they playing when they are passing the video or they seeking the video from one particular place to other place in the video or they changing the speed watching the video in 1.5x or watching the video in points 1.5x or they are looking at the transcript or not. Those kind of information also can be captured from the behaviors in video watching. So, all this information can be captured in MOOC. So, simply the idea is that please collect all the learners interaction with the system that is called clickstream data. Wherever the students clicking buttons using mouse or your keypad clicking buttons typing all the data just capture it. Then you come up with the features which can be used from this data so that we can predict the students performance or predict which students going to drop out based you can predict whatever you such question is. So, simply collect the learners interaction data using clickstream data capture. This is a one type of data format which used for edX course, but the raw data will be different format. Look at this raw data. This raw data says there is a username we hide the username and it is a browser the action name it is called seek video. I told you what is seek video. Seek video is moving a video from one particular time to other time. So, what is the time he was watching the video at one minute now we seek the video to third minute. So, the time he was watching and the new time also should be recorded what is the whole time and new time and even type is seek video. So, this information can be captured from this log data. So, this is a general format for a log data it is one type of format or it is most usually used format. Let us see this data. Can you take a minute pause the video and try to identify what is this log data means what action a student is doing. So, here student name is also done. The action even name is test book PDF page crawl. So, the student is crawling the PDF page which page there is a PDF called pre-MLIAT Bombay PDF in the direction is upward scrolling in your mouse you can scrolling upwards. So, actually is watching the reading the page. Based on the ways you use Mac OS or Windows OS you can say whether the student is watching the reading the page in the next page is going back to the previous page. So, this information is can be captured from this log data. So, this is another type of clickstream data we are capturing like it is not a clickstream even every action a student does like scrolling also in the page view. So, we call it as a trace data. So, there are two type of data clickstream and trace data. In general all the platforms which allow MOOC will help you to collect this data and fortunately NPTEL will not record all this user information because the number of users in NPTEL is really huge and we do not have a space to keep all the data and server. So, we might be coming up with new projects to collect all the data. But if you see the courses offered in edX or course era they might click out all this data. So, I was giving you the example that we can collect all this information from this information I want you to create a log features in a one specific format. The format is so, you collect a raw data. So, you should convert the raw data into actions or events by writing some scripts like a Python script or script. So, the raw data you saw in the previous slides should be used to convert this data into particular actions or events. So, what is that actions or events? Similar to the log data we listed in the classroom environment we also need to identify the features from these raw data. For example, you want to know number of pages viewed in last 10 minutes. How do you do that? You have to write a script from the log data that the data you captured to identify how many pages viewed from the time x minus 10 to x all the page use should be counted that number should be listed that is a feature. Why do you want to know number of pages we do not know that is a domain expertise that you have to come up with the which features will be useful to predict the students knowledge or students performance. For example, the average time the students spent on a page number 3. So, you have to capture whenever student is leading the page number 3 in all the sessions and average time has to be computed from that. Or if student is leading a page you might classify them as a read long or read short why? For example, I opened a page number 1 I spent only one second do you consider that as a read may not be right. So, you might expect a student has to spend at least certain times say 5 seconds to read at least one line in that particular page. So, you can come up with the threshold to classify the read long and short. For example, if the student is watching less than say 5 seconds you can classify it as a read short. If the student is watching you know that the student will be reading from 5 seconds in this particular page to say 1 minute you can consider as a read. This is the actual read a student can do in this particular page. If it is greater than 1 minute say 60 seconds this threshold is based on your knowledge on what is the content whether the content does lot of pictures or mathematical location it might take more than 10 minutes also. So, thus is based on your knowledge is applied here. Then you might say it is a read long a student is reading this particular page for long time and if a student on the page more than 5 minutes for example you can ignore that content student might probably open the particular page and you left you move to other tab is doing some other activity watching some other videos is coming back to read page. So, you should ignore it also if the student is not even spending less than say 2 seconds or something like that you cannot ignore it. So, if you have a time timestamp information and you know what data student is watching it will help you to capture this kind of information read short read long. Why this data is useful? You can say there are some students who are not reading at all they might be attempting the quiz or assignment question they are not able to solve it. Though you know the reason because they do not read or some students will be reading a lot of time they will be reading reading reading they are not taking any assessment you also can send a message saying that why cannot you go and take a quiz. Some students doing good they read long and they take assessment they can pass the exam. So, you know what is the students behavior from the log data. So, also you can come up with number of comments in a forum it is information you can capture. So, there are a lot of data you can come up with. So, what is that how do you come up with these kind of features that is called domain expertise. So, the feature construction is not just I can capture all the raw file I can use that raw file to predict something no. Instead it is about you apply your domain knowledge you are expertise that is why you come up with a domain expertise in education or domain expertise in teaching experiences. Apply that knowledge to come up with this list of features to get these features from the raw file you might need a knowledge on writing a script like a python or that is what I said about you might need small bit of programming knowledge but for this course we will give you all the features extracted from the raw file. So, there is a tool which is used very heavily in industries for future is called feature tool. Please check that tool this course is not focused on explaining that tool because that tool is not important to us because we extract the features based on our domain expertise. This feature tool might help you to construct more features if you have the knowledge on domain. So, if you do not have any knowledge on domain this feature tool also will not help you to create features. So, please check this tool called feature tool this has been used heavily in the industries nowadays. So, I mentioned that there should specific format your data should be stored the format is timestamp, user ID, session ID, action name. What is action name? Action name may be reading or watching videos taking place something like that these are the action names. In each action you might have a context the context may be in a reading page what page is reading what page number is in what type of video is watching is he in a video is he seeking the video or is he playing the video in particular time speed limit all this information can be captured in context of the actions. So, the action name will come from your domain. For example, if I use MOOC as a domain I know that there are four major actions in MOOC that is video watching behavior that is play pass seek some kind of behavior in the video watching interaction in the forum in the forum commenting or creating a thread or the actions. Also in the reading behavior they might be reading some pdf or something like that also they might be navigating from one menu to other one tap to other. So, these are the four major actions I might have a four or five actions. So, you need to come up with those actions and combine the time to create long or short actions or the actions can be repeated multiple times you might get a multi kind of suffix to it. Then you can have the context of the action where it is done is like you see reading page number three you see reading which video is watching what is the speed that kind of information can be used to provide meaningful data collection and that can be used to predict something. So, what you want to predict that is your question you want to predict the students performance in the classroom students performance in the in the particular course or who will drop out in next week something like that. So, also you have to learn about pre-processing in other courses like mission learning or data mining courses this is your LXT external course country can go and watch. However, I recommend for educational video you do not need to do much instead you need to understand if you have missing values how to replace the missing values. Some suggest missing values can be replaced with 0 missing values can be replaced with the mean of other values but it depends on the data you are missing and also your domain knowledge. So, apply logically what should be replaced that missing values and first try to understand why the values missed. Also, I recommend you to normalize all the data to 0 to 1 for example, the performance score is measured in a scale of 0 to 100 but number of upwards is measured in the scale of maybe 0, 10, 2, 3 how do you compare these two scales in a single comparison the mission learning algorithm might work if you do normalize it to 0 to 1. Some suggest to do standardization so, think about it normally session or standardization then you apply based on your requirement. So, in this video we talked about data collection in a classroom environment also in a MOOC environment I also talked about how to extract features from the log file. We will discuss that in detail what are the features how to extract features I will show examples in the next video. Thank you.