 In this week, we will start with one of the prediction algorithm called linear regression. Then we will have a demo of tool called VECA. This is a machine learning tool and we will have a small course project and description about the data. In last weeks, we saw what is predictive analytics. Let us look at what is one of the predictive analytics algorithm called linear regression. Let us start with the activity. In that, you have access to data such as attendance, engagement and performance of the students in a class. So, in a class of 60, you will have students attendance over the period of time, also the engagement in the class or engagement in the moodle and the performance in the test. Now, you would like to predict the students performance based on their attendance and engagement in an upcoming exam. So, what data you need or how do you predict the students performance in the upcoming exam? Please think about the answer, write down the answer. After completing the writing of answer, you can resume the video to continue. As you saw in a last week, predictive analytics or to predict performance in this particular scenario, we need to identify a historical data, need to collect historical data that is in our case, it is attendance and engagement and performance. From the historical data, we need to compute the correlation between attendance versus performance or engagement versus performance. When we compute the correlation, if the correlation is average say 0.5 on positive correlation or negative correlation, if there is some medium to high correlation, then only we will consider this particular variable, otherwise we may not consider this variable. So, first thing is we need to compute the correlation between the variables like attendance versus performance or engagement versus performance etc. This we will do in diagnostic analytics and descriptive analytics to see the data, what is actually how the data looks like, is there any relation between these two data. Unless we have a statement saying that there is a relationship between attendance and performance also engagement performance, then we will go for the next level that is predictive analytics or you might have a very fine grained data such as not just a performance, instead the students performance in each questions. For that, you might have a data of what is the difficulty level of the questions or what is the topic covered. So, you might be asking questions related to a topic you covered in the class or are you reusing the questions because the students might have a questions about from the seniors or from the last year exam papers or are you reusing the questions, students might have prepared those questions and they might do better even though they do not understand the concept. So, you can consider these factors to predict the students performance or simply we will start with the engagement or attendance in the class. From this data, we develop the model then we extend the model or extrapolate the model to predict the future events that is predictive analytics. Let us talk about one such predictive analytics algorithm or algorithm or ML algorithm to predict a future events. The basic and very simple one is linear regression. Linear regression gives you intuition that how it is related, how the performance is related to attendance or engagement. And it is easy to understand also it gives you the intuitive weight between two different variables in a linear regression. We will talk about this, the variables and the performance in detail in this value. Given a data set x and y, x can be attendance, y can be performance. So, we want to predict students performance from the attendance. If we have a single independent variable, independent variable here is the attendance and dependent variable is performance because performance depends on some other variable. The performance is depending, performance might change based on the other variable. So, dependent variable or the target variable. So, you might consider that the attendance and performance then we have to come up with the linear regression model and identify what is the relationship between these two variables. The basic assumption is there is a linear relationship between independent variable, dependent variable that is attendance and performance. How do you establish this basic assumption? You simply plot them. When you plot them, there is a correlation then you can say there is a linear relationship. If the plot looks like a fully scattered dots then you can understand there is no correlation then the data may not be used for a linear regression. You might need to use the some other algorithms. Then analysis, analyze the relationship between dependent and independent variable to create the linear regression model. Let us start with the simple linear regression model. In simple linear regression model, we consider one independent variable and one dependent variable. The generic formula for simple linear regression model is given here y equal to wx plus c. It is y equal to mx plus c we draw a slope in a in a graph in a class 6 or class 7. In y is the performance we are predicting and w is the weight associated with the x, x is the attendance and c is the intercept and w indicates the slope of this line. Let us consider we have a 6 students data, 6 students attendance and their performance like now we can say the data is xi yi for a student 1, what is the student 1's attendance percentage say 80 percent, 60 percent, what is the student's performance? Similarly for 6 students you have pair of data, 6 pair of data. In a linear regression the goal is to find the linear relationship that best fits the line for this data. Let us look at the data. In this linear regression there are 6 students data like as 20 is attendance percentage, the mark out of 100 is 30. For attendance percentage 40 the mark out is 45. Similarly for the other data we have a marks out of the attendance percentage. Students now from this data we can see the student who attend the class regularly scored good good in exams this is just by plotting the data in a scatter plot. Now you can see there is a relationship when the attendance percentage increases the marks also increases. If there is a linear relationship you can see then you can apply the linear regression model for this particular data. Let us try to fit a line for this particular data. So we are trying to fit a line say there is a line which tries to fit almost all the data here and this data this line has a intercept value say 15 year and there is a slope as there and this line looks good. So is this linear model correct? How do you verify it? How do you validate it? Please think about it and write your answer in a paper. After completing a task please resume the video to continue. So which model is correct? Here we have two lines that is fitting this particular data the red and the blue one. So which one is correct? Let us do the linear regression calculation. We know how to identify the best fit line we need to compute a least square method. Least square method is very simple it is nothing but comparing the observed value that is the line fit value into the original value. For example, if I have drawn this particular line here are the six data. If you look at the data at attendance percentage is equal to 20 the mark is almost equal to 35. For given this line the linear regression model for attendance percentage of 20 the student is expected to get a past percentage past mark of or the performance is 35. However, we know that from the picture for from the data we know that for the attendance percentage is equal to 20 the marks out of 100 is only 30. But this particular model predicts for attendance percentage equal to 20 will have 35. So that predicted mark is 35 minus the actual mark that is 30 this difference is 5 and we are squaring it. Squaring it just because sometimes the predicted mark will be lower than the actual mark in order to address the lower mark or higher mark we just want to find the difference and square it. So that we will have a difference squared that is called least square method this 25. If you sum up all this least error at each point x 1, x 2, x 3, x 4, x 6 then you will have a sum of these values it gives the least square value. If there are 2 or 3 lines fitting this particular linear regression model you need to select the line which has the least error. For example, this particular line the least the error value at x 1 is equal to 25 similarly you have to compute for x 2 and x 3, x 4, x 5 and x 6 then you have to add everything sum of all these values computed for this particular linear regression models error value. Similarly, you can draw other lines then compute the error value for the each linear model. However, the mathematical formula to do this is not like this instead of computing standard deviation and other methods like how to calculate the intercept there is a detailed mathematical description how to compute this but in this course we are not going to discuss that. The basic abstract idea is that in order to find the best way to you need to find the least square error this is the basic concept for linear regression. And you can compute linear regression in existing software like Excel, Microsoft Excel and we just use the same data and we calculated the linear regression model y is equal to 0.7494 that is the weight of this particular slope of this line into the x value plus 12.40 to 12.4 will be the intercept if they extend the line here it might here it may come here to 12.402. The intercept also decides where the slope is based on and the linear regression coefficient is equal to 0.9515. So, it indicates there is a very strong correlation between attendance and the performance. As you might know that this data is not actual data we just populated the data to show the linear regression. In one of the LBDs in this week we might give you other data of students and we might ask you to use the Excel to create a linear regression model to get to answer the how to use the Excel for the linear regression. The next topic, the next topic is multiple linear regression. Sometimes the dependent variable performance might have depend on multiple independent variables. For example, performance depends on engagement also in the class also the attendance. The engagement can be collected from students engagement in the class or the engagement in the LMS like mood of the blackboard. So, we will have multiple independent variables. Performance is dependent variable and engagement and attendance is the independent variable. So, the generic equation for multiple linear regression model is given here the performance is equal to weight w1 into the weight is actually slope in a previous generic model for simple irrigation here the weight 1 into text on the value that is attendance value plus weight 2 and engagement intercept. We will train the model to learn these weights from the existing data. The weights values will be learned and also the intercept value will be learned from the existing data the historical data. Then we will apply this data to predict the students future performance in the next class or next exam. So, the Xi that is the Xi can be more not just to performance is depending only on attendance and engagement it can also depend on multiple variables X I can be 1 to n. Maybe the students submission rate of assignments or the students number of upwards in the discussion forum or the students performance in the midterm exams a lot of factors a lot of variables can be involved the W I actually indicates the strength of independent variable on dependent variable. So, this is a very beautiful feature of linear regression is a very very intuitive method when we have multiple independent variables like performance and sorry like attendance and engagement we can see which weight which is more strength on the dependent variable like performance. So, W 1 indicates the strength of independent variable on dependent variable. However, the linear regression have a drawback that is it assumes the relationship between the two variables is linear and we cannot use the linear regression if the number of independent variables is too much or more than number of samples that we have. So, if the number of independent variables is really high you might need to consider some other algorithms. Thank you.