 Assalamu alaikum. Welcome to lecture number 15 of the course on statistics and probability. Students you will recall that in lecture number 3, I began with you the discussion of the various ways of summarizing and describing data. And for the most part our discussion pertained to the analysis of a univariate situation. That is we talked about the central tendency, the dispersion, the skewness and the ketosis of one single variable. But students I am sure that you will agree that in many situations our interest is not in just one single variable, but we are interested in studying situations where a number of variables are related with each other. For example, if you consider the case of the yield of a crop, you will agree that the yield depends on so many factors, the fertility of the soil, the amount of rainfall, the quantity of the fertilizer used and other variables. So, is hawale se, aaj ham jis mazupe discussion karengi that is regression and correlation. A very very important area of statistics and we will be discussing the simplest situation that of simple linear regression and correlation where we develop and establish the relationship between two variables only. Alright let us begin this interesting topic by picking up an example. An important concern for any pharmaceutical company producing drugs is to determine how a particular drug will affect one's perception or general awareness. Suppose, one such company wants to establish a relationship between the percentage of a particular drug in the blood stream and the length of time it takes to respond to a stimulus. As you know, there are many drugs that have side effects. So, there is a concern for a drug that creates some drowsiness and this study is very interesting. For example, agar aap usko prick kare, usko switch abhoye to kitne, kitna time lagta hai to respond to this prick aur ishi liye jo jaya saag aap dekhing screen pe, jo data hai regarding the reaction time that is in milli seconds. Suppose that the company administers this drug on five particular patients and obtains the following information. Subject A, B, C, D and E percentage of drug in the blood stream 1%, 2%, 3%, 4% and 5%. Or iske against your reaction time here to respond to a particular stimulus that is 1, 1, 2, 2 and 4 milli seconds. Students, as you will agree, in this example the reaction time is that variable which depends on the variable percentage of drug in the blood stream. So, we say that reaction time is the dependent variable and we denote it by y and percentage of drug in the blood stream is the independent variable and we denote it by x. In order to determine the nature of the relationship between the dependent variable y and the independent variable x, students the very first step is to draw the scatter diagram. It is a very, very simple diagram. You take the independent variable x along the x-axis and the dependent variable y along the y-axis and you simply plot all those ordered pairs that you have in the form of points on the graph paper. So, going back to the example that I just discussed, the scatter diagram will be of the form that you now see on the screen. The point to understand is that for the subject A, the ordered pair x, y is equal to 1, 1 and hence our first point on the graph paper will have abscissa equal to 1 and ordinate also equal to 1. For subject B, the ordered pair and the point is 2, 1, for subject C it is 3, 2, for subject D we have 4, 2 and last but not the least for subject E the point is 5, 4. Students the scatter diagram is a very, very useful tool to judge the nature of the relationship between x and y. Abhi jo diagram aapne dekha, usme aapne ye note kiya hoga that there is an upward trend and what do I mean by upward trend? It means that as x increases, y also increases and this is exactly what you just saw. Of course, you will say that the points were up and down but if you look carefully, you can see a linear trend in this data and in the figure that you now see on the screen, this fact is depicted. Students why is it that all those points that you saw on the graph, they are not all lying on the straight line? This is a very, very fundamental point and you must realize that in the sociological sciences, psychology, sociology, economics and many other social sciences, you will 95 or 99 percent of the time not have an exact linear relationship but you will be able to judge that overall your data follows a linear pattern. Sometimes it will be linear, many times it will be linear and sometimes it may be parabolic, it may be like a curve and sometimes you will have some other patterns. But this point why is it that the points are not all lying on that mathematical curve or line? I would like to explain this to you with the help of an example. Consider the case of a few firms who are spending a certain amount on research and development R and D. If you take the data of R and D and the profit margin and if the R and D expenditure is exactly the same, you will find that still the profit margin will be different. These firms are operating under different conditions. The efficiency of the firm, the goods being produced in the market, the firms share in the market, all these things affect the profit margin of the firm. So, although the R and D expenditure was exactly the same for these few firms, their profit margins are different. So, this is exactly the point that I was conveying earlier. For one single value of x, you will have different values of y. If your x is fixed by itself, even then the y variable is a random variable. And what I have told you in the end, this points to a very important mathematical point with reference to regression analysis and that is that your x variable should be a non-random variable, whereas your y variable will be a random variable. Let us go back to the same example, which you saw on the screen a little while ago. Students, if we are saying that this diagram follows an overall linear pattern, then obviously our objective will be to find the equation of this line that passes through this diagram. I am sure that all of you know the basic equation for a straight line and that is y is equal to m x plus c, where m represents the slope of the line and c represents the y intercept. If we rename the values c and m as a and b, the equation becomes y is equal to a plus b x. By the slope of the line, we mean the tangent of the angle theta, where theta is the angle between the line and the horizontal axis. And by the y intercept, we mean the distance between the origin and the point where my line intersects the y axis. Students, ab ek bohati important baat dekhne ki hai. Aur boh ye hai ke through the same scatter diagram, we can have many lines. As you now see on the screen, for the one scatter diagram, which had only five points in the example that we considered in the beginning, you can have one, two, three and many lines passing through it and all of them are more or less reflecting the linear relationship that exists between x and y. Toh phir sawal ye pehda hota hai ke hum inme se conci line ko select karein as the one which best represents this particular data set. Students, for this purpose, we use a method called the method of least squares. And the line that we will obtain by this method will be called the regression line of y on x and this entire procedure will be called simple linear regression. Ab ye method of least squares se kya murade? Is silsle mein, the first thing to note is that we would like to draw a line, which is passing through the points, yani kuch points line ke upar ho, kuch points line se ho, it should be a kind of an average line. Tabhi hume intuitively ye mai sush hoga, ke as if this is a good representative of this data. Is point ko un distances ke hawaale se dekhye jo exist kareinge between any individual data point and the line. As you now see on the screen, some points are above the line and some below and you have vertical distances between all these points and the corresponding points on the line. The important point is that if my line is passing through the point x bar, y bar that is if my line is passing through the mean point of the data then the sum of the positive and negative deviations will come out to be 0. But the problem is that as I indicated earlier there can be many lines for which the sum of the deviations of the actual y values and the corresponding y values obtained from the line is 0. So ye criterion kaafi nahi hai is baat ke liye ke hum konsi line pick karein. Asi toh kai line hai jo sabki sab x bar, y bar ke through pass karein hai aur unke angle thoda, thoda mukhtalif hai. Lekin wo sab criterion kumit karein hai that the sum of the deviations of the observations from the fitted line is 0. So we need another criterion to get the best fitting line and as I said earlier this is the method of least squares. What we do is that we square each and every one of these deviations before we find their sum and students what we find is that the sum of the squares of the deviations of the actual data points from the line that will be minimum for one particular line and that line is called the best fitting line. Ye baat aap ab screen per jo diagrams dekhenge usse vazhe ho jayegi. In the first diagram you note that for the four points A, B, C and D of the scatter diagram the line that passes through them in the direction that you now see generates two squares which are quite large whereas two are not that large. But when we find the sum of these four squares obviously that will be quite a large quantity. Now if I rotate the line around the point x bar comma y bar I get a new position and in this new position students you find that the A, B, C, D, C, areas of the squares are not as large as before. Students iss explanation se aap pe ye vazhe ho jayega ke agarche points A, B, C or D exactly usse jaga per rahenge. But if you rotate your line there will be one position where the sum of the squares will be minimum and this is exactly the principle of least squares. So aap savali peyda hota hai ke wo joham ne rotate karna hai ke jiske zariye wo position aajaye ke jaha pe wo squares bohot hi chote ho jayenge yani minimum sum acheev ho jayeg. How will we find exact exactly that particular angle students? Iske liye of course we have to resort to mathematical techniques and we will be taking the derivative of a certain quantity in order to achieve that those particular equations that will yield this best fitting line. Lakin is course main. I will not be discussing any of the derivations. I will simply give you those equations which are the result of these derivations and as you now see on the screen the two equations which enable you to obtain the best fitting line are sigma y is equal to n a plus b sigma x and sigma x y equal to a sigma x plus b sigma x square. Students these two equations are called normal equations. But please remember ke yaha pe ye joh loves normal me ne kaha this has nothing to do with the normal distribution that I mentioned in an earlier lecture or which I will mention in detail when we do continuous probability distributions. Ye joh doh normal equations hai in a hum simultaneously solve karenge in order to find the values of a and b and these are exactly the two values that you need in order to find to determine your line which is the best fitting line to your data. The point to understand is that in these two equations the only two unknowns are a and b. Ba ki toh sari joh quantities hai they will be available to you. After all when you have a set of bivariate data you already have a column of x and a column of y and you can multiply the x column with the y column to obtain the x y column or you can square the x column to obtain the x square column and once you have all these columns you will be able to find sigma x sigma y sigma x square and sigma x y. So, in equation au mai the only two unknown quantities are a and b and when we solve these equations simultaneously and obtain a and b we will have determined that line which is the best fitting line to our data. Let us apply all these points to the same example that we were discussing earlier. As you see on the screen the sum of the x column is 15, sigma y is equal to 10, sigma x square comes out to be 55 and sigma x y is equal to 37. Substituting these values in the two normal equations we obtain 10 is equal to 5 a plus 15 b and 37 is equal to 15 a plus 55 b. You noticed that I wrote 5 a because n is equal to 5, n represents the total number of ordered pairs that you have and in this particular example as you know we were talking about 5 patients on whom we had administered various percentages of the drug. Solving the two equations simultaneously b comes out to be 0.7 and a is equal to minus 0.1. Hence, our equation of our best fitting line comes out to be y hat is equal to minus 0.1 plus 0.7 x. Aapne note here that a hat has been placed on top of the y and students this is to differentiate between the actual y values pertaining to your data points and the y values that you obtain from the line corresponding to exactly the same x value. But I am sure that you must be thinking what is the purpose of fitting this line and going through all these lengthy calculations students. This line plays a very important role in estimation and prediction. For example, in this example that we are just considering suppose that we are interested in determining what will be the reaction time of a patient who has 4.33% of the drug in his bloodstream. As you now see on the screen all we have to do is to put a hat on the patient and then we have to put a hat on the patient. So, we have to put a hat on the patient and then we have to put a hat on the patient x equal to 4.33 in our equation of the line and doing so y hat comes out to be 2.931. So, you have noticed that through this process of regression we have been able to estimate that a patient having 4.33% of a particular drug in his bloodstream will take 2.931 milliseconds to react to that particular stimulus. Students is puray regression analysis kender or is estimation procedure me jo main a abhi aaps ke saath share kia. A point very important here or bo ye hai ke one should be careful in the estimation process that your x value corresponding to which you want to find the estimated y value that x value should be within the range of the x values which are available to you. In the example that we are considering at this point, aap jaanthe hain ke x values jo available hain go hain 1, 2, 3, 4 and 5. Agar main a 4.33 ke liye y estimate karna hain to that is alright. Agar main 5.5 x ke against y estimate karloon to wo bhi kabile kagul but students if I estimate y corresponding to x equal to 10 in this example that will not be very wise. Bhaja kia hain? Bhaja ye ke 10 is so far away from the highest x value that is available to me that there is no guarantee that the linear pattern that we have established in this example for the values 1 to 5 that same pattern will prevail all the way up to x equal to 10. As you now see on the screen it is possible that it becomes a curve rather than a straight line and if you use the equation of the straight line to estimate the y value for an x value which is considerably far from the available x values, then your estimated value may be quite incorrect alright. Now, that you have understood the basic technique of regression students I want to draw your attention to another point. Abhithak hamne jitni discussion ki we assumed that y is the dependent variable and x is the independent variable. Yani jab ham y ko estimate karna chahin from x then we will be applying the equations that you just saw. But in some situations we may be interested in estimating the x variable from y aur aap ye soch sakte hain ke issi equation ko ham utilize karlinge to find x corresponding to any particular value of y, kyu ke aap x ko iss side peleya ye y ko iss side peleya ye and you can find an equation. So, that you have y is I am sorry x is equal to something in terms of y uske andar y ki value put kare aur x aaj aega. But students this is not the thing to go for and the very important point is that if you want to regress if you want to estimate x from y then you should regress x on y and in this situation you will interchange the roles of x and y in your normal equations. I would like to encourage you to work on this problem on your own and to practice with one or two problems numerically. Students at this point I would like to discuss with you another important concept and that is the concept of the standard error of estimate. Now the point is that we have fitted a line and some points are above the line and some points are below the line. So, ham iss baat me bhi toh interested hai na ke what is the degree to which they are scattering from the line? Kya hamare points line ke bahot nas deek hai? Ya they are scattering away from it and to measure this we would like to compute what is called the standard deviation of regression or the standard error of estimate. As you now see on the screen the standard error of estimate is defined as the positive square root of sigma y minus y hat whole square divided by n minus 2 where y denotes an observed value and y hat is the corresponding value obtained from the least squares line. Students iss formulae ke structure se aapne recognize kr liya hoga ke yaha bhi basic concept wohi standard deviation wala concept hai. We are summing the squares of the deviations of the y values from the estimated y values the ones that we obtain from the line. Lakin this formula is a bit cumbersome to apply because in order to apply this first we will have to find all the y hat values corresponding to all the x values that are available to us and that will be a bit cumbersome. So, as you now see on the screen the standard error of estimate denoted by s y dot x is equal to the square root of the quantity sigma y square minus a sigma y minus b sigma x y whole divided by n minus 2. Ye jo notation here s y dot x students it is not very difficult to understand if we are regressing y on x we will write s y dot x and if we are regressing x on y we will write s x dot y. Aur aaya ab hum iss concept ko apply krte hain on the same example that we have been considering. As you now see on the screen the column of y square has been added to the four columns that we already had and the sum of the y square column is 26. The values of a and b have already been found and now substituting all these quantities in the formula of s y dot x the standard error of estimate for this particular problem comes out to be 0.61. Ab sabaal ye peyda hota hai ke how will we interpret this value of s y dot x the point here is students that s y dot x lies between 0 and s y where s y represents the standard deviation of the y values. Now if our s y dot x is close to 0 it means that those data points are not deviating very far from the line that dispersion is less and we can say that our line is a good representative of those data points. But if our s y dot x is large and if it is close to the upper limit s y then it means that the data points are deviating from the line and the line is not very reliable for estimation purposes. In this example s y the ordinary standard deviation of our y values comes out to be 1.10 and if we compare the s y dot x value 0.61 with the s y value 1.10. We find that 0.61 is not extremely small compared with this upper limit and hence we conclude that in this particular example there is a certain amount of dispersion of the values around the line and that dispersion is not extremely insignificant and hence this line is not extremely reliable for estimation purposes. Regression ke baare me tohum ne bahut detail me baat ki, the next concept which is very important and closely related to the concept of regression is the concept of correlation. As you now see on the screen the correlation is a measure of the strength or the degree of the relationship that exists between two random variables. You have noted that here I have said that it is the strength of the relationship between two random variables and this is a mathematical point that whereas in regression analysis y is a random variable but x is a non random variable in correlation analysis they are both random variables. For example, if you want to determine the strength of the relationship between the heights and weights of young children you will realize that both height and weight will be regarded as random variables. Obviously, we would like to have a numerical formula to measure this strength of the relationship that exists between two variables. As you now see on the screen the correlation coefficient is defined as covariance of x and y divided by the standard deviation of x into the standard deviation of y. Covariance of x and y is defined as sigma x minus x bar into y minus y bar over n and it is a measure of the degree to which two variables vary together and that is why we have the term covariance. The shortcut formula for the Pearson's coefficient of correlation is sigma x y minus sigma x sigma y over n and this entire quantity divided by the square root of sigma x square minus sigma x whole square over n into sigma y square minus sigma y whole square over n. Let us apply this concept to an example. Suppose that the principle of a school is interested in ascertaining the correlation between marks in mathematics and marks in statistics and he collects data for 9 students and the marks in mathematics are 5, 12, 14 and so on whereas the marks in statistics for the same students are 11, 16, 15 and so on. In order to compute the correlation coefficient we construct the columns of x square, y square and x y and we obtain all the relevant sums and substituting them in the formula for r our answer comes out to be 0.88. To interpret this result students the first thing to understand is that r is a quantity that lies between minus 1 and 1. In case of a direct or positive correlation between two variables r lies between 0 and 1 but in case of a negative or indirect correlation between two variables it lies between 0 and minus 1. When I say indirect or negative relationship I mean that as one variable increases the other decreases. For example, as the temperature of the city rises the sale of woolen sweaters declines. S e situation may r will lie between 0 and minus 1 whereas in case of direct relationship. For example, as the temperature of the city rises the sale of ice cream rises. So, they are varying in the same direction in such a situation r will lie between 0 and 1. The stronger is the relationship between x and y. Graphically the closer my points are to an upward going line the closer my r will be to 1. So, in this example if r is equal to 0.88 or approximately 0.9 students you will appreciate that this means that indeed there is a very strong positive linear correlation between the marks in mathematics and the marks in statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I would like to encourage you to study this on your own and to go on to study not just simple correlation and regression, but a little bit about multiple correlation and regression, where we try to study the relationship between not just two, but three or more variables. Today, we have completed the discussion of the first part of this course and that is descriptive statistics. In all these lectures that we have had during the past weeks, you have noted that we have learnt different ways of describing that data that you have collected on sample basis. The next 15 lectures will be devoted to probability theory. The most important theory which will be the foundation for statistical inference. Until next time, my best wishes to you and Allah Hafiz.