 So, let us get going with the discussion today on how to look at models where we look for relationships between variables. So, yesterday we had discussed the essence of a hypothesis test, what is the relationship between variables, how do you work out a null hypothesis statement and then how do you figure out what is the best approach, rigorous systematic approach towards identifying the truth of falsehood of a null hypothesis statement. So, today I am kind of switching gears because I want to talk about something that most researchers are likely to encounter as part of the work which is how to collect data and then model data and look for relationships between variables in the data. So, these involve topics of regression analysis, so how to build a linear regression model for example and then also how to evaluate whether two variables are related closely enough and if they are related to figure out which variable is the cause of an effect. So, in today's talk I am going to limit myself to talking about the basics of regression and when I come back tomorrow I will complete this discussion with a focus on causality, in other words what is the cause and what is an effect when you study a collection of variables. So, the focus overall is on how are variables associated with each other, so in other words what is the relationship between variables and ultimately where science goes with this is we would like to know given a group of variables which variables cause an influence the values of other variables, so causality. It turns out that this is actually quite a hard problem and in fact it is a problem which has troubled philosophers over centuries and it also turns out that mathematics is not well equipped to deal with particularly the notion of a causality. So, I am going to start by assuming that we are working with two random variables x and y that is the simplest scenario we can have. These are variables of interest in any phenomena you are looking at there is the ability for you to go out and collect measurements with these two variables and the problem that you are going to encounter is to prove that a variable x is related to a variable y, in other words that there is an association between x and y. So, how do we even express such a relationship in terms of mathematical notation? So, we do this first today and then we will discuss practical problems that most people have when they try to perform the act of simple regression fitting a straight line to a collection of data points. And then in tomorrow's lecture I am going to come back and talk about the differences between fitting a straight line and making serious systematic comments about what is the cause and what is an effect in a given phenomenon. So, let us go back to just simply working out relationships between variables y and x and you will realize something immediately which is quite curious which is that if you take an equation y equals A x this equation gives you absolutely the same amount of information as an equation x equals B y and that is because the constants in there or the model parameters in there can be written in terms of each other. So, B equals 1 over A. So, that tells you that the two statements y equals A x and x equals B y are actually equivalent statements and yet from your school days you have been told put the variable x on the x axis and put the variable y on the y axis with the interpretation that x causes y. We are told that there is an independent variable typically in an analysis and that there is a dependent variable and you are told to put the dependent variable on the y axis and then ask how does it depend as you vary along the x axis. But if you go back to my equations y equals A x and x equals B y the amount of information contained in those equations is the same regardless of which variable you plot on which axis. So, surely a relationship between x and y is not demonstrated by either of these equations. All that these two equations are doing for you is indicating that there is a connection between the value of x and the value of y regardless of whether you put the x on the left hand side or on the right hand side and regardless of whether you plot the x on the x axis or on the y axis and this is a pretty standard question I ask in my whenever we interview students who approaches for graduate positions. So, just to see how they cope with visualizing data. So, when a student is given a y versus x data set a plot they are asked whether they can invert the axis and figure out how the non-linear curve now looks like in an x versus y plot. And you will quickly realize that for most phenomena you have been conditioned to think of a certain shape or a curve by all the learning you have done in your school and by all the textbooks that you have seen. So, a skill that most people seem to lose over the years is the ability to visualize data in any random coordinate system that they are given. You learn to appreciate plots according to specific curves that you see in a book. All of this actually stems back to the problem that equations have an equality sign in the middle. So, mathematics deals with working out in particular algebra deals with working out relationships between variables and when you talk about a relationship between y and x this relationship is demonstrated by using an equality sign in the middle of your equation. And of course, as I showed you on the previous slide if I just interchange the terms on the left and right hand side the information contained by an equality relationship remains the same. So, it is not necessarily clear why the y variable has to be on one side and why it has to be on one axis the y axis if you are going to try to demonstrate a relationship between two variables there should be no dependence on axis. And you will even appreciate that what we really want in science is rather than the equality symbol you could also use an arrow symbol in the middle of an equation to try and indicate which variable works as a cause and causes the effect that you see on the other side of your equation. So, in other words you would like to have a y arrow x kind of equation which tells you that y causes x. Similarly, you would like to have an x arrow y equation which tells you that x causes y and of course, now because of the direction of the arrow the two equations cannot be the same. So, where the equality equation did not depend on where you wrote the terms on the left hand side or the right hand side if I were to have an arrow in my equation that influences significantly the interpretation of such an equation. And our problem is in algebra we do not have such an ability to manipulate an arrow symbol in our equation. So, whatever inference we have to make about what is causing what for example, whether smoking causes cancer. So, whatever inference we have to make we have to make use of equality relationships where we show relationships between a pair of variables. And curiously enough from there what we really wanted to be doing was to imply a causative relationship meaning you wanted to prove that a variable x caused variable y and not the other way around. So, in other words how do we use equality statements and yet make comments about causality. So, that is basically what I will come back to tomorrow, but today I am going to focus on the equality statement itself and try to work out different ways in which you can denote relationships between pairs of variables. The only thing we have in mathematics which helps us with dealing in dealing with relationships particularly with respect to time is a conditionality statement and that is something which comes to us if you have learnt conditional probability. So, if you go to Bayes theorem which is a theorem in conditional probability. This theorem has to deal with asking what happens if events x and y both happen and what is the relationship between the probability of x taking on a particular value and y taking on a particular value is that related to the probability of y taking on a value and x taking on a value. So, Bayes theorem starts with observation that in general when you are looking at a probability of variable x that is nothing to do with the probability of variable y. So, how do I read this equation on the slide? The probability that a variable x lowercase x takes on a value uppercase x given that y takes on the value uppercase y. So, that probability cannot be written in an inverted sense where you are talking about the probability of y taking on a particular value given that x has taken on a value. So, notice the use of this conditionality bar that vertical line that vertical line implies that you are looking at the value of x given y on the left hand side and on the right hand side you are looking at the probability of y given x and in general the probabilities cannot be the same. But that is practically the only notation we have which says that something comes or happens before something else. So, on the left hand side when we say probability of x given probability given y that implies that we already know the value of y somehow and we are then asking what is the probability of x taking on a unique value given that y has taken on a value. So, there is a hint of an arrow there, but it is actually a conditionality bar this vertical bar that we have got to take advantage of in terms of evaluating relationships between variables. Now, Bayes theorem really have to do with interchanging x and y and if you look at the formal form of this theorem it says the probability that you have both events x and y happening could be written as the probability of events y and x happening, so the probability of y intersection x and this in turn can now be written as the probability of x happening first the p of x followed by the probability of y given x. So, given that x is happen what now the probability of y happening given that x has happened. So, events x and y both happen if x happens first and then y follows or if y happens first that is the probability of y and then x follows probability of x given y. So, this equivalence between probability of y given x p x and probability of x given y p y sets up the possibility for a test of a relationship between a pair of variables. So, variables x and y are not related to each other or are more precisely independent of each other if it turns out that the probability of x and y is simply the product of the probabilities of x and the probabilities of y. In other words the probability of a y given a particular value of x does not seem to depend on x at all. In other words y does not depend on x so the probability of y given a particular value of x is irrelevant it is simply the probability of y. In which case we have the ability to claim for independence of variables of course in probability theory a lot of emphasis is on independence of variables, but actually when you do your research and we look at your data you are not actually worried about independence in fact you want to do the opposite you are worried about trying to prove dependence because you want to show that y depends on x or vice versa. So, independence of course now implies that the probability of x and y is the product of the two probabilities and it can be shown that this in turn implies that the expected value of x times y. So, the average value of x times y is simply the average value of x expectation of x and the average value of y expectation of y. So, that follows from the probability statement above. So, if I look at two variables x and y and I have for some reason the belief that they are independent then all I need to do and this is basically suggesting a procedure the lower equation suggests a procedure here. All I have to do is to keep asking what is the value of x times y and then ask what is the average value of x times y and separately work out what is the average value of x and what is the average value of y and then ask is the left hand side turning out to be equal to the product of the average value of x and y and if it turns out to be this to be an equality is the two sets of values match then my variables are independent. But again remember what we are actually trying to do in building model is not necessarily to always work out independence you are more focused on proving dependence. So, in other words we are interested actually in asking what is the variation in x and the variation in y and more precisely as we build a model we get interested in asking questions like if there is a variation in x if I cause a change in x is that going to result in a change in y and vice versa. So, to talk of variation in x and y you first need to go back to the definition of a variance. So, first to that. So, I am going to go back to this notation mu of x and mu of y. So, mu is the population mean. So, the population value mean value of x is mu of x that is the average of all the measurements of x that I have and the average of all the y measurements that I have is mu of y. So, the expected value of x is mu x the expected value of y is mu y in which case the variance of a random variable x is basically what it is a measure of how far are individual measurements deviate from mu of x. So, therefore, the expected value of x minus mu x whole squared and this it turns out can be simplified in terms of algebra it can be simplified to asking what is the expected value of x square what is the average value of x square for all the values that x can take minus the mean of x squared mu x whole squared. So, this describes the variation of a random variable x, but we have two variables in analysis x and y and. So, if you can describe the variation of x I can also describe the variation of y and the variation of y is what is the expected value of y minus its mean value mu y squared. And once again this can be simplified in terms of the average value of y squared minus the mean of y whole squared. So, these are standard formulae in your statistics books ultimately they help you describe the variation of your collection of measurements you will see these in different forms these equations in different forms you will probably see them as summations, but here I have deliberately used the expectation notation to bring a link back to probability theory. So, what we are really interested is not necessarily to what extent x changes on its own and to what extent y changes on its own. When you are talking about relationships between x and y you have got to focus on the question if I were to change x for example, deliberately in some experiment what is then going to happen to y and this measure of a relationship between a change in x and the corresponding change in y is called a covariance. So, we are interested in the how x and y covariance. So, the covariance of x and y is defined by looking at how for each measurement x that we have how far does it deviate from the average value of x. So, mu x and then separately you take the values of y that you have and ask how do how does y vary from the average value of y and now you ask the question for an x y pair of data if you take x and this deviation from x was there at the same time a corresponding deviation of y from its mean. So, the covariance basically is now a reflection of a deliberate change in either x or y causing a resultant change in y or x. So, you can see that when we are interested in building models we are actually testing to see whether x and y covariance. So, fundamentally where this goes is if I ever interested in building a straight line model if I interested in showing that x and y are related by straight line that implies that they must covariance as you change x y must change which implies that covariance of x and y becomes for us an important measure of a relationship between x and y. Now, it turns out once again like on the previous slide I can simplify the term inside the square brackets on the right and this therefore, now is the expected or average value of the x y product minus the average value of x times the average value of y. So, the covariance of x y can be written as the expected value of x y minus the average value of x and the average value of y. Now, if you go back and think about what we said about independence. So, what was independence? We said that if variables x and y are independent then the average value of x y is simply the average value of x times the average value of y and you immediately notice that this influences what we have on the right hand side of the covariance expression. So, if x and y are therefore, independent it will turn out that they are also uncorrelated that there is no covariation between them and that follows from the relationship on the previous slide. So, again here we have a measure of independence or lack of correlation, but the fact of the matter is remember that when we are looking at x and y we are actually investigating whether there is indeed a correlation. But if I were to connect back to yesterday's lecture on hypothesis test you recall that what we had said was we really want to prove a null hypothesis wrong or false. So, here is how you connect building a straight line to a hypothesis test. So, if I want to prove that x and y are related I actually set up a null hypothesis that x and y are not related and then I try to prove that statement wrong. So, if x and y are not related what should happen if x and y are not related then the covariance between x and y should be equal to 0 and this is convenient because if you remember we said that in a null hypothesis we must have something written as an equality statement. So, that we can go and test it in a very focused manner and then we can try and shoot that down. So, the way to prove that x and y are related is to prove that the covariance between them is non-zero and the way we set that up as a hypothesis test is to say that the null hypothesis is that the covariance between x and y is 0 and if that turns out to be proven false that indirectly has then meant that x and y are actually related. So, here is now an interesting problem. So, on the previous slide I have ended up showing you that if two variables x and y are independent that they are not related that they are uncorrelated, but then we one can ask the question if two variables x and y are uncorrelated does that imply that they are also independent. So, does it work the other way around? So, what I am doing here in this very simple sketch is I am putting up four points, two points along the x axis, two points along the y axis and that immediately tells you if you look at this just by inspection that the average value of all four points along the x axis. So, just look at the x coordinates of each of the four points and the x coordinates of each of the four points will average out to 0 because there are points on either side of the origin by symmetry alone and similarly the average value of y the y coordinate will also turn out to be 0 again because of symmetry of how the points are spaced around the origin. Now, that immediately tells you that that that plot immediately tells you that the product x y is also 0 for every point because where the y coordinate is not 0 the x coordinate is 0 and where the x coordinate is not 0 the y coordinate turns out to be 0 which implies that x times y must always be 0 with one coordinate or the other is ending up as a 0. So, if x y is 0 for every point in my plot then the average value of x y is also 0 for the plot. So, we have actually now a curious result which is that the covariance between x and y that of course, dependent on the average value of x and y and the average value of x y we just said was 0 and then separately before that we said the average value of x is 0 and the average value of y is 0 because of how the plot points are symmetric around the origin. Therefore, the covariance is 0 and the covariance of x and y is 0 what does that tell us it tells us that two variables x and y are not correlated. So, if the two points x and y are uncorrelated I go back to the question does it mean that x and y are independent. So, remember previously I said if x and y are independent then they are also uncorrelated and now we are asking the reverse question if x and y are uncorrelated and my four points end up giving me a case where x and y are uncorrelated. So, now if they are uncorrelated does it also mean that x and y are independent and the answer if you think about it is no because the moment x is not 0 y has to be 0 and vice versa. So, because one of the coordinates has to be 0 at a time if the other coordinate is not 0 that implies that there is a relationship between the two variables and the dependence between the two variables. So, you actually have here this one counter example which lets you prove that if two variables are uncorrelated that does not mean they are independent. So, this is actually important in research. So, if you plot a straight line and it turns out this line is flat. So, x and y would then seem not to be related at first sight they seem not to be related the point of it is your linear model for the moment tells you that x and y are unrelated because your line is a horizontal flat line let us say when in the possibility exists that there is a dependence between x and y and that you need to go and study your data set a bit more closely to check for dependence between x and y. So, in general then uncorrelated variables need not be independent they could be some dependence, but then the factor the matters for the most part wherever we see uncorrelated variables they turn out to also be independent. So, what I have actually shown you with this one example is that they can always be a counter example and in general there is a need for caution. So, the summary of this point so far is that uncorrelated variables need not be independent, but the other way around independent variables are always uncorrelated. So, with that I am going to jump into a discussion of the straight line model as we normally applied. So, this is an activity which practically you do in every single undergraduate lab. So, that lab starts with some straight line model with some parameter in it typically you are not interested in the intercept typically you are interested in showing that the slope takes on a particular value, but then it of course depends on the domain you are doing this experiment in. In a previous talk I talked about proving v equals r i in electrical engineering, but take any phenomena you are interested in where there is a linear relationship that you expect between x and y and you are now interested in asking what are the values for the parameters in your linear relationship. Now, again I want to go back to something I said yesterday and for that matter Debupo yesterday about statistics which is the scientific significance of what you are doing is not necessarily relevant to the statistician. So, when you are doing statistics it is not necessarily clear to us why you are looking for example, at a straight line model. So, I am going to assume that you have a good reason to try to connect y and x using a straight line equation instead let us say you could also have used a quadratic equation. So, we have at this point some reason for you to investigate y x and y relate according to a linear equation in which case it will turn out that the general way for us to write a linear model is y equals alpha plus beta x plus epsilon. So, let me try and explain these terms in a linear model. So, remember again what we had said about the acceleration due to gravity. So, f equals mg is of the form y equals alpha plus beta x, but then when I told you about that example of gravity measurement we said that regardless of who does the experiment in which corner of the world the value of g has to be a fundamental constant. So, there is actually this population value for g which is a true universal fundamental constant, but then as you set about doing your experiments and trying to work out the value of g, it will turn out that you have got experimental error, it will turn out that your measurements are not precise, it will turn out therefore that you have at best an estimate of g. What you do not have is a precise measurement of g and of course, if you have a approximate estimate of g, what that forces us to do is to come up with an interval around g which we think within which we think the true value lies. So, what I am showing you here in this first equation is that ideally I would have loved after my experimentation. So, I have come up with some theory which says y depends on x according to alpha plus beta x and that would have been a precise combination of values alpha and beta, it should have been true regardless of who did the experiment, where they did the experiment or when they did the experiment. So, alpha and beta fundamentally in a relationship should have been constant, should have been model parameters to be precise and if there is any error to be allowed in my model that error is in the form of some deviation term. So, this epsilon term as a last term in my equation if there is any error to be seen in my measurements ideally I would like the epsilon term to describe it. So, since we always see scatter whenever we replicate our measurements. So, if I try to measure the same thing again and again and again and not unlikely to see the identical measurement again and again. So, if there is some variation that variation now cannot be explained using alpha and beta, because alpha and beta are the true global constants population parameters which cannot change. So, there is anything that can be changed it is this epsilon term and so any variation that I see now must be crammed into the epsilon term. So, this epsilon term therefore is actually referring to possible error in our measurements. So, anytime we look at a measurement of y given a particular value of x, I allow for myself the possibility that I will see a measurement which does not actually fall on a straight line. And if you look at the curves which are tilted to the right on the straight line what I am acknowledging is for a given value of x, I realistically or ideally should have obtained a value of y which should have fallen on the line. But the reality is I will see a range of values on either side of the point on the line. In other words I will see some scatter in my measurements and that scatter in my measurements is probably well described by a distribution function a probability distribution function and most of you will recognize that what I have actually drawn is a Gaussian distribution tilted to the side. So, that Gaussian distribution is basically saying relative to the value you should have expected which is the point on the line, you may see values lesser than that or you may see values higher than that. And so for any value of x what this implies is you should have actually seen a point on the line, but then we may see points away from the line with a certain probability and that implies that we are looking at a Gaussian with a certain variance sigma square. So, the variance sigma square in turn gets built into this error term that I built on the top equation. So, the epsilon is a measure of variation or to what extent I can expect variation in my measurements as I build a straight line model. So, fundamentally then a simple linear regression is about trying to obtain once and for all perfect estimates of alpha and beta and then trying to do that given that our measurements will have variation and the extent of variation I will have about a mean value of y for any given x the amount of variation is defined by the term sigma square which is the variance of this distribution which is tilted to the side. Now, if I were to average all my measurements at a given value of x then I should end up back as that point on the line. So, here is an interpretation for a point on the line, that point on the line is the expected value of y at a particular value of x given that I am measuring that particular value of y several times for a given value of x. So, the expected value of y at a given value of x that is on the left hand side will turn out to be a point on the line which is alpha plus beta x. So, it turns out that when you are fitting a straight line what we are hoping for is actually a fit to this model. So, when you are saying moment we choose a particular value of x we would like to know what value of y results and that value of y should be a point on the line alpha plus beta x. Now, our problem is because we do not generate an infinite number of measurements and because the measurements are full of error we end up with only estimates of alpha and beta and we do not have precise and perfect population estimates of alpha and beta. So, we end up with a plus b x where what we really wanted was alpha plus beta x and now the question comes up is a close enough to alpha is b close enough to beta. So, if gravity had a model which said f equals m times g. So, f is the y, m is the x and g is now the equivalent of beta there is no alpha in that model f equals m g. So, then it implies that actually what we are saying is the the perfect model describing gravity involves choosing alpha equals 0 and choosing beta equals g where g is 9.8 meters per second square. Now if I were to try to collect y measurements that is f the force and x measurements that is the mass and I try to work out what is the relationship between force and mass I may end up seeing that the value of a is not precisely 0 and I may end up seeing that the value of b is not precisely 9.8. But that is basically a reflection of the fact that there is error in my approach and in my limited collection of samples. So, the question immediately comes up then what do we do if our measurements have error and how do we know that we have gotten good estimates of whatever it is we try to find alpha and beta. So, in particular as I said we are interested in the value of b and the reason I say that we are interested in the value of b is because if it turns out that b is 0 what you are basically saying is y is independent of x and you are probably not doing an experiment to prove that y is independent of x you are probably doing an experiment because you already believe that y and x are related. So, b then has to be interpreted as being the sampled estimate of the true underlying value beta. So, if you look at a relationship between force and mass at one center and you do an experiment at one center you will get a value of g which will put down as the value b and that is that centers estimate of g and if you go to other centers and do the experiment in slightly different ways you will end up with slightly different estimates of g. So, we end up with a situation where we know that the a and b that we are going to get are possibly not the precise values alpha and beta in which case it becomes critical that we try to minimize the mismatch between the model that we are trying to build and the observations that we have seen. So, of course, in simple linear regression the way it goes about and this becomes actually an optimization problem the way it goes about is you take all these deviations of a point of a measurement from the line drop the verticals from each point on to the line square each of these verticals square each of these small lines add them up and what you are looking for is a minimum sum of squared deviations of points from lines what is called least squares most of you may be familiar with the fact that a linear model has to be built using a least squares approach. So, the idea here is find d line which passes best through the points such that these deviations from points between points and line become as small as possible. So, if you take a particular data set with obviously significant scatter showing up in this data set then what you are looking for ultimately is a line which passes through the data set. Now, this line will give you a slope B which obviously is the best estimate we can come up with of the true slope beta and we are not going to see the true slope beta because we have not collected an infinite number of points. Now, if there is an error in our estimate of beta then we need to know how much error can exist and that takes us back to asking what is an interval around the line within which other lines might also exist which could also explain the data that we are seeing. So, what we are basically saying now is line we just built as some error in it. So, we cannot be absolutely confident that we found the perfect line there could be other lines very close by deviating slightly from the line B first true in which case there is actually a band of possible measurements or lines which could come up which could explain the linear model that we are trying to prove and develop. So, we need an interval estimate. So, for example, then at a given value of x I need to know what is the given what will be the resultant value of y and if I just fit a straight line the line would tell me that, but we then say that look we are not absolutely sure that we got the perfect line fit. So, therefore, the y value that I expect is what I would get according to the straight line that I found out, but then it is plus or minus something and that plus or minus something is a reflection of the fact that we are not fully certain about the fit of the line and therefore, becomes an uncertainty band about the straight line. So, that band tells us within what interval we might expect to see if we repeat the experiment again of fitting a straight line to a new data set within what interval what might we hope to find a particular measurement of y for a given value of x and on the left I have an interval estimate for what should be the average value of y in other words the point on the line itself is in doubt and on the right if I try to predict where might my next measurement show up for a given value of x it will turn out have even more uncertainty. So, if I ask you where a single value might be you will have a lot of uncertainty if I will ask you where an average value is you will have less uncertainty and that is a consequence of the fact that we said that when you any time you work with a sampled variable your uncertainty will go down as a function of sigma square by n. So, it turns out that this is what most of us do when we fit a straight line in fact most of us go to some software either it is a spreadsheet or it could be MATLAB now increasingly as you have done SILAB you recommended to learn how to fit straight lines in SILAB using the linearly squares approach for fitting a straight line it turns out that practically every implementation of a straight line fit has actually made several assumptions