 So, let me quickly run through these assumptions because if it turns out that in your experiment that these assumptions are not necessarily being followed or true then you must be open to the possibility that your straight line fit is probably not a good fit or is in other words an incorrect fit to your data. So, what kind of assumptions do we expect? So, the assumptions that go into the fitting of a straight line using a least squares approach are that your measurements are independent of each other. Now, invariably whenever we carry out particularly in chemistry for example, if you are carrying out measurements of the concentration of a compound you are probably putting that compound into some device. For example, a spectrophotometer and you are measuring the color generated by the amount of compound in your sample. Now, when you go from one sample to another sample could the could traces of the previous sample have influenced the measurement of the next sample. And if it turns out that some of your reagent has not been properly cleaned out and therefore it remains behind in your measurement in the next sample, then you have a situation where measurements of subsequent samples of sequential samples your measurements are not independent of each other in which case you have got a problem because one measurement now depends on another and therefore a bias could start creeping in terms of your measurement of a straight line. So, it turns out it is critical for you to figure out how to make your measurements independent of each other and not be too affected by errors which creep in because of the precise order in which you are carrying out your measurements. Second, it is also assumed that your errors are normally distributed. So, when we looked at this distribution describing the scatter of data at a given value of x, we took a Gaussian distribution and we turned it on the side and try to demonstrate that so much scatter could be expected and at the most probable value is a point on the line. But it turns out that in several situations that the errors may not be normally distributed. So, what you need to do therefore is to collect your measurements plot them on either side of the line that you are proposing and ask is there a skew in the sense that there are more measurements falling on one side at a given value of x then they are on the other side and therefore if there is a skew you are probably not justified in using a least squares approach to fit a straight line. A third problem is that you have actually assumed that the variable x is measured without any error and in fact all along we have been concerned about an error in y and why is that the reason is because whenever we are trying to fit that straight line you are concerned with asking the question is there a relationship between y and x given that x is being moved around on a certain scale along the x axis. So, what are the reasons that y is changing when x can change. So, there are two reasons which if you look at the plot now there are two reasons which can explain why the value of y changes relative to the value of x. One is that there is indeed a relationship between y and x. So, in other words if I move my x from the left hand side to the right hand side of a plot a plot where the slope is positive then that suggests that y value should increase. So, there is a genuine relationship behind y the value of the variable y increases as the value of x is also increased. But the other reason the value of y changes is because of error in your measurement and that is reflected by this random points scattered around the line. So, fundamentally in regression we are actually playing off between the two possibilities the value of y changes because of error in your measurements versus the value of y changing because it actually systematically depends on the value of x. And what we really want to prove is that the dependency on the value of x is much larger than any variation because of error in your measurement. So, in other words the slope is significantly large relative to the deviation of points on the line or the scatter of points around the line. Now in all of this the measurement of this deviation is done only along the y axis. So, the interpretation is that when we designed an experiment we chose carefully the values of x and then proceeded to make measurements of y. And now we are asking the question is the value of y deviating systematically as a function of the change in x. And we are not interested in even asking the question could there have been an error in the implementation of a value of x in the first place. So, the value of x that set point whatever it is you are doing along the x axis if the value of x itself is not precise maybe because your instrumentation is not precise maybe because there is a least count to your device which you cannot improve upon. Then fundamentally there is an error along x and on top of it an error along y in which case what you really should have been looking at was something like this. So, you are talking about deviations along x and along y and now these deviations can influence the shape of the line. So, if I want to find the shape of a line the question comes up should I only account for and try to minimize the vertical lines which reflect my concern that y is changing on me randomly. Or should I also account for the fact that x can also change a little bit randomly which is the horizontal component of any deviation of a point from a line. So, it turns out invariably we are guilty of the fact that we do not account for the fact that x can also vary randomly. We do not have precise measurements of x and we are only worried about whether y is precise. And one simple way to appreciate the fact is if you ever take a straight line model like the one on the left between y and x and you now exchange the variables x and y and you plot the variable x on the y axis and the y on the x axis. The question will come up will you get the same straight line and the answer will turn out no. And the reason you do not get the same straight line if you interchange variables x and y according to the procedure on the left is because you are always looking only at vertical residuals. And so therefore in the first case where you plotted y versus x you are looking at errors along the y axis along the variable y. And if you then interchange variables x and y it will turn out that you have been concerned about errors along x and not along y. So, therefore fundamentally there is a strong dependency of this approach for fitting a straight line and what you are calling the y variable and what you are calling the x variable. Whereas as you said at the outset any relationship between y and x can be interchanged about the equality sign. And so therefore the equation of a line should not be too dependent on what you are plotting on the y axis and what you are plotting on the x axis. So, it turns out that you in principle need to ask if I interchange my axis will the equation of my line change significantly. And if it does then you got too much of an influence of an error along x which you can no longer ignore in your model building. Finally there is another assumption that critically impacts your data. We have assumed that for any value x when you look at the value y that we will see scatter along that y coordinate which is a point on the line. And we have assumed that the amount of scatter around each point on the line is more or less the same at any cross section of x. In other words we have assumed that that error term in my original model has a constant distribution associated with it therefore a constant variance associated with it. And this it turns out is not true for example if I give you a data set like this clearly the error in my data is increasing as x increases. And so therefore I got to be very cautious about how I fit a line through this because now the points on the right end of my curve which are deviating further and further away from a possible middle line they start strongly influencing the shape of the line. And the points on the left hand side of the curve which are all tightly bunched around the line probably do not matter and do not influence the shape of the curve. So it turns out that if you have got this problem of your data kind of spreading out as a function of x then it turns out that the amount of deviation you have is also a function of x it should not have been. And this in turn means you probably need to be more systematic and go to more elaborate methods of finding how to fit a line. And finally there is this question of whether your measurements are relevant or not. So it turns out that invariably in our experimentation we are going to get some outliers so it is not too clear on this plot but hopefully you can see some red points below the lines. So pay attention to the line with the smaller slope first pay attention to this line first. So this line has been generated by using all the data points on the slide. But if I then systematically ask the question of each and every point in my collection of data points, does it deserve to be present in my data set? For example is it an outlier which deviates way too much from the other points? I can then systematically test for the existence of outliers or as they are called influential points they are influential because these outliers end up strongly influencing the equation of the line. So we end up asking systematically of each point does it deserve to be present or not? And then there is an approach which allows you to filter out such points as being points which do not conform with your linear model. And if I throw out all the points towards the bottom right of this plot I will end up with the line at the top which again by inspection is probably a more reasonable fit to the data than the first line that I showed you. So therefore you cannot simply take a collection of data points, take it into a spreadsheet or some software, fit a straight line and act as if that is the end of the whole exercise of fitting a model. There is a lot of exploration to be done to ask whether each data point deserve to belong in the entire analysis in the first place and you got to ask whether the final model is at the end of it all not just statistically significant but scientifically significant. And therefore in this case by inspection alone is it a relevant fit. You have talked therefore about fitting a straight line as a way of proving that y and x are related. So remember you have looked at independence between variables, you have looked at covariation between variables and now the straight line tells us that variables y and x are related. How does it tell us that? So the slope is not 0 then I know that variables x and y are related. So here is what we typically do as a systematic way of proving a relationship between x and y. Now it turns out that there is a standard measure that people report when you look at a relationship between x and y and that is the correlation coefficient. And the correlation coefficient is a measure of if you look at the numerator it is basically the covariance that we have defined earlier. And if you look at the denominator inside the square root it has inside it the variance of x and the variance of y that is what those two squared summation terms imply. Normally we work with r square and r square will therefore be a value between 0 and 1 and if r square is close to 1 that implies that the values of x are co-bearing strongly as y is changed or vice versa. On the other hand if x and y are independent then as y is changed x may not change or vice versa in which case the numerator will turn out to be close to 0 and therefore r will turn out to be close to 0 and r square will turn out to be close to 0. So the correlation coefficient r square now is a good indicator of the relationship between x and y and it is something that is to be routinely reported in any straight line modeling activity. Now there is another reason why you must pay attention to the r square which is actually obvious to you if I go back to the previous slide. So here if I wanted to show you a relationship between x and y I have fit a straight line and you are seeing a line which seems at least to your i to indicate a relationship between y and x. But now just think about this if I take my y axis and I multiply every single measurement along the y axis by the factor 100 then basically I have scaled up every single measurement by a fold factor 100 and now if I plot 100 times y versus x it will turn out that this is basically a flattened line because every single measurement now will seem to fall on practically a horizontal line without much dependency on x. So clearly then the appearance of a straight line is a strong function of the units you are using to plot along x and the units you are using to plot along y and so the problem arises that in research it is possible for somebody to randomly change the units of how you are plotting a data and then give the appearance of there being a relationship between variables or for the other way or for the matter the other way around which is to kind of say that there is no appearance there is no relationship between x and y by trying to showcase data with a line passing through it which is practically flat. So the magnitudes of your scales x and y actually has a strong influence on the equation of the line and on the appearance of your plot and therefore on the extent to which y and x are related and ideally that should not interfere with an investigation of whether x and y are truly related. So the scaling should not influence your analysis and so it turns out that the correlation coefficient because of the fact that it is a fraction and if you look carefully at it it will turn out to be a dimensionless constant because it does not depend on the units of x and it does not depend on the units of y. So it turns out that r is a better measure of a relationship between x and y because you cannot fool r by changing your units. The problem however with r has actually two aspects to it. One problem is it does not the moment you talk of x and y being related according to a calculated value of r square it only tells you that there is a relationship between x and y but it does not tell you what the precise relationship is and of course for the precise relationship you needed the straight line on the previous plot you need that equation a plus b x to get a precise relationship for example to help you predict what will happen for a new value of x. The other problem with the correlation coefficient is that occasionally it can be misleading and I am showing you here what is actually a famous collection of data sets, four subplots where it turns out four different data sets actually claim to have the same large correlation coefficient and so if you remember we just said that if the correlation coefficient is large and close to 1 that is kind of an indicator between of a relationship between x and y and so here is a collection of data sets where all four plots have the same magnitude of the correlation coefficient. So, at first sight they seem to be all claiming a strong relationship between x and y and yet by inspection you can see there are problems with at least three of them. So, if you look at the top right plot the fundamental problem that is underlying this is the fact that a straight line is being fit to data which is not linear. So, remember a correlation coefficient is a measure of a linear relationship between x and y and there is some relationship between x and y alright but that relationship is more likely to be quadratic than linear in the top right plot. So, in other words it is an application of a straight line model where you should not have been fitting a straight line model and so it can therefore give you a misleading result that the straight line fit that you can get by fitting a straight line through this data is actually a good fit. Now, on the bottom left the problem here is that there is one bad measurement as an outlier and that is towards the top of the plot that is this point here. So, this one bad measurement ends up strongly influencing the shape of the line and in fact it will turn out that if the researchers are taken care to pull out this one point out will actually get a much better fit of a line passing through the remaining points. So, you have an inferior fit as a consequence of not dealing with outliers. In the bottom right plot you have a case where people have actually not systematically looked at varying the value of x across the entire range of x. So, you have actually chosen only two values of x and that too I am not sure you can make it out on your slide but it turns out there is only one measurement at the top right corner here there have been several replicates done for a given value of x towards the left of this sub plot. And this is an example of a very bad experiment design because fundamentally what is happening is that you have no information at the extreme end here or in the middle and what is worse the equation of the line is very strongly defined by just as one point and you can see that if I change this one point around if I move it up and down the equation of the line is going to change tremendously. So, this is an example where a straight line is being claimed and technically the correlation coefficient seems to be a good correlation a good value close to one but the fact remains that it is an improperly designed regression experiment and that therefore the line is not to be trusted the equation of the line is not to be trusted. So, in all of these four cases it turns out that the first sub plot is probably the best application of a straight line to a data set because you can see by inspection alone that there is genuine scatter of your measurements about the line and that the x value has been systematically changed along the entire scale of x and consequently there has been a collection of y values which have come about along the y axis. So, it turns out therefore that in what you have discussed so far that we now have several ways to say that x and y are related. So, if you go back to what we said at the beginning we said that we really want to talk about dependence and x and y are dependent if the expected value of the x y product is not equal to the expected value of x times the expected value of y. So, here is one way to show that x and y are dependent on each other. Now, this is not something that we routinely use in terms of implementation of a model testing procedure instead we are more likely to evaluate the covariance and if you want x and y to be related then the covariance between x and y should end up not being equal to 0. The average value of the deviation of x from its mean times the deviation of y from its mean. The next approach was to come up with a correlation coefficient which was that value r square and the correlation coefficient we said should not be equal to 0 if it is 0 it is the same as the covariance being 0 which in turn implies that there is no relationship between x and y. And finally, the other way to say that x and y are related is to say that the slope of a relationship is not 0. Now, each of these has its pros and cons if you are trying to prove that x and y are related. The first statement of dependence in terms of expectations of x and y that is invariably little harder to work with is not a graphical approach in the first place. The second approach and the third approach are actually related because the correlation coefficient if you want to think of it this way is actually a non-dimensional version of the covariance as scaled by the variances along each independent axis. And finally, the slope being not 0 as I just pointed out is something which can be easily fooled by changing the scaling of your axis and so by cramming your axis or by stretching it out you can give the appearance of there being a relationship or there not being a relationship. So, the preferred way of proving that x and y are related is to come up with a correlation coefficient. But I repeat something that I said a little earlier it is not enough to simply say that there is a relationship between x and y invariably we want to know what is that precise relationship between x and y which means what is the linear model which takes us back to asking what is the slope of this relationship, what are the slopes and intercept in other words what is the regression model a plus b x which relates y and x. So, each of these measures as this uses. So, I will very quickly try to give you a hint of what might happen for a non-linear problem. So, up till now we have been talking about fitting a straight line models of the form y equals a plus b x. But you are also very likely in your research to encounter non-linear equations and then what are you going to do with trying to fit a model to the data that you are going to have. So, in this case I take something out of biochemistry, but it is also something which applies in several other domains. This is an equation which tells you that the variable v is related to the variable c in a non-linear form and it is clearly non-linear because the variable c occurs in both the numerator and denominator on the right hand side. Now the way I have written this equation the value v takes on a maximal value v max and then that remaining term on the right hand side is a fraction. So, c over k plus c k we assume to be positive in which case c over k plus c will be a fraction and you can imagine that this is a fraction between 0 and 1. When c is very small c over k plus c is practically equal to c over c which is equal to 1. So, therefore that fraction on the right hand side is a value which ranges between 0 and 1 and so v max acts as a scaling factor to scale up that value from 1. Now if you are interested in finding the values v max and k which are the model parameters what most books will recommend for you is that you actually transform your equation such that you are plotting 1 over v versus 1 over c and the immediate appeal for doing this is that 1 over v versus 1 over c seems to follow a straight line and you have just studied how to fit a straight line. So, a non-linear model if it can be linearized seems to offer the immediate advantage that you can then take advantage of simple linear regression approaches. So, 1 over v over versus 1 over c would give you an intercept 1 over v max which gives you 1 of the constants of your model and then it gives you a slope k over v max and because you already know the intercept v max you can back calculate the value of k from the slope. So, that is the recommended procedure in most books. So, let us take a look at this from a graphical perspective. So, the suggestion is plot 1 over v versus 1 over c if you have a non-linear equation of this form and my argument ultimately will hold true for most transformations. So, here is a data set which tells you that actually you have to be very careful when you do such things. So, here is on the first sub plot a plot of v versus c. So, on the y axis it is v on the x axis it is v and you can see this is a kind of curve which saturates off which it levels off at a maximum. So, we had that v max value and what we are basically saying is v will increase from the left increase upwards linearly and then it will seem to slow down and saturate and level off at v max. So, this data if you were to then transform as 1 over v versus 1 over c now looks like this. So, we are plotting 1 over v now versus 1 over c. In which case the question comes up can we fit a straight line and of course, we figured out how to fit a simple straight line. And this simple straight line will seem to pass through the points in which case here is the curious thing. If I have fit the straight line that implies that I have managed to back calculate the values of v max and k and if I have managed to fit v max and k I should be able to go back and ask how does the equation look like. And so now on the right hand side I fit back this dotted line in the red the lower of the two curves which is what is the curve predicted according to a linearized equation. In other words when I linearized on the left hand side and found my constants model constants and then I go back and ask what will that model curve now look like it will turn out to be this lower red curve whereas by shear inspection alone the blue curve on top is a superior fit to my data set. So, this gives a suggestion that the approach of inverting a non-linear equation in other words transforming it then finding model constants or model parameters and then plugging the model parameters back into the original equation and then asking how does it behave this is actually something which introduces some type of error. So, where does the error come in here with this kind of an approach. So, it turns out that in this kind of a process where you are looking at v versus c it turns out that the error in data collection is largest at low values of c. So, this is a practical observation from this particular data set the error in the collection of measurements was maximal at low values of c and. So, therefore, it is at the lower end of the top sub plot on the left, but then the moment we inverted a variables and we plotted 1 over v versus 1 over c low values of c will end up being high values of 1 over c and. So, we now have a large error in high values of 1 over c. So, my error is large at this end of my straight line. So, if there was going to be any error up here that error would have been at this end of my curve which in turn becomes a large error as I invert c at this end of this curve at this end of the straight line and you can quickly see that I actually have only two points I only have two points at the right end of my straight line and these two points end up strongly influencing the slope of my line. So, the practical problem is that the slope of my line is in a inaccurate because two data points strongly influence the slope of this line and these two data points happen to be the very same data points where I have the maximum error in the collection of the data in the first place. So, with the net result that the model that I predict according to a straight line calculation turns out to be a bad model as opposed to simple straight forward non-linear regression. So, the model of this is that you do not simply transform variables and then apply straight line model simply because the straight line model is more convenient to you. Use instead non-linear optimization and non-linear regression tools the plenty of these available and so one of the things going ahead is you will have to figure out how to use non-linear optimization in psi lab to carry out the act of regression not just of straight lines but also now of non-linear model. So, that you avoid errors which creep in as you transform your variables. So, to summarize therefore, then when it comes to showing relationships with an x and y what I recommend is that you compute a correlation coefficient r square prove that x and y are related according to a large value of r square being present but I have given you one counter example involving four subplots where the r square is large but yet there are issues with ways the data is collected and presented. So, you still have the duty of inspecting your data to see whether the fit is indeed a good fit. You need a model which gives you the ability to predict y for further future values of x. So, therefore, you have got to go ahead and fit a straight line model and find the parameters in your model a plus b x that is normally done by simple regression approaches like the least squares psi lab can do that for you. So, find the straight line model and then be careful if you are interested in fitting non-linear models to your data especially avoid transforming your models unnecessarily instead use powerful non-linear regression tools available to you and use them directly on your data without any transformation and introduction of further error. So, at this point what I have discussed today is how do I show relationships between y and x and what I am going to do now is respond to some of your feedback from the chat and tomorrow I am going to continue with this discussion of relationships between variables and tomorrow I will start talking about how to prove that x causes y as opposed to y causing x. What are aspects of causality which influence our relationship between x and y? So, with that let me take a look at some of the comments from you folks. There is a question from E Road. Sir, we have a question. We have a situation where we have a highly non-linear system like a power plant. We are measuring operational parameters of a boiler. For example, typically 5 to 10 important variables and when we are measuring we are not in a position to identify the noise from the system. Is there any way to find out the noise in measured data by scientific methods? So, the question basically is that in a process where he is monitoring several variables he has a collection of measurements which involve a lot of noise and basically the query is about how to identify noise and filter out noise. So, it basically boils down to what I had discussed as outliers and so I am going to assume that you understand the process that you are modeling and for which you are collecting data well and therefore you know what is a true trend and what is noise. If not you have got to be cautious that what you are calling noise may actually be very informative measurements in the first place. So, now how do I identify a point as being a noise or for that matter being an outlier? So, I see a number of comments on chat about what is an outlier and how to identify an outlier. So, fundamentally it is this there are two approaches one is you ask the question after fitting a line through all the measurements that you have are some points turning out to be at least 2 to 3 standard deviations away from the line. So, a measure of a point being far away as an outlier is actually to look at the distribution that I showed you initially I drew a straight line I showed you a Gaussian distribution on the side that Gaussian distribution had a variance to it and that variance now is a measure of how much scatter can be expected at a given value of x and that deviation if it turns out for an individual point is too large over and above let us say 2 times the standard deviation or 3 times the standard deviation and that is something you need to define for your own process what is your tolerance of a outlier you then formally call it an outlier if it is 2 standard deviations away. The other way to do it is to ask if that point where to be removed from the data set is it going to change the slope of my line significantly. If it is a point which changes the slope of your line significantly then at the very least it is an influential point what I called an influential point it is a point whose presence strongly matters influences the model that you are about to build. It is for you now to analyze and interpret that point as either being an unusual measurement of a process and therefore an unreal or wrong measurement or ultimately possibly an indicator of a problem happening with your process as consequence of which you are seeing your measurements start to move away from a trend. So, you seem to be talking in the context of an online process where you are collecting measurements and possibly some of these measurements deviating away from what is expected. So, therefore fundamentally an outlier is something which is a couple of standard deviations away from the straight line, but it could also end up being an influential point in which case it needs more systematic investigation and that usually is of the form of eliminating a point from eliminating that point from the data set and asking will my model fit change as a consequence of this point going missing. So, over to you. Sir, thank you very much. K. K. Wagnashik. We want to know what is the importance of error plot interaction plot and normal probability plot and second bit is what is the significance of r square and adjusted r square value. Okay. So, there are several things that you are asking there in one question. So, I will stick to the topics that I have actually discussed for all at this point, because I am sure most of the other participants have not heard about interaction plots and so on. So, I will stick to this discussion about r square. So, the r square again I will repeat is a measure of the extent of correlation between two variables x and y. The assumptions behind this are that the model that you have developed or that you are investigating is a linear model and for a linear model question now comes up what extent does a change along x influence and result in a change along y. So, the r square is to be interpreted as a value which is a reflection of the non-dimensional covariance and the larger the r square the usual interpretation is that you have got a good relationship between x and y. The danger with this is that as I pointed out on occasion you can have non-linear behavior in your data which seems to give the impression of a high r square, but actually it is not confirming the presence of a linear model at all. Okay. As to the effect or the use of interaction plots and the other types of descriptors of data that you talked about we can take some of these on the model. These are concepts which are not been introduced to the overall audience at this workshop at this time. So, thank you for your question. M.B. Patil college. Sir, what is the difference between the creation coalescence coefficient small r and the capital R that you have described in flight number 21? There is no difference, there is no difference. So, the question was whether there is a difference between the creation coalescence coefficient which I presume the user has seen some book and the definition of r square which I provided in one of my slides, there is no difference. Patil college you are connected, go ahead with your question. I want to ask how to represent time series data using transformation such as network traffic. So, the question is about how to represent time series data and then work out relationships I presume as a function of time. It turns out there are several ways you can do this but fundamentally the thing to be done in time series data is to work out if there is a relationship between measurements as a function of time. We have not discussed this term called an auto correlation. Basically it involves trying to estimate the extent to which there is an auto correlation within the collection of data. There are different ways to do this. One way is to look at certain time intervals and looking to see with an offset is there a variation between data as measured at different time points given a specific offset between a pair of time points. In other words try to work out in a difference form or a what is called a state space form. So, there is a relationship between a pair of variables. There are other forms to this where you try to fit some sinusoidal type of models. For example, using transforms to look to see whether relationship as a function of time can be converted into some other domain. For example, Fourier series and then you look to see if there is a relationship which shows up in a different domain. So, we have not had time to cover the basics of time series analysis in this, but I can take this forward on model and give you couple of pointers as to reading material and this is the sort of thing we can do in a more elaborate sense in a subsequent workshop. Thank you for your question. So, we now break for T. Thank you.