 have been better to talk about clustering after regression or before regression. Or maybe it doesn't make that much of a difference. Even though the problem seems very well defined, what you're doing regression analysis is not trying to fit. What you have to realize is what you're really doing is you're taking an abstract theoretical ideal model. So it's actually statistical modeling. Even something as simple as a linear regression is statistical modeling that tries to figure out how nature behaves in principle and asks, how then is your data a representation of that data? So there is some idea of an idealized process that underlines how these data were generated in the first place. And this is what you're really interested in. You're really interested in the idealized model, the idealized parameters of the model. So once again, the question is, what is this scatterplot? It's got values and y values. And what's interesting about this? What's obviously different from this than what you would expect simply from a uniform random distribution of points in this group? What do you notice? Yes, there's a direction in this. So the direction arises from a kind of correlation. Correlation means that if a point on the x axis is small, there's a rather high probability that the coordinate on the y-axis is also going to be small. And if the coordinate on the x-axis is large, again, there's a rather high probability that the coordinate on the y-axis is also going to be large. So the two measurements we have here are apparently not completely independent. But they're slightly dependent on each other. If you have one in a certain way, the other will turn out in some kind of a predictable fashion, not independently. Now, if the dependence were 100%, all the information in your data could be represented just by the points in one dimension with the x of y-axis, the other dimension could be perfectly calculated. And in that case, the points would all fall on a straight line. But that's not the case here. There's some variability. And the question is if there's this variability, how do we then go about and fit this? So here's some examples that I just came across. These are impact of SNPs of so-called tall alleles. Apparently, we have a number of alleles associated with tall body size. And what the researchers did was to take something of all the alleles that are associated with difference in body size and look how the number of tall alleles in the set of SNPs were correlated with an approximate height difference between people who have these alleles. And it turns out it's a very, very linear relationship. If you know the number of tall alleles, you can very well predict diversity. If you know the height difference, figure out the number of tall alleles. Which is, you know, for looking at alleles and SNP distributions, this is a surprisingly tight relationship. Nature genetics May 2008. Here's another example of correlations. People were looking at those responses. They were looking at whether they could make antibodies that they would be focused on HRV1, on particular epitopes, by boosting the immune system first with the entire protein and then only with particular epitopes and actually get neutralizing epitopes. And the evidence that this works, this immune system refocusing approach is in this kind of a plot here to have points. And if you add more antibody concentration to get a tighter percentage of neutralizing activity, so this is an indication that antibody actually neutralizes the virus. In this case it doesn't. This is thin. So, once again, like this morning, I'd like you to think about it. Because once again, when you do regression analysis of points and plots, and try to figure out whether something in this is significant or not significant, you have to have some idea of what to expect from the first place. So, once again, like this morning, I'd like you to think about it. Because once again, when you do regression analysis of what to expect from the first place. So, again, stick your heads together and define a scenario as a point, as a thought experiment where you would have data where you measure one thing and you expect that one parameter to influence another parameter. How many samples do you have? What's the domain? I.e., what is the possible values that one parameter can take up? What's the range? What's the possible set of values that can be taken up? What could an uncertainty in the independent variable look like? I.e., if I use x as my dependent variable and I can't measure it very accurately, what would that possibly look like? Finally, of course, what's the question that we would like from the data? I would like to have at least one thoughtful example of some kind of data where you have some dependence between two variables that you're trying to figure out through linear analysis. So think about what you would contribute. Number of data points, variability. I have one example that we'll then go through, but basically take this as a motivation to be able to come up with your own test examples and simulate your own data. Okay? Five minutes. All right. All right. The question is just trying to ask is whether or not there So, were you all able to come up with some interesting example? I hope. You want to share yours? Come on. So the problem that we wanted to set up is actually a really important one for all of us. It's about Louis Blanche tonight. So the question that we wanted to ask is what's going to predict the success of a particular exhibit? And the independent variable we thought might be interesting to look at is the distance of the exhibit from a public transport stop. So, if the exhibit is a long way from a streetcar stop or from a train stop, maybe it will be a distance centre for people to go and visit. So the thing we'd like to measure the independent variable is the number of visits per exhibit. Is that all right? Perfect example, yes. So, you're trying to find out what the correlation is, how tightly these two assets are formed. So, the range we thought was obviously number of visits is zero to some very large number? We don't know what the range is. Well, no. The range is the distance. So basically the range is for a function of what is it defined for? So it's defined as zero if it actually happens at the subway stop to the distance where the next subway stop would be short. Short. Did I just get this wrong? No, sorry. I've just described the domain. So the domain is what functions defined. The range is the kind of numbers that you get as a function of applying your function to the values. The question I have, though, is the popularity of the event proportional to the success of the model? Exactly. And that would be an interesting question here because the kind of linear analysis would allow you to see. Do we have random variation overlaid on that? Or is there something more to it? Are, for instance, do we have outliers only for venues that are far away from subway stations or that are close by to subway stations and so on? So that's the kind of exploratory analysis that you could add on top of that. If you have a relationship, could you use it to predict something? Well, the question my answer is, you know, even if you had a crampy exhibit, you might get a lot of visits if you were very, very close to public transport. So the team of curators would make sure that, you know, the exhibits that they're not as excited about at least have them in the subway stations. Okay, that's a great, great example. Not the quality of the exhibit, the number of visitors, something that you can measure. Or if you have a metric that defines quality of the exhibit in a different way than the number of visitors. It would be interesting, you know, to see if we have something that's very, very good. Does that information spread through the population? Do we raise the number of visitors during the night? Many different types of analysis. So this is, as I said, these regression analysis are types of modeling analysis. You start by specifying some model. In that case that we've heard about, the model is that there's some relationship between proximity to a public transport station and the number of visitors that you can get, which is a reasonable assumption to me. And then you start estimating parameters for that model and you ask, do my parameters describe my, you know, does the model, given the parameters of the model, describe my observed data well? Is it, is it adequate? And if yes, we can use the model to make predictions in the future, for instance, or define outliers and look at these more closely. And if no, we need to specify a different model. Now when statisticians speak about model, it's a little different. For instance, computer scientists speak about model. Computer scientists model something they'd like to line up the numbers so that their algorithms represent, you know, some representation of what actually goes on in age. Molecular modeling is a good example of that. You have these virtual atoms that bounce around in your computer memory and try to behave as if they were real molecules from their behavior you can make. A statistician's model is actually merely a device to explain data. And it's not obvious. Linear regression is maybe one of the better examples where it's possible. But it's not obvious that a good model actually need to have anything at all to do with physical, biological or chemical reality. It's not designed to be explanatory. It's simply designed to represent the mathematical reality behind the distribution of the data. So a statistician's model is not necessarily explanatory. Linear regression assumes a very particular model. And this says the y-values of our observations i are linear function, some parameter alpha, which defines the intercept of the slope, some parameter of beta, which defines how shallow or how steep the slope is, plus a set of error values that are in some way associated with your data. So the reason why your data don't actually all lie on one purpose at a time. In this terminology we call x the independent variable. It's the one that you choose, and y is the dependent variable, the one that you get after taking x and multiplying it by beta, adding alpha and adding some x along the term. But in different contexts these terms are also called differently, even though they all mean the same thing. Here's the little selection. The independent variable can also be called the predictor, the regressor, the control, the manipulative variable, the explanatory, the exposure variable, the input variable. It's all the same thing. Now these excellence are errors, and error is not in the sense that they're wrong. If you measure your data precisely and well, there is as much a reflection of physical reality as the ideal value that you would be expecting under a linear model. So these epsilon don't mean they're measurement errors, for instance. It can be that there are other factors which are contributing to your data points that are simply not in your model. And thus, since they're not explained by the model, we just pack them all into error terms. Say, well, we're not speaking about them. They're not in our linear model errors, as we call them there. But actually they might be something interesting in these so-called errors. There might be interesting questions to ask why the errors come out in the way they do. They might be simply random and uniform, which says probably we have small differences in measurements. Sorry, not uniform. They might simply be Gaussian distributed, which might mean that they're smaller. Actually, do you realize why the Gaussian distribution is so often observed in terms of errors in measurements or so-called random variation in measurements? Who here has ever heard of the central limit theorem? The central limit theorem. Now, that's something that you can really easily try out in R at some point. But the central limit theorem states that if you start from a certain value and you add and subtract randomly chosen small increments of a positive or of a negative sign. So essentially you have a random walk starting from one point going left, right, right, right, left, left, left, right and so on for a long time. And then you measure the point where you end up. The resulting probability distribution is going to be the normal distribution centered on the starting point. So the normal distribution is the expected distribution that you get if a large number of small random variations influence your measurement. And since that is a situation that very often happens in real biological processes, you often get the normal distribution as a good approximation of what the underlying reality is. So it all goes back to the central limit theorem. Simulating that in R is really trivial. You just set up a loop. You define one point and to that point you randomly add or subtract an increment say 100 times. And then you put the result into a vector and you plot the vector and you'll see that it has a Gaussian distribution. Something you can write in a minute or two and it's really fun to do. Nice homework exercises. Okay. So simple linear regression is useful when there's actually only two variables that you're interested in. If there's more than two variables, you need multiple regression. One variable is a response and one is predictor. It doesn't matter which one is which. You can formulate it this way or that. And you don't need a adjustment for confounding or other between subject variation, i.e. the samples are truly independent. There are some assumptions. One assumption is that the linear model is actually right. That's a non-trivial assumption because especially in noisy data you might have the bottom part of an exponential function which you don't realize is an exponential function because it looks much like a linear function to begin with. And the data doesn't really tell you that it's not one or the other or it could be a polynomial function of a higher degree or it could be a superposition of an exponential plus a linear function and so on. So sometimes it's not easy to show that something is truly linear because the data is noisy but linearity is the simplest, most economical explanation. So according to Occam's razor, the one that you should go with unless your data forces you to do otherwise. For instance because the residuals look systematically different. We'll get to that in a minute. The variance should be independent of x. So the variance should be independent of whether your data are large and small. So not small variance for small data and large variance for large data. The error terms should be independent and for proper statistical inference especially p-values for prediction the error values should actually be normally distributed because for prediction again we'll get to that in a moment. If you're trying to make an inference about how probable it is that a data point is an outlier or part of the model you have to make an assumption about how the errors are distributed and in the case in standard cases the assumption is they're normally distributed. So there are really two parts in the analysis that's the estimation of the parameters which is easy to do and the characterization of how good is our model anyway. And then the third part is using it for prediction. So parameter estimation is simply the question of how do we choose parameters in this case to intercept and slope parameter in the best that really depends on what best possible means. There's many possible interpretations of what best possible is. In the normal statistical sense the criteria for best possible choice is to minimize the sum of squared errors. But that's not the only criteria. In special circumstances you can use other parameters, other criteria of quality of fit. But here usually sort of gloss over this but the sum of squared errors is a good piece for a university used metric that we can use to distinguish good from poor estimates. So in the general sense what does this mean? Simply for a sample of points x1, y1, x2, y2, and so on we take the observed response coordinate or the observed y-values and we subtract from that the model applied to the x-value. So we take the x-value, we apply our model and that means it's predicted y-value and we subtract the predicted from the observed y-value and that deviation, that difference is a characterization of how good the model has predicted our observations. And we take that difference, we square that sum over that and that is the sum of squared errors SSP that abbreviations use a lot. So sum of squared. Now it turns out that if you use the sum of squared errors that's the same formulation you should get the wireless. The same formulation as before but now in the explicit terms of the linear model so yi minus a minus b times h squared in the parameters a and b is sum of squared estimates. And this is nice because this method of least squares has an analytic solution, closed form solution for the linear case. What does that mean? What's an analytic solution or closed form solution? You're looking across that? So if the mathematical formula that you're trying to do has that property that's what it means. You can derive a formula into which you can plug your data and you immediately get the answer. So an analytical solution is one for which you can write down a formula and get the right answer. The opposite to that is a numerical solution. So with a numerical solution you have to optimize things numerically, come up with guesses, improve your guesses in an iterative fashion and so on. So analytic solutions are fast, numeric solutions are slow as very generally speaking. But this one has an analytic solution, you just plug in the data, you immediately get the parameters, you can draw a curve and then you can define so-called residuals that define how different our observed values are from the predictive values. I'll get to that residual part a little further down the line again just to introduce it here. Okay, how does that work in R? So again, let's work with a synthetic data scenario. Of course, we can take any data set and just apply linear modeling to that, linear regression analysis and get a number but that's not so interesting. What's more interesting is if we come up with a synthetic model, can that kind of analysis actually recapture the parameters which we put in? Because as I said before, if we can't do that, we can't expect that we get meaningful results from our own results. So I'm defining a function here which I call make data and I just go through the syntax here. Oh yeah, what I'm trying to do is to model the correlation between height and weight of humans with reasonable parameters, not too sophisticated but just in a linear way. So I say there's going to be a vector x and a vector y to which I assign an empty set simply to initialize them. So x and y then are initialized empty and I define a matrix. And for our height and weight model, oh yeah, that's fine. First of all, I take random samples of heights and I take random samples of heights by generating numbers between 1.5 and 2.3 and multiplying these by 3. So the model here is that height in meters between 1.5 meters and 2.3 meters, which is pretty cool. Simply multiply by 40 and adding 1 to it as a linear model would be the prediction for weight. So that would give you extreme values. This is why I'm thinking about domain and range before that. So this is our domain. We're looking at human beings from 1.5 to 2.3 meters. And again, in order to simulate it, the range that I'm looking at here is between 60 kilos, well, you know, it could be a little less than that and that would come out to 92 kilos. It could be a little more than that. So the ratio could be a little steeper actually. What's the 1 per meter in our unit? What's the one here? What's the one here? Do it once, exactly. Give me just one sample, because I have it in a loop. I could also have generated them beforehand and then just use this index here to make it a little more explicitly. I have this in a loop and I just want one single sample of the uniform mediate between these two boundaries. Don't you need to use a seed? I could use a seed. I don't need to use a seed, right? Since I'm not using a seed, this will turn out slightly different every time I use it. And once I have this, and if I would plot this, of course, then everything would be on a perfect line. No point in analyzing this. But basically these are the numbers to remember the parameters of model B is 1 and A is 40. And then I add some uncertainty to this. And in fact, I'm violating one of the assumptions of the linear regression analysis because here I'm saying my standard deviation should actually depend on the data. Because here I'm saying my standard deviation is not an absolute number, but it's 1% of the height. Something like I can measure height to 1% accuracy. So I'm going to make, you know, on the order of 2 cm error for the really tall guys and just 1.5 cm error for them. And here, again, I'm violating this because I'm also saying my standard error for the weight is going to be not independent on the wide parameter. We'll see what influence this has on the data, if any at all. So my model is violating one of the assumptions of a pure linear regression analysis. And, you know, this is something that I might be interested in because I have a good reason to believe that people that are taller have a much broader variation in body weights than people that are smaller. You know, if you're only 1.5 cm tall, where can you go? You can be somewhere between 40 and 80 kilos, right? So that's a difference of 40 kilograms. But if you're a 2-meter guy, no problem to weigh 200 kilos. And the really lanky ones might only bring 90, but that's 110 kilos difference if you're 2 meters tall. So that variance is larger. So basically this dependence here is biologically motivated, even though it'll violate my assumptions. And that's kind of important. Many of the code examples of how to do things run samples on the same function or on the same type of function that they then use to analyze the data. I always think that's a bit of cheating because if you use the same function to generate and to analyze your data, there's no question that it will turn up since the same assumption. The interesting question is really, what happens if the assumptions are violated? How sensitive is my analysis going to be to these violations? Okay, so I put that into a file, I store the file, I call the file makedata.r, I type the command source makedata.r, or I could just copy and paste it into the r window. That's the same thing, console window. And remember what that file does is it's an associated with that name makedata, which is a function of n and n is the loop variable here. So it's a positive parameter n, I go through that loop and I spit out a number of data points with these properties. Okay. So here I have my data vector. I run makedata with parameter 40, so I get 40 x and y coordinates and I plot that here. That's what the data looks like. You can kind of see the relationship here, sort of something like that. The variance, yeah, you might be able to see that it's somewhat smaller here than there. The values are kind of positive 70 kilos here and I'm going up to 110 kilos here. So for a simple model it doesn't capture all of human variability, obviously. Now, estimating the linear model is, you see, as this called, this says, get me linear model that supposes that my data column two, now this is the tilde, this is not a minus. This is the squiggly thing that on my Mac keyboard is way to the left of the screen. So this is the way to the tilde. The tilde in our syntax means data two is modeled by data one. So data column one plus a model generates data two. And if I do that, I get the following information. First I get this, do I actually have this? Okay. So the first thing I do is I get this column LN formula of data two is modeled by data one. So in this line, R only summarizes what I told it to do. So I can check again that what R did was actually what I thought it should be doing. I passed it to the right set of parameters and the computer didn't just do what it did, but it didn't do what I say, what I said, but it did what I thought. And it gives me two coefficients, an intercept and that's the value of linear model where X is zero. And so, you remember what the parameters were that we put into the model? Yes, but the parameters. 40 and 1. 40 and 1, right? That's what I get back, 39.7 and 1.1. For noisy data like that, that's not too bad. Why does it turn out so well? Well, that's because a lot of the other modeling parameters, it is noisy, but to a very large degree, the noise that we added was just randomly distributed. There was not a trend in the noise and the linearity of the model is rigorously corrected. There's no quadratic term that we've added here, exponential hyperbolic function, which is really underneath the data. This is a pure linear model and if we analyze it in that way, we get three degree estimates back for the data. Now, we can take these here, the intercept and slope, and we can plot a line, an A B line, basically a line of intercept and slope, the parameter A and B, of these two values, and we'll call that an overlaying curve. But these two values of intercept and slope are also the return values of the linear modeling function. So when the linear modeling function returns, it basically spits out these two values so we can plot this expression into A B line, open parenthesis, run the linear model and it will do the same thing. So basically, you don't need to type out the parameters to get this line. Now, what's the significance of the line? Well, the line is basically the line that would give us the right y value or the ideal y value for every x value that we give up. So this is the idealized model that we suppose is behind these data. And there are a number of statistical parameters to characterize that. I think Raffler mentioned a little bit about the coefficient of correlation and regression, and whether the parameters are significant or not. The standard error in the parameters, for instance, is interesting to look at. The slope has a mean value of 30, an estimated value of 39 points. And the standard error of 6.78. So the data says, you know, you could fit in different slopes and the fit wouldn't be too bad. It's just that this value is the best one that we can have. But, you know, within slopes of, say, plus or minus five, we would still get reasonable fits. Actually, I think it is a good time to have a break. Because now we're going to look at whether the parameter estimates are good and look at confidence limits. And before we do that, I have some coffee. When do you want to be back? Oh, right, it's exactly three now. So, 3.30? Okay. Yeah.