 I would like to move on to discussing the concept of regression in more detail, i.e. the concept of what it means if the knowledge of one aspect of data allows us to make predictions about another aspect of our data. Now, let's move to another module, so I don't need to give you detailed instructions anymore, just in our studio please make a new project from this web link, our EDA regression as the repository URL. And this new project should then load something like this. Now you can't type it because I've just clicked it away. Here you go. Basically the same URL as you used before except for introduction we use regression. Now what I'd like to do with you as we discuss regression is understand the principles of what correlations in data even mean. Understand linear and nonlinear regression analysis. Take you through examples of actually computing linear regressions or computing the degree on which particular aspects of data are dependent or independent of each other. And I'd also like to show you a new tool, the maximum information coefficient as an alternative to the Pearson correlation coefficient that we usually use in data mining. So here's the scenario. When we measure more than one variable for a member of the correlation and we do a scatter plot of some x's and y's, we may note that the values are not completely independent. That there seems to be a trend for one variable to increase as the other increases. And this is really crucial as we try to move from a purely descriptive treatment of the data that we're looking at to understanding what's going on there, understanding some mechanisms. So a pure description would be happy to just say well we have numbers and this is how the numbers are distributed and we look at a histogram and we have a peak of some particular set of values. But that doesn't really tell us why this is going on and trying to understand the why, trying to understand the mechanisms is a lot of what we're interested in in biological analysis in the first place. So for example, going back to our CD4 and CD8, if we have cells that are positive in CD4 and CD8 and we have a large population of these, we are called upon to explain why that would be the case. And if we see that this isn't random, there seems to be a correlation that more frequently if CD4 is highest, at least in the subset of cells, CD8 is also high. This points that there may be an underlying causality. There may be something that co-regulates CD4 and CD8 and that allows us to make some biological hypothesis about why these cell surface antigens might be expressed at the same time and how that comes about. So this kind of a correlation analysis, looking at the relationship of different features of the objects that we study is hugely important in trying to figure out what the underlying structures and underlying mechanisms are. Now fundamentally, there are three reasons why we could find something to be correlated. One reason is, and that's, our brain is kind of hardwired to usually make that inference, is that two things are correlated because one causes the other. So we usually think in terms of causalities when we observe correlations. But there are two alternative explanations that we also need to keep in mind. One is, it's just random chance, it just happens to be so that it turns out that some data is, has a trend in one direction and another data has another trend in one, another direction. If we link up the data, we find that there appears to be a relationship, why in fact there is none. So a funny example is, if you look at the population of storks in central Europe, you find that the population of storks is decreasing. If we look at birth rates in central Europe, we find that over the same periods birth rates are also decreasing. So this old idiom that stork is the source of babies seems to be born out in data. However, just because we can discard a causal relationship, which presumably as biologists you would all recognize that there should be none, this doesn't mean that things are unrelated, are actually unrelated. Because if you think about the data more deeply, there is a reason why both data sets should be behaving in this way and that is industrialization. As industrialization leads to a loss of habitat for the storks, it also leads to different family planning decisions in families, which means there are less and so the age of when families have children is later and there are less children overall. And this is the third reason and this is called a confounding factor. This is something that affects both variables, but in fact these two variables themselves have no direct interactions with each other. So it may well be that there is a deeper relationship even if you can exclude a causality between your two factors. And all of these, the random association, the direct causality and the influence via confounding factors, all of these are discovered through regression analysis. However, in the end it's going to be your job to figure out what the result of the regression analysis really means. Now when we look at correlations, this is one example when we analyze them quantitatively of something that's called statistical modeling. And statistics modeling is something quite different from what you normally would think it is. A statistical model says nothing about how our observations come about. A statistical model is simply a functional description that as closely as possible reproduces the aspects of our data. Whether this model actually describes the underlying mechanism or not is not relevant. The important fact is only that the model should reproduce the data. So in a sense it's merely a device to explain feature of our data. It's not necessarily the representation of a particular physical or biological mechanism. So what we do in statistical modeling is we specify some model from random chance of first principles or economy principles, Occam's razor, something simple that can describe our model, our data, and we estimate the parameters of our model. I.e. estimating, for example, if we're looking at a linear relationship, we're estimating slope and intercept and variance. And we then ask whether the parameters seem to be adequate in explaining our model. If this is yes, then we can use the model, for example, to make predictions about future observations that we don't actually have. If the parameters are inadequate, even though we've tried to optimize them as best as we possibly can, we have to go back and reconsider what our model really should be. That's a good question. This is something we do with confidence intervals. We'll get to that when we look at the linear models. So linear regression assumes a particular model, i.e. there's an independent variable, which we choose from a range of possibilities, and a dependent variable, which is basically determined by the value of our independent variable plus some multiplicative factor beta, plus some additive factor alpha, plus errors or noise. And there are essentially two parameters here, which is this regression coefficient beta, i.e. something that characterizes the slope, and the intercept alpha, which characterizes where on the line this is situated. So that's the simple example of linear expression. There's a number of assumptions, only two variables are of interest. One variable is a response and one a predictor. We don't need any adjustment for confounding variation or between subject variation. It is linear. The standard deviations are constant, and especially the standard deviations are independent of the individual values. That's certainly not always true, and the errors are also independent of each other. And this regression analysis includes, yep, but if there are two predictors, then you need mixed effect analysis. So there's special tools for that. You need to regress over both at the same time. Okay, so when we estimate our parameters, we try to choose parameters that are as close as possible to the true values. Now, how do we distinguish good from poor estimates? We try to optimize our parameters, and one possibility of doing that is to minimize the sum of squared errors. So minimizing the sum of squared errors has some desirable properties. One of the desirable properties is that this is analytically solvable, so this is a very fast method. But certainly one could negotiate about what actually comprises a good choice of parameter estimates, and one could negotiate the trade-offs that one makes with parameters of one type and one of another type. Minimizing the sum of squared errors is the standard way of optimizing parameters and statistics. It is not the only way, but it's the standard way of how one does this. So what we try to do is we minimize the value of the standard error by summing over the squares of the predicted value and the observed value, sorry, the difference of the observed value y minus the predicted value calculated with the regression coefficient and the intercept from a dependent value x. So that should be as small as possible. If it's as small as possible, then we say this is the best choice of parameters. Now this is usually captured then in the coefficient of correlation which basically allows us to define for a data set the degree to which the linear model is applicable and to which it is true. And actually I'd like to illustrate that a little bit in code. So this is our data file. So what does a particular correlation coefficient mean? Let's start with simply looking at 50 random values, normally distributed random values. And now if we take these as our independent variables and calculate dependent variables with different degrees of dependency, y values where a fraction is simply what we observe here and another fraction is additive noise. And we have, we plot that. So in this case here my y values are constructed in such a way that to 99% they are identical to the x values and to 1% they are random noise. So I plot this, this is what that looks like and I calculate the coefficient of correlation. In this case the coefficient of correlation is 0.999973. Now if we do the whole thing with 80% identical and 20% random noise. We have a somewhat rougher plot here, more noise is added, but still the correlation coefficient is almost identical to 1, 0.98. 40% noise, this looks quite rough and quite all over the place. But this now still gives us on the order of 50% of correlation. Now I think just looking at the data and eyeballing it you can see that there's a bit of a problem here. Because if I would be removing these two points here, then I would probably get something with a very small coefficient of correlation that I would no longer recognize as having the internal structure that I've imposed on that. So at that level of a correlation coefficient of 0.5 things become kind of sensitive to the data and whereas previously at 80% it's clear that something is going on if you look at it, that becomes much less clear in these situations. Conversely if I say 99% of my data is going to be random noise, this is what a completely random noise set looks like and I get a correlation coefficient in this case of slightly negative. So that's to develop an intuition about what the correlation coefficient that's often reported with linear regressions means in practice. Now what happens though if the relation is not linear to begin with? So let's look at something else. We start with uniform random values between minus 1 and 1 and we do essentially the same thing. Now if we have an underlying periodic relationship, i.e. in a kind of a cosine relationship of the same thing. So basically my y values here are 50% correlated and give me this correlation coefficient of 0.98. So that's the underlying relationship but now I'm folding my x values differently. I'm folding them according to a cosine relationship and in this way here. What does the correlation coefficient do? It drops to 20%. So if you look at the data you immediately see there's a lot of structure going on but it's not the linear relationship so the correlation coefficient fails. It tells us that the data is uncorrelated which is not true. It's just not correlated along a linear regression. Same thing if this is a polynomial relationship, x squared. So we could fit a parabola well to this but if we try to fit a linear relation we get again essentially nothing similar for an exponential relationship. Exponential relationships are less sensitive to that because they usually have long stretches in which they are actually linear and we get a somewhat satisfying 60% correlation coefficient which however is wrong because I've used the wrong statistical model and the wrong underlying assumptions which I probably can't detect if I don't actually look at the data. So this is again important to look at the data. Or if I put everything into a circular relationship again I get a very weak, almost insignificant coefficient of correlation. So the caveat is it's easy to calculate coefficients of correlation. It just requires a single R statement, correlation between x and y but you have to be cautious in interpreting this. You have to be sure that you're actually looking at linear relationships otherwise the coefficient of correlation is not meaningful and we'll revisit the maximum information coefficient as an alternative at the end of this module. But now let's actually look about how to do regression analysis in practice in R. So we start from synthetic data. A completely synthetic sample generate observation that could come from you having a cohort of people and measuring height and weight. So we generate random heights in an interval then calculate hypothetical weights according to a simple linear equation and then add some errors here. So this is the function I use for that. I try to keep it reproducible and thus set the seed and the number of generators. I set two parameters for minimum and maximum height. I have some pretty tall guys in here with 2.3 meters and I generate a column of numbers in this interval and then generate a column of numbers with a linear model. So the linear model here is to take the first column of numbers of whatever n numbers are here multiplied by 40 and add 1. So this is my linear relationship. And then I add some numbers, normally distributed with the mean of 0 and the standard deviation of 15 and return that. So this is my function to generate height and weight samples and for example I can generate 50 of these and plot them. So this is my plot of my population. So on average you'd have people around 1.8 meters and around 80 kilos or so. Now when we calculate the correlation of these two columns of the dataset we get 0.5. So there's something there but it's not very strong. Now regression analysis in our simple linear models is provided with the function lm. And the simple thing to do is to run a linear model on our dependent data versus our independent data. And this is not a minus sign, this is a tilde. So the squiggly line which is usually on the key which is most to the left and the top using the shift character. So it's the squiggly hyphen. So this in an R expression specifies that you try to calculate a model where one is dependent on the other. So all that takes is one keystroke. The output of this is it summarizes the call that it had, column 2 against column 1. The coefficients are an intercept of minus 2.9 and a slope of 42. Now what do these numbers mean? Relative to what our input was. Do you remember the input of our data? How do these numbers minus 2.9 and 42 relate to our question? Right, so we have a slope and a y-intercept. But how does that relate to our question of analyzing the model that underlies our height and weight relationship? So where did my numbers come from in the first place? So the numbers came from this function. This function which generates synthetic height and weight samples in a range between minimum maximum height with random uniform numbers for the independent variable and transformed numbers for the dependent variables plus some error. And the transformed numbers were calculated by taking the independent variable, multiplying it by 40 and adding 1 and then adding some moisture. So this multiply by 40 and add 1, this is the key in our model. These are the key parameters that generates a prediction for weight given a certain height. And that's what we would like to retrieve from our data. So the question is, well, did LM, did this R function achieve this? Did it give me back the numbers that I expected if my model is correct? What do you think? Is it any good? Do I get back 40 and 1? I get back 42 and minus 2. I would say that's pretty close. I would say that's pretty close. Given that the sample was as noisy as it was, that seems to be pretty good. We can try this with a somewhat less noisy sample. So is it 6 kilograms? Right. So as the height goes to nominal zero, there's still a little bit left. If we make this less noisy, let's say we're adding random noise with a standard deviation of 5. We create this, this would be a much closer correlation, no longer really realistic. Correlation coefficient is 0.88, and in this case the intercept becomes smaller, and the slope becomes quite a lot closer to 40. So the random noise of course adds error, but by and large we've pretty much achieved our objective of capturing the underlying data, the underlying parameters that informed our models to begin with. Okay, now, well the intercept and the slope characterize the regression line, which is here, so I'm plotting the regression line here. So what this means is that the idealized model that has parameters, which give me the least squares error for predictions, looks like this regression line. Remember, we used this, minimizing the square error criterion for finding the parameters in the first place. So the parameters of slope and intercept are the ones that minimize the predictive error, or put in a different way where if we take say a height of 1.6 meters and we observe this weight, we would expect a weight down here instead, and the difference between the observation and the expectation is an error that we're accounting for. And we're trying to find the slope and intercept of the line that makes this residual error as small as possible. We've come across a b line before, so what you see here is that we can take the output of the linear model, which I've plotted previously, and directly feed that as input into the a b line. So the output is a little more complicated, but a b line knows how to work with this kind of output and then provide the line with a proper slope to explain this. Okay, so now let's have a look at the residuals, i.e. the data that is not accommodated in the model, the degree to which the points deviate from this individual line. So if we calculate the residuals, we use the function recid of our linear model, and we can also calculate the idealized values with the function fitted. So the residuals calculate the values here, and the fitted values calculate all the y values or all the weight values in this case that we would get by simply applying the multiplication of this slope and this intercept. So we can plot these differences as segments. So these are the line segments of the differences. Essentially the length, we're trying to minimize the square of the length of these differences. So segments gives me lines which overlay my plot, where I specify one point where the line should start and another point where the line should stop. For every segment the starting point is the actual observation, which I find in columns 1 and 2 of my data. And the ending point has the x value of my observation and the y value of my fit, which I've calculated here. So this draws me a line which goes from my observation to the regression line. Now let's plot and analyze the residuals. So we plot the fit against the residual, and this is what that looks like. So for each of these fitted values, we look at what the residual error is. And if we calculate the correlation of that, this is a value which is extremely small. It is numerically indistinguishable from zero. So this is actually the way that this works. The regression coefficient tries to minimize this and bring it as close to zero as at all possible to leave no unexplained variation in our data. So essentially if we would see a linear trend or some kind of another trend in our data, then we would know that our linear model is not adequate. So maybe I should just quickly code that up as an example and put that in the task solutions files just to illustrate that. Let's take our example here of, yeah, let's take our example of the random values that we, sorry this one here, the random values that actually correspond to a different functional form. So now we calculate our exponential values according to this here. Okay, now if we do this linear model analysis here, this is the line and this is what the regression line looks like. So this regression line here minimizes the sum of squared errors. But as you can see, it doesn't explain the data very well. Okay, now what do the residuals look like? Very simply we've done this residual analysis here to plot the residuals. For the residuals we need our fit and our res, which I'll just take and we can plot the differences just to make sure we're doing the right thing. It's easy to mix up whether you should specify the dependent or the independent variable first. So it's a good idea to actually plot this at some point and that looks correct. So once we have fitted and residuals, this is now the plot of our fitted and residuals. Remember it was basically a random point cloud before that. Now this is certainly not random. If we do a correlation analysis and plot fitted against residual and see something like that, that immediately tells us there's information that is not accounted for in our linear model. Our linear model is not an adequate explanation of the data. We have to do something else. Okay, now let me reproduce my plot here. There's more that we can do. And that really has to do with the question how good is our model? What does our model explain here? And for that we calculate prediction limits and confidence limits. So the prediction limits give boundaries for future observations. So they characterize how well the model is expected to accommodate new data. So if I plot prediction limits here, it will tell me into which bounds I can reasonably expect future observations to lie, or it will also tell me which points potentially could be considered to be outliers and might warrant further scrutiny of whether the measurement was actually correct or whether something is going on. The confidence limits tell me something about my model. They tell me how much variation can be accommodated. Given the noise in the data, how confident can I be that the slope and the intercept is actually what it is? So these are the confidence limits. And if I calculate them with the predict function, I get my fit and I get points for lower and upper bounds. So for the first point here, this would be at 60. The fitted value around 60, so this is this first one here, with lower bounds of 51 and upper bounds of 69. So I can plot this by sorting them according to calculating them in sorted order and then just plotting them as lines. So I get these lines here. These red dotted lines characterize the bounds in which I can, for the expectation values, I believe it's a 0.95 probability of observing additional points with that distribution or observing additional new points within these boundaries given the observed distribution and variance of my data. The gray lines in here characterize my model. So with the same confidence, if I calculate a model, it would, given the noise in my data, lie within these gray bounds. So essentially, this is additional information that are good to have when we plot a linear model, the prediction intervals and the confidence intervals that basically not only show us the regression line, but also how sure we can be that the regression line is actually correct and that the regression line explains the data that we observe. So most of the data points are in there. There's only one and two that are outliers or maybe these two guys here. So this is the linear case. Before I move to nonlinear least squares fit, I'd actually like to consolidate that and give you a task. We still have loaded our, no we don't, our LPS data. No, we don't have it loaded, but I've provided it for you. So rather than reproduce and recalculate our LPS data, I have saved this LPS data object from our previous session with the following command, save LPS.file equals LPS.rdata. That's a really convenient function to have. Often you'll have complex, important and hard to reproduce data sets. Data sets would take a long time to reproduce and recalculate. R provides this save command to save them. If you save an R object like that, you will get a binary compressed file which is usually very compact that can recreate exactly the same object that you had before you left your session. And it can conveniently be recreated with this. So in line 325, you can simply execute load, file equals rps.dat, and my object reappears here. Now note that I simply type this. I simply load the file. I don't need to assign the contents of a file. So there's nothing that's being recalculated. It's just being recreated in the same way that it existed before. It was recreated into exactly the same name and exactly the same state that we had previously. And I can also do this for multiple objects. I can give this save command a list of objects that will all be put into the same R database and then can be extracted again from that. So since I've put this in here, now you can simply load this file and then you have it back. And now you can calculate a correlation. So what I'd like you to do is... Well, what should we do? Correlate what? A cell against another cell? Well, let's look at how reproducible... Yeah, let's keep it simple. Let's simply look how reproducible control measurements are between two cell types. So take, for example, monocytes and macrophages and calculate the coefficient of correlation between monocytes and macrophages in the control populations. And also plot that as a scatterplot. Okay, to save myself large amounts of typing, I'm simply taking the value of the monocyte control and the macrophage control and assigning these to X and Y here. And if I calculate the coefficient of correlation, I get something like 0.66. So it's pretty high. What does it look like if I plot? I think most of you got something that looks very similar. But as usual, we need to think about what this actually means. So in our X-axis, we have expression values, log expression values for monocytes and here for macrophages. And every single point corresponds to a gene. So one of these points is located near the middle if the expression values for monocytes and macrophages are the same. One of these points is higher up if the expression values for macrophages is larger than what we see for the same gene in monocytes. So that's the interpretation. If we plot an AB line here of simply expectation that one is exactly the same as the other, that AB line would have an intercept of 0 and the slope of 1. And this is pretty much what it looks like. There may be a little bit of an excess at the top here. So let's quantify how much that is. And we can simply calculate the linear model here. Our linear model is then y tilde x gives me an intercept of minus 2 at 0, which is somewhere up here. And the slope of 0.8, 0.78, which is close to 1. Let's plot that and see how close it is. That's what that looks like. So there are slight differences, and probably if you run the same thing of prediction and confidence intervals, you'll find that the red line and the blue line are probably in the same model interval. So within the variation which we see in our experiments here, overall, by and large, the resting states of monocytes and macrophages seem to be pretty much identical. Of course, they can't be completely identical because the cell identity is determined by varying expression levels of particular genes, which we could now identify as outliers, for example. If we calculate the predictive value, then we would have an access to seeing what the characteristic genes are in which macrophages and monocytes are different. So far, so good. So that's the linear case. Now, of course, not everything is linear in life. Many things are not linear, and I'd like to take you through simply an example of nonlinear least squares fit to demonstrate how this is done. In order to do a nonlinear least squares fit, you have to specify the formula that you would like to fit. The formula that I use in this case is a logistic function, essentially something that gives you a sigmoidal plot, and it would be appropriate for a model of, say, modeling disease onset. So let's simulate some values for the risk of contracting a disease at a certain age, according to this logistic function. And let's have this logistic function centered on 50 years and spread an onset approximately between 0 and 100 years. Now, in this case, we employ a strategy that can be employed for a sample with any arbitrary target distribution. So we generate a random uniformly distributed variate x in an interval. So simply pick a random number here, in which in an interval in that you're interested. And then we calculate the corresponding function value. And then we essentially roll dice, whether to accept or reject the first variate. If the function value is smaller than the roll of dice, we accept x as a sample in our distribution. And the result of this is, if we do this many, many, many times over, we get a distribution of values that exactly corresponds to the function that we've used to define it. So I've shown you examples of pulling out, of using the R norm function to calculate normally distributed functions. Or we've also used R uniform for uniformly distributed values. This is a technique that allows you to get values that are distributed to a particular function that you specify, which is completely customized. Okay, so this is the function that actually does this calculation. As it's input, it requires a number which defines how many values I want. I start by specifying an empty vector. And while I don't have my vector filled yet, I generate a hypothetical age from uniform distribution between 1 and 100. And then I sample in the range of my function, which I give the parameters e to the power of 0.1 times h minus 50. And if my sentinel value is smaller than the function value, I add this age to the vector and increment the number of my successes until I reach the number of successes that I want. But if it's larger, I reject it and try again until I find something that's smaller. And in the end, I return my vector of values here. Okay, so this is the logistic function that's essentially being simulated here. We have a vital function where we have a mean risk of onset here and the risk gets larger the longer the person lives. Of course, if somebody once has the disease, they don't get the disease as well, so the curve actually has to saturate at some point. Okay, so let's calculate 10,000 different ages of onset for our disease. And this is what these ages of onset look like. So our probands or our synthetic population here is all pretty long-lived, so we get a number of high ages. This is the ages of onset that we've tabulated from our function. So different samples, different counts of these 10,000 samples. Now, there's this underlying function here which is a logistic function, and the goal of our logistic these squares regression analysis is to recover these parameters. Now, for this, we do this nonlinear least squares fit of a self-defined function. It's this logistic function here with parameters T, S, T, M and B which appear in this function here. So T is the time, S is the scale factor, TM is the medium age of onset which we set to 50 and B is the scale which we set to 0.1, i.e. something which characterizes the risk of contracting the disease. And let's try some reasonable starting parameters for S equals 180, TM is 48, B is 0.2, and see how that looks like. We have to define it, of course. Okay, so this dotted line here is wrong. It's just something we made up on the fly. It does not correspond to a fit against the data. But this is our starting point. If our starting point and our starting parameters for nonlinear least squares fit are completely off and completely wild, it could be that the function doesn't converge. So whereas linear regression analysis is an analytically solvable problem, i.e. we can come up with a formula that defines on how to calculate the best parameters. In general, nonlinear least squares fit has to be numerically optimized. That is, we try different values and we accept parameters that are good. And we try to optimize this by simply trying different values. There's intelligent ways of how to try new different variants of values so that our estimates get progressively better and better until they are so good that they can't be further optimized upon which we say that the curve fit has converged. It won't converge if our starting parameters are too far off. So particular problems are if our starting parameter has to be negative and we start with a positive parameter, then the optimization has to take that through zero. And as it comes close to zero, often there are divisions involved and then you get numerical instabilities and it simply doesn't converge. So starting from a reasonable estimation is often key. A reasonable estimation might come from a somewhat simpler model with fewer parameters or constraining some parameters. Sometimes you really need to play around. Okay. So in a similar way, as we've done previously, now we can simply try this nonlinear least squares fit of the counts which we calculated from our synthetic dataset against the formula and see what we get here. So this nonlinear least squares fit has converged and it gives me parameters here where our medium age of onset is close to 50. That's the parameter where we're trying to recover and our risk is 0.1 or close to 0.1 which we also try to recover. So this nonlinear least squares analysis has converged. Now we can extract the coefficients of S, medium and B and plot the curve with the fitted parameters to overlay that. So now you see this is the resulting curve, the optimized curve that in fact indeed describes our values really, really well. Now if we look at confidence and prediction intervals, that's not as easy. That's very different for nonlinear least squares fit. Here's a rather quick and dirty solution to get a better idea of where the choice of parameters could influence the curve, the underlying curve here. So we simply calculate the mean and the standard deviation for all three parameters and then simulate a distribution of curves with parameters that all lie within the standard deviation of what the parameters could be. So instead of this one thin red line that completely defines one single curve, now this is an overlay of a large number of faint blue lines that are all compatible with the model within one standard deviation of randomly varying the parameters. So this is an example of nonlinear least squares fit. I hope that the comments and the code is explicit enough that it should be easy, hopefully, to extend and apply it to problems you might encounter in your own work. But I'd really like to, simply as an alternative to Pearson correlation, introduce you to the concept of the maximal innovation, the maximal information coefficient. The maximum information coefficient was designed to solve the following problem. So all of these values here, all of these distributions here are essentially uninformative under the Pearson correlation coefficient. If we have periodic functions or functions like this, they all give us a Pearson correlation coefficient as zero. Obviously there is information there. It's just that Pearson doesn't capture it. Now calculating the maximum information coefficient works in a somewhat different way. What happens here is you subdivide, essentially you subdivide a scatterplot of the data into intervals, and then you ask whether there's correlation within these subintervals. And you try to optimize this in such a way that you have a smaller number of subintervals as possible and there's large number of explained correlation. So this here is a good solution that contains much of the data. This is a cut that contains much of the data. This is a poor solution here of subdividing the data because most of it is lying outside. Now if you apply the strategy, what you find is for random data, you get maximum information coefficients that are small. Pearson and Spearman correlations, mutual information and so on, which are alternatives to the Pearson correlation coefficient are all around zero. If you have a linear correlation, the maximum information coefficient as well as all others are zero. If you have cubes and exponents and sinusoidal and so on, the maximum correlation picks these up. Pearson correlation coefficient gets progressively worse and some others also don't perform well for all of these cases. So this seems to be a very general, generally useful approach for data mining. So for example, if we apply that to our LPS data, we calculate something here between B-cells and macrophages. So this is the activation profile of B-cells and the activation profile for macrophages and that has a correlation coefficient of 0.46. So 0.46 seems very large. Now the question is, what would we actually expect by random chance? Is that significant or is it not significant? And in order to do that, we can simply shuffle the data and compile the distribution of the correlation coefficients from a random comparison of genes. So if we shuffle the data, we're no longer comparing the expression in one cell type for the same gene in another cell type, but simply with a different gene. And that correlation then should be completely random. So in this way we can calculate from the same dataset whether a structure that we observe or observations that we think we observe are due to the structure of the data that's essentially there or whether it's really due to the correlations that we see. For example, if all our data in macrophages would be 25 and all our data in our B-cells would be 27, we would get a perfect correlation. But it doesn't necessarily mean that we have very close correlation. It just means that we've only observed one value. Maybe our instrument was broken or something like that. So the shuffled version then would tell us, well, the values are all the same, but it doesn't actually depend on the genes. It's always all the same. So this kind of thing can be calculated. So I'm calculating 10,000 random distributions here and looking at a histogram of these random shuffled distributions. So the coefficients of correlation for my shuffled data are here in this histogram. So this is what I expect if there should be no correlation. What I observe is this red line here. So my correlation of 0.46 is very, very, very different from what I would expect simply from a random correlation. Now let's try the same thing with the maximum information coefficient, installing it first, running the MIC on these differences. So the MIC here is 0.14. That's a number. So the maximum information coefficient doesn't necessarily translate into a Pearson correlation coefficient as we've seen before. So again, the question is, well, what does that mean? How does that compare to our shuffled data? So since this takes much longer, MIC takes much longer time to calculate, we'll run this for 1,000 trials only, but we'll do the same thing. So it's running, working, heating the room, this little icon here. I could click on that and stop the calculation. Yes, it's done. So this was 1,000 random trials of calculating the maximum information coefficient for the shuffled data. And once again, we see the distribution of the shuffled data is fixed on one point. Apparently from a random correlation, we will expect a maximum information coefficient around 0.06, but the one that we observe with 0.14 is very much larger than that. So this distribution is important. It tells us something about our expectations. Again, we learn what we should be expecting. And the number that we get from this analysis really only makes sense if we can compare it with this expectation. If I would now fit a normal distribution to this, presumably a normal distribution captures this pretty well, then from the fit of the normal distributions I could calculate the number of standard deviations in which this value lies out there. From that, I could calculate the probability of this result. And from that, I could then answer the question, how good is this model? How significant is the difference between what I've observed here and my random shuffled observations? Okay. I think it's time for a coffee break, our last coffee break for today, and half an hour until 3.30. And I think this concludes everything that I wanted to say this afternoon about regression analysis. If you have more questions, see me. Ask about it. Ask your peers. Once again, try to play with it. Try to calculate correlations. Try to calculate linear regressions and nonlinear regressions for problems that you might have. That's how you consolidate this material.