 Something I've just learned is that I can actually detach these windows if I drag them into the file area. So that's kind of nice because now I can switch between windows and I can have one very large and not everything is so large and that will make it easier to look at plots and things. I've also made a few small updates so if you want to just use the version control icon and pull branches you will have the latest version of the EDA introduction. What did we learn about quantile plots? What interesting things did you find? So I had a number of discussions about what do these quantile plots actually mean. How can we understand what we're doing here? If this was a quick post-prandial quiz, could you write down meaning of a quantile plot and get away with it? So essentially what we're doing with these quantile plots is we're comparing the shape of a distribution. So it's relatively easy if we analyze quantitatively to compare means and variances and standard deviations and ranges and everything. But there's more subtle information encoded in the shape of the histogram, in the shape of the distribution. And that's what we do with the quantile plot. Essentially we build a histogram along the quantiles and then we compare how these histograms match up against each other. So we get rid of any variances that are simply due to shifts. We get rid of variances that are simply due to scaling. And we basically look at the shape. And if the shapes are very similar, then the quantile plots will fall on a straight line. If the shapes are different, then the quantile plots will differ. And I think what many of you who have compared controls against the LP estimated data have seen is that there's a break in the data. So the parts on the left-hand side are kind of linear. And then around the value of minus 10, usually it shifts, it drops down, or there's a curve there, or stuff is different. The way I would interpret this is that with the values of the small enrichments that we're seeing, the controls and the stimulated distributions are essentially the same. So there are localized differences, but essentially the differences in the individual genes correspond to noise. So there's no big difference there. But there are significant differences in the higher enrichment sections. And that's where things actually happen. So whenever you have distributions, or plots, or lines, or regressions that seem to have two lines, you always have to think about probably there's more than one population in my sample here. And in this case, the two populations that would come to mind are the regulated and the non-regulated genes. The ones that correspond to an induction with LPS, and the ones that do not respond to such an induction. Now, if we know that and we can say, well, all over all, there seems to be a breakpoint between the regulated ones and the non-regulated ones around the value of minus 10, we can use this information for subsetting the data. And we could then just pull out all the values where genes at some point fall below minus 10 or not, and thus increase the contrast between signal and noise in our data set analysis. There's nothing we knew about that before we looked at the data. So this is a completely unbiased exploration of the data, looking at some of the distributional properties, i.e., essentially the shape of the histogram, and then from that, figuring out that probably there are two subsets of data there and how we can pull them apart and use that for further analysis. So this is why QQ plots are interesting. It'll probably take a little while to wrap your mind around what that means. Again, the best thing to do is remember that we're basically looking at the shapes of distributions here. Think of how you could use something like that in your own data at home, for example, by then comparing distributions from your own data against normal distributions with QQ norm or with between controls and experiments. And ultimately, it'll click and it'll work, and you'll be as confused as now but on a higher level, as we say. Now I'd like to shift gears, still stay with exploration of data, and look at a few scatter plots with flow cytometry data. So not to go too deep into scatter plots, there are loads of examples in the plotting reference, but I'd like to just create some visuals before we move on to completely other things. This is a table, gvhd.txt in the package. It's a rather large table. And this is tab separated values, including a header row. And I got this many years ago from Sorab Shah, I believe. This is graft versus host disease. So flow cytometry of graft versus host disease. This time we read this with read table, graft versus host disease.txt. Header equals true because there is a header line. And that's what this looks like. So I'm not entirely sure what these channels are. This column here is fit C labeled CD4 surface antigen. This column here is CD8 labeled probably FICO erythrine. This would be CD3 labeled per Cp. Another different wavelength through R04. And this is probably aloficocyanine. Yeah, exactly. So different fluorescent dyes. How do they color the cells? How do we get the dye specifically onto the cells? The cells have different fluorescent colors? Antibodies. So we chemically conjugate the dye to an antibody which specifically recognizes CD8 or CD3 or CD4. Note that we have two columns for CD8 here. What can we do with that? Exactly. We should expect that these two channels correlate very well with each other. One value is high. In one channel, it should also be high in the other channel. That's a good internal control of a dye effect during that labeling. OK, so let's play with that data. For example, we could extract only the CD3 positive cells. If we look at a histogram, what would we consider to be CD3 positive? You know, there's a distribution of labeling events for the purposes of the tutorial. We've always used a number of 280. That's about here, breaking it around the middle. And then basically subset our data and say, graph versus host disease CD3 positive is a data frame of our original data where column five of our original data is greater than 280, so subset. And of that, we want the columns 3 to 6. So now we have the four dye-label channels here in columns 1, 2, 3, and 4. And only some of the rows are used. How many rows do we still have? How many elements in this? How can we tell? How many cells are still in this data set? Can we do DIM? I don't know. Let's try. What does this mean? DIM. It's the dimensions of our object. So this is a two-dimensional object. And it has 2,089 in one dimension and 4 in the other dimension. We've already seen that. We have four columns. So that would be the columns in 2,089. Any other way? Can we not click somewhere? Let's click somewhere. Data here. If we click on there, and we'll look at it, and it tells us 2,089. Is there another way? We can do structure, data frame of that. Can we have another way? Why am I harping on this? This is because if I do DIM, I think this is completely correct. Looking it up and using structure basically has that information as a side effect. If I want to write it into code, I would like to make more explicit what I'm actually doing. And thus I would use the command n row to tell me that. Why do I need that? Well, sometimes I need to iterate over it row by row by row. And then I often put 4i in 1 colon n row of whatever the object is. I think that's the most explicit way to document what that number is, what it does, and where it comes from. There's a Cognate n call that does the same thing for columns. OK, so let's plot it. So here we plot all the rows and two columns. So what's the x and what's the y in this? So if you think about this as two columns of the entire thing, then the first column is implicitly used as an x value and the second column is implicitly used as a y value. And that's our scatter plot of Kraft versus host disease, CD4 versus CD8. Incidentally, we just mentioned that CD8 against CD8 would be a good internal control, and it should be highly correlated. So that would be column 2 against column 4. Let's have a quick look. This is different. This is columns 2, 3, and 4. OK, well, since we have it now, if we give the plot function more than one column of a data frame, more than two columns of a data frame, it shows us all of the combination plots. So I mistyped here instead of 2 and 4, i.e. C2 comma 4, I typed 2 colon 4, which expands to 2, 3, 4. So now I'm seeing a scatter plot of column 2 against column 3, 2 against 4, 3 against 2, 3 against 3, against 4, and so on. So the one that I was looking at is 2 and 4. This would be this one or that one. This is 2 against 4, and this is 4 against 2. Same difference. And what do we notice? Are they well correlated? Not really, right? And it's kind of weird. It's not just noise. So apparently some subsets of these cells can be easily labeled by both fluorophores. And some subsets of these cells don't, especially in the middle intensities. There's something weird going on. As the signal of CD8APC increases, the signal for CD8BPE doesn't seem to increase as much. So the truth of the number of receptors is somewhere between these extremes of plots. And that's what you have to keep in mind when you look at data like that, this way. That's what I meant. Interesting. So there's a lot of overlap in these plots. And I would just like to demonstrate a few ways to work with data that is very dense and to find basically what the underlying distributions of elements are like here. There's so much overlap that we can't really appreciate the fine structure. Now, one way to look at this is Hexbin. So Hexbin is kind of like a two-dimensional analog to histograms. So instead of having histograms all on one line, it puts histograms into the plane. And each single histogram has a hexagonal shape so that the plane is completely tiled. And the elements that fall into each bin are approximately equally distant from each other. And there's a function here, Hexbin, which we can feed our graph versus host disease. We can specify how many bins we want. And we can specify the colors that are being used. So the Hexbin actually basically specifies a data object. And the plot function then knows what to do with that data object based on the class that is assigned to it. So this is a Hexbin plot, useful for very dense data. So now looking at this, I think we can appreciate very clearly that there seem to be at least four different populations of cells in our graph versus host disease that we probably need to consider differently. It's the same way of like we looked at our QQ plot and determined that probably there's more than one population in the genes. And we should analyze them separately, thus that would be basically the same with these that's demonstrated with the Hexbins. Yeah? So in the column. Right. So there's a lot of information on colors in the plot reference. But essentially what happens is that the function color ramp palette returns as its value a function. And that function generates the colors. So let me show you what I'm talking about here. So this is a very bright yellow color, much red, much green, lots of blue. And this is a kind of a dark blue color. And these are some intermediate colors. So basically this specifies fixed points in a possible gradient of colors. And the function color ramp palette creates a function that can be assigned to the variable. So if I have, if I assign the output of color ramp palette to the variable f call, then I can use f call to create color values that will lie on that ramp. So if I need five color values, it will give me five color values starting from the first one I specified, going to the last one I specified, and having something in between. So notice I only specified four, but I can get five colors here. If I need more than five color values, i.e. in this case, I need 20 color values, if call will give me 20 color values. Again, starting from the beginning, ending at the end, and going smoothly in between. So this is how we can end up with as many color values as we want, smoothly varying. Now, if I would have changed this to something more purple, I would get a different color ramp. So I can either get very smooth or very different. The use of color in plotting needs some care and thought. So what we can do is we can color either in a quantitative sense, i.e. one extreme of values is, say, light, and the other extreme of values is dark. Often we use things like heat colors, so black for low values, and going via red and yellow all the way up to white for very high values to emulate the emission spectrum of iron or something. But we can also use colors for categorical differences. So if we have apples and oranges, we would color them green and orange, for example, because that is a good mnemonic for us to identify categories of elements in our data plots. So these are different ways of using colors, so it shouldn't be arbitrary. But there's a fair bit of information and some examples in the plot reference script. OK, so Hexbin is one version of doing this. The other version of plotting and dealing with overlap is something called smooth scatter. This basically smooths the points and gives us this kind of a cloudy, low-resolution appearance, kind of reflecting the fact that there's error associated with every single of these data values and then giving us dark colors where there's a lot of overlap. Somewhat similar is if we vary the colors themselves by density. So if a data point falls into a very dense region, we would give it a darker color. And if it falls into a very non-dense region, if the next neighbors are all far away on the scatter plot, we could give it a lighter color. And this is by using the dense calls function on the two data columns as the color argument in a plot. PCH is the plotting character that we use. CX is the character expansion. So the plotting characters we use can be circles and squares and triangles and daggers and stars and different things. And these are identified by individual numbers. Again, refer to the plotting reference. Now, PCH 16 is a way to plot things so we can specify the elements to be filled with a particular color. And CX, the character expansion, specifies whether the plotted symbol is going to be large or small. So if we plot this here, it kind of looks like the scatter plot, but more defined. If we make CX smaller, we get smaller dots. And so you can also appreciate the internal structure of data sets that overlap. And this trellis plot, which is implicit on this kind of lattice, we've accidentally stumbled in before. So that's what I basically wanted to cover as a first introduction to looking into data. We've talked about plotting data in various ways, defining some statistical measures, some statistics like mean values and variances in order to characterize our data, how to use quantile analysis to analyze the distributions of our data points and compare them. But to get a little more quantitative than that, one of the most frequent types of analysis we do are regression analysis. So let's spend some time on regression analysis. So we'll talk a bit about correlations. We've mentioned the word correlation before, but let's have a closer look at what correlations mean. We'll talk a little bit about linear and nonlinear regression. Unless there's huge demand, I'm actually not going to explicitly go over nonlinear correlation. But the principles should be clear from linear correlations. And there's code for how to specify functions and how to take functions through nonlinear least square fit. We'll talk about that later. And of course, we should be able to compute or not compute them. So here's a scenario. These are points on a scatter plot. And that's a very noisy scatter plot. But nevertheless, even though it's a noisy scatter plot, we kind of see that there's a trend that if values are small on the x-axis, they're kind of also small on the y-axis. And conversely, if they're large on the x-axis, they're also kind of large on the y-axis. So some of the information that we store in the x and y is actually redundant. Because some of the information of x can be recomputed and then give a value of y, or can be used to predict values of y. So this is what we mean when we say data is correlated. i.e. a part of the data depends on another dimension of the data. Now if we assume that just for example, the x values could be the height of people that we've measured in meters. So we have small ones and tall ones. And the y value could be their weight in kilos. So we have small ones that are rather light and tall ones that are more stout or in the middle here. And the question is, how do we quantify it? To what degree these are correlated? If we quantify that, what does that even mean? And how do we get the parameters that can define this correspondence? And that's where we get into correlations. So correlations don't imply causality. That's another question. In r, the function core correlation measures a correlation coefficient. And the values can go from minus 1 to 1 and 0 means no correlation. So these correlations are parts of models. And linear regression is one of the possible statistical models that we can apply to data analysis. Now for a statistician, a model is not what you perhaps might think it is. So model does not mean we have a kind of a quantitative understanding of nature, and we can plug in some values into that quantitative understanding, and it will spit out the correct results. A statistician's model is simply a statistical function that creates data that has properties which come as close as possible to the properties of the observed data. So it's a black box. But it is supposed to have the same properties statistically as the observed data. So what we do here in modeling is we specify a model. In this case here, it would be that there's a linear model that we can apply. We estimate the parameters of that model. So for a linear model, the parameters would be the slope and the intercept of the y-axis. And then we ask, well, is the model any good? Does it actually explain our data well? If it is not good, we either re-estimate the parameters or we specify a different model. But if it's adequate, we can use the model, for example, for prediction. Now, linear regression assumes a particular model. And the model is that there is some offset plus some proportional factor here, basically a multiplicative factor plus some errors. And these different components of the model go by different names that you will encounter in the literature, things like independent valuable or predictor valuable, regressor, exposure variable, and maybe input variable. And the result is the dependent variable. So there are two parameters that you can independently fit. One is the regression coefficient, i.e., something that characterizes the slope. And one is the intercept. So the implicit assumption here is that our whole system only has two variables that are of interest. Of course, there's also errors or noise, but we would like to try to exclude that. One variable is a response and one is a predictor. We don't need to adjust for confounding or other between-subject variation, i.e., confounding means if we have two variables that change in the same direction due to a similar effect. So one example of confounding is it's well-known. Actually, I don't even know. Do we have storks in North America? Do we? I've never actually realized that in Canada. Yeah, anyway. So in Germany, there used to be lots of storks. And as you know, that Germany is also one of the first world countries with declining fertility rates. So fertility rates are declining. And it also appears that the number of storks are declining. So there's a very clear correlation there between the number of storks and the number of babies that are appearing in the households. Does this mean that we're validating the idea that babies are brought by storks? Not necessarily. So is it just spurious? It's not necessarily spurious either. The underlying realization is that both of these effects are caused by industrialization or affluence. As people become richer and are perhaps better embedded in a social network and have old age security, it becomes less important to have many, many, many children to feed you porridge once your teeth fall out. And at the same time, industrialization also reduces the wet meadows, which produce lots of frogs, which are good food for the storks. And thus, the habitat of the stork is reduced. So this industrialization is a confounding factor which acts both on the number of storks as well as on the fertility rates in society. And both of these effects are influenced in the same direction. So if we analyze them one against the other, we will find a correlation. But there's no causality implied. However, there's a relationship which is determined through that so-called confounding factor. We could adjust for that confounding factor by simply measuring how much of an effect the confounding factor has, and then specifically removing that confounding factor effect from our original data sets, and then asking whether they're still correlated. And usually, we will find that they're then no longer correlated. So this is an adjustment that we would need for confounding factors, which we can do with basically regression analysis that we would do beforehand. So it's part of cleaning up your data before you analyze it. But that's the assumption. Any kind of cleaning up of the data that needs to be done is already done before we throw our data into the linear regression analysis. Another assumption is that, well, it's actually linear. It's not quadratic. It's not synocoidal or anything else, not exponential, but linear. If it's not linear, it can be fit. Anything can be fit. But the fit will not be meaningful. The correlation coefficient, the variance should be constant, and especially it should be independent of the size of the value. So we shouldn't have larger variance if the values are large. The errors should be independent of each other. And for proper statistical inference, the errors or the noise should actually be normally distributed. So even a simple statistical regression, like linear regression, has a number of implicit assumptions that we make. Fortunately, most of these assumptions are not bad. So we can live with them. Now, once we do linear regression, we estimate the parameters. And then we need to ask ourselves, well, how good is the model in the first place? Can we believe our parameters? How much of our data would then be covered by a prediction? And one of the reasons why linear regression is so important is because it has an analytic solution. What does that mean, an analytic solution? And what's the opposite? Have you ever heard that term? Comes up a lot in computation. Well-behaved problems may have analytic solutions. Messy problems have numeric solutions if they have solutions at all. So analytic solutions are solutions that can be solved in closed form with an algebraic formula. Numeric solutions don't have closed form solutions. So we need to start trying things or simulating things. Somehow running an optimization algorithm that will find the solution by slowly, slowly, slowly iterating itself towards a correct solution. That's numerical. These things are often slow. They often get stuck in local minima. There's a number of other issues. And nonlinear least squares fit, i.e. fitting fundamentally different models, like logistic regressions or trigonometric functions, have these properties. Linear regression, on the other hand, is well-behaved. There's an analytical solution which we can compute. This means it's fast. It's efficient. We can apply it to very large data sets. And it's a good assumption for a first kind of look. So let's study this in code. Mohamed? Yeah? Is that something that we measure, or is this something that we measure? There are new statistical methods that actually try to work with quantitative estimations of error terms and get some interesting results. For our practical purposes, what we do is we estimate the slope and the intercept and then treat these as the residuals, i.e. the differences from the predicted value and the actually observed value. But that these are residuals and not true effects and the data is, again, an assumption that perhaps can be improved upon. OK, so this is a different R project. The project is R EDA regression. You download it in the usual way with the URL of the repository, github.com, hughin, R EDA regression. R underscore EDA capitals hyphen regression. And once that is loaded, type in it. So we'll have some stuff. Anybody stuck at that point? Wave your hands. Put up a red post it. No, everybody got it. Wonderful. So in any kind of modeling analysis, it's really important to be able to generate synthetic data. You're going to try to estimate parameters of an unknown distribution according to some model. And how would you know that your estimation is correct? How would you know that the functions that you're using and the procedures that you're using actually have the capability to find the right parameters? Well, what you do is you generate some synthetic data for which you already know the parameters. And then you try to recover the known parameters. And if you can get something that's close to the known parameters, you may have some confidence that your workflow is valid. If you admit to do that, it's just a wild guess. It may be right for the wrong reason. It may be entirely wrong. And if you publish results based on that, you're in a good way to have a conversation with the dean about your recent retraction. Uncomfortable. What? The model? The one is to use the new. Would it require that? Unfortunately not. It's just pretty terrible. But what people do more and more frequently is they take all of their code that they've used in the analysis of the data. They post it on GitHub, and then everybody can start reproducing the data if they're interested. That's the modern way to work. Make your reviewers feel confident that they can trust your data because you apply the utmost care and generate it. Now, here's a small function that generates synthetic height and weight samples according to a function that I've essentially have just made up. It takes the parameter n, the number of variables or random numbers that we would like to retrieve, just like our norm has that parameter, and that tells us how many numbers it should generate. It has a parameter hi, height minimum of 1.5 meters, just to constrain it. A number ha, height maximum of 2.3 meters. I think that's kind of reasonable. It has a scaling factor of 40 that kind of seemed to give sensible numbers to me. We set a seed value here to have things reproducible. If I would be doing this for a function that I would reuse more often, I would allow the option to turn off the seed or to pass the seed as a parameter into the function, not to always get the same values. Then I make a matrix, something like a two-dimensional vector, with n rows and two columns. And then I generate a column of heights in the interval defined by hi and ha. And that uses the function r uniform. So this is not run if this is r uniform. Oh, OK, so that's not defined. So let's say seven values between 20 and 23. So this gives me values. And where the values for the normal distribution were, of course, distributed according to that bell curve, the values of the uniform distribution are just uniform within a specified interval. So basically, this is random numbers in the interval. And then I generate a column of weights with a linear model. And the linear model is simply q, i.e. this number 40 times what we find in the column of heights and adding one. And after we've computed that, we also add noise to the data. Otherwise, we would have a perfect linear correlation, but adding noise to that gives us some variability. And then we can create an object simply by saying this function, return me 50 values with the default parameters. And we can plot that in the same way we used scatter plots before. If you have a very good visual memory, you will notice that this is strikingly similar to the plot that I have shown in the PowerPoint slides for reasons that are probably self-explanatory. OK. Now, what about the coefficient of correlation? That's easy to calculate. It's just a function c, o, r with the first column and the second column. So that calculates the correlation between height and weight. Bingo, here we are, 0.54. Yes. What does that mean? Good, bad, high, low. What should we expect? Not sure. I'm not sure. I don't know. This is a number. We often have these procedures that give us numbers. So we need some frame of reference to interpret this number. What's a high correlation? What's a very high correlation? How does that look like? So let's try to build some mental images. Let's ask, what kind of correlation coefficient do we even get if we take a perfect correlation or a less than perfect correlation? And for this, I take 50 random values according to the normal distribution. And I set a little constant here. And then I define the y values that correspond to the x values as the product of this constant plus 1 minus r times another set of random values. So what this means essentially is that in this case, I have 99% of my data has a perfect linear correlation. And 1% of my data is random noise. In this case, 80% of my data has a perfect correlation. And 20% is random noise. And in this case, 0.1% of my data is correlated and 99% of my data is random noise. And we can plot that and then see what the correlation coefficients are. So this is the first case here. Noise is only 1%. Correlation coefficient is 0.9999577. So very close to 1. If we have a correlation coefficient like that, that's kind of what the data looks like. Very, very much everything falling on a single line. Here it gets scattered out a little more. This is a correlation coefficient of 0.97. You already have significant scattering, 20% randomness, but still r value of 0.97. 60%. In this case, the r value drops down to 0.57. So this is the kind of a data cloud. This is not our height and weight, even though it kind of looks similar. But that's approximately what that looks like. Yeah, remember, our correlation was 0.54. That correlation is 0.57. So they kind of look similar. So correlation coefficients of 0.5 or even 0.4 are still considered quite significant. Matters become a lot less trustworthy if the correlation gets less. And in the case of just 1%, so this is essentially the random correlation, and the coefficient here is minus 0.04. So very close to zero. So that's the kind of intuition we can form about the actual value of the correlation coefficient and what degree of scatter or what degree of noise that represents. So up to around 0.5, it's pretty good. Now remember, we said that the relation is not necessarily linear. In biological data, we often have sigmoidal functions, or we have exponential functions, or distributions that are normal in one degree. But since the values can never be negative, they're censored. And then there's a skewness in the tail. So what happens if that's not linear? So let's have a look at some correlation coefficients of functions that are not linear. And I'm doing this with data that is from a uniform distribution between minus 1 and 1. And this mixing factor here that I've used above of 80% or 40%, I'm setting that to 0. 0.9. So the actual correlation now is 90% functional and 10% noise. Now in the linear case, it looks like that. So that's a linear case at that level. And our correlation coefficient is 0.98. So what happens if it's a periodic thing? I.e., we're looking at correlations of time series in cell cycles. So this is going through one cell cycle, periodic. Again, 90% functional, 10% noise. Now the correlation coefficient is virtually 0. The coefficient of correlation tells us that this data is not correlated. The scatter plot tells us that's actually not true. It's not linearly correlated. But finding the right function would allow us to make a very good prediction of what the data are. What about polynomial? So here I'm using an x squared function, parabola. Same problem. Very nice correlation with a parabola. Very, very poor coefficient of correlation. Exponential. The values get very high, so the noise kind of gets suppressed. Here we actually get a reasonable coefficient of correlation. Why? Well, if we draw a regression line here, we could kind of fit it in here. And then there would be outlier values that wouldn't contribute as strongly. And so we could only looking at the number kind of believe that there's something there. But of course, the fit, having this exponential data and then fitting it with a linear function is wrong. Circular. Data on a circle here. Again, virtually a relationship of 0. So that's the caveat here. The correlation coefficient picks out things that are linear correlations quite well. But if your underlying model is not a linear correlation, then of course the correlation coefficient may be entirely meaningless. So we have to be careful. This is why EDA is so important. Just using canned procedures and calculating numbers may be very, very misleading. It's really important to look at your data, look at the scatter plots, and try to determine whether your statistical model is actually correct. That said, calculating a linear regression in R is absolutely trivial. All we need is the tilde operator. So note that this is a tilde. This is the character that is at the left top of your keyboard. This is not a hyphen. So look carefully. Linear model of height weight column 2 against height weight column 1. And that gives us the parameters. Intercept minus 2.8 and a slope of 42. Is that any good? So remember, we worked with synthetic data. What did we put into the data? 40. That was the slope. And the intercept? Plus 1. So is it good? I think it's pretty good. You know, looking at that degree of scatter that we do have here, getting the parameters right to very high reliability is actually quite surprising. So we re-plot that. Now let's plot a regression line. So we use this app line again. Only this case, we don't specify vertical or horizontal, but we feed it a regression model. And app line then is able to take the parameters from the regression model and plot the correct regression line. Here it is. This is the regression line. What's this line? What does it represent? Intuitively, we say, OK, it kind of characterizes where the data would be in the abstract case in the ideal case. But mathematically, why do we put the line into this spot? The sum of the distance is 40 plus sum. Almost. The square of the sum. Not the square of the sum. The sum of the squares. So this line minimizes the number that you get when you take the distance of each point from the line and square that and add all of these together. So this is why it's called at least squares fit. So it's a fit that generates a mathematical formula where that distance is minimized. Now, residuals are these difference values. And we can easily calculate the residuals to look at our plots. We can also easily calculate the idealized values with the function fitted. So fitted again takes the linear formula and knows how to extract the values from the linear formula and put that into a vector. So we can then use what we calculated for residuals and for fits to plot the differences. And we do that by plotting these as segments. So these are now unconnected line segments. That's another way to put additional information onto a plot. A segment needs four points. x1, y1, x2, y2. So x1 are the values we find on the x-axis of height, the first column in our data. x2, or y, is the second line. So taking these together, it says our segment starts at one of the data points. That's the coordinates of one of the data points. We find that in these two values. And our segment goes to the same point on the x-axis, but a point on the y-axis that we find in fit, i.e. in the idealized value. So taking together, this will draw a line from the data value down to the corresponding point on the regression line. So this is a plot that shows us the residuals. And the square of the length of these red line is what's being minimized. Then we can plot fit against residuals. This is this. What's that good for? So this is the fitted value. This is the residual value. If we look at the correlation, we see that fit against residual is extremely close to zero. Well, that's not surprising, because the model was built in such a way that that should be minimized and should be as close to zero as numerically possible. And that's successful. So ideally, the correlation between fit and residual should be zero. What would it mean if there is remaining correlation in fit and residual? No, it would mean that we can come up with a linear model that is better than the one we have. But by the definition, the one we have is already the best one. So we could then adjust our linear model to further reduce fit against residual, which is kind of contradictory to the mathematics behind this, which is already optimal. But there's another way in which fit against residual could come out systematically different. And that's if the linear model is no good. If the linear model is not able to remove any kind of a systematic effect in the residuals, it means we really have to change our linear model. So that's one way to see whether the linear model is adequate. We plot fit against residual, and only if the linear model is adequate will we see that this is now a random and uncorrelated data setting. Then the linear model is good. Isn't that the same? Isn't fit plus residual the observed value? So the observed value is residual minus fit. So it should come out the same. We can try that in the coffee break. But I think there's no different information there, because fit plus residual gives the observed value. There's no independent observation. OK, now in order to further characterize whether this is any good and what we can do with the parameters, there are two aspects of our linear model that we need to discuss. And one is prediction limits, and another one is confidence limits. Prediction limits tell us boundaries on future observations. And so they say how well the model is expected to accommodate new data. If we plot prediction boundaries on our plot, we can tell where in our plot new data points that are not in our observed population could possibly fall. Whereas confidence limits give us boundaries on the adequate model. So they characterize how well the model is expected to accommodate the data. So if we go back here, the confidence limits would tell us the possible range of slopes and intersets that we can have on that red line, whereas the prediction intervals would tell us the possible range of where points could fall. Now when both are calculated with predict, one is calculated with predict interval equals p, and one is calculated with interval equals pc. And it gives us the fitted values that we've seen before, and a lower and an upper bound. So for the first observation, which is in our data set, we have the fitted value, the one that would, I don't know which one the first value is. Let's assume it's this one. Oh, actually I can tell it's 60.57. So actually probably this one here, or this one. It should, there's a lower and an upper bound for the model at that spot. The lower bound is that the fitted point could lie around 51, i.e. down here. And the upper bound is it could lie around 69, i.e. here. So these are possible bounds on the fitted intervals. So basically this then tells us where this could lie. Now all we need to do to plot these boundaries is to take the numbers we find in lower and in upper and plot them on our plot here. But that's not as easy as just saying lines of these columns here. Because the values are not ordered. So if we simply do a command and do lines from point to point to point, the lines would jump all over. And that's not what we're looking for. The lines should go from the smallest to the largest. And in order to plot them from the smallest to the largest, we have to use order again, as we've done before. i.e. we need to sort our data frame in ascending or descending. So we should recompute pp and pc in sorted order. And then we can plot them. First, we plot the data. It's now plotted in an ordered way, which we can't tell. It's the same data. But now we start with this, then that, then that, and so on. And we can use the function mat lines to plot the upper and lower bounds. So these are the upper and lower bounds of the fit, i.e. within the confidence interval, which I think by default is 0.05 of probability, i.e. something that is still significant. A line that somehow lies in this field here would explain the data equally well as this line here. So we can say our model lies somewhere in this bound here. The mean of where it could lie is this gray line. But it means that there's a lot of noise, and other lines could explain the data equally well. And in terms of the prediction intervals, we have this here. And this says, again, within our confidence limits, a new data point would fall in between these two lines. We would be surprised if it lies outside. That would kind of violate our assumptions about how well the model performs in this case. However, if we have large numbers of data points, we are almost certain to find some that lie outside some of the time. Warning, predictions on current data refer to future responses. So these are not predictions about the current data. So it's a prediction about something that's not observed in the data. Why is that there as a warning? Honestly, I don't know. Maybe it's just a reminder about how statistics works in principle, but it's kind of odd. I should look it up. OK, now, this is what I would consider a complete regression plot, i.e. You show the data. We can actually see what the data looks like. Here it's not a whole lot of data, but if it's more than that, if it's very much data, we've looked at scatter plots with density colors or with hex spin or whatever, all of which you can use to create a plot about lots of data and the density of the data. That's still informative. You show the regression line. That's the result of what our best belief is, what the correct model here is. You show the confidence intervals for the model, which will be broad or narrow, depending on how well our data actually is explained by the model. And you show the prediction limits, i.e. If we have new data, where would we expect this to lie in? So with this plot, we're not just showing our analysis of the data, but we're documenting the quality of the analysis. The difference between this and simply showing a regression line is the same difference if we have a bar plot without the whiskers on top that show the significance of the lines. So basically, think of this as a two-dimensional or a regression analysis equivalent of such a bar plot. So we have about 10 minutes before we break for a coffee break. How about you take our data and produce one regression line plot for a good regression plot? Yeah, let's just take any of the cell lines, take the controls, compare them to the LPS stimulus, and produce a scatter plot from that, and produce a plot of this type on top of that. So basically, all you need to do is to copy some of this code, throw out the comments because you don't need them. And change the labels of the R objects that we're using here and the column numbers. If you're really lazy or smart, which is in computing the same thing, you would overwrite HW column 1 with a control and overwrite HW2, HW1 with a control, HW2 with an LPS plot, then you don't even need to change the code, right? You don't understand what I just proposed? Or you could easily reproduce it. That's the wonderful thing about the script. But yes, indeed, I'm implying that. And once again, if it doesn't work, if you're stuck, if I've been confusing you all afternoon and you don't know what to do, then show us the red posted. Sorry to interrupt, but I might have missed this. But what is the significance of ordering a vector before entering it into prediction? Remember, we went to, if they're unordered, and then if we use the similar one, we'll start with the first, go to the second, go to the third. That's not going to work. If we order them, it will go correct. Sorry, that was an obvious place from the start. No, it wasn't obvious. Yeah, I'll save it. So you don't really start on the call. No, just very briefly, my solution for this very simply would be, since we have a new project, we need to reload the data. So we load the data from that path. The two dots mean go one level up to the parent directory. We specify the directory in which we want to descend down, which is the R EDA introduction. And from that, we pull the R data for the LPS data. Then we make a matrix with the number of rows equal to the number of rows in LPS data and the number of columns equals two. And we place the two columns that we're interested in. In this case, I'm using macrophage control against macrophage lipopolysaccharide. And we place that into column one and column two of HW. I'm using HW, as I said before, because I'm lazy. I can then just use the code that I've used previously to do everything. Now, plotting the prediction and confidence limits first requires us to order it, i.e. creating this ordering vector, smallest first, next smallest second, next smallest third, and so on. And then we rearrange the data in HW according to this ordering vector. So now everything goes nicely in terms of all the values go from left to right, not all over the place. And since they all go from left to right, we can consistently draw a curve from them. So we compute Pp and Pc in that sorted order, i.e. based on the resorted values here with intervals of confidence and prediction. And then we plot it. You can use a standard scatter plot. I've used one of the density plots here just to integrate something we've done previously using these density colors, which shows me that most values in that data set are very small, and only a small number of values then become larger and larger. And I specify the label of the x-axis, mf.control, the label of the y-axis, mf.lps, the y-limits to lie in the range that my prediction gives me. If I wouldn't be doing that, then the y-axis limits would be bounded here by the highest value of the actual point. But the limits that I want to plot can go beyond the highest point. So if I want to include them on my plot and not have them truncated, or have them truncated at the bottom here, I need to redefine the y-axis limits with some expression like that. So the range between the maximum range of the highest and lowest in my prediction intervals. Plotting character is simply a filled circle. Character expansion is one, i.e. keep the circles relatively small, and that gives me this plot. And then I had to change nothing with these lines. It's exactly the same command for the prediction boundaries and the model boundaries. And you see that in the correlation between induced and non-induced data, the linear correlation is very highly defined, but there's also a lot of noise. And there's quite a number of points that fall significantly away from that line. And these would potentially be points of interest to look for actual regulation in response to the stimulus. So one final question here is maybe, why do we see such structure here? What does that structure mean? Where does that come from? Why does it look gridded? Any idea? Whenever you see structure in data, you'll know you'll have to explain it. If not to yourself, then to your supervisor, worst of all, if your supervisor is yourself. Well, basically what these gaps in the data mean is there's values here, but nothing in between. How can that be? Well, because of the limited numerical precision we had in our data. So many, many of these data points, simply depending on the numerical precision, we are not using the entire range of the x-axis here. That's all that means, nor of the y-axis.