 if your head is spinning by now I know exactly how you feel coffee keeps me upright now Warren was asking more about that that violin plot again and Lauren was so kind to show us how it actually works I've put this into into code snips you need two libraries some of you might have the libraries installed already GG plot and reshape the box plot of of the data looks like that we've looked at the histograms before their multimodal distributions and in order to put it into a format that GG plot likes we use the melt function from the reshape package it basically takes all of the columns defines each column as a factor and puts the columns one after another into the same column and then it can create a plot from that an abstract plot and it can add the violin aesthetics and that's what what that looks like so this is a violin plot and you can you can kind of appreciate that this is conceptually similar to the box plot but it's richer in in data and I've put that into code snips we're moving on to to the second great topic of the workshop the first one was just an introduction to exploratory data analysis we'll be getting a little more quantitative now with regression analysis so the concept of regression is essentially to explore quantitatively explore correlations in the data and we've kind of vaguely looked at the scatter plots before we've seen that some of the data is correlated and we can we can say that that is meaningful in a sense if we see correlations in biological data it often means that one causes the other or it means that both are caused by something independently which acts on both or it can mean that this is just a random chance correlation sometimes the task is important to figure it out but guilt by association seeing that two things vary in the same way often tells us something useful or creates useful hypothesis about mechanisms that underlie our data in the first place so if you net all of you should have updated and and pulled the latest version from from the repository to make sure that what you have here is actually the latest file of the our EDA regression unit so if you find our EDA regression in your files pain and click on that this will put you here we're talking about regression and before we talk about regression we need to think about correlation what does correlation mean when is data correlated how do we calculate correlations and what do the different correlation values mean when we see them so let's let's play with the data and let's look at that so we'll take 50 random deviates from a normal distribution with a mean of zero and standard deviation of one this is this our norm 50 and we assign that to x and same thing for y and if we plot the two against each other this is something that by definition should be random we've taken random values in x and we've taken random values in y so there should be no similarity between or no no significant influence of x and y because they were independently taken from a random or a pseudo random distribution now the way to calculate the correlation the degree to which one set of deviates depends on the other set of deviates is simply with c o r so this calculates the Pierce so-called Pearson correlation of two values and as we see the Pearson correlation between x and y is a very small number now if on the other hand we now reassign y to actually be x so it's now y is the same value as x so what does the plot look like why is the same value as x we've plot that as a scatter plot everything is on a diagonal right whatever the x value is the i value is the same thing the x value is still random but the i value is the same value as x so what's the correlation of that well correlation of that is plus one exactly one if we would go and say y is minus x and then plot that the distribution looks like this and the correlation is minus one so correlation Pearson correlation can take values between minus one if something is un anti-correlated i.e. an increase in one value means a decrease in the other value and vice versa via zero if there's no correlation and a change in one variable does not have anything to do with the change in the other variable to a maximum of plus one which means they're perfectly correlated an increase in the value of x definitely means an increase in the value of y by some some linearly scaled factor now what we're trying to figure out when we look at correlations is is there a significant influence so as we go from minus one to zero to plus one at some point our belief that this is meaningful that the correlation is meaningful is going to deteriorate and we'll say you know at first yeah this looks good this doesn't look as good and and then we say this is really bad this these two values have nothing to do with each other but where do we draw the line what does that even look like so let's try to develop a little bit of an intuition here so I I'll do I'll write a little function here plot correlation which takes which takes some values x and a portion of noise which I call r so we compute y as values that are to our parts simply x and to one minus our parts random noise right so we're looking at the intermediate states here of going from the uncorrelated to the correlated and small steps of in between so we calculate the noise as the same length of x and then we plot these things together so initially we start with 50 normally distributed values of x and and run this plot core plot you see that there's a slight deviation here right 99% of what we see here is y equals x and 1% noises has been added so something like that is almost indistinguishable from one it's 0.99999 a super high correlation sorry so if we go to 90% it gets a little you know more spread out but the correlation coefficient still is crystal clear 0.996 very similar to one at 80% so adding now 20% of noise to this we still have 0.97 so it doesn't start deteriorating very rapidly this is still an extremely high correlation of course looking at it you would you would say there's there's obviously these two deviates are varying in the same sense 40% of noise it gets a lot more spread out this is a correlation coefficient of 50% so 40% of noise sorry 60% of noise 40% of x value 60% of noise it turns out that we have about a correlation coefficient of 50% this is still quite good if you have correlation coefficients of 50% in your data analysis you can pat yourself on your shoulder because you've probably found something that that is interesting well and then you know if we if we get to have less and less this is now almost uncorrelated 99% noise the correlation coefficient is a small value in this case it turns out to be minus 0.13 if we repeat that it's minus 0.10 repeat that again minus 0.16 so there's still you know some something going on here but correlation coefficients of around 10% or 20% are really indistinguishable from random situations now something that we need to note however is that correlation here's the correlation has is defined and calculated with reference to a linear correlation and if our values are not linear we can have very low correlations for very good very strong dependencies so remember if we when we look at 10% noise we get correlation coefficients around 0.99 so let's look at 10% noise but now we don't use a linear function but we use a trigonometric function a cosine function you see here it goes down it goes up it goes down it goes up it's the same same thing super high dependency of one value on the other but with a different functional form what does the correlation coefficient go do boom it goes to nothing 0.07 so if you look at the data you'll immediately realize you know if your data looks like that you've discovered something but if you only see the correlation coefficient you might think there's nothing there there's no correlation there there's no linear correlation here but it doesn't mean that the data are not mutually dependent so polynomials this is a quadratic function again we go down to a range of non-significance exponential here the correlation is a little bit higher because you can imagine that we can kind of fit a line through this and still get you know lots of residuals but but still kind of a significant description of course 0.69 is higher than nothing but it's far away from the 99% of correlation that we would otherwise find or if we have something circular again it goes to nothing so in all of these cases there's a clear functional relationship it should have yielded a correlation coefficient around 0.99 but not a linear correlation which turns out to be much lower so you have to be a little bit careful about how your data is structured now let's stay with linear modeling for a moment and try to to figure out if we have a scatter plot of data and the data is linearly correlated how can we figure out what the parameters of that functional dependence are so the the task here is to take data and fit a line to the data and calculate the slope and the intercept of that line to characterize the trend in our data of course we could just take that function and apply it to our some of the one of the data sets that we've seen you know pull out something from the graph versus host disease or the lipopolisaccharide stimulated data but that would actually be an error when we when we do quantitative analysis on data with simple or more sophisticated tools the golden rule is always do this on synthetic data first you have to synthesize your data in a way where it is similar to what your measurement looks like but also in a way where you know the parameters that you're trying to find then you apply your analysis then you figure out did you recover the parameters that you put into your model if you applied an analysis to synthetic data and you were not able to recover the parameters of how you synthesize the data there's no way that somehow magically the same analysis will give you the correct results on real date so that's really important you have to tune and hone your analysis in a way where you first apply it to synthetic data so here's a very simple code that generates a model of height and weight of humans so when we all know that as we as humans grow taller we also get heavier sometimes we get heavier without growing taller that's for different reasons but there's also a big distribution here so we'll take 50 values we'll say that our minimum height is 1 meter and 50 that's not very tall person our maximum height is 2.3 meters this guy is probably playing for the raptors and we build a data frame with 50 rows of heights and weights and then we generate a column of numbers in the interval so here we don't use our norm but we use our unit this is not run if this is our unit our uniform random data from a uniform distribution so simply random numbers somewhere in the interval of heights and then we generate a column of weights which are correlated to that with a linear model where we say the weight of a human being is 40 times the height plus one plus some random noise so if somebody is 2 meters tall they would weigh 80 kilos plus one 81 kilos in this model here plus some random noise depending on whether they work out or or not actually both of which would make them heavier right so we have heights with weights and we can plot what we get here so you kind of see the trend if you reproduce this this should look exactly the same even though we've used random numbers why because of this here set seed short digression into set seed and why set seed is also really important when you simulate data so random numbers in computers are not really random usually not this unless we have special hardware to create real random numbers random numbers in computers of so-called pseudo random numbers they are generated with an intelligent algorithm which spits out data according to some distribution or just uniform random data that is as unpredictable as possible and as close as possible in statistical properties to true randomness but it's an algorithm if I repeat the algorithm to generate these pseudo random numbers with the same starting value I will get the same pseudo random numbers and that's important to realize because that allows us to take random numbers that are always the same when we repeat something and if we develop an algorithm that depends on some kind of randomness it's actually important to be able to troubleshoot it to make it reproducible so let me illustrate that briefly sample 1 to 5 will permute the vector 1 to 5 1 3 2 5 4 1 4 3 2 5 4 1 5 2 3 so I get a different permutation every single time depending on the random number generator now if I set seed to say and I sample 1 to 5 at first I don't notice anything different 4 5 1 3 2 5 4 2 3 1 but if I reset the seed to that number I get 4 5 1 3 2 5 4 2 3 1 and again 4 5 1 3 2 5 4 2 3 1 so it's always the same but it's still random from that point on the random number generator just does its thing but when I reset it with set seed it goes back into the same state over and over again not really you can use any any integer that is smaller than the largest integer I end up using 1 1 2 3 5 8 a lot Figuonacci numbers I kind of like that sometimes I use 3 1 4 1 5 9 you can use one there's actually if you if you care to look for it somebody didn't analysis of random number seeds on github you can you can web scrape github and they and and download the code that is the our code that's posted on github and then just you know parse it and find whenever people set random seeds and figure out what that number is it's actually don't say some people just have too much time it's actually a nice exercise for for data analysis on large scale and they found that you know 1 2 3 is very common 1 is very common 42 was very common not so many people use 1 1 3 5 yeah if you don't set a seed the algorithm set seed is run when you start our with something that guarantees to give you some randomness where that randomness comes from is operating system dependent there's there's a machine dependent random number generator which I think is then used as a seed to simply make sure that if you don't set the seed you really do get different results I don't know how it works on every operating system at least on Linux operating systems the machine collects something called entropy which is which is a long vector depending on what's displayed on the screen and how you use your mouse timing of system interrupts and so on and that's collected we actually had an absolutely interesting case at one point where an algorithm started to fail on a server because the random number generator didn't work and after much troubleshooting we found that the reason was that we had server was just running and we had disconnected the mouse and the keyboard because you know it's a server and since there was no more mouse and keyboard input literally the machine had run out of entropy the entropy file on the machine to generate random numbers which was required for some system library to run had been exhausted and that system function didn't run anymore because there was no more entropy in the system great cuter was to my my assistant at that time who actually was able to troubleshoot and find out what's happened I'm absolutely fascinated a computer that runs out of entropy who would have thought so in most studies people use algorithms that are not based on you know something in their computer but possibly some mathematical constants or something like that or what have we found the optimal random number generator yet the optimal random number generator is true randomness it's a little hardware device that actually generates true random numbers you can do this with radioactive decay you can also use do this with quantum randomness with a with a very sensitive device and these things exist you can you can buy a little USB thing that you stick into your computer which generates true random numbers I always thought I at some point I'd like to geek out and have one of these yeah so that exists that said the actual random numbers that we use in our and in other applications are very very good so they are really really quite indistinguishable now something you can't do with the true random number generator is to repeat your randomness if you want to repeat your analysis in the in the code that I use here I actually need it repeatable because you know for some things I just need the same thing to appear on your screen as on my screen and if I use any kind of sample or our norm or whatever we need to start from the same point all right back from randomness and the set seed so this is the plot and and all our plots probably should look pretty much the same and let's care let's calculate the correlation here and 0.54 which tells me indeed there's probably something significant as the height increases the weight increases as well which could be causal correlation or it could be due to a confounding factor now very simple to calculate a so-called linear model I eat calculate which line would pass through that data and best explain it this is a linear model this is just LM linear model where the function call is weights heights and the two things together with a tilde this is not a minus sign this is the squiggly character that lives on the top left of your keyboard the tilde ti lde squiggle how do you call this does anybody call it tilde anybody call it different good so be careful don't don't read this as a minus sign it's not a minus sign and if we calculate that this is what we get it returns the actual call of the formula LM formula is whites tilde heights and the coefficients that returns is an intercept of minus 3 and heights of 42.9 what do these numbers mean how do they relate to our requests what did we even ask so we created a synthetic data set in a particular way as a linear model so take an x value multiplied by 40 add one add some noise and that means this multiplied by 40 and add one should come out of our analysis so did it yeah yeah kind of you know we got 42 with the amount of noise that that that's in there this is actually pretty good if we run this many many different times with different seeds this is actually one of the better examples it just turned out that way I didn't actually fade it that way even though I didn't use 1 1 3 5 8 but 83 is a seed minus 3 is not very different from 1 in the general scope of things here if we look at the intercept so the numbers relate to our question because they are the parameters of a linear model which we use to generate the data in the first place is the estimate any good yeah pretty good let's plot a regression line there's a there's a function that we use frequently to draw lines on plots AB line it can be called in different ways AB line is often called by just giving a parameter a for the slope and the parameter B for the intercept you can also use it to have horizontal lines or vertical lines on a plot if I want a horizontal line I say AB line H equals something if I want a vertical line AB line V equals something you'll recognize that I use that in code from time to time but in the general case here it can take the output of a linear model directly and do something with it and it plots the so-called regression line so what is this line it's a nice line we draw through the data but what does it mean that this line is nice there's a mathematical interpretation why this line is is fitted in exactly this way and the mathematical interpretation is that basically the points on the line would predict where the values were if this were a perfect correlation but there are so-called residuals the actually observed values are different from the predicted values and the line is built in such a way that the squared sum of the residuals becomes as small as possible this is why this is called a linear least squares fit it finds a line that makes the square of the residuals as small as possible to calculate the residuals we use the function reset store that if we calculate the idealized values we use the function fit it and then we can use segments to plot differences so the intercepts of the line would be where the prediction would fall the length of the line is is the the height of the residuals now if the fit is any good that's actually interesting the correlation of the residuals should be zero or close to zero if the correlation of this the residuals is close to zero it means there is no extra information in the residuals that we need to take into account in the case that a linear model does not explain our data well for example because it's a sine wave or it's a quadratic function or an exponential function then there would be remaining correlation in the residuals we can't remove all the correlation with a single plot here so if we plot fit against residual we get this here is this random well the coefficient of correlation now is minus one to ten to minus sixteen which is very very very very close to zero there are two other concepts that apply here and these are prediction and confidence limits so prediction limits give the boundaries on future observations they characterize how well the model is expected to accommodate new data so basically the prediction limits tell us where would we expect a new point that we haven't observed yet to fall the confidence limits give us bounds on since our data is noisy we might expect that our linear model can also lie slightly differently this way or that way be moved up or down or twisted left or right and what are the bounds in which our model could lie so we'll illustrate this so this these are points they tell us the fitted value and the lower and upper bounds of where the where the confidence limits are depending on on the order of the x values the values are like 60 67 92 76 so they basically go up and down so in order to plot that as a line we need to sort that first we use order so the residual basically says if we take an x value and we multiply it by 42 something and then we subtract minus 2.9 that's the prediction of if that linear model were true where that value would fall but it doesn't actually our observation is somewhere different and the difference between the predicted value I point on the regression line and the actual observed value that's the residual residual is the difference between a point on the line and the actually observed value so basically the length of these red segments here these are the residuals now if all of the trend if all of the functional relationship between our data points have been explained well enough by that linear model then the only thing that the linear model does not explain is the random noise and random noise should have a correlation around zero right so we expect that the lengths of these residuals have a correlation of zero and as we see that's actually the case now if the correlation is not zero but there is some residual value there that would mean that we have only explained part of the correlation in the data with a linear model we might need a second linear model to explain some more for example if this is an exponential distribution there will be one best line and then one second best line and you know you can add more and more lines and it will become more random but you will never explain all of the trend away yeah kind of related to that so here you're starting with the assumption they have a linear model and you're carrying out your correlation looking to see if you have a minimized residual yeah but would you alternatively or additionally kind of try for example the exponential correlation or other and then in those cases look for residual to see if you know that reveals whether your assumption using a linear regression is correct as opposed to for example the exponential given a reduced residual maybe that's the better way to go yeah well that's that's that's really very difficult because the linear model is basically the the the smallest possible model that that that we can apply and and all comes razor tells us that that's a good idea alternate functional forms well there's arbitrarily many that we can use we can use hypergeometrics and we can use exponentials and polynomial functions and trigonometric functions and they are very very large books that contain all possible functions that we could use if we use all of them we would find one that has an almost perfect correlation because they're just so many so how is that then the correct one well exactly so how do you decide when to stop if you look at the correlation there's an obvious trend from the scatter plot of something that you might try like an exponential then you try that and you just trust that this might be correct if it fits very well you can argue that that's a valid model you don't know whether it's mechanistically correct but at least it explains your data in a sense that's what statisticians mean by modeling data not knowing anything about the mechanism but but explaining it with in a functional form so that the residuals against the fit of the functional form become very small so on the other hand you might have an idea about the mechanism you might have an idea that you know this is say gene expression which you can model according in them according to an impulse model for which you have published data on what that impulse model would look like or this is you know some kind of logistic function that that varies in a sigmoidal fashion between zero and one and if you have an idea about the mechanism or how that data should present itself and by all means you can do a nonlinearly squares fit and try to present that data we'll look briefly at nonlinearly squares fits later on but in principle what what what we do in in practice is using some of the standard models like linear exponential where we have reason to believe that the data actually behaves in that way because of the biological context that we know that our data comes from given the script that we have that's something that's very easy for you to try just change the value in the script and then pursue this through five or 500 data points fundamentally what you'll see if the number of data points becomes very small then our fit becomes very uncertain if it becomes very large and the fit becomes more and more accurate if it's something like a thousand I would virtually guarantee to find the correct parameters so we were looking at prediction intervals and confidence intervals usually when you when you see scatter plots published you only see the red line the regression line the information in these dotted lines is is really really important as well so if you do it right and you do it well you add this to your regression plot your referees will be very grateful because they will know that they're dealing with an author who is careful about their data and gives readers an idea about how how convincing the results are so essentially the line is the regression line the inner boundaries are kind of where the regression line could move to and God I don't know what the limit is I think 5% 95% of all all cases of randomness at that here would allow the line to lie in this region whereas the outer lines are the boundaries of where we would expect unobserved points I knew provance that we add to the data set to lie given the parameters that we've already seen so this is the proper way to plot a linear regression the possible range of the models and yeah right so it's the 95% confidence interval that we're plotting here predict takes as a parameter which confidence interval you want to choose the default is 95% now I don't want to go into nonlinear least square fits immediately there's about a thousand lines of code here by the way so we'll only get through about 800 of these before 5 but I would like you to try some regression tasks here so we'll we'll write up a task I'd like you to try a scatter plot and a linear least squares fit of expression data from our LPS that analysis so data commons Daniel so lines would you expect to behave similar which which cell types would you behave expect to be a similarity to LPS these cells and case and case cells and like I need a pair of cells that we expect to do kind of similar thing in response to both would be I think there's so much okay let's do not the pages and so that's let's have a hypothesis and say in our LPS data what were the macrophages anyway do we have macrophages M M F oh yeah M F M F and M O art to respond similarly to LPS if so the LPS minus control data should be highly correlated is that the case plot the scatter plot for this assess whether there is a linear correlation correlation for 20 marks you still wake up from time to time and think oh my god I haven't done my homework all right so what do we do about that well we don't start writing code in a little task like that first step is to understand what we are doing and break it down step by step by step so how do we break this down step by step what's the first step we would we would possibly do okay so calculate the stimulation values assign them to some variable plot calculate correlation calculate the linear model plot a deline of the linear model PC and PC intervals to make the plot nice it's kind of all laid out here the thing with the ordering of values might not be so obvious but you should be able just to copy the code and plug in the variables that you have and arrive at that and then of course this is way down at the black at the whiteboard you can't see that's hidden behind this here I'll be standing in front of it so you don't see it so you have to engrave it into your heart instead which is a very good idea interpret the data before you interpret your analysis it is not complete just spitting out a few numbers without writing into your journal what these numbers mean you're not done if you don't interpret your data you are performing cargo cult bioinformatics not the real thing or cargo cult science can't lead to anything good because it's just numbers without you know thinking about what the numbers mean and writing that down you are not done so interpret you can't see it from where you're sitting so do engrave it into your heart all right yeah write some code for that and if you're done put up a blue posted if you don't even know how to begin put up a red posted right now and if you if you know how to begin and you get stuck then put up a red posted at that time good I see a few red post it's already that's great that's what we are here for and that's what you are here how do you know what what am I looking at control okay I see right because they're all different okay that makes sense so then we're doing something to take kind of similar to what we did in our yeah okay great and like so so so why don't you to sign the first, and choose one of the signs. Literally plotting the difference in chiefs into that kind of brain control system. Yeah, I think it's literally just the difference. So what happens is somehow you mess up the numbers and whatever. So it's just a simple call. Not for you guys. So after you leave... I just want to figure out if I'm on the right track beside me. Do I want it to affect your brain? So... So do I want it to affect your brain? Yep. Right, yep. I made it out of hair. So it's one of the best strategies. Yes. So then I've done that for this guy as well. So now I can start plotting. Am I on the right track? Absolutely. Cool. Just plot it like this. Oh, I'm going to try this again. What, with my data? Yep. So I have two vectors. E, S, I'm going to use L, M. I'll scatter away. Maybe I can print. Yeah, so the line should be higher. I'm going to go over it like this. Yes. I'm going to try out the lines. Okay, try. We hear it. All right. I'm going to try to get the line to the right. I'm going to... Yeah, I felt like... So I've done a lot of failed research. I'm still on the wrong track. I'm still on the wrong track. I'm on the wrong track. I'm on the right track. I'm on the wrong track. Well, I guess this is where the line is. What is the second line of line? Learn it. I'm very young guys. But it could be complex. It's a lot different in some程. But it's good. Right. Yeah. I haven't read it yet. I haven't read it. Sorry, yes, yeah, okay, I don't think I would come up with that on my own camera, but, uh, this is probably dark. This is the background. Okay, I'm going to do it, and it's going to work, I'm sure, but, where is it supposed to go? Yeah. Hello, thanks. You started with just plating? Yeah, yeah, we were doing it just a little while ago. I started with plating, I was just like, Look how clearly, wait, that didn't work out. I didn't know what to do, I tried to just, I didn't know what to do, I didn't know what to do, I didn't know what to do, I didn't know what to do, I didn't know what to do, I didn't know what to do, I'm not good with this. Okay, Traci just raised an excellent point, what's in a rose, which we call arose by any other name would be just as sweet. So what we, how we apply this here is with this, we're calculating confidence limits. With this we're calculating prediction limits. If I call them PP and PC in the other way round, that doesn't mean that they're somehow different. It's just that I messed up labeling them. So this should be PC and this should be PP. Thank you, Tracy. But it doesn't really matter how you call it. You might just be confused when you plot it. So confidence and prediction. It's very easy here. It is, but like, no. That's aspirational, it's the most, but... Holy... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... So if you're done with that and you want to explain some more, my resident expert suggests to compare NK cells against dendritic cells, because they have quite different functions. So we can see that. NK and PDCs. See whether that looks the same or different. So... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... and you want to extract it from the control. You want to say the same cell type. Oh, so just by cell type, I'm going to get to one cell type. Yeah, the problem is it's minus and the second is not. OK, so what do you want to do? So first I have to get the cell type. I'm going to get the cell type. I'm going to get the cell type. I'm going to get the cell type. Here, then, you want to take the treatment minus the control type. So, Sierra, you still need all of those things, yeah, just channel to them, but you can't have the policy. Thank you. I switched it, but I switched it back and forth. Oh, you switched it back and forth? Yes. Yeah. Come after. Oh, that's what it says. That's what it says. And then the same thing, but for the different subject. What are you doing? What are you doing? Okay, so what they actually did was just do makeup. So try to join up. So now, I will exit. That's what I'm seeing. And now I'm just going to be on the line. Why should I be on the line? So, I'm also going to do all the troubles. So now I'll just take a lecture and see what I'm doing. See what I'm feeling. See what I'm feeling. And then, yeah, try to see whether there's future. Okay. See what I'm feeling. Cool. Okay. Okay. Okay. Yeah. Yeah. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Great. It is perfect. It is perfect, but that's because of this. So why used to say this, but now why is it the same as that? What I've seen across the room is you guys are doing really well. Most of you were able to do the plot and calculate the regression line and calculate correlations. Correlation is around 0.6 for the macrophages and monocytes. I've seen correlations around 0.57 for the NK cells and PDCs, and that's slightly less, but not very much. I don't know whether that's significantly less. It would be expected to be slightly less if the cells are different. So what does that mean? How do you interpret this? There's strong linear correlation. Sarah did calculate the residuals. The residuals are almost perfectly zero. There's a very strong linear correlation which perfectly explains the relationship between stimulation values for these different cell types. So what does that mean? Biology. Not mathematics. The two different types of cells are responding similarly to the change in condition. What does the response mean in principle? Let's break that down. So a single row or a single data point means what? Change in gene expression. Change in gene expression for a single gene. So you have a similar global expression response. Exactly. So we have a similar global expression response across the 1,300 genes that we're looking at here for stimulation to LPS. Is that something we expect? Yeah. I don't work in this, but... Well, is that something that you'd expect? Yeah. Yeah. I mean, they all seem to have the same receptor. I think it's a TLA receptor. Four or something like that. And I don't know whether that's the case. I mean, remember the distributions we saw of what these simulated values are in the first place? Most of them are kind of around a mean and simply a stochastic correlation. This means most of these genes don't respond to LPS at all. And that stays the same between whatever the cell types are. So I would actually expect differential effects only in the small subset of genes that are, in fact, differentially expressed. So I'm not very surprised, but I didn't know how this will turn out. I'm not very surprised that we have a good global correlation because these are just human cells and they behave in similar ways. But there would be differences if I know or if I can find out which ones are actually significantly differentially expressed. Perhaps I would see more differences there. So that's where I would look next in this interpretation, right? Daniel, am I saying something wrong? You're looking critical. If I have a biology expert in the room, I'm always super careful about what I say. Yeah. We are not, like, commenting on how many genes are expressed. We're just comparing if the same genes are over and under expressed in human cell types, right? Right. So given that those cells are over and over in many types of express, sometimes, yes, we're better. Yeah, you want to pass a little like this? Well, it doesn't necessarily mean that the absolute values of the expression change are means. It just means that if I have a random gene, it responds in the same way between macrophages and monocytes. It doesn't necessarily mean that that changes in any way. So there would be perhaps some global expression change effects, non-specific effects, you know, that might be active. And that would be shown here. That would be my next question that I would ask to the data. Am I just seeing, you know, general biology is actually here? The correlation is high because they're all human cells, all blood cells. Or does it actually mean they respond to LPS in the same way? If I want to ask, do they respond to LPS in the same way? I need to stratify my data. Different teams stratify between the genes that are actually different from the expression changes that are there. Wait, it doesn't mean that we respond to LPS in the same way. If you actually do the job of controlling LPS in the monocytes, because you want X, X and Y, they all actually correlate to 0.6 X, which is consistent with the idea that most teams on average do not respond to LPS. Oh my gosh. The fact that between all four combinations, the correlation is 0.66, that's consistent. And that's what you were talking about. I don't know if we're going to get there, kind of get to the point where we say, okay, well, what are the ones that actually are responding? Well, in principle, you already have the tools for that and the subset analysis would pick up subsets of genes that respond strongly, that have a stronger friendship. So if you do that, you basically take the subset of the 100-land boundaries of responders in one cell, ask the ones that respond as they respond. The other cell takes the union of that and the first kind of analysis is separately. Then we would see a very significant difference. Thank you. I have another, but this is where I would go next. You see something about your data, do you plan on having that problem? Is this in some way significant? Is there anything like that? I would like to remove all of that. Are we even looking at responding to these things? They're maybe not. I'm just putting why it comes back. Okay, maybe we should be starting. I'm already top center. So you kind of start with that. Or you, which is where I guess some sort of gene makes a cluster. Yeah, you just call. You're still doing the... One thing that I want to do is color them differently. I don't have any experience. You're still doing it. So I'll add one there. In front of what? In front of what? I'll add one here. No, no, no. It's where you were before there. It's right in front of what they brought from the red light. It's the same as this. It's just right there. I'll predict. So right there. In front of what? Right before the bracket. Before the white bracket. So you're doing exactly this. Except instead of this. You're trying to get the white and the white. You don't do your analysis. You don't have the analysis. You don't have the analysis. You don't have the analysis. You don't have the analysis. Except what? But that's how they are described in state law. They are saying that both are the most variable genes. Oh, this is LBAM. It's not a biology. No, I'm not. Instead of the comms there, you mean that both of them are the same. And you want it to be the same. We remember that this is the ground. The change is very small. So your model is Y, Y, I, scale, group around this. So that's that one. I was like, maybe Y, LBAM. Yes, dad. So the number of actually there are differentially expressed genes. I would doubt that we have 1,000 for these randomly expressed genes. Oh, it's just like the most variable. Yes. But you're not so often. So that's nice. We've seen that we can apply statistical models like linear correlations to data. And we can start exploring data with these models. What I want to briefly discuss is the idea that we can use such linear or such models as descriptors of our data. So rather than looking at raw expression values, we can start correlating the data with a model and then individually, and then start subsetting our data by how well they fit to a particular model. And I'll take you through a little bit of the code. This takes us into nonlinear least squares fit because the code is based on yeast gene expression profiles from cell cycle expression data. So the data set that we're working with here is derived from GSE 365. This is from the GeoData base, published in 2006. Saccharomyces alpha factor cell cycle. Cells were synchronized with the alpha factor and then sampled every 10 minutes across two cell cycles. A total of 13 samples were analyzed. Pramila at all. So this is a yeast cell cycle expression data set across two cell cycles. And of course, if you do a cell cycle expression data set, you'd like to find genes that are actually cyclically expressed, i.e. that correspond in some way to the cell cycle. So there's two data sets, which I've kind of pre-prepared from the GeoData and from GeneData that I've downloaded from SGD, the Saccharomyces genome database. One is in YG profiles, yeast gene profiles. One is in yeast gene data. The yeast gene profiles is derived from the actual expression values. There are 6,228 rows in that data, i.e. in principle, one for every yeast gene, and 25 columns. So T0, T5 minutes, T10 minutes, and so on for the entire cell. So these are not categorically different expression values like the ones we had in our LPS data set. These are expression profiles along a timeline. Each of the rows in this data set corresponds to the variation of expression of one gene across two cell cycles. YG data is a data set which has a few annotations. So what I've put in here is the SGD ID for the gene. The systematic name is important because that's what we can cross-reference to our YG profiles. The way the profiles come down from the GeoData base is they have systematic names. So when I see a yeast systematic name, it doesn't tell me anything. If I see a yeast gene name, it sometimes tells me a little bit more sometimes. But we can correlate that. So for every one of these systematic names here, we hope to have an explanation in our YG data file. What is the geodata? What is the standard name for the gene? And what's the alias? And what is the description of the name? So Fun14 has an alias of MCP3. It's an integral mitochondrial outer membrane protein, a dosage suppressor of an MDM10 null, and so on. So there's a lengthy description there. So these things are supremely useful. If you do, again, exploratory data analysis and you find clusters or dots or whatever, the next thing you always ask yourself is, well, what are these? If you're doing analysis of genes and your analysis just spits out the row numbers in the matrix, that's not going to tell you a whole lot. So you need some kind of annotation prepared that you can then correlate with your data and figure out, you know, seven of these 10 genes are mitochondrial genes. So yeah, that's something significant or things like that. So the annotation part is important. This is why we were harping around on data annotation so much in the first part of this workshop. So this is one of these expression data sets. I'm randomly picking out one of the profiles. This is the expression profile. The row name of that is YBL005W and YBL005W. I've arranged them so that the row names correspond to each other. You don't need to cross-reference them by the actual name or the actual identifier. So the gene for which we have an expression profile in row 123 is the gene for which we have data in row 123 of that other data set. Now, in this case, using correlation analysis for exploring data sets, seeing that we're interested in genes that have a certain behavior, for example, that the gene is cyclic-expressed, how could we find them? For example, we could define a model for our expression profiles and then search for genes that correlate with this model. So we take all these expression profiles, and rather than doing something purely mathematical, we say we have a biological model. These are two-cell cycles. So if I have a cyclic-expressed gene, it should correlate with a cosine function or a sine function of two waves. And that's something we can just compute. So if we build T as a vector of sequences, like our time points, going from 0 to 120 by 5 minutes, we get the 25 values of time points. Then we can calculate some model. For example, the cosine of T divided by 60 times 2 pi. And we can look at what that looks like. So this is a completely synthetic model built from first principles where we say, if this is a cyclic-expressed gene, I would expect it to vary in this way. So can we use this as a fishing hook to pull out genes from our soup that actually look like that? Now we can easily calculate the correlation of a real expression profile with our synthetic profile. So for example, take rho 815. This is what rho 815 looks like. And we can calculate the correlation, which is minus 0.9. Ooh, that's a very strong correlation, actually. It's a very strong anti-correlation, I should say. Or if we look at rho 5,571 and plot that and calculate the correlation, plus 0.86. So this is very highly correlated. So with this simple correlation analysis, I've discovered two cyclic-expressed genes without doing anything else. Simply defining a model that is able to filter out data from our profiles of expressions that somehow look like the data in that model. Now the amplitude of our synthetic profile does not matter, because the correlation does not depend on the factor of the linear relationship. So if we want to calculate correlations for all profiles, we build ourselves a little vector for my correlations, which has space for the values we derive from every single row. And then we write a little for loop. And for every one of the 6,000 rows, we calculate the coefficient of correlation and put that into the vector. Done. Quite fast. No, actually, I didn't even do it yet. Here, there we go. Still done. So what do we get? So this is the histogram of correlations. The bulk of them is around 0, but there's quite many that are very significantly correlated and very significantly anti-correlated. So some of the correlations are very high. If we plot the 10 highest correlations and the 10 highest anti-correlations, that's very easy to do. We build a selection vector 1 and 2, which is ordering this in a decreasing sense, or in an increasing sense, and in an increasing sense. As we've done before, and then just pulling out the 10 largest values of either kind, and we can list the genes that we have here. So these are the 10 most highly correlated ones. I don't actually recognize any of these. SpO12 would be a sporulation gene, perhaps something that we could expect in the cell cycle somewhere. The highest anti-correlations, manasal transferases are functions in building cell walls. So again, that's something that I would expect to be switched on and off during the cell cycle. And we can plot the expression profiles and see that indeed, well, as expected. We find genes that are cyclically expressed by phishing them out with that approach. So model-based correlations are easy to compute, and we'll of course find profiles that correlate with the models. Moreover, however, we would not find genes that are phase shifted, because these have near zero correlations. So if you assume that one of these is a transcription factor, we wouldn't find the transcription factor targets because they're expressed after the peaks of the transcription factors. So if we shift the model by 15 minutes, for example, and plot that, so these are the shifted model and the normal model. And this is one expression profile. If we calculate the correlation with one model, it's actually the other way around. One of the correlations is 0.18, and the other correlation is minus 0.03, i.e. virtually nothing. So this depends, which means we could build a family of such models and start searching by just shifting them for all of the correlations. But we could also do something different. We could take every line, every row of expression profiles, and fit the parameters for a model. So if we say, well, we're looking for cyclically expressed genes, let's take the model of a curve of a sine wave, or a cosine function, with parameters of amplitude, phase, yeah. In a way, you're going into this analysis with a certain assumption or bias, which means you postulate a model, and then fish for genes fits your model. So if I went into this and had no clue, though, that there was a certain cyclical to all of these genes, how would I go into this? How would I analyze that to try to figure out that pattern that exists? Would I go through all of the 5,000 lines and basically? When we talk about dimension reduction tomorrow, we'll revisit that. So in principle, with principle correlation, principle component analysis, we can find patterns that correspond to the strongest variances in data sets. And so you can basically pull out the prototypically varying genes in some sense, and then just look at the curve and see what's the function form that we have. Probably see that we see curves that are just generally decaying, or increase in expression, and other curves that actually are cyclically represented. So these are essentially the low line fruit. We do this analysis with the question of, well, will we find cyclically expressed genes? A cyclically expressed gene by definition should be similar to a sine wave or a cosine wave. So we can use that as a model, or we can fit that out. Does it mean we'll find everything that's interesting? So didn't you also set a certain periodicity? Yeah. So that things could? Right, because the description of the GSE data set tells me this is where two cells like this. So over the time here, I've subdivided such that I have two sine waves. But once again, that is something I can also fit. With a nonlinear least squares fit, I can fit amplitude, phase, and frequency. So if I add these three parameters to a model description, I can now go into every single profile and get the best fit for amplitude, phase, and frequency. And then I can store that description. And that's now a new descriptor of my data, these parameters. And then I can ask, which genes have frequencies? Whatever amplitude and phase is and how they look in principle, which genes have frequencies that are close to 2? Which genes have frequencies that are close to 1? Are there any that skip the cells like? Which genes have phases that correspond to them being high initially and low later? Which genes have phases where they are switched on at some time point in the cell cycle? So at that point, I'm not looking at the raw data anymore. I'm looking at a higher level, at an interpretation of the data. And I'm doing my exploratory data analysis with parameters of a curve. And I get the parameters from non-linear these squares. Yeah? Are you also able to do factor variables? So let's say you have a, not for guidance, but instead you have two conditions. And you want all the genes that are mostly correlated with one condition versus another. Yeah, yeah, yeah, of course. So I can address some categorical variables as well. So I could include these in this one. In the something like 900 lines of code that follow here, that's exactly what we do. In the final two minutes of today, I'm not going to go through 900 lines of code. I just leave this as something for you to explore. I think it should be reasonably self-explanatory from the code that I've supplied here. So just roll with it and try to understand what's going on at some point in time. I've explained the principle, though. If we explore our data with regression analysis, we're not simply confined to particular models. We can basically extract the whole idea of regression analysis and say, well, let's derive data of how well or what parameters we get for any particular model that we can define. And then we can do our filtering and selection on the model parameters, rather than just looking at the actual data. So basically, it's kind of building a computational telescope where we transform the data in a way where the things that we're looking for become more often.