 In the general flow of things, we started looking at data initially, then we started plotting data along many different dimensions and in different visualizations, then we started analyzing features in our data with dimension reduction and then finally using similar techniques to cluster our data into groups. Now the final unit for today concerns itself with something that is actually really not longer exploratory data analysis, but then goes into the field of statistics that asks about significance, i.e. hypothesis testing. Once we start making a hypothesis about our data, then the question becomes, well, what's the likelihood of that hypothesis actually is true, that it makes sense? Or in other words, that something we think we've discovered about our data is not simply due to random chance, to random fluctuations in measurements, to spurious correlations that are there sometimes and sometimes are not, and we just happen to be there on a day when the gods of statistics were rolling dice in a particular way. So that's the domain of hypothesis testing. The Greek vase theme for that is Oedipus pondering the riddle of the sinks. So what I'd like to discuss is discussing the principle idea behind a statistical test. What is a statistical test in the first place? What does it mean? What are we trying to define here? And what do the concepts of true and false positives and true and false negatives even mean? What's a p-value? What is significance? And we'll start then analyzing a little bit about applying simple parametric and non-parametric tests to the data that we've been looking at and talk about how to interpret the results. And then we'll also have to talk a little bit about multiple testing because in the age of high throughput biology we don't make a test once. If we make a significance test on results from a high throughput gene analysis, in effect we are repeating the test 20,000 times or however many times our genes are represented. And we'll have to talk about what to do about such multiple testing problems in the context of expression analysis. So once we have a statistical model that describes our data, we can explore data points with reference to the model. And we typically ask questions such as is a particular sample a part of the distribution or is it an outlier? Or can two sets of samples have been drawn from the same distribution or did they come from different distributions? So for example that's the typical question we would have if we have two clusters. Are these two clusters actually sets that contain the same class of elements from our underlying population? Or are they truly different? And in what way are they different? And how much different are they in the first place? What does a difference mean? If we just calculate a number of say the average distance between the cluster centroids and that number turns out to be 25.7, that really doesn't tell us anything. We have to have a way of systematically, rigorously comparing that number that we get with a number that we would expect. So this is confirmatory data analysis in contrast to exploratory data analysis. Fundamental to all of this is the idea of a null hypothesis and an alternative hypothesis. The null hypothesis, sometimes called H0, simply states that nothing of consequence is apparent in the data distribution. Nothing happens. All of the points, all of our data elements are drawn from the same distribution. The distribution might not be completely uniform but there's no reason why we would claim that one element should be treated preferentially from another element. This entirely corresponds to our expectation. Boring. Nothing goes on. We learn nothing new. The alternative hypothesis states that some effect is apparent and the data is different. We expected to have something and we saw something else instead. That's cool. That's what we're looking for in science. We don't try to confirm the same old thing over and over and over again. We try to find new things that we can publish in nature or cell or journal of irreproducible results. So we need to account for something new. Not in all cases will this result in a new model but a new model always begins with the observation that the old model is inadequate. But you see here's a catch. The alternative hypothesis as a model may be wrong so we don't actually know what the correct alternative hypothesis is while the statistical test can tell us very confidently that the null hypothesis must be false. So simply finding that the null hypothesis is not true doesn't really tell us anything about what the alternative hypothesis actually is. What the correct alternative hypothesis is. It's a little bit mitigated by the fact that a statistical model is not what we normally think about as a model. So as long as we have a function that kind of seems to have the same statistical properties in its distribution as the data that we've had we're happy with it as a statistical model. It doesn't necessarily need to be a generative mechanism behind producing that model. Statistical models don't need to be generative. They just reproduce the data. But of course you can imagine that a large class of different functions can give rise to virtually the same distribution using some set of parameters. So that's something that we can't figure out at that point. However what we can figure out is that whether the null hypothesis is true or not true and whether it's even worthwhile to start continuing looking at the data, working with it and trying to figure out what's really going on there. So just as large as the variety of hypothesis is the number of possible tests for hypothesis is large. And the proper application of tests can be confusing and it's easy to make mistakes and then come up with misleading results. And fortunately there's a way out of that, which is often a good choice for non-statisticians and we'll talk about that a little later on. So common types of tests that we have are one sample tests where we have a single sample and we compare it to a population. For example to the mean of a population. How much of an outlier is a single sample? And from that we can derive a probability about our null hypothesis. So assume that we know that our population is normally distributed and we observe a value, a single observation of an event at a deviation that corresponds to a sigma or standard deviation of 2. We can conclude that the probability of that is, if I remember correctly, sigma of 2 would correspond to a probability of 0.1. Either there's only a 10% probability if we choose one of the data points at random to have that large of a deviation from the mean or even more. So that's very easy to say. That means 90% of events would, no, that means the probability of the null hypothesis being true is only 10%. It doesn't tell us anything about what the alternative hypothesis actually is. But it says, well, if we observe something like this, the probability of the null hypothesis becomes small. Well, 10% is not very small. In 10 trials we would be expecting to make one mistake with that. So the cutoff where we take such a p-value to be significant is often smaller than 10%. Two sample tests compare sample with each other. We randomly or not randomly pick two samples from a population and say, are these different, i.e., are they taken from the same population or not? Compare gene A to gene B, are they co-regulated in the same way or are they something different? And paired sample tests compare pairs of observations with each other. And typically we ask whether their difference is significant. So paired sample correlations would be if you have a study with patients, you would try to find pairs of patients, one of whom is, say, a smoker and one is a non-smoker, but they should come from the same socioeconomic background and they should be all equally lean or stout and they should all have approximately the same weight and so on and so on. And the better you can match them across a variety of variables, the better it is because then you're really only looking at one variable and you can discern its effects on whatever you're measuring in a more confident way. So these are paired sample tests. So common types of tests, as you would find them in a statistics textbook, would be things like Z tests, T tests, non-parametric tests, chi-square tests, F tests, and so on. So the Z test is the first one that I mentioned. It compares a sample mean with a normal distribution. Z essentially means how many standard deviations away from the mean are we? Normal distributions are often good in the world of biophysics, often not so very good in biology. We often have much higher correlations or edge effects or extreme value effects in biology because the properties of our genes are not random. They're correlated with each other, A, because they're functionally related or because they're related to each other via homology. So one of the assumptions that often leads to normal distributions is that the elements are independent. But regarding biology, almost everything we look at, the assumption that individual data points are independent of each other is usually not a very good assumption. So relaxing the requirements on normality for the sample, we come to the T test. And that's usually the workhorse of our statistics. It compares the sample mean with a T distribution which essentially says how far are the averages allowed to be different from each other from two populations before we have to conclude that one population is different from the other one. And thus, the requirements on normality are relaxed that we can accommodate more easily non-normal, especially if we have a large number of measurements. So this is why it's a very popular test to make. Now if we have no reasonable model from which we can derive a distribution for the null hypothesis, we resort to so called non-parametric tests. So non-parametric tests essentially take the numeric values that we have in our distribution and transform that into a rank. So instead of saying minus 101, we say a ranked lowest, ranked next lowest, ranked highest. And then if we compare the properties of ranked elements among each other, this becomes independent of the actual size and whether the values bunch at some point or whether there's a fat tail on the left-hand side and a sharp cut-off on the right-hand side and vice versa and all of that. So these non-parametric tests are very versatile in that sense, making less assumptions about the data. The drawback of non-parametric tests is that you lose a little bit of statistical power. So as you remove basically the requirements on the behavior of the data, because these requirements might not be valid in the specific case, the data can vary in more creative ways and thus you lose a little bit of statistic power, i.e., the relative effects need to be larger in order to support that the null hypothesis is or is not valid. And then there's chi-square tests and ANOVA tests for different situations. Now if we talk about hypothesis testing, we really ought to think about what hypothesis testing means. We test the hypothesis, i.e., we have some observation, we have a model of the data and you ask about the probability that the model of your data would contain your observation. That's unfortunately often not what people do. Often what people do is they went to graduate school and they were taught to apply t-tests so they keep on applying t-tests and they don't really know why but it gives them a number and it prints out a p-value if we do it with r and we can plunk that into a paper and we can be very, very wrong about that. I've included some advocacy about what can go wrong in the project folder. And in particular, two papers here. One is by Sande Nguyenhoijs and Eric Wagner-Makas. And they did an analysis of erroneous analysis of interactions in neuroscience, a problem of significance. So what the situation is here in neuroscience papers, we often have one population and we look at an effect and it's significant, yay. And then we have another population, a control population and under the same metric the effect is not significant. Yay. So we conclude that the control population and the test population are different so that our treatment actually has an effect. That's very compelling. So you report that one effect is statistically significant whereas the other effect is not. Although this is superficially compelling, this latter type of statistical reasoning is erroneous. It's the wrong conclusion because the difference between significant and not significant is, does not necessarily by itself mean that the difference is statistically significant. So assume that you have a situation where you calculate the normal conventional cutoff for p values to be considered significant is 0.05. So you calculate the p value for your controls and it comes out to be 0.049. Not significant, good. And then you do your experimental group and it comes out as 0.051. Sorry, other way around. Significant, that's good. But now the difference between control and experiment was tiny. The conclusion really has to be there is no difference between control and experiment. It just happened that one turned out to be slightly smaller and the other one turned out to be slightly larger than the p-value cutoff. Nevertheless, this very often gets reported in the literature as if this would support a conclusion of a null hypothesis rejected. So what you really have to do is when making a comparison between two effects you have to report the statistical significance of the difference and not the statistical significance of each and everyone individually. So that's nicely explained there. The problem is it's not just nicely explained, they then went and applied it to a large number of articles and did a literature analysis and found that in a distressingly large number of papers that error was actually made and in some cases published in very high impact journals led to completely erroneous conclusions. Mind you, if you have an erroneous conclusion based on something like that it doesn't necessarily mean the effect doesn't exist, but it means that if the effect exists it didn't show itself in the data that you've just presented. So this is one of the two papers talking about how to properly even do these analysis of interactions. The other paper is a very unusual thing in that the American Statistical Association made an explicit recommendation on the use or abuse or use of p-values. So in this statement the ASA advises researchers to avoid drawing scientific conclusions or making policy decisions purely on the basis of p-values. That can be very misleading. The problem here is that we often misrepresent what a p-value means as supporting an alternative hypothesis. And that's just not correct. A p-value says nothing at all about the alternative hypothesis. The only thing that a p-value does is says something about the probability of the null hypothesis. It says nothing at all about the alternative hypothesis. So a p-value of 0.05 does not mean that there is a 95% chance that a given hypothesis is correct. Instead it signifies that if the null hypothesis is true and all other assumptions made are actually valid, which is another caveat, then there is a 5% chance of obtaining a result at least as extreme as the one observed. So it doesn't say that the alternative hypothesis is true. It just says something that the chance of observing the null hypothesis is not very large. In many cases it doesn't even mean that the chance is 0. And since we're making and publishing a large number of observations, there's a very significant chance that to a certain amount of times we'll just be plain wrong due to a fluke in the way the data turned out, even though it will look very convincing. So there's more background on what one should be doing, how you interpret p- values in particular on how to and when to base policy decisions or scientific conclusions on p-values or something else that's discussed more. Importantly the recommendation basically says that one should always make the data accessible and make the entire set of analysis accessible, and this is why we shouldn't work with closed statistical software and our little desk calculators anymore on penciled spreadsheets that get inundated in the next coffee break, but that we should work with scripts, post them on GitHub or somewhere else, or at least make them available for download where everything can be reproduced. This is one of the important contributions of software like R that we can make this possible. So relatively easy to post your entire analytic pipeline in a way that everybody can reproduce it. Is there any common place that the community comments on the appropriateness of the analysis? I don't know how common place it is, but if risk is the product of probability and damage, and damage in this case is the conversation with the dean about you having to retract the paper, I think the risk is very significant regardless of whether it's common place or not. It's very, very good practice too. Is this a new question? I wouldn't say it's very new. Greg, Lauren, what would you say? I think it's become more and more prominent in say the last five years or so. I guess what I was trying to say is that recently there's this preprint server called BioArchive that's come out and I think it has this massive amount of manuscript on there with a few comments, which is really nice. I just don't know if GitHub is also doing something like that. Well, you're not expecting comments on GitHub. Okay. But if you post your analytic code on GitHub, that means everybody can now download it and reproduce the results. That's the important part. If you post something on a preprint server, you still would only be explaining your methods and you wouldn't actually have to make your code and your data available. And if you've actually ever tried to reproduce somebody else's experiments from the published methods, you know about the limitations that simply posting methods brings with it. All right. So I've used the word a lot now, but what actually is the p-value? What is the p-value? Except that it's a number. So we can say a p-value is a measure of how much evidence we have against the alternative hypothesis or the probability of making an error or something that biologists want to be below 0.05 or the probability of observing a value as extreme or more extreme by chance alone or all of the above. Right. You can put a star on that bar and if it's less than 0.01, you're allowed to put two stars there or more stars and then things become totally convincing. The more stars you have, the less likely somebody is going to challenge your interpretation until they find out that it was just your keyboard that was slightly stuck. So in principle, all of this is true, but all of this means something slightly different. So a measure of how much evidence we have against the alternative hypothesis is, as I said, in a world where either the null hypothesis or the alternative hypothesis is true, we say something about the null hypothesis. Probability of making an error kind of depends on the type of errors that we make. So there are error types that relate to what the truth is and what the decision is that we make. So the null hypothesis could be true or the alternative hypothesis could be true out in the real world and our decision could be to accept the null hypothesis or to reject the null hypothesis. So if we accept the null hypothesis and the null hypothesis is actually true, all is almost well. Maybe we're not going to have something interesting to report in our paper, but at least we are speaking the truth. If the alternative hypothesis is true, the null hypothesis is not, and we reject the null hypothesis, all is also well and we might be able to report something that's of interest to the community. But in these cases, we accept the null hypothesis, but we're wrong. Or in this case here, we reject the null hypothesis and we're also wrong. Very often, you will find statisticians calling this type 1 error and type 2 error. Oh, if we could only eradicate this, this gives rise to so much confusion. I don't know if it's just my very small brain, but I can never remember what a type 1 error is and the type 2 error, and whenever I encounter these, I have to go and look it up and I'm very frustrated that people force me to do so. It's a meaningless label that we apply and we shouldn't be applying meaningless labels when there are perfectly valid alternatives, like false positive and false negative. So if I can plead anything with you is use the words false positive and not type 1 and type 2 error. And what does that mean? Well, a false negative means we made a conclusion that something is our experiment, gave negative results, but we were wrong about it. It should have been a positive result. False negative is very clear. False positive means our experiment gave a positive result, but again, we were wrong about it. The positive result was our interpretation, but in reality the null hypothesis is true. Again, there's simply looking at the words you immediately know what we're talking about. Use your brains for intelligent things and not to remember trivia unless you are in a trivia challenge night. Now something that biologists want to be below 0.05. Where does that come from? Where does the 0.05 come from? This is a cut-off about significance. Now p-value is something that we can calculate from our data. We can look at the distribution of the data and making some assumptions. We can calculate what the probability is that a particular observation was part of a population that gave rise to that distribution. That's very straightforward. Calculating a p-value, i.e., calculating a probability. We cannot calculate significance. That's because it's not a meaningful mathematical concept. Significance is not a mathematical concept. It's a cultural concept. Significance is the cut-off that we make to say that all over all on the face of all evidence we are going to accept or reject the null hypothesis. If we accept the null hypothesis then something is not significant. And by convention in the biomedical literature that is often 0.05. There's absolutely no reason at all why it shouldn't be 0.037893 or 0.049217. All of these numbers are just numbers that give some kind of arbitrary cut-off. And one could really argue that in the times of high throughput experiments with large numbers of experiments and data points, 0.05, i.e., making one error in 20 experiments, that's actually a pretty large error rate that we're allowing and taking into account here. We should be more stringent than that. Only because we're allowed to make a star doesn't mean we should be happy with one star and go home. But often we are. So the 0.05, the significance level, is a cultural convention and it's essentially meaningful outside of the context of whatever group of peers you are communicating with about how strong or how weak your results are. And the last thing, the probability of observing a value as extreme or more extreme by chance alone, that's actually a really important concept. That's often not correctly understood. I think I have a diagram here with which I can illustrate that. It's meant for something else, but this is what it's about. So assume we have an underlying population that is normally distributed and now we observe a value of 2. So how does that value of 2 relate to a p-value? Does the p-value mean the probability of observing 2? Or assume that this is a normal distribution. What is the probability of observing 2? Can you estimate it from here? 0. Any dissenting voice? Shouldn't it be something like 5 or 10% for 2 or greater and so on? 5%. So Muhammad is right. The probability of observing 2 is 0. And that's actually interesting because after all, didn't we learn at some point that the probability of 0 means something is impossible? So does that mean it's impossible to observe 2? And yet we do observe it. So the issue here is any exact number has an exceedingly small probability of being observed, i.e. one exact number has no length on the number line. And the probability of observing something from a distribution is related to the area under that distribution in that range which we are observing. So the area under a single number is 0 and therefore the probability of observing that number is also 0 in the limit. So that for the probability of observing 0 and 2 is the same? The probability of observing 0 and 2 is the same in the sense that 0 equals 0. Now, the probability of observing something in a small range around 0 and in a small range around 2, that is of course very different. But the probability of observing a particular number is in fact 0. And that's what we have to keep in mind. Observing an observation of 2 cannot be interpreted reasonably as a probability. And we mean something else when we look at a p-value. We say, well, the probability of anything larger than our observation is even smaller. So we will take this observation as the lower bound of making extreme observations and say the probability we associate with observing a value of 2 is the area of our distribution above 2 divided by the area of the whole thing. So if the area of the normal distribution above 2 standard deviations here is 10% of the whole, then the probability that p-value associated with the value of 2 is 10%. So that's important to keep in mind. It doesn't make sense to speak of a single observation here or even a group of observations as a probability. What we're always talking about is the probability of that value or something more extreme versus the rest of the distribution. Now, if that's the case, that actually gives us a very convenient way of looking at statistical tests because what a statistical test really does is it takes that assumption and formulates it in terms of analytical solutions, i.e. a mathematical formula that will give us a p-value if we plunk in a series of values and observations. Now, naturally, if we build such formulas, we need to make a number of assumptions. For example, that the underlying distribution is normal or an extreme value distribution or left or right sensor or that we're looking at differences of means versus individual differences from the mean, and so on and so on, each of which would lead to another type of measure and another word for such a measure is a statistic. It would lead to a different statistic that we would then apply in order to be able to say, well, what is the distribution and where are we regarding that distribution? An alternative way to think about this, however, is not to use analytical approaches at all but to use numerical approaches, i.e. if we can build a simulation that simulates data points according to this distribution, then we can easily, instead of integrating over a formula analytically, we can just count how many of our outcomes lie above and below the boundary that we're trying to look at. And such simulations are quite possible. So here's one way to build such a simulation of values, i.e. a distribution of points that correspond to an underlying function. So we define the probability function that we are building and then there's an easy approach to build this distribution in an arbitrary way. So in essence, if this is the normal distribution, in essence, a procedure like this will give us results that look like the outcome of our norm, i.e. random deviates from the normal distribution. So what we do is we just randomly throw in points from a uniform distribution on the y-axis and on the x-axis, and then ask, do these points lie inside the curve or outside the curve? If they're outside the curve, we reject them. If they're inside the curve, we accept them. And if we do that a lot, with a perfect histogram that will correspond to that distribution, but even if we do it only once, we will come up with a point that will be a valid random sample from within whatever generating functions we have here. So there's a number of probability distributions which we can use in our binomials and the normal distribution and uniform distribution and t-distribution and so on. But this is a completely general way. As soon as we can define something in the form of a probability distribution, we can use this algorithmic approach to build our random distribution samples. And we can then use approaches like these for accounting statistics, i.e. simply simulate and then ask, how often does it happen that one of the green points we choose is larger or smaller than two? Simply by counting. And we're not constrained to have to integrate over the area of a curve. We might not even know very well what that curve even looks like because the curve itself might not be a formula in the common sense, but might be the result of a complicated decision tree, i.e. it's a high value if we have a particular mutation in the mitochondrion and environmental stress and the sample was taken from South America. Whatever. Things that become extremely difficult to formulate in terms of a mathematical theorem become easy to simulate. Now this is perfectly rigorous. This is perfectly valid statistics. It just doesn't work with analytical methods, but it works from domain expertise that you have as researchers. So that's the key here. In order to do the correct statistical test, you need a lot of statistical knowledge. Anything but the most trivial tests I would advise you not to try at home but talk to your resident statistician. The ways that you can go to wrong are just too numerous. This is a simple alternative that you can use, but it also has its problems. The problem is how do we build a valid simulation? How do we build a little function that constructs such a curve, such a cut-off in a reasonable, valid way? But in order to do that, you need domain expertise. You need to know something about the biology. The programming and the statistics behind it is absolutely trivial. It's just accounting statistic after your simulation runs and does what it is supposed to do. But the way the simulation is built needs to be done correctly in the view and in the context of what the biology is. And that's something that you've grown up with and that you've learned to evaluate and do right. You know a lot about getting the biology right and you may not have or may not want to have the intricate knowledge that's required about statistical backgrounds and the many-fold assumptions and alternatives that such tests make and have to be valid. Using simulation tests instead of fixed and packaged tests is a bit of a game-changer and I think it's a great way to allow... Well, it's a good way to remove some excuses that some people are making about testing the validity of the data. So the take-home message there is to rely on a number of a statistical test but make sure that you understand how that number was generated, ideally try to come up with a simulation that produces the same number based on assumptions about your data that you know to be true. And then I mentioned a little bit about multiple testing. So what's the probability of multiple testing? There's a problem with multiple testing. So what we normally do when we evaluate the significance of a result is we fix the false positive error rate, for example, at 0.05. So we say that anything smaller than 0.05 is considered significant and we allow that we have a certain amount of false positive error rates. And then we try to minimize the false negative, i.e. to maximize the sensitivity to our text. So this is traditional testing. But what happens if we perform many tests at once? Does this affect the false positive rate? And the answer is yes, indeed. We usually look at a very large number of decisions and this creates the multiple testing paradox because we are more and more likely to make an error. So what we really have to do in response to that, we have to adjust our significance threshold. So typically this is done with the so-called family-wise error rate, classically the so-called Bonferroni multiple testing adjustment, where we say the adjusted probability is simply n times the old probability, i.e. we divide the significance level by the number of experiments. If we have a cutoff of 0.05 and we do 10 experiments, we now require a cutoff of 0.005. But the problem with that is the number of tests or experiments that we can run is very large. So this is rigorously correct, but the more data we collect, the harder it is for every single observation to appear significant. So now you are in a situation where you have data that supports at, say, a 1% level that there's an effect and that's good and that's true and it's really an effect. But you do more and more testing of other data that's not even related. And then according to the required testing corrections, your 1% level is no longer adequate and you are actually reasonably forced to throw out the 1% actually correct observation. So that's a paradox. The more testing you can do, the more difficult it becomes that any of these results become significant. So we need approaches to address that paradox and what Benjaminian Hochberg published thankfully in 1995 was a solution where they said, well, let's not worry about false positives at all. Let's look at something else which is the false discovery rate. And basically that means we might make mistakes, but we can live with a certain number of mistakes. We don't claim to be absolutely correct, but once we give our results to our graduate students, we would like that not more than half of the genes they need to follow up on with whatever doing RNA-seq experiments or quantitative PCR that not more than half of these candidate genes should actually be red herrings. So that's the false discovery rate. The way this works is for each of the observations we order them and then if we have some K so that the largest rank in that order is less than a certain allowable false positive, then we say the false discovery rates for the genes 1 to K is controlled at alpha. Typically, we get one to almost two orders of magnitude, more valid observations in this way without actually significantly impacting the number of false positives there. A catch with this is of course also that hypothesis need to be independent and as I've mentioned several times now that is often a poor assumption. So this FDR correction, false discovery rate correction is usually the standard of what we do in micro-area experiments or any kind of high throughput experiments these days. Now, let's try some of these tests to figure out whether they're actually easy to use and what they look like in practice. So if you could perhaps load this last REDA working unit, the hypothesis testing unit from GitHub and since this is now a new project all of your previous data has been lost and since we're actually working with the same data set GSE 26922 we need to reproduce it. So if you execute this code reload these libraries that you've already installed reload the REDA clustering GSE 2699 R data and then go through the steps to recreate the top table and annotate the NCBI platform annotations. That's all very quick. Can I ask you to save the project? No. Is there an error? Wait. Just the URL you put up on the screen has .R on the end. Oh, the project is not .R. .R is the file that we're working with. Okay, if that doesn't work for you RED post it and we'll fix it. So let's look at simple t-tests, two sample t-tests. So in the two sample t-tests we take multiple observations of our samples and ask is the mean of the observations of the samples different from each other. So in terms of our genes here the question would be if we have two genes and they look different in terms of their expression values are they actually different or do we just have a random fluctuation of individual expression values that spoils this. So here we do a t-test of the singing gene for differential expression and remember that these are triplicates and the first sample is characterized by three replicates at t0 and the second sample comprises all other columns lumped together. That's differential expression. i.e. our control value is t0 and the value that we compare it against are all the non-t0 i.e. is there a difference between the blocked state the control state and everything else that happens. What is the need to go through all of these steps before like no transformation has done do you want us to do that or do you need the data? Yeah, just go through all of these steps until line 95. If you just selected all it should run through without error. At least it does for me. I hope it would do the same for you. So the highest expressed gene in the top table has the id 8117598 and we can we can find which row it is in and this is in row 25803 whereas the number 250 in the top table has the id 84946 and it's found in row 15059. So these are the rows of the two genes that are at the top and at the bottom of the expression table. So with that we can retrieve the actual log expression values and plot them. That's what they look like. So this one here is the top differentially expressed gene. It has the largest difference between the control values and the actual measured values according to the analysis and this is the lowest differentially expressed gene. It has a smaller difference but still is deemed to be significant. Well, if we order them it's at level 250. Now, to do a sample against sample t-test for the cycle values against the t-zero values we do a t-test of the expression values of our gene and t-t1 on columns 1, 2, 3 versus columns 4 to 18 and we store that object in g1. Conversely if we do it for the 250th gene we can store this in g250. So the result object now tells us what this actually did. It's the Welch 2-sample t-test Our first sample is the control. Our second sample is the expression values. The t-statistic comes out at minus 6 which corresponds to a p-value in 100,000. So very small p-value. And the sample estimate is that the mean here is 9.3 and the mean there is 11.5. Of course the maximum is higher and this test knows nothing about time series it just treats all of the points in the same way. It just forms two groups. One is that control group and one is everything else lumped together. On the other hand if we do this with the 250th gene t-statistic is 2 and the p-value is 0.001 or slightly less than what we would normally call significant. And the means here are 9.44 and 9.12. Now if we if we look into what's in this results object the results object of applying the t-test is a list. That list has nine different elements statistic, parameter, p-value confidence intervals, estimates which alternative we use two-sided or one-sided what the data name was and so on and with that we can pull out the p-value from our list. So g1 p-value is the actual number that we can get. So using this we can compute the p-values for all of the 32,321 rows of the expression matrix and see how these values are distributed all over all. Using this two-sample t-test calculating all of the p-values individually for all of the genes to determine what the p-values are. So we do that by first generating an empty vector which I call which is a numeric vector as long as the number of rows of the expression matrix. And then we write a for loop for i in 1 to the number of rows in the expression matrix we do these t-tests. So the t-test is expression values in row i of columns 1,2,3 i.e. the zero values and then expression values in row i of columns 4,2,8 4 to 18 as the difference values that gives us such an expression object and to that expression object we apply the dollar operator and get dollar p-value which is that single number. And then we assign that into our t-all into the slot i. So when that runs we've done 32,000 t-tests and it takes a few seconds to run and now it's done. So now our our t-all contains all of the probabilities and that's great we can look at a histogram. So we can ask how are the probabilities distributed which ones are small and which ones are large and that's the distribution of probabilities here. Now here's our AB line of 0.05 and our range is from a p-value that is as small as 2.55 to 10 to the minus 13 to a p-value that is almost identical to 1 we have everything in that range. Now who could interpret the histogram? What is one of the bars of the system? What are we looking at? The number of genes that have that p-value applying that t-test that two sample t-test comparing the p-zero to all of the other. So how many do we seem to have that are statistics at that level? About 7,000. I think 7,000 is a good guess. We can guess that from here of course we can count them we'll do that in a second but once again we did 32,000 t-tests on differential expression simply applying a significance level of 0.05 tells us that approximately a quarter of these genes are significantly differential expression. I think it's a bit of a stretch to imagine that that's actually true. So how many rows have a p-value of less than 0.05? Well we can do this here and come up with a good guess 7,716 ooh that's crafty code. How does this even work? What are we doing here? Sum of t-all less than 0.05 why does that give us a reasonable number? So this is a comparison what's the result of that comparison? No? True or false? So this gives us a vector with 32,000 elements of true or false so why can we sum over that? Exactly what it does it counts the truths and what does it do so? The reason is coercion. Sum operates on numbers. What we have here is a logical vector. Logical values can be coerced into integers or into floating point numbers with the convention that true becomes 1 and false becomes 0. So what the sum does it takes the vector of true and false converts it into a vector of 1s and 0s that we've just seen. So without knowing about conversion it would be very hard to understand what's happening here and why this is even correct. I hope that you know it now. It's something we do frequently so this is I would consider it crafty code but it still passes the smell test so since it's really relatively simple and we don't need to too many assumptions with that it still works quite well. So what was the one with smallest p value? We find it this way which is the value t all is equal to the minimum of t all assign that to gp min and the maximum gp max gp min is in rows 17 17 766 gp max is 18166 so let's plot it. The minimum in black and the maximum in red. So in this case we see that the largest p value the one with the 10 to the something minus 16 that's this black line here. The one that was top of the top table is this screen line here that one was at the bottom of the top table and this is the one with the p value that's most close to most close to 1 but the black one is actually the gene that is most significantly differentially expressed. So I always find that a very very interesting slide to ponder about because it tells us a lot about what actually happens when we calculate such p values. We're not really looking at the size of the difference here alone. We're also looking at the variability of the individual measurements so if the standard deviation of these points is very small and the standard deviation of these points is very small and there's a slight difference between them. The fact that they have a very small standard deviation makes it so that the probability that they're from two different populations becomes very very large. The mathematics says well this is a very very narrow sample, a very narrow distribution. The other one is also a very narrow distribution and there's extremely small overlap between them. And we have to keep that in mind because of course one of the problems is this distribution is very very small because of three individual points. Now if one of these individual points would be slightly larger maybe just 5% larger overall that statistical significance would completely evaporate. So one has to be very very careful. We just throw a p-value, a t-test and we get a fantastically low p-value but if we actually look at the data and ponder about why that p-value is small we find here I'm not convinced that this is actually a differential expression. So as you can see the difference between this being the least significant and this being the most significant here really just lies in how these three points behave. So once again we have to look at the data. I'd like to look a little bit about non-parametric tests because they're kind of obscure to understand. Let's look at two random data samples with slightly different means to explain non-parametric tests. So what I'm doing here is I'm taking 25 samples I'm putting them into a matrix. The first 25 the matrix of 50 samples the first 25 are taken from a normal distribution with a mean of 10 and a standard deviation of 1 and the second 25 are taken from a normal distribution with a mean of 11 and a standard deviation of 1. So if I run that and then I do a box plot that's the difference we have here. So this one has a mean of 10 and this one has a mean of 11 and it turns out that there's a lot of overlap between the samples and they're only kind of different and now the question becomes well are these two different distributions and I don't know what you would tell if your student brings this data whether he's actually discovered an effect. Now if we plot them by their index and color them so these are the first 25 and these are the second 25 we can kind of imagine that we can see that these tend to be lower than that but just from looking at the original values it's really hard to tell. If we put this these measurements if we sample them randomly and we just mix them it becomes impossible to tell and to actually from just looking at the data distinguish the two distributions. Now what the non-parametric tests does is it orders them by rank. So this is the exactly same plot as this one or this one except that we have ordered the index differently on the x-axis and the way we ordered them on the x-axis is an ascending order so the black ones first and no no the lowest ones first and then higher and higher and higher and now if we look at it that way it becomes a lot more intuitive that we can actually see the red values bunch at this level and the black values are more down here. So there's an excess of red values here and an excess of black values here and that's the comparison that the non-parametric Wilcoxon test does. So basically the Wilcoxon tests asks how many how many samples are smaller than the one that I currently observe and what category do they belong to. So if I do an actual Wilcoxon test on these two populations that's what the output looks like Wilcoxon ranks up some test and I have a p-value of 0.035 and yay indeed that's significant so we can go out and publish a paper about it and give it a start. So the purpose of this is really to illustrate how such a non-parametric test works in the first place. The point is that the other tests require some explicit assumptions about the underlying distribution. The non-parametric tests require no such assumptions at all whether things are logarithmic or whether they are normally distributed or distributed according to some extreme value distribution it makes no difference it all gets flattened out as we only look at ranks and then the rank comparison tells us something about how significant that is. Okay. Now there's more code about testing multiple testing trying this a random number of times and so on. I think it's late in the day we've talked about a lot I think I'll just close it off at this point the scripts are online you can play with them in whatever way you want if you get stuck email me if you have suggestions about improving things explaining things in a slightly less confusing way that I did explain them I'd love to hear from you so we can improve some of your suggestions can and should go on the survey that Anne has prepared which is really important we all all faculty of CBW meet once a year to discuss what worked well and what didn't work so well and the courses do change and do adapt so that's good and so you can do both you can complain about us in the survey but you can also contact us and us for improvements and clarifications you are unhappy with the way that your experiment behaves or needs suggestions of how to poke it in different ways to be successful then by all means do contact us too so it's been long two or three days I enjoyed it I thought you were very attentive I thought there was good energy in the room I know that many of you had very amazing a-ha effects on one or the other topic so that's good but it's only good as a beginning if you don't keep it up and if you don't think of projects to apply this to and simple things to code and try it out and fail and fail again and look up how to do things differently on stack overflow or with Google it'll become lost very quickly it's a new language you've learned the first few words now you'll have to go out and speak it in the real world sometimes when you order something in the diner you'll get a crocodile instead of what you thought you would be ordering but if you have enough tests and validations that won't impact your progress so for now thank you and let's stay in touch