 Getting you guys started off on R and XCMS and hopefully we'll be able to use some of that work you've done. If not, we've got some other data sets for you to play with. So standard shots here. So we're going to do a background and statistical method. And this one, based on feedback we've had for a number of years, is one of the modules I think people appreciate a lot. And of course there's always ways of making it better, but it is a way of starting to understand statistics because most of us haven't really had a lot of formal training in statistics. This is just a reminder of our schedule. We're on day two. We've already done the LCMS analysis. We'll do the background and statistics, and then we'll jump into Metaboanalyst. And it's sort of a bit of a walk-through lab, although I'll be trying to shepherd people along for that. So it's a little different type of lab. And then we'll just talk about some of the applications towards the last module. So this is really a review of statistics. And it's trying to cover essentially about two years of statistics, university statistics, in an hour and a half. But we'll try and get you acquainted with a few things and try and give you, I think, a more intuitive understanding of statistics. And that's important because otherwise it's just crunching numbers and the numbers will quickly become meaningless. So I think the critical thing with statistics is learning about distributions and measures of significance. Then we're going to move to what are standardly called univariate single variable count statistics. We'll also talk about correlation, which is tied to that, and clustering. And then we'll move to the last part, which is multivariate statistics, which is what people traditionally use a lot of in metabolomics. But all of these, both univariate correlation analysis, clustering, and multivariate, are all part of the same thing. And you'll see that when you guys do some work on metabol analyst. In terms of statistics, I think there's some great quotes I've seen. The best one of all is by Benjamin Disraeli. He was a British prime minister who was trying to deal with data that they were getting from various sources in the 1800s. And his comment was there are three kinds of lies, damned lies, and the worst of all, statistics. Another one I think is also pretty legitimate. 98% of all statistics are made up, which is also made up. And then this other one by Aaron Levenstein. Statistics are like bikinis, what they reveal is suggestive, but what they conceal is vital. And often this is this thing, not explaining how the statistical methods are done, or what's being used, or some of the assumptions built into them has been a real problem. My view, having played around with statistics for a long time, is essentially, statistics is a formal way or a formalized way of describing impressions. And many of us are able to see impressions and gather impressions, and we have intuition, this is right, this is wrong, this is different, this is bigger, this is smaller. But that's a qualitative view, and statistics makes it quantitative. It formalizes it. Usually your impression or your gut is right. But to just put this in a grant or in a paper and say, my gut says this is the way it is, they're not going to accept it. So they want to have my gut with a p-value of 0.05 says this is right, and then they will accept that. So that's the premise. We'll dive into distributions and significance, because this is sort of underlying all of statistics. So the first thing that you typically have to deal with is a distribution over a population. And it might be a single variable that you're looking at in this population. And this population, a bunch of people, I don't know, 30 or so, and we could choose something, a variable like height. And so if we were looking at the height of this population, we would be doing more assessing univariate statistics. We could also measure weight, that's another single variable, scores they got on some kind of test, their intelligence. Each of those is a variable, each of them is assigned to this individual. That is the essence of univariate statistics. So single variable, measuring a single variable over a population. And if you measure that variable, let's say the height of those people over that subset or the entire population of 7 billion, and measuring the frequency of that variable, again, height, you will get a distribution, and the distribution will look like this. That's a bell curve, or a normal distribution, or a Gaussian distribution. And it's a pretty remarkable thing, and it was picked up by Carl Friedrich Gauss, but almost every biological measurement, every physical measurement we make, if we collect it over a large enough sample, we'll follow this same distribution. There's going to be something that is trending towards the middle, and then there'll be outliers, and they have this rather smooth shape, especially if the sample set is large enough. So the normal Gaussian distribution is symmetric. It has an average value, and it has a width. And the width can vary, depending on the frequency, or the type of measurement. But that width is called the standard deviation, or sigma. And as I said, it is the most common distribution known, because we've seen it in every biological or physical measurement that's been done, or most of them that have been done. As I said before, the more measurements we take, the more normal or smooth that curve will be. If we take a very small number of measurements, perhaps if we took the height of people in this room, we'd find that it's not terribly normal looking, because we're dealing with 18, 19 people. To get a good distribution, the rule of thumb is you need about 30 or 40 samples. And that's intuition, but that's observation, and something that's been used, and is used, and should be used when you're thinking about the type of experiments you want to do. It's also one of the reasons why classes in elementary and junior high school typically have about 30-odd students, and why we try to aim for enrollments near 30 to 40, because it gives us enough of a sampling to know whether we're getting, or we'll get a reasonably normal distribution, something that's not skewed. Many clinical experiments use the same subset of 30 controls, 30 cases. Again, that's this issue of trying to get enough to get a reasonably normal or Gaussian distribution. So it's Carl Friedrich Guss, who actually came up with the equation that described the normal or Gaussian distribution. And so this is this xy-plat, and basically it's e to the minus x squared. It's the essence of it, and so we're plotting x, and in this case we're probably p of x or y. There's some factors that are involved there. You can see the standard deviation, that's sigma. You can see mu, that's mean. The variance, which is sigma squared, that's also up there. So that value 1 over square root 2 pi sigma is sort of a normalization constant, a scaling factor. And you'll hear me use the word normalization, normal scaling sort of interchangeably, and it's an unfortunate case in English language that these are used, probably to mean some very different things. But anyways, there's a scaling factor there. What's marked off are segments in terms of the normal distribution, the width. And so plus or minus 1 sigma covers 68% of the area under the curve. Plus or minus 2 sigma covers 95%, that's the curve. Plus or minus 3 sigma covers about 99% or more. At the center of the curve, the height corresponds to the mean. And the mean is also something we all know how to calculate anyways. Take the average height, or calculate all the heights, divide by the number of heights you've measured, come up with an average or mean height. The variation or variance is more measured of how different, how much does the height vary if someone's in this room is 5 foot 2, another person in this other room is 6 foot 2. Our variance is not simply 5 foot 2 to 6 foot 2, it's a variance which is again measured in inches, and it wouldn't be 12 inches, it would be more like let's say 6 inches. Then the square root of that variance, which is about 2.5, is our standard deviation. So this again reiterates what I was just saying about the amount of the area that's covered when we measure these standard deviation groups, the amount of the curve or area under the curve going from 1 sigma, 1 standard deviation, 2 sigmas, 3 sigmas, 4 sigmas all the way up to 5 sigmas, which is very extreme. But the 2 sigma one is significant, and I'll talk about that a little later. So that 1 standard deviation or 1 sigma probably that something is more than 1 standard deviation away from the mean. So if our average height for people in this room or the population I was showing there is let's say it's 5 foot 7. So we've got men and women, so it kind of comes up there. Then if our standard deviation was 3 inches, then about a third of the population would be above 1 sigma, so above 5 foot 10, then a third of the population would be below 5 foot 4. The entire curve, yes. Now if we're looking at 2 sigmas, 2 standard deviations, if the average height was 5 foot 7 and our standard deviation was 3 inches, then that's 6 inches, so that's how many people are above 6 foot 1, that would be 5% of the group, then how many people below 5 foot 1 a game there would be, or 2.5% to 2.5% of the group. 3 sigma, almost no one would be above or below the 3 sigma, especially with this population here because 0.3% is 1 in 30, or we're actually 1 in 300, and we don't have a population of 300. In a way, I think it's probably a little easier to remember. All of us have gone through school, university some of us have been in very large classes, and when you have the very large class, we all have to follow, at least the profs, have to follow this normal distribution, and our deans and chairs tell us the average grade that we're going to hand out in a class of 400 has to be C, so that means if you're within that group, roughly 68% under that curve, you're going to get either a C or C minus, C plus maybe a little bit higher, a little bit lower. If people score one standard deviation above that average score for a test overall, then we're supposed to give them a B, and if they get two standard deviations above, we're supposed to give them an A, likewise one standard deviation below the mean, or average, they're supposed to get a D, two standard, it's an F. So that's classic grading to the curve, and that measurement of standard deviations essentially is, if you are getting an A, you're in the top roughly 2.5 to 5%. So that is also the basis, this concept of standard deviation and essentially the 2 sigma, where I have this idea of what's called the p-value. And this is the probability of getting a particular score, or a particular measure, or to reach some extreme if you want. So 0.05 is essentially a 5% probability that this particular measure could be achieved. So in statistics, we have either accepting or rejecting a null hypothesis. So we reject the null hypothesis when the p-value is less than 0.05, which is the usual value we often use, sometimes we use 0.01. If that null hypothesis is rejected, then we can say this result is significant, statistically significant. The choice of an alpha is actually pretty arbitrary. If significance is absolutely critical, life or death situation, you might want to have an alpha that's 0.0001. If it's something that is going to be a good day or a bad day, then significance 0.2 is okay. And so there are lots of people who will die on their sword because the alpha came out, or the significance level of their t-test, p-value is 0.06. And the whole thing is insignificant, let's toss it all out, the experiments of failure, which is absurd. And as I say, this is an arbitrary value which has been chosen almost by convention. So that's an important issue. A lot of people don't really acknowledge or even accept, but it's a value that's somewhat arbitrary and the value of 0.05 is convention. But it is something that can be changed to suit your circumstances, actually. So I was giving the example of height and say a population of 30, 40, or 50 people, men and women, typically the average height, when you take both men and women, it's about five foot seven. We can say that the standard deviation in that group might be five inches. You can use things like both the normal distributions, probabilities, shape of the normal distribution curve, and if we use this significance, alpha of 0.05, what's the chance of finding someone who is more than six foot 10? And so you can construct a hypothesis that anyone who's over six foot 10 isn't actually human, they're a giant or something like that. So given this null hypothesis of 0.05, is someone who at six foot 11, are they a member of the human species? At that alpha, you probably wouldn't assume that they are members of the human species, but if you chose an alpha of 0.01, you could. So by choosing a different cutoff, you can make an assertion of whether they are members of the human species or not. The same sort of thing with coin flips. You could do the same sort of statistics. If you flip coins, ideally if it's not an unbalanced coin or something that hasn't been tricked or fixed, you could start asking the question if I flipped a coin and I got heads, 70% of the time, or 14 out of 20 times, is the coin fair? And so we can calculate the odds of this thing happening, and so the odds of this happening is 0.058. If we choose an alpha of 0.05, we might conclude that the coin is not fair, but if we use an alpha of 0.1, we conclude the coin is fair. So again, the threshold you choose can change, and it's one that you have to be explicit about, and sometimes you might have to justify it. And it just goes back to this point about statistics is the math or mathematics of impressions. So we'll still use the value of 0.05 for future discussions, but remember it can be different for different circumstances. So the other thing that we have to remember and think about is how distributions change. The Gaussian or normal distribution is actually a symmetric distribution, but many distributions are not, and when a distribution isn't, then we have to use other descriptions to talk about the center or middle points, and this is where we come up with terms called mean, median, and mode. So in a Gaussian distribution, the mean, median, and mode are all the same, and so that's why most of us only know what the definition of mean is, and most of us don't know what a median or a mode is. If you have a skewed distribution, which is not uncommon, you have to use these three terms. So a mean can be calculated in a skewed distribution. Again, it's just the sum of all values divided by the total. But if you have some extreme values, something very small, it can kind of mess up that distribution. The mode is the most common value. So that would be the one, in this case, the top of the distribution, the thing at the top, or the column or segment that has the largest values or numbers with values. And then by definition, the median is halfway between the mean and the mode, or usually halfway between. So it's considered the middlemost value. Some distributions will just have a single peak. Either distributions will have two or three peaks. A single peak distribution is called unimodal. One with two peaks, bimodal. You can get trimodal distributions. You don't want to see those, bimodal or trimodal. It says there may be something inherently unique or different about your distribution or how you're measuring things. And in particular, if you're trying to do statistics, almost every statistical test is based on working with a unimodal distribution. So the Gaussian distribution, as I said, or the normal distribution is the most common one, but there are others that you will encounter. Coin flipping is actually a binomial distribution, and coin flipping taken to an extreme follows a Gaussian distribution. Statistics involving relatively rare events, things like mutations along a DNA strand, or in the case when the Poisson distribution was originally developed, it was actually the number of soldiers dying from horse kicks per year in the Prussian army that follows an exact Poisson distribution. In blast scoring for sequence alignment, the extreme value distribution is followed, and that's typically a skewed or semi-exponential distribution. So the binomial distribution is, as I said, if you take that to the extreme, will generate essentially a Gaussian distribution. But for small numbers, and you're dealing with events like a coin flip, a two-state binomial, you can actually calculate what these things will be, and all it's done, all you need to do is just take probabilities of x, probability of y, or p and q, and then take the number of occurrences that you're going to sample, n times. So p plus q to the n, and taking the coefficients of that polynomial. So p plus q to the 0th power is by definition 1, so the coefficient is 1. p plus q to the first power gives you p plus q. The coefficient of p is 1, the coefficient of q is 1. p plus q to the second power gives you 1 plus 2pq, not 1p squared plus 2pq plus q squared, so you get a 1 to 2 to 1. p plus q to the third power, you get pq plus 3p squared q plus 3q squared p plus q cubed. And again, the coefficients are 1, 3, 3, 1. Plotting the coefficients gives you this distribution, as n gets larger and larger, meaning that in fact the curve or the plot of the coefficients follows something that looks a little bit like a Gaussian. And if we went from n to about 30, you would see that it looks very much like a Gaussian curve, which is where that rule of 30 comes, and then taking it to infinite degree produces the Gaussian equation, and then taking it to the x or minus x squared. Poisson distribution, as I say, is essentially statistics of rare events. And when, especially the value of mu, which is sort of the expectation value, gets sufficiently large, more than about 5 or 10, the Poisson distribution becomes the Gaussian distribution. But when it's very small, it becomes essentially an exponentially decaying function, and then in between it has these other variations. So it is used in the statistics of relatively rare events, and it trends and changes depending on the size of mu. The extreme value distribution is the one that's used to try and assess scores in BLAST or MASCOT when you're looking at proteomic data analysis. It's also the distribution that should be applied to university classes, but it isn't. So the point is that if the university, you're supposed to take the students who have typically the top 20 or 30% of high school students, and then you put them in the first year, and so that's sort of what we're doing here. The assumption was that then you take your top 20 or 30% of the class or high school class that they should follow a normal distribution. They won't. They'll follow this distribution. That is called an extreme value distribution. So it's effectively skewed. In the case of universities, it's probably skewed to the right. If you're choosing jockeys, horse racing, you probably skew it to the left for people who are short. But either way, you've created a skew distribution. And you will typically find with these some very, very extreme values. And in the case of universities, you'll find these kind of super geniuses like you find in the Big Bang Theory, guys with IQs that are sort of off the chart. Those are the ones that the extreme outliers that you'll typically get if you have that. And trying to assess them is sometimes challenging, and this is why extreme value statistics were developed. So you can fix skewed distributions, one by sampling more frequently, broadening them, but the other thing you can do is it's a mathematical fix which is to convert them into a log transformation. And if you talk to statisticians, that's their favorite thing to do. I always talk about log scores. Everything is converted quickly to a log. They can do logs in their head because for them it allows them to take what used to be kind of messy distributions and convert them into a normal distribution. The reason why you want to convert things into a normal distribution is because essentially all of statistics was designed to handle normal distributions. And anything that's not a normal distribution, most statistics falls apart. That's important to remember. So on the left is an example of a skewed distribution. This would be probably, typically it's measuring intensity data for microarray experiments. And linear intensity scales, photomultiplier tubes, tend to produce sort of this skewed thing. It's also sort of the way our eyes work, also the way our ears work too. If you log that, take the values, you can convert this skewed distribution which looked something like an extreme value distribution and boom, it's a normal distribution. Now you can do statistics on it. Try to do it with the skewed distribution and try to do t-tests and other things. It won't work so well. It's an example of actually some real data taken from, I also believe, a microarray experiment. And here's log transformations. And they're not perfect Gaussians, but you can see that they are more like a bell curve on the right than the ones that were on the left before taking the log. And in real life situations, that's often as good as you can do to try and get a real Gaussian distribution. So that's univariate statistics, that's distributions, and it's talking about means, means and other things. So why do we want to talk about that? Well, we do this because we want to be able to distinguish between two or more populations. So almost every metabolomics experiment you will do or attempt to do or you are currently doing probably involves trying to distinguish between two populations, something where it's treated, not treated, sick, healthy, growing under this condition, growing under that condition. So in this case, we're looking at two populations, normal people and leprechauns in green. And if you've ever read about leprechauns, usually they're supposed to be short. And so the question is, is there a difference between the height of the normals and the leprechauns? So we could plot out the population, their height and a curve, and we're going to generate the normal one in the light blue and the green one in the leprechauns on the left. The question we ask statistically is, are they different? Intuitively, everyone would be able to say yes, but can you give me a quantitative measure? Can you say how confident you are in that? Let's make it a little harder. In Southern Ireland, there's a large collection of tall leprechauns. And if we're wanting to distinguish between the two, are they different? So if we plotted that out, we'd see that the curves actually overlap uncomfortably more than we expect. So the question there is, are they different? How many people think they're different? How many people think they aren't different? This is where you'd have essentially a difference of opinions and how you settle that and what sort of rules do you use. So this is where statistics comes in. And this is called the student's t-test. It was developed more than 100 years ago. Most people just call it the t-test, but it was developed to determine if two populations are different. The null hypothesis is, what's the probability that the two sample means are the same? If you reject the null hypothesis, then they are different. So if you calculate the t-statistics, and I'm not going to show you how it's done, because it's done... you can do it on Google, you can do it on any calculator, so it's not worth it. But if you calculate the t-statistic and you get a p-value of 0.4, and if your alpha, that's your threshold, you've decided it's 0.05, the t-statistic says two populations are the same. If you do the t-statistic and you get a p of 0.04 and you're cut off that you've chosen by convention to be 0.05, then you can say the populations are different. You could have chosen a p-value of 0.01 or 0.06, that is the alpha you've chosen, and depending on your choice, you could say they're different or they're not. So this is where this arbitrariness of the alpha comes in. Now there are both paired t-tests and unpaired t-tests. Most of the things that you would be doing are for unpaired t-tests. However, if you are doing some temporal series before and after, then you are supposed to do a paired t-test, the rule. You can actually use a t-test also to look at clusters of data. So if you're familiar with PCA plots or something like that, here are two clusters. And in essence, these follow normal distributions or reasonably normal distributions. And so you could use a t-test to quantitatively measure how different they are or how close they are, these two populations. So it's a matter of just sort of mixing the variables appropriately. What if the distributions aren't normal? As I said, the whole t-test was developed with the assumption and essentially the essence of all of statistics were developed with the assumption that distributions were normal. So here's some messy looking distributions that are almost bimodal. What do you do? The trick here is called the Mann-Whitney-U test. Or it's also called the Wilcoxon-Rank-Sum test. Has anyone heard of these before? A few? Okay. Anyways, if the populations are not normally distributed, then you can use this. So it is technically a more powerful test. And in this case it's not the means, but it's the medians that you're calculating. So that's the hypothesis you're testing. The u-test is what you're measuring, not the t-test. The u-test, if you get a P of 0.4 and your alpha is 0.05 then the two populations are the same. And if the u-test gives you 0.04 and your chosen alpha is 0.05 then the populations are different. So same concept as the t-test, but it is a different statistic and it's used when the distributions are kind of messy. Okay, so that's for two populations. What if you've got three populations? The normals, the leprechauns and then the elves. And you want to measure them? The question is, are they different? Again, intuitively you can say they're all different. Yes, and there's a clear height difference. We go to a different collection. We've got normals, leprechauns and a very tall population of elves. And we now get something that's again merged. What can we use to distinguish between them? And the answer in that case isn't the t-test but it's called analysis of variance, ANOVA. And so that's a generalized version of the t-test and that's for three or more. It could be five populations, a hundred populations. It's general. And we can look at the group variation or group standard deviation and determine whether the means are equal or not. So that's the null hypothesis again. Formally we call it F measures. There are one-way, two-way, three-way, ten-way ANOVAs. The one-way ANOVA is the most common one and we're just trying to figure out if any of those three or more populations is different. It's not necessarily identifying which pair is different. So we're just trying to find the outlier population. That's the one-way ANOVA. So you can also use ANOVA in clustering analysis and determine whether any of these three clusters are sufficiently different. As long as the clusters follow a reasonably normal distribution. So there is a formal structure that you could use to distinguish among clusters or groups. So this is a little different than what you might have in your slides, but we've had a number of people ask over the years about this issue of how to deal with multiple tests. So yes, you may have two populations, but multiple variables. So you end up doing lots of t-tests, pair-wise t-tests. And that's not uncommon. And you might have two populations and 100 different metabolites that you're measuring. You're doing 100 different t-tests. And we use the p-value or the alpha value of .05. What's the odd that some of these things are finding are going to be false and this almost certain that something that has a value of .05 will be false. It could be several that will be false. To make sure that you're not going to mess up, there's a thing called the Bonferroni correction which says you're going to have to take all of your t-test values and divide by the number of t-tests that you've performed. So 100 metabolites that you've measured. In order for that set to be significant, you're going to have to find a metabolite that's different that has a p-value of less than .0005. Put that in, I guess, more realistic terms if we were talking about this class and we wanted to identify someone who was, let's just say they're taller than normal, we'd have to have someone in this room who's 8 feet to say that we found someone who was taller than normal. And that is pretty extreme. In fact, I've never seen an 8-foot person in my entire life so I've probably seen hundreds of thousands of individuals. So in many cases people believe the Bonferroni correction is way too extreme than it is. And in large part it's been discarded as a technique. And so what people tend to use is a thing called a false discovery rate or an FDR method. This is just an example of using a Bonferroni correction which is in weather where, if you sum I think most of the probabilities here, there's something will happen tomorrow. So the question about the false discovery results a huge significance. What if I've done a test with a thousand and then the false discovery rate, what if the Bonferroni would be to the strength? I agree with you. So how do you decide it's my question? Are there others that based on what the God tells you that 3 gets too little so you shouldn't use some conclusion or have you justified it? Yeah, to some extent you have to use some human intuition. Formally the way the false discovery rate model is done is you actually look at the distribution of your p-values over all 1000. And the p-values, if you had your thousand you might have, well in this case it was just 3 that were 0.05 and then you might have had, well who knows, about 10 at 0.02 and a bunch smaller, smaller, smaller. Or you may have had something way like here with most of them at p at 1. That's a very unusual distribution and a lot of FDR tests would probably choke seeing something like that. The normal occurrence is that you'll see something that looks more like this, that it tends to go down like this over time as it's spread out from 0.05 to say 1.0. And if you were saying that you had mostly 1000 values, maybe flat out something like this, the essence of an FDR is to just draw a line here and this is the correction you would have. These are the ones that are likely to be false. Anyone above this are likely to be real. So if there were three, this would be about one, so two of them probably are real. You can have situations where they're all the same. It's just flat. In that case, these are all fake or they're not, they're false. So that's sort of the, say this is more typical. This is a case where what you're measuring is mostly nonsense. So if you're trying to understand the TCA cycle and the metabolites that you're measuring are all the metals. Metals have nothing to do with TCA metabolism so you probably get something like this flat distribution. You're not measuring the right thing. But if you're measuring something relevant, some metabolite in the TCA cycle and you're seeing that some of them are more significant and others just weren't, then it might say, okay, this is where the perturbation is or this is what's relevant. This is where intuition is. If you are measuring, like looking for your lost keys and you go to the Antarctic because you think you've put your keys there because you've never been there, that's a silly thing to do. So intuition says, look for your lost keys where you have been. So that's where I think it's relevant, where intuition is important. So yes, it's kind of an inflection point. It's sort of like when you have a log curve and you try and choose the point where it turns over. So that's roughly what's done. So the Bonferroni correction, just going back, is something that you could have used in this one where we say, like, that's 14 weather predictions and which one is the most significant one, the only one that is significant, which is 0.05 divided by 14, which would be less than, I don't know, 0.002, is the eclipse. So using that rule, the only thing that is statistically significant is the prediction for the eclipse. Those are, again, intended to show sort of the extremeness of the Bonferroni correction. An FDR measure, which this one is sort of simplified, is, let's say you had 100 different t-tests, but let's say you had a p-value, most of them were 20, but below 0.05. Some of them you would expect to have a false positive, and you can estimate how many of those might typically be in a normal situation. And so that would be one test. So to correct for that, rather than dividing by 100, we only divide by about 20. And so the new FDR corrected or adjusted p-value, and it's also called the q-value, would be 0.05 divided by 20. So that's not as bad as 0.05 divided by 100. So that's why the FDR is a softer criterion, and it's also a more reasonable criteria. And so this is an approximate one. It's the back of the envelope one. It's not the way you do it mathematically. R would have an algorithm to do this more accurately. And it would look at the distribution of all of the p-values and figure that out. OK, so that was a little bit of a segue or sideways discussion on p-values and significance with multiple t-tests. I'm going to talk a little bit about scaling and normalizing. So as I say that they have an unfortunate dual use, because some people will say, can you normalize your distribution? That means can you make it into a normal distribution? Some people will say, can you scale your distribution? Well, that means can you move or multiply things by a scaling factor? So as I say, it's a little confusing. Anyways, we had two populations, and let's say that one of our rulers was miscalibrated. We got a bad ruler, and we found that things were off. But initially we get this clear separation. So even though to my eyes these look like identical populations, somehow because we messed up with our ruler or scaling, one population looks a lot taller than the other. If we fix our ruler, realizing that it's off by 10%, we get a more reasonable distribution. So this is correcting for if you want a systematic error. So an example of a systematic error in metabolomics is a dilution effect. This is common in urine analysis, but it could also be common in any tissue preparation where you've diluted or you didn't weigh out things or you didn't consider hydration or water content in the sample when you calculated the mass. So this, quote, normalizing or rescaling is obviously critical. As I said, normalizing or scaling can have multiple meanings. So when I say normalize things for distribution, it means make it into a Gaussian distribution. If I can say rescale or normalize, sometimes that just means make it so that the total sums to one or so that it's appropriately adjusted for dilution. At least when it comes to normalization, I've already mentioned this in accordance with the log transformation and to make things look as close to a log or a Gaussian distribution as possible. So that is something we'll talk about and we'll see this in Metabolanalyst. And we'll use, and Metabolanalyst does the same thing, it uses normalizing and scaling in the same term, where in some cases means log transformations for normal distributions, in other cases it means scaling to adjust for dilution effects. We're going to switch over to another part, which is data comparisons and dependencies, which is another part of statistics. And this is one that is very common in metabolomics and every other omics study, it's very common in modeling, it's very common in computer data analysis. We want to know something about before and after with a treatment or an intervention. We want to measure one variable depending on another. We want to have an observed property matches with a predicted property. In all these cases, we're usually working with many, many samples, dozens, 30, minimum, hundreds, more typically. When we do that kind of dependency measurement, the standard way to do it is through a scatter plot. So this is a scatter plot of the relationship between a husband's age and a wife's age. And so in this case, the question is, do men and women of the same age typically marry? Intuitively, we'd say that's usually the case. The graph to our eyes confirms it, but then how can we quantify that? This is where statistics is useful. And the way that we can sort of measure this relationship between predicted and observed before and after whatever is through calculation of the correlation. So that graph between husband's age and wife's age is positively correlated. If we were trying to correlate husband's age and distance to the equator, we might find something that's completely uncorrelated. If we reversed one of the ages and add one going from positive to negative, the other one from negative to positive, we could get a negative correlation. They're still correlated. It just happens that the slope is negative instead of positive. I think most of us can recognize when something is highly correlated. The cluster in the data on these cluster scatter pots is tight. A perfect correlation gives us a perfectly straight line. You should never see this in physics or biology or chemistry. And then there are low correlations. These are the things that you typically get when you do psychology measurements. There's a fair bit of scatter. The quantitation of correlation coefficient is through the correlation coefficient. It's called the Pearson correlation coefficient. It's the formula for calculating it. It has a number or a letter associated with which is r. Given these things, we could actually assign a number. The one on the left has a correlation coefficient of 0.85, which is pretty good. The one in the middle, a correlation of 0.4, which is not so good. And then the perfect correlation coefficient is 1. It's essentially a measure of the variation compared to the expected slope. So as I say, there's the linear coefficient of linear correlation, the Pearson product moment correlation, that's the most common one. There's another one called the Spearman correlation coefficient. That's less commonly used, but it is quantitative. It's the way many, many people use to compare predictions, simulations, and all kinds of data dependencies. There is an issue though. If you use Microsoft Excel, which most of us do, and you use the plot, it produces a graph with scattered data. And it will calculate not the correlation coefficient, but the thing called the coefficient of determination. It presents r squared. And that's, I think it's, well, it's a bug, but they've never fixed it, and unfortunately it's become common use. So people quote r squared, and they often quote, this is the correlation coefficient, it's not, it's also wrong. Pearson intended as he wanted people to use the correlation coefficient. That's what you should quote, that's what you should use. So take the square root and quote that. Now, there's also an issue of when is a correlation significant? So here's something where we've got, you know, maybe 200 data points, correlation is 0.85, is that significant? Here's when we've got three data points, and you've got a great correlation coefficient, 0.99. Could you publish that? In some cases you start collecting more data, in this case where we have the three data points, which are great, and you do two more experiments, and you show the two data points are now in red, and so your correlation went from perfect to abysmal. So this is an example where, again, collecting typically more than 30 points is critical. And there are a few tricks that people have done, where A, they just show three data points and say, look, my model is perfect, or my data is wonderful. They can also do another trick, which is only collecting data at the extreme ends. So you might have a lot of data that you've collected at one point, and then you collect a few other data points at an extreme situation. And they say, oh, now thanks to that, I get a fantastic correlation of 0.95. But you've left out all the other data in between, and normally that will reduce your correlation coefficient. Some people are very selective in the data that they choose. I don't like that point. I don't like that point. I don't like that point. They toss out 197 points, and they keep the three ones, and they get this perfect correlation. That's also incorrect. So you can actually use a t-test to assess the statistical significance of correlations. So correlations can be quoted with essentially a p-value, or a t-test value, and whether the slope is statistically different than 0. So that is another way of essentially quantitatively assessing whether this is a really decent correlation, and whether you've got enough data to make a statement about that correlation. There are other things that you can see with correlation plots. Sometimes you'll get a wonderful looking correlation, and then you'll see these outliers. Some of those outliers are a measurement error, or a typographical error, but it depends on this experiment, and it depends on the situation. So sometimes those outliers, in fact, are very important. They may represent a substantial change in physiology, or gene expression, or whatever you happen to be measuring. In the case of micro-experiments, that's usually what you're looking for, are the outliers. Things that are different than the rest. So as I say, outliers can be good. Sometimes it can be bad. Obviously, if you're doing things like modeling, it means your model is kind of messed up. But looking for outliers is also a way of seeing if you've messed up in your typing, measurement, recording, whatever else. If it's looking at metabolite concentrations between two groups or another, one of these outliers could be an indicator. It's a biomarker for something. So it can be useful for looking at significance. So we've looked at correlations. Here's a scatterplot of gain. I think most of us realize there's a correlation between height and weight. So this might be for a collection of rodents, and we can see that we've got this, you know, rough relationship between the height and weight of these bipedal rodents. So you can calculate a correlation coefficient, and it's pretty decent, and it tells you something about physiology. But you can also do something else with that data, and you can cluster it. And if you knew what you were measuring, you might also see, well, all the ones that were in that cluster, at least were visible to our eyes, were all the males, and then the pink one, those are mostly females. So rather than trying to correlate, which, yes, height and weight are reasonably correlated, clustering would have allowed us to identify something probably more useful, which is these rodents have very distinct populations in terms of physiology and males, or sexual dimorphism between males and females. So clustering is another technique, in addition to correlation, that we use in statistics in bioinformatics and cheminformatics. So it's widely used in metabolomics. It's widely used in micro-ways. It's widely used in proteomics. We use it in phylogenetic and evolutionary analysis. We use it in classifying protein structures, the grouping sequence family. So it's ubiquitous. It is essential to essentially all parts of biological statistics. So clustering is a way that we group things that are logically similar. So if we are sorting socks from the laundry, we are clustering. It's a little different than classification. Usually a classification, somethings are labeled. In the case of sorting socks, you might say this sock goes with that sock. Each of them has a label. So someone went through with cape and labeled it for you. That's what you're doing. Versus sorting socks from the laundry, you're trying to judge by shape, color, age, whether they're supposed to be joined together or not. Clustering can help inform whether you should classify things and how they should be classified. But they are distinct things. So the clustering process requires a measure of similarity. So you can use the sequence similarities or dissimilarity or coefficients of dissimilarity you can calculate. And you also need a threshold to say this pass is my test, therefore I cluster them. And you need to be able to, with the dissimilarity coefficients, to be able to measure the distance between clusters. If you are sorting laundry and sorting or pairing socks, usually you have to start with one sock first. So usually you start with a seed. And that starts the whole process. So there are three different types of clustering algorithms that are used. One is called K-means. So it's sort of clustering by dividing. Then there's the hierarchical clustering, which is more frequently used in biology. And so you're trying to get a progressive nesting of clusters or groups. And the last one is more computationally based. That's self-organizing feature maps. And so clustering through training, machine learning if you want. So the K-means algorithm, or partitioning method, is first of all, grab an object. And you identify that object as your center for the first cluster. And you calculate, grab another object, and calculate its similarity to that center. And if it's just one cluster, then it's just that one centroid. If that passes this similarity threshold, then you can put the cluster or join the two. Once you've joined the two, then you calculate a new center from the two objects. And then you go back and grab another object and see if they should cluster or not. If they don't pass the threshold, they can't join that group. Repeat again. So this is how you might do K-means if you had different colored balls. And so you might choose one. Determine if anything, so the center is its color. Then you choose another and determine based on this rule if it's within 50 nanometers of the absorbance wavelength of the first cluster or the center that you've got, you join them. In this case, it passed, and so you're able to join those two bluish turquoise balls together. You could try it for the other ones and probably the other three wouldn't fit. So you now have essentially four clusters, two with two objects, and then three with one object each. Hierarchical clustering is a little different. In this case, it's sort of a finding and merging until everything has been merged. So essentially you've got one cluster, but they are sort of ranked. So in that case, you don't just randomly grab one. You actually do a complete pairwise comparison to everything. And it's only after you've done all these pairwise comparisons that then you choose the closest pair and then you choose the next closest pair and then you choose the next closest pair and the next one and eventually assemble a large tree, which is in this case the hierarchical cluster. And this is how heat maps can often be generated or it's how evolutionary phylogenetic trees can be generated. So in the end, in hierarchical clustering, you've essentially joined everything into one massive single tree, but there are many branches. The branch lengths tell you how close or how distant things are. So to say hierarchical clustering is generally the preferred route for many biological clustering things. So that's clustering. So you've done clustering, we've done distributions, we've done ANOVA analysis of variants who talked about correlation. These are all pieces of the puzzle that allow us to do the big thing, which is multivariate statistics. So again, just like univariate statistics, we can start with a population. But instead of just measuring one variable, height, we're actually going to be measuring in this population many variables. We're going to measure their height, their weight, their hair color, their clothing color, eye color, and everything else. In the case of metabolomics, we could be measuring all the metabolites in their blood. So there's 50-100 different compounds that we might be measuring there. So in this case, because we're measuring more than just height or more than just one metabolite, we are now doing multivariate statistics. We also have to do, well, as part of the multivariate statistics, we have a fairly fair bit of what's called multi-dimensional or dimensional reduction analysis. On top of that, when we do metabolomics experiments, we'll not only look at a population, in this case a bunch of people or a bunch of rats, we may also do technical measurements, which we'll call technical replicates. So the intent when we have these populations of, in this case, it was 30 cases, 30 controls, those 30 cases are supposed to be resembling replicates. They're supposed to be somewhat similar to each other, we hope. Healthy people should be roughly metabolically similar. But there's variation. And then with our equipment, we also expect that our equipment's going to reproduce exactly the same measurement over and over again. And I think we all know that doesn't happen either. So in many cases, we design our experiments both to have biological replicates. Point of having 30, 40, 50 cases. And technical replicates, which might say that we're going to collect a couple samples from each person. So that's a part we'll touch on again later on in the course, but that's a bit of a segue or diversion. So whether it's a few metabolites or hundreds of metabolites, whether it's using multiple instruments or one instrument, we're basically dealing with multiple variables. To make multivariate statistics work, make it feasible, we want to change the data from multivariate data to mostly univariate data. Once you can convert the things to univariate data, then you can apply the t-tests and the ANOVA and all the other things that we understand or just learned about. So to convert multi-dimensional data to one or two or three-dimensional data, we use a thing called dimension reduction. So in mathematical terms, we talk about many variables as dimensions. X, Y, Z, three dimensions. A through Z, 26 dimensions. So if we have 100 metabolites, that's 100 dimensions. The technique for doing dimension reduction is called principal component analysis. So most people have probably heard of PCA. But as I say, it is a mathematical process that tries to identify generally correlated variables into a smaller number of uncorrelated variables. So height and weight are correlated. So technically you could put that into a single variable. So if it's done correctly, it is possible to do a PCA and take literally thousands of variables and produce them into two or three features. They won't represent in metabolomics, they won't represent single metabolites, they'll represent some of metabolites. These represent features. So we've seen this actually before, but this is a more detailed example. Here's three spectra, three different rat samples. Each of these have hundreds of peaks. We can have, in this case, maybe 30 different samples. In this case, 30 controls, four paps and five or four or five ennets. But these are their treatments. And so we can take all of that spectral data and we can see that they're three clusters. The other thing that you should also be able to do just with your eyes, you don't have to be an NMR expert or a mass spec expert. You can see just by looking at the spectra that these three look very different. So what PCA is doing is what you can almost intuitively do or naturally do with your eyes. So generally PCA captures what you should be able to visually detect. If you can't see those differences from some of your data, and again, if it gets really complicated, you maybe can't. But this is some NMR data and you don't have to be an NMR wizard to say these three spectra are very different. And if you looked at 40 of the paps together or 40 of the annets together, you'd probably all look pretty much the same. So you would be able to cluster them quite naturally just through intuition. All that PCA is doing is making it mathematically formal. If to your eyes you can't see any difference, if all of them look identical to the pap spectrum, PCA is not going to help. All it's going to show is they're all one giant cluster. So how can we think of principal component analysis? Well, one way is to think of this idea of projecting shadows. So let's take a three-dimensional object. We can take a bagel or a donut and we have it sort of hanging by a wire and we're going to work in a sort of dark little box and we have two flashlights. And what we're trying to do here is work on projections. So one way happens that we shine the flashlight onto the bagel in one direction and it produces a shadow that looks like an O. And then we use our flashlight and shine it on and we get a projection on the other wall. Again, it's a two-dimensional projection that looks like a sausage or a wiener. One projection is more informative than the other. If I saw someone drew the shadow picture of the O and said what kind of pastry do you think this is, I would guess donut or bagel. If someone showed me the object on your right, which looks like a sausage, and I said what kind of pastry is that, I might say that it's not a pastry, it's a hot dog or something, or hot dog bun. And I'd be wrong. So one projection is more informative than the other. The most informative projection is called the principal component one. It's one that has the largest eigenvector, it has the most information in it. The one that's less informative is the principal component two. It does give you some dimensional data about how this donut is, or this bagel actually has some depth to it. So that's taking a three-dimensional object, projecting it into two dimensions. We could have a five-dimensional object and project it into two dimensions. You can do that mathematically. It's so easy to do it with flashlights and boxes. But that's the concept, that's the essence of principal component analysis. It just happens how they've chosen the axis, that their main most important eigenvalue, eigenvector, is the one with the largest value. Okay, so 1, 2 doesn't mean... No, the labeling just happens to, if they always say the x-axis is PC2 and the y-axis is always PC1, then however they've chosen to do it. But it's the principal component, and usually the most they'll show is three principal components, that the one that has the largest weight, or scores the one that's most important. That's the most critical value. So when they do the algorithm, they just assign... There's a weight to that eigenvalue, and it's essentially the size or the magnitude of the eigenvalue. So you're trying to explain very complex things with essentially a minimal set of components. And the size of the eigenvalue is the one that is determining which ones are the most component. And if you've done a good job, and if the data is fairly clear, there's only about three or four significant eigenvalues that you'll find in most data. In some cases it's just two. In some cases it's just one. And if it's only one eigenvalue that's significant, then you should have been able to see the data right away. They're different. They scale them so that it's proportioned, so they calculate it, so it's 50% of the total of all the eigenvalues. You should. That's right. But you'll find that the eigenvalues start falling off precipitously. Yeah, tiny, tiny. Yeah, that's right. So formally what you're doing with PCA is you're transforming data to a new coordinate system. So you're rotating it in two dimensions, or three dimensions, or four dimensions to sort of maximize the variance of the data and so identify the one that has the maximal variance on one coordinate, the second greatest variance on the second coordinate. So a three-dimensional PCA plot just plots two PCAs. A three-dimensional PCA plot plots the three most significant ones. Mathematically what you're doing is called a singular value decomposition. So that's a linear algebra trick. It's not easy to do by hand. But it takes, in this case, our data matrix, which might have our samples, 30 or 40 human samples, and then 100 or so metabolites. So it's a 30 by 100 matrix. And we're going to decompose that matrix into a set of eigenvectors, which we call scores, and a set of loadings, which we call P's. And then the data, which are those coordinates, are related to your scores through that equation there. The T1 score is equal to the loading times, say, the concentration of alanine plus the concentration of this P2 loading. It might be citrate or something like that. So these are how we've essentially converted our data into a collection of what we call scores and loadings. This is something you guys don't have, but I thought it would be helpful to help. So this singular value decomposition is to take your data matrix on the left and to break it down into these three other matrices, one of which is in the middle, which is S or sigma, which is sort of these eigenvalues. A more realistic or understandable example is, okay, I can always show you matrices, but you can take examples of Phil Mickelson, Tiger Woods, and BJ Singh as their golfers. And here's their golfing score on the first nine holes for Sunday's PGA. And you can actually, in essence, predict what their score should be based on the difficulty of the hole, player skill, and then a scaling factor. So if we take those three tables or three matrices and multiply them out, we would get their scores. So we could, in essence, predict their scores. What we're doing in singular value decomposition is we're taking their hole scores and decomposing it into those three matrices. So it's kind of the reverse. But there's information obviously in those three matrices and those are the ones that we're wanting to do with PCA analysis. Another way of understanding PCA is to take an example that Roy Goodeck gave a number of years ago, which I think is still one of the best ones. For some reason, someone gave him airport data. And this was data that included the latitude, longitude, and altitude of every airport in the United States. There's actually 5,000 airports. And this included the airports in Alaska and Hawaii and then the continental US. And I'm not sure why they wanted to do this, but they said, can you do some cluster analysis? Can you do a PCA analysis to see if there's sort of an interesting relationship between airports and their altitude and maybe their latitude? Okay, so you take the data and you crunch through it. This is the PCA plot. So if you look at it closely, you can see that it includes or represents a cluster. And the cluster is the continental United States. There's Florida. This is California. This is Texas. This is Puerto Rico. This is Hawaii. And that's Alaska. It's not going to be airports in Alaska, but you can see the general shape of Alaska. And the PCA analysis doesn't produce a shape of the US the way they standardly draw it. But that's what the first and second principle components gave, and that's latitude and longitude. There is no other relationship between latitude plus longitude or latitude plus altitude minus whatever. It's just those are the two most obvious things. And then in the corner is this relationship between altitude and I think longitude. And it produces something bizarre. Not really sure what it is. But the point is that what PCA is, and this has been proven mathematically, is it's K-means clustering. So I've already mentioned what K-means clustering is. That's this grouping of balls and deciding how things should be joined. So that's what PCA fundamentally is. It's a clustering algorithm and it's fundamentally K-means clustering. And in this case the PCs didn't come up with some linear combination. They just came up as the direct variables latitude and longitude primarily. Once you've been able to do this dimensional reduction you will end up with clusters of data. So we saw clusters of the continental US, clusters of Alaska, clusters of Hawaii in PCA space. And you can use things like t-tests and ANOVA if you want to determine if these clusters are significant. So we usually use visual cues to say whether these clusters correspond to things. And most of us still continue to do that. Most of us just draw circles and say this is a cluster here. But you can be a little more statistically significant or useful by using things like ANOVA to determine whether those clusters are significantly grouped together or not. One question, just to give you a step back. Can you give us an example of what the PCA is? I was wondering if you may have a perspective on how to describe that in the Tableau in these terms. Like if you have a timeline of concentrations and what you're trying to do. So what are the three performance likely to represent in certain... Yeah, so this would be... Well, yeah. So this is buildings in the Tableau, and that is the tableau from the DJs. The tableau might be the concentrations for downing and citrate, for downing and adenine. Those are there. And then this one you might be proposing it into... That's going to be a small matrix and I know you made this with... No, I didn't. So it's exactly the full matrix when it's small matrix one point one. That's going to be full matrix one point one. I don't know what you're saying. Why don't we multiply? I think it's going to be this... Yeah. Small matrix one point one is two sides. So I don't even know you can't make any back end. So that's high state first one. So this is... This is that if you really want to see the inferno of PCA, this is how we can help make loading which is more important. Which feature of timing is the separation so it's a lot of loading. And I was wondering if there's sort of a synonymous just as in particular cases four will be the whole thing will be and loading to your player ability if you have anything... Yeah, so that... We'll talk about that. We'll get to that just in a second. We'll get to that very shortly. So basically this is this idea of scores and loadings. So the cluster plot that we saw showing say the US is a scores plot and it replots the data showing these different principle components. In the Tiger Woods BJ Singh thing it would be the weight of the eigenvectors and the set of the eigenvectors that probably would be a combination of a whole difficulty in player ability if you want. The other thing that's produced that isn't as frequently shown in papers is called the loading spot. So that's showing you how much of each say the variables, metabolites contributed to the different principle components. So these are all the metabolites glutamine, citric acid, leucine, serine. So they'll all be labeled with some of them aren't. And if we were looking at our scores plot and we saw that things were separated let's say that they're just the state of argument there's one cluster that's here and another cluster way at the top corner just for the sake of argument. The things that were driving that separation of those two clusters are identifiable to here. So in this case it's phosphate and malachite which are on the extreme they're driving that. And then oxo-prolene and maybe a serigine the other one that we're driving the other one is the upper corner. So by partly almost superimposing your scores plot with your loading spot you can actually figure out which metabolites we're driving or pulling the clusters apart. So it's a visual way for you to sort of reinterpret some of that data and there are other ways you'll see in metabolites that also identify which ones are driving the separation or moving those clusters apart. You would normally do one and two and this would have been if we had a three-dimensional PCA thing like what's here that's 3D then you're kind of stuck trying to do one projection then another projection. But yeah, normally most people are just happy with two dimensions and yes, it would be normally principal component one and principal component two. You're talking about the single value decomposition. They're a little more complicated. So this is, yeah, so that's there's a single value decomposition but normally what we're having to do is we're going to miss scores, loading, data. Okay, so you've gotten a bunch of results so the SCD iso-level it's not going to be as difficult like you're going to have. Here's somebody with these things or if you come close to that but which one is the ending label that's a score or which one is the ending label is loading is not there to be because at some level labels also are it's actually equivalent to the PCA decomposition but they end up using different numbers of pictures. So the picture as I said that I showed in the SCD that's the one that's best explained so that's why I sort of default to okay, let's just talk about the SCD and they are equivalent. But I still think that the most intuitive understanding is so that's the picture so we can really calculate the manual but actually it's more that all can do the inside but it's faster so from what I've done is we use an SCD so that's why I'm saying it's equivalent to the audience score but I mean we're not even as sweet almost but the picture they showed about the audience score that's exactly classic how do you quite understand and I think I'll in the last few minutes here we'll just try to wrap up and we'll break for lunch I mean these are tough concepts and I think good questions but part of it is just we're trying to get a conceptual understanding and understanding again the intuition behind it say okay here's my plot what do I, how do I interpret it what's it doing and I think most people find two or three slides that are most useful one being the the bagel projection thing is often the most intuitive understanding of why we, how we do dimensional reduction and then the other one that most people find most useful is the PCA plot of airports and how things essentially simplify so again PCA is sometimes a fairly intuitive process it's just mathematically formalized many cases our brains can do it but again it's more intuition generally people will use PCA as an initial foray to see if their data is useful and is able to give them some separation and some people look at first and second, second and third third and fourth and by the time they're going to the 11th and 12th principle component and they're seeing some kind of weak separation basically all is lost at that point if you can't distinguish with a PCA plot then it's usually a hint that nothing that you'll be doing could distinguish between those samples hitting it with more statistics trying to look at more obscure principle components again not something you should do now sometimes PCA will give you a hint and not a strong hint but it still gives you a hint and this is when people will start using another technique called partial least squares discriminant analysis this is not a clustering technique it's a classification technique it requires that the data has to be labeled and what it is equivalently doing is checking what you might have been able to get from a PCA and further maximizing that separation so it's using a bit more information because you've labeled things and you're saying healthy, sick and PCA says there's not much of a difference but now you've labeled it and so in essence the program can say I'm supposed to distinguish between healthy and sick so I'll play around with the variables a little more and stretch out things so you rotate coordinates even further with that but PLSDA is a prediction whereas PCA is just simply clustering so when you make predictions you can actually over train and you're telling it here's the answer now make it look good if you give someone the answer key for a test and have them take the test they should do very well then you give them a new test without the answer key they may not do so well so this is the example of over training so the different ways of assessing whether your PLSDA has actually been cheating too much or over trained so one way is to use what are called R-squared and Q-squared assessments the other one is called permutation testing so R-squared is that same thing that's the correlation index it's not correlation coefficient refers to the goodness of fit ranges from 0 to 1 Q-squared is the predicted variation or quality of it and it also ranges from 0 to 1 and basically R-squared and Q-squared should match reasonably well with each other so just like with correlation coefficients or correlation indices R-squared of 0.2 or 0.3 is pretty crummy R-squared of 0.7 or 0.8 that's pretty good same thing with Q-squared if it's 0.2 or 0.3 that's pretty crummy if it's more than 0.5 typically that's pretty good 0.9 almost never seen cross validation assessment you can calculate Q-squared partly through permutation testing metabolismist calculates R-squared and Q-squared but my preference is actually just do permutation testing that's the most robust way of doing it R-squared and Q-squared are kind of inventions of e-metrics classical standard statistical stuff but permutation testing is ubiquitous in the field of both machine learning and now more and more in statistics so what is permutation so this is the other key slide because so let's say you've got a PCA analysis things are unlabeled and initially it looks like there's no real separation so we're going to quit our job the other option is let's say we look at the label data so they're going to be blue and orange or red and once this stuff is labeled there's a bias it's not a perfect separation and maybe if we were plotting this it's a three dimension separation but you can see that the red is starting to separate and the blue is higher up and the red is something crossing but it's still required to be labeled if we run the PLSDA with the data it has just a little bit more information so now it's going to rotate and there's a little bit more and after PLSDA we have to see this nice separation so if we're ready we're going to quit with our job but what we want to do next is test to see whether that was this burn out of cheap so what you do or what's the computation of that is you re-label all of the data so this is your first review of the label and it causes so this top one used to be blue now you say it's red and this bottom one used to be blue now you say it's red so everything is switched so you're going to have this re-labeled thing from you then then you run PLSDA for half an hour and it says you're ready and this isn't separated okay now it's re-labeled again this one is red and this one is red this one is red and then you send it to the machine and it calculates another PLSDA run play run play it says now still no separation and what you can do is you can kind of do a little separation different ways of doing it and what you do is you track that score how well things are separated so this first one you get a great separation the next two, that's three, that's four separations so you can start whining if you want the separation so it turns out that this one which is up here so well separated everything else that you've done and you've got hundreds and hundreds re-labeled and they all look like this then you can say this is significant and based on the number of times that you've commuted if you did it a thousand times and this one's still out there you can say that this is significant at a p-value of one over a thousand like zero zero one the one with the average represents that yeah, that's the one Rosa the other ones are the other who say that yeah so you'll do this many hundreds of times so I think it's off to do this re-desert of course none of us would do this by hand so permutation types here you can do this so it generates in essence synthetic experiments yes to determine whether that separation that you're seeing yeah that's right so it's the most robust and intuitive way of doing it so I don't like I like permutation data as a rule and it's very calculable it's a very robust test you can permute if you want to have a p-value of point zero zero zero zero one you can permute a million times a p-value of point zero five you can permute twenty or thirty times so you can change it give your level of significance or your alpha but this is also it's not unheard of this is the sort of thing that you'll see where the PCA just doesn't give you quite what you expect but then going to that next step PLSDA does seem to give you something a little better and obviously there are other techniques you can use whether it's PLSDA, you can go to support vector machines random force, naive bays, neural networks these are all classification techniques that are commonly used these are also available through Metabolanus as you guys will see if you are trying to attack your data and clear the data barrier wall there's an approach start with a small archival barrier and the smaller degree it's the PCA plus strength hopefully if that works if it doesn't then you bring in a heavier benefit this is for PLSDA you have freshened methods for you and you saw that as a previous slide where you did help sometimes that's a quickly good job if you can use the machine range you're a little smarter about separating things or identifying sometimes that will be sufficient so there's this progression so you should see if there's first of all if there's some natural clusters that form that's great you've probably got great data to work with so if the clustering isn't that obvious then you can bring in some of these supervised machine learning methods classification, PLSDA neural nets, other things but once you are doing those supervised methods you have to assess whether these are significant and that's where this permutation test or R-squared Q-squared is important so the supervised methods like PLSDA, SVMs neural nets everything are very powerful and people get enamored with them so much so that they actually skip the PCA and they shouldn't and that's when people start making errors and in many cases people do PLSDA straight off and won't do the permutation testing that's another major error in the end if you can't see separations from PCA or by eyeballing some of the data we're just looking at it before straight going to a computer then no matter what you do it's going to essentially give you insignificant results so again it goes back to almost the first slide statistics is the mathematics of impressions or the mathematics of intuition so always look at your data the trends should be present reasonably obvious if they aren't to your eyes then the statistics aren't going to fish it out for you they rarely do so that's it for statistics and I think it's lunchtime now