 All right, so we're on lecture number five in your books. And these are the usual slides, creative comments. And so what we're going to be doing today this morning is sort of a background on statistics. And some people who have taken a lot of statistics or are statisticians themselves can fall asleep here. But for those of you who aren't that familiar, haven't taken a course in a while or puzzled, I think, hopefully this might cover some areas. After this background on statistics, we're going to spend the next two sections making use of some software that Jeff has written. It's called Metabo Analyst. And it will have you go through a variety of metabolomic data and you'll use some of the ideas and concepts that we talked about yesterday, but primarily today. So we're kind of getting into the meat of data analysis here as it refers to metabolomics. But it's also very, very similar to what you would do with microarray data and with proteomic data. So it's one size kind of fits all. So we're now on day two, statistical methods. Then as I say, these two areas are going to be essentially connected. This is Metabo Analyst. We'll give you a lecture on Metabo Analyst. And then there's a bunch of questions. And essentially, this is an opportunity for you just to answer those questions that were in the lecture, but also to explore some of the software, because there's a lot to it. And then we'll close off at the end of the day with sort of a broader perspective, discussions on a couple of other areas of metabolomics. It's only a two-day workshop, and so we're not going to be able to cover things like fluxomics and modeling. Those are easily full-day or multiple-day discussions. So this is just to sort of scratch the surface. And then we'll wrap up at the end. OK, so today morning is about statistics. And there's lots of perspectives and views about statistics. I think these are some really useful quotes. One that I like is that there are three kinds of lies, lies, damned lies of statistics. And that's actually a quote that has been attributed to Mark Twain a lot, but it's actually Benjamin Disraeli, who was one of the more prominent prime ministers of England back in the Victorian age. And I think this is sort of a comment he was making about particular budgets. The other one is 98% of all statistics are made up. I think that's true. And then another one, which is this from Aaron Levenstein about what statistics are, sort of like the keys, what they are reveal is suggestive, what they conceal is often vital. Anyways, I've been involved in dealing with statistics and crunching numbers for a long time. And my view is that statistics is trying to formalize or provide a mathematical framework to an impression. And so if my gut is, this is correct, it doesn't make a strong scientific statement. But if I can say that the t statistic that says my gut, that says this, is 0.05, then people will tend to believe it. And that makes it quantitative. And it's framed in some kind of mathematical structure. It's not to diminish statistics. I mean, it's vitally important. And it's framework in which so much is based in modern science and in economics and business. So to really understand statistics, you almost have to start at the beginning. This is the issue of distributions and statistical significance. So the first part we'll deal with is univariate statistics. And then towards the end of this lecture, we'll talk about multivariate statistics. So univariate statistics is often about populations. And one of the best ways of thinking about univariate distributions is to think of a collection of people. And you're just measuring a single dimension. In this case, we're going to be measuring height. We're thinking about height. And you can see there's a collection of people of different heights, also different weights. But this is what we're going to be looking at. So univariate sounds fancy, but it just means single variable. And if we're looking at a population, we could be looking at people, we could be looking at objects, we could be looking at test scores, whatever. So we're measuring the single variable. I said today we're going to be looking at height. This is sort of the one we'll focus on. But we could have taken weight. We could have taken some test score and IQ. Whatever we want from this collection of objects. And if you take that variable, this case height, and you measure that over this population, one that I just showed or a much larger population, the entire population of Toronto or Canada or North America. And you measure the frequency of those heights. And they're sort of partitioned by one or two inch intervals. And you get this. And I think everyone has seen this. And we call it the bell curve. It's also a Gaussian curve or a normal distribution. So we've partitioned everyone's height in, say, one or two inch intervals. And they're going from the shortest to the tallest. And they're just calculated or added up the number of people in those certain height ranges. So this could be four feet, four feet, two, four feet, four and so on. So the characteristic of this bell curve is that it sort of looks like a bell. That's what everyone knows or learns about it. But it's remarkable because almost every variable we measure in nature and in living systems, even in physical systems, tends to follow this. And it seems to be a fundamental property of many, many things. And it's quite striking. So the normal distribution is fundamentally symmetric, meaning that it's shaped evenly on either side. And that symmetry lies about the mean. And every normal distribution does have a mean or an average. You can calculate it. And it has a width. And that width varies. So that's sort of the unique thing about the distribution is that it still has this general shape. You can have a broad shape or you can have a narrow shape. And that shape, the description of the curve is the standard deviation or sigma. As I said, it is the most common distribution known. And it seems to fit all kinds of things, particularly in life and physical sciences. So that's sort of stated here. That's the normal distribution. The other thing that people have noticed is that the more samples you collect, the better the curve looks. So if you're only collecting, and this is a population of 16 here, your curve, if we try to plot it as a height distribution, we find something that might be a little strange. There might be a gap or there might be a bimodal distribution. But instead of 16, we had 1600, that curve would be very smooth and very nicely shaped. So this is something that people have known for a long time. And it's one of the reasons why classrooms, particularly, evolved to have an average of about 30 students. Because there's a general observation that 30 was sufficient to give you a good normal distribution. So obviously, if you get up more, great, but eventually gets too hard for teachers. So a rule of thumb that's been around for more than a century is that if you want to get a good sampling, 30 to 40 seems to give you enough to get a good normal distribution. So that's why a lot of pilot studies have 30 normals, 30 controls, or sometimes 15 normals and 15 controls. And that's sometimes sufficient to get those statistics that you get for a normal distribution. So that normal distribution actually can be defined by an equation. And this was developed by Carl Friedrich Gauss. And so it's called the Gaussian distribution. So this equation has been around for a couple hundred years now. So you can see it's that e to the minus x squared is the general form. And that you can see that the mu is the mean. And then there's sigma, which is a standard deviation. So you can see that there's a partitioning marked off here in terms of sigma values. And you can integrate the area under this curve. And you can integrate between these intervals. And you'll find out that this area here occupies 68% of the curve. This area here occupies 95% is 99%. So the average plus or minus 1 sigma is sort of that one standard deviation. From that, there's also some formal definitions, the mean, which everyone knows how to calculate. And then there's the variance, which is the square of the standard deviation. And this is the variable minus the mean by the number of observations. And standard deviation is the square root of variance. And this is the one we tend to use more often. So these are the two that are formally defined for normal distributions. So standard deviation width of the distribution mean is, as we know, the average. So that sigma plus or minus 1 sigma is 68%. Plus or minus 2 sigma, 95% of the area. Plus or minus 3 sigma, 99.7%. So when we grade on a curve, as we do in universities, typically people in this interval, which makes up the largest group, we typically try and aim for a grade of C. If you're in this interval, it's a B. And if it's in this interval, it's an A. This interval is D. This interval is F. So that's grading on a curve. And that's how we partition things so that we identify what's appropriate and what kind of performance is their exceptional. It's F, yes. So there's also some numbers here, which particularly, as we're looking at the extremes, this is what we're often more interested in when we're looking at large groups or large populations. We're interested in, is that observation exceptional or not? It's like we want to identify someone who's an A plus student. And these are essentially probabilities that some of these things are exceptional or how frequently they might be observed. So if we're looking at anything scoring above this score or anything above this height, the likelihood of finding, say, someone above this height, which might be 6 foot 5, is about 15%. The likelihood of finding someone who is, maybe this is 6 foot 10, is 2%. And the likelihood of finding someone up here, above 7 feet 2 inches, is 0.1%. Same sort of thing, what's the likelihood of finding someone who's maybe under 4 feet and under 3 feet? That's the distribution. We can use those numbers to actually figure out whether that observation is extremely rare or what the likelihood of finding that observation is. And that's an important thing to know. And that gets into the issue of significance. So something is more than one standard deviation, either larger or smaller. That's the likelihood of seeing that. So someone above 6 foot 5 or someone below 5 feet, say, is about 32%. The likelihood of finding someone who's above 6 10 and who's below 3 and 1 half feet, maybe 5%. And likewise, as we go up for the three standard deviations, someone finding above 7, 1 and below 3, very rare. So this is either or, or as I can say, it can be one-sided or other-sided. And then we just divide those numbers by 2 to get that likelihood or the significance of that observation. Same analogy I just gave you before about grading. Class sizes, when we have first year universities, there are 200 to 400 students. And when we grade on a curve, that interval, that middle interval is C. And those above one standard deviation B, those below one standard deviation D, and then A, and then F on the other side. So this is, I think, everyone is familiar with this and knows about it. But it's important to know those basics. Because now we're going to get into some of the real statistics. And so there's a term we use in many fields, and certainly in metabolomics, we use it in statistics. We'll use it in microarray studies, in proteomics studies. And it's the P value, P standing for probability. And it's the probability of obtaining a test statistic. So that could be a score, set of events, a height of at least as extreme as the one that was actually observed. And so in statistics, we use hypothesis testing, so accepting or rejecting a null hypothesis. So you reject the null hypothesis when the P value is less than the significance value. And that significance value is, we call alpha. And actually, it's an arbitrary number. Historically, we've chosen 5%. You can choose a value of 10%. You can choose a value of 1%. And there's no reason why 5% is God's law. And this seems to be, I think, forgotten by many people. But 5% is the one that's standardly used. So if you hear about political surveys where they're trying to identify voter preference, and they might say, so-and-so is voting for this party at 48% and the other is for this party at 32% with the likelihood of an incorrect assay of 1 in 20, or this is correct 19 out of 20 times, it's the 5% rule that they're using. Anyways, when you reject the null hypothesis, two populations are the same, or this observation is inconsequential, when you reject the null hypothesis you say, I found something that is significant. So I found something at the 5% level that it's significant. So my likelihood of being wrong is 5%. My likelihood of being wrong is 1%. My likelihood of being wrong is 10%. Some people are satisfied with my likelihood of being wrong is 50%, and that's OK. If that's what you've agreed to, that's fine. If someone feels that you need a likelihood of being wrong with only 5%, OK, that's again a bargain, or a discussion, or a debate. And as I say, it's not God's law that 5% is followed. So let's go back to this issue of height. If the average height for an adult in North America, where you're both men and women, is maybe 5, 7, 5, 6. So that's men who are taller, women who are a little shorter, and it comes out to something close. And so if the standard deviation is 5 inches, then we can say that some roughly 68% of the population is between 5 foot 2 and 6 feet. So we ask ourselves, what's the probability of finding someone who is more than 6 foot 10? And if we choose an alpha of 0.5, so 2 sigma gives us roughly 6 foot 5 and 3 sigma gives us 6 foot 10, and then you have to be above 3 sigma to get 6 foot 11. So you can say you're null hypothesis. If we find someone who's 6 foot 11, are they a human being? You can frame it that way. And in fact, if we use this alpha of 0.05, in fact, finding someone who's 6 foot 11, we would rule out that that is not a human being, because it's above our cutoff. Now we could choose a different cutoff of 0.01, and we can say, oh, this fellow who's in the basketball court is a human being, because that was our hypothesis. We've arbitrarily changed our cutoff. And now normally, we don't need to have something that extreme, but the choice of alpha is somewhat arbitrary. Same sort of thing as the choice of alpha. If you flip a coin 20 times, I think most of us have done something like this as a high school or a junior high. And let's say that coin turns up heads 14 out of the 20 times. Is this a fair coin? And you can calculate the probability of this is about 0.58. So if we chose a value of 0.05, we would conclude that this is not a fair coin. If we chose a value of 0.1, we would conclude it is a fair coin. I mean, it's pretty hard to make an unfair coin. So I think most of us from intuition would say, OK, this is a small observation. It's likely a fair coin. Rather than rejecting my hypothesis, because I've got a value of 0.058, I would probably just simply adjust my p-value and say, well, I'm choosing. I'll choose a higher one. And yeah, this seems to be a fair coin. So as I say, there's no official rule that alpha has to be 0.05. And I'll say this, because I've seen a lot of studies where people have published things and they get a significance value of 0.07. And their conclusion in the paper stated in the abstract is, the observation is insignificant. No, it's not. It's actually, there's a pretty strong trend there. And they're choosing this arbitrary cutoff. Some are smart enough to say that there's a strong statistical trend. It's an interesting observation. But a lot just simply use this cutoff and say, I toss out the whole work, toss out the PhD thesis. Nothing important came out of this. And that's not true. So we also have to be aware that not every distribution is perfectly symmetric. And we can sometimes deal with asymmetric or skewed distributions. And when distributions are skewed, we have to talk about different measures of centrality or different measures of normality. So in a symmetric distribution, the mean is the same as the median, which is the same as the mode. But in an asymmetric or skewed distribution, they do have a different meaning. So the mean is the average of all values. The mode is the value that is the most abundant, or the value that's most abundant. And the median is sort of halfway between the mean and the mode. So it's differently calculated. So if a skewed distribution is seen, and that's not uncommon, it can be that the average is affected quite significantly by an extreme value, an outlier. Totally throw off your average. The median, which is that sort of middlemost value, is something that a lot of people like to use because it's not so affected by that extreme value. Or alternatively, the mode, which also seems to be a good choice. And unfortunately, a lot of people don't know about these two. And a lot just choose mean. And again, not every distribution is normally distributed. Likewise, not every distribution has just a single piece. You can find bimodal and trimodal distributions. And that's sometimes important to identify very early on. So knowing the shape of your distribution and looking at that distribution is critical. And it's critical in determining what sort of statistics you can use. The statistics I'm talking about or have been talking about are largely based on assuming that the distribution is normal or near normal. If it's not, if it's bimodal, forget it. If it's highly skewed, forget it. So some of the other distributions that people have observed are binomial distributions, which in extreme level trend towards a normal distribution. So that was how the normal distribution was derived. And then another extreme of the binomial distribution is called the Poisson distribution, which is sort of the statistics of very rare events. There's one called the extreme value distribution, which is supplied to the analysis in sequence alignment. And then there's a variety of skewed or exponential distributions that you can see. So the normal distribution was really the first, or the binomial distribution was the first one that was formally calculated. And it follows the structure of Pascal's triangle of people or the polynomial coefficients. So p plus q or x plus y, the first power is tells you the coefficients. You see x plus y to the second power, to the third power, to the fourth power. And you'll see that this is the coefficients that you would see in front of the variables. And you see also, essentially, distribution in terms of their frequency. So if this is the probability of observing and not observing more of heads and tails, you can calculate based on the number of coin flips how many times you might be able to see or expect to see the number of heads, number of tails. So likelihood of after five, six coin flips that you might see just one head or one head to no heads, one head to five tails, two heads, four tails, so on. So as n gets large, and as p and q trend towards 0.5, you will end up with a curve that looks basically like a Gaussian. So if n is 30, which is large, your binomial distribution will look pretty much just like a Gaussian distribution. On the other hand, if p is very small or q is very small, likewise if the observations are relatively few, you get something called a Poisson distribution. And the Poisson distribution, an extreme one that was almost like an exponentially decaying function, more typically when someone's talking about a Poisson distribution, it has this sort of shape. But as the average gets larger, and as n gets larger, the Poisson distribution eventually trends towards a normal distribution. So it depends on the components. It depends on mu. It depends on x, that variable. So it doesn't have one shape. It is dependent on a few things. Those are probabilities. So the extreme value distribution is a distribution that you get when you start sampling at the extreme ends of a normal distribution. So actually what happens in a university is kind of instructive. In university, actually, we tend to choose the top students from high schools. And typically, it's the top 30%. In graduate school, we tend to choose the top 20% from undergraduate. So already there's an extreme selection that's going on. Yet, in many cases, we still assume that there's going to be a normal distribution. And in fact, that's not the case. We should really be grading on what's called the extreme value distribution. Most profs don't know about it and don't know what the formulation is. The extreme value distribution is a skewed distribution. And you could imagine if you just simply took all of the eight plus students and then have them take a test. Even if it was a very tough test, they'd still probably get 100% on it. Can you grade on a curve with those? So the distribution is you take this. I mean, yes, there'll be some that might do a little worse, but you'll still get this kind of shape to an extreme value distribution. So it is a skewed distribution. You'll also tend to see these things, these almost would look like outliers. And if you don't modify that distribution, it kind of messes up. Your normal calculations, the p-value, the t-tests, synovas, and things like that. So this is a skewed distribution, an extreme value distribution. And as I say, not every distribution is normal. And it's a critical thing in statistics and it's that favorite thing of statisticians to fix these skewed distributions. And the way you can fix them to make them look like they're Gaussian, to make them look like they're normal is to do a log transformation. And if you do a log transformation of an extreme value distribution or a skewed distribution, it does a rescaling, basically. And you can see this here. So this is a characteristic of an extreme value skewed distribution. See, there's lots of things to the right. Very few things in the middle. The average, the mean, and median, and mode are all different. If I took this and just multiplied by log 10 or a natural log, every single value, this is what I get. This is statically done in microarrays, actually. They'll often do the log transformation of intensities because many of the intensities have this skewed distribution. But doing that log transformation, now you get a very nice normal distribution. Now this is sort of an artificial one. This is some real data. And this is more typical of what you'll get. Decades, decays, it's hopeless. Do the log transformation. This isn't exactly normal, but it's better than this. This one actually is. So that's what a log transformation did to it. When you've done that transformation, that normalization, as sometimes it's called, allows you now to do standard statistics that were officially developed for normal distributions. If you tried to do your statistics on this or this, using those numbers without being properly transformed, your results would be meaningless, useless. So looking at the shape of your distribution, looking at your data is critical. And many people don't bother to look at that shape distribution. And many people will just simply collect their data, start applying standard statistics. The numbers, the formulas, will work. The results will be meaningless. OK, so we've talked about a single population. We've talked about the types of distributions we'll see with a population that tells us certain things. It's certainly informative of how we do things like grading and stuff like that. But then there's other situations, which is how do we distinguish between two populations. And this is of greatest interest, because when you guys are doing tests, you're typically looking at two populations, healthy and diseased. Cells grown in one culture, cells grown in another culture. Something exposed to something, something exposed to another thing. Two populations. Are they different? Did the metabolite levels change? So we'll use this one again, where we're looking at height since everyone understands that. We're looking at normal people. And we're looking at green leprechauns. If we measured their height, and it's fairly obvious from the picture, one population is different. So we can look at their height distribution, and we can plot this. Two populations. And then the question is, are they different? Are normal people different than leprechauns, at least in terms of height? Let's say we did another population check, and we went to the leprechauns of Botswana, to the leprechauns, or the normals here. And now the leprechaun population in Botswana is much taller. And so we measure them, and are they different? So this is a statistical question that's been asked for 100 years. How do you distinguish populations? And it could be for height. It could be one batch of beer and another batch of beer. And this is actually where the original origin of statistics came from. And they can measure a variety of features, whether it's height or taste or anything else. So the development of this type of statistical test is called the student's t-test, or just the t-test. And it's basically used to determine, it's the one everyone uses to determine whether two populations are different, officially a t-test allows you to calculate the probability that two sample means are the same. So that's the null hypothesis. Are the averages the same? So is the average height the same? So if you calculate a t-statistic, which we're not going to go through the math work because it's done now on computers. But if the t-statistic gives you a p-value of 0.4, and you've chosen a cutoff of 0.05, 0.4 is greater than 0.05, that says the two populations are the same, the mean is identical. So in this case, we would get something like this. If we did a t-test on another population, we got the p-value of 0.04, and our cutoff is 0.05. And the two populations are different, and this is this one. So that's, I think, pretty intuitive. There's a cutoff, and you're higher than a cutoff. Didn't make it. If you're below the cutoff, great, they are different. Point again is this cutoff. That's arbitrary. By convention, people choose 0.05, but it's not written in stone. There are other types of t-tests. There's paired and unpaired t-tests. So the paired t-test is used typically when you've got a population, a single population, and you're monitoring over time, so before treatment, after treatment. The unpaired test is the one that's more commonly used, and this is two different populations, control and diseased, or something like that. You can also use this student's t-test to determine whether two clusters are different. Is this cluster different than this cluster? Or not. Again, these are just populations, and we're sort of looking at two variables, but it's essentially a way of measuring the difference. And we can kind of scale the variables, combine them to create a single variable. But it is a way of determining whether two clusters are different. What if you don't have two populations? You've got three populations. So we've got, in this case, the normals, the leprechauns, and the elves. Are they different? We could plot this out, look at the height, and most of us visually, whether it was just from the picture or from the distributions, can probably say, yeah, they are different. The statistics, the t-tests, or other tests that we'll talk about can give you a formal mathematical description or justification of your obvious initial impressions. Again, if we looked at a different populations, this is normals, leprechauns, and elves, from maybe a different region of the world, are they different? It's three populations. And most of us would say, no, they're not different. Can we say that statistically and with any robustness? To do that, we don't do a t-test, because we're doing more than two populations. We use analysis of variance or ANOVA. So you switch gears, and so it's really to determine if three or more test populations are different. It's a generalization of the t-test, and it looks at group variance, and it determines whether or not the means of several groups are all equal. So the t-test is just the mean of two groups are equal. And rather than, since they said of p-value or t-test, it's called an F measure, and they can have one-way, two-way, three-way ANOVA's. There's more populations, four-way. The most common ANOVA method is called the one-way, just like the unpaired t-test is the most common one for the t-test. And it's just going to answer whether any one of those three populations is different. It's not going to distinguish which pair might be different. It's just any one of them are any one of them different, which is fair enough. That's the reason why it's the most common form. So you can also use the ANOVA concept to distinguish whether three or more clusters are different. Clusters, again, visually, most of us can pick out three clusters. Can you statistically say that those three clusters or any one of those clusters are different? So ANOVA can be used for that. So it's three, four, or five. The three-way ANOVA, I could say, like the need to compare each pair of them, each group of each other might be different. Well, if you're doing each group with each other, that would technically be more like a two-way ANOVA or three-way ANOVA. But if you just want to say, is any one of these or any one of these groups different? The three-way ANOVA. Potentially. You could also do, other people tend to just sort of look at t-tests. And you can kind of just do pairwise t-tests to see if there's something distinct from that. We'll get into that because we'll be talking partly about clustering and also talking about, because you're actually looking at a multivariate problem rather than a univariate one. And so far we've just been talking about a univariate test. This is, if we're looking at multiple populations, this is gone from 2 to 3 to 100. This might be a situation that we typically see in micro-rate experiments and metabolomics experiments. So there might be 10,000 genes or 100 different metabolites or 1,000 different proteins. And you may look at each of those individual genes. So we're not going to be necessarily doing ANOVA here, but we're just doing pairwise tests saying, here's the control. Here's the disease. This gene, the thyridoxin gene, superoxide, this mutase gene, we're doing pairwise tests. Are they up or down? And so in the course of looking through your list, you might have found that 20 metabolite values, based on the distributions that you're getting, had p values less than 0.5. They seem to be different at a level of cutoff an alpha of 0.05. What's the probability that one of these is going to be false? Well, if we're cut off this 0.05 and we've got 20 of these, we could do crystal-rough estimates. It's probably, it's about 100% probability, that one of them is going to be a fake one. Another way of framing it is what's the probability that any one of these, how many of these tests are likely false positives? So it's the 20 times the value of 0.05. So at least one of them will be a false positive. That may be an underestimate, actually. And so how do you correct for this? And there's lots of discussions. The Bonferroni correction is the most commonly used one. There's a false discovery rate correction. It's a little different. The Bonferroni is the most trivial. And basically, if you've got a whole bunch of what appear to be significant hits or significant values, choose the one with your lowest p-value. And probably you want to try and choose the p-value that is ideally taking your standard cutoff of 0.05 and dividing it by the number of what look to be positive hits. So if you can find something that has a p-value of less than 0.0025, those are ones that are pretty robust. Yes, it would be over 20. So these are the ones that are identified as significant. So this is a case where, and this would be quite unusual. So more typically what you might find is maybe five results or just one result where p-value is less than 0.05. And at that level, if you're only finding one result, that's less than 0.05, that's the one you're using. But if you're getting a whole bunch of what look to be positive hits, then you start to, you know, is this real? And so again, it's essentially corrected by the number of potentially significant hits. So you try and look for those very low values. Yeah, and 80 of them failed because there's nothing significant. So only 20 of these t-tests actually yielded something that seemed to be significant. I'm just wondering why I don't have to divide by 100 because I don't know the problem. No. Because I don't know in the beginning the outcome, right? That's right. Yeah, there might be something here. As I say, this is pretty artificial, but we're dealing with numbers. And typically what you would see is it wouldn't be for 100. It might be for 1,000. So 1,000 genes, 1,000 metam, well, you never get 1,000 metam weights. But this is something that the numbers we would deal with, like 100, 200 metabolites in metabolomics. So is that exactly the motive? It's sort of, yeah. Do I have only to divide by the positives? I mean, like, because, for example, I have, like, somebody says maybe not any variant, but for example, I have 17 metabolites. I'm looking on metabolites, or like, looking at constraints, or the metabolites, which has significant difference. So I have to test it. Yeah. Yeah, this is, I mean, this is sort of a specialty niche. We're still talking about univariate testing. And most of you guys should be doing multivariate work. So if you just did a whole bunch of paired T tests or unpaired T tests on your data, this is the sort of thing you'd have to worry about. You shouldn't do this normally. But so we would be looking more at a multivariate analysis. But this is just, as I say, if you chose to do this sort of thing where you're just doing pairwise analysis down the whole list to see what was, or what individual metabolites were significant and whether this was highly significant, which sort of cutoffs you can do. It gives you, I mean, people do this. It's perhaps a little tedious. But it is something that you can inspect. There's software that will do this. But this is, again, just essentially trying to think about the Bonn-Fironi correction. So if I would like to choose a cutoff from 0.01 and I would get, like, maybe 10 positives, that would really be right by the 10. Yeah, it might have been that actually, if you chose your cutoff at 0.01, you may have only got one result. And if you only had one result, that's fine. It's significant. It's this issue of when you're starting to get a lot of hits. You say, I didn't expect this many. What's the problem here? Have I got some false positives? Because there's no cutoff like 0.01, my false positives will be lower, right? Because you can't reach 10 to 0.01. You'd expect it. This is made up data, so I don't know what you'd end up with. And so if you chose an alpha of 0.01 and you only had one result that was below 0.01, you're done. There's no false discovery, right? Correction, no Bonn-Fironi correction. There's no issue of, is this, maybe, well, I accepted that FDI is so strict. So usually, when you do correctly the p-value, you do not need to put the 0.05, it was cut off. Really, you could put the 0.02, or 0.1, 0.2, or even larger. So this is just a kind of suggestion. It's not the real cutoff you should use, you should call the original. So this is perhaps another example of who are doing weather prediction, what's going to happen tomorrow. And they usually give probability. So it might rain, it might be sunny, foggy, cloudy, snow. Add all these up. There's a probability something's going to happen tomorrow. The probability is one. But which one of these is the one that we would choose to perhaps be the best? And so you can say, well, this one has the best. So there will be an eclipse tomorrow. And this is the one that we would choose as being the most important. That seems to pass our Bonn-Fironi correction of these, I don't know, maybe six of them or seven of them were below 0.05. So this is an example, as I said. It's just these paired tests. And I think, as James pointed out, most people use a higher value than 0.05. 0.1 is more 0.2 in some cases. But this is, as I say, this is univariate statistics applied to a long list of things. And that's not usually the wisest thing to do. You want to go to multivariate statistics. Another point, as I say, is this feature called normalization or scaling. Sometimes they mean the same thing. Sometimes they mean different things. We're going to talk about this in Metabolanalyst. Sometimes this causes confusion. So here, again, is a population of normals and leprechauns. But what if we measured the normals that was with a miscalibrated ruler? And this can happen. It actually happens more often than you care to think. But in fact, because of that miscalibration, we may see now two populations which, to our eyes, look different. But now, according to this distribution, are quite different. So by rescaling the ruler so that it's properly measured or measuring, so rescaling those values, we can now get the two populations now similar. So rescaling is sometimes called normalization. But it also has other meanings. So you have to be fairly specific. So in multivariate statistics, normalization also refers to making the distribution look normal. So that's that log scaling thing. So scaling and normalization can sometimes be used interchangeably, but sometimes they mean different things. Normalization in mathematics typically is a scaling thing. Normalization statistics means changing things to look like a normal distribution. But that's, as I said, going converting a skewed distribution to a normal distribution. That, too, is normalization. OK, so we're getting into slightly more complicated areas. Thinking about comparisons, we're still talking about univariate comparisons and univariate statistics. So in comparisons, we may typically want to look at things, experiments sort of before and after, so longitudinal studies, but it's sort of two populations. We may want to measure one variable against another, height versus weight. We may want to be able to say, here's my prediction, here's my observation. So expected versus observed, that's two variables. And in many cases, we're not just looking at one observation, one prediction. It's many predictions, many observations, many measures of height, many measures of weight, many groups before, many patients afterwards. So when you have lots of data, you use scatter plots. And this is a typical scatter plot. And this is the correlation between a husband's age and a wife's age. And we can see from this real data that, in fact, they correlate pretty well. So this is a standard route. And when we're working with scatter plots, we're interested in identifying certain trends, see if there are patterns that occur with those scatters. And when we look at those trends or that pattern, we typically look for a thing called correlation. So when we work with scatter plots in populations, we can see positive correlations. We can see negative correlations. Both of these are equally significant. Of course, many people seem to not realize this one. This is something that is uncorrelated, meaning that the two variables are unrelated to each other. Because things in nature do not follow perfect laws, the reason why we never see perfect line is essentially because of the normal distribution. These could be measurement errors. They can be variation in the population, whatever. If things are correlated, we can have different ranges of correlation. So it's the perfect correlation, which is never seen, except for basic physical laws. But even then, there's measurement error. There's a low correlation, and then there's a high correlation. And qualitatively, I think people can see these. But again, in statistics, it's sort of the mathematics of impressions. So we want to be able to quantify something which everyone in this room can sort of see as very good, not so good, excellent. So to quantify it, Pearson statistician came up with a correlation coefficient. So it's the mathematics of our impression, and it gives us a quantifiable way of saying that this is perfect, this is pretty good, and this is not so good. And so essentially, it's a measure of the variation of both these along that particular line. So this is for a linear correlation, and most things we presume when we're trying to correlate observed and expected, or x versus y, is that are assumed to be linear. It's not always the case, but this is the assumption. So a very good correlation coefficient is 0.85, not so good as 0.4. But deciding what you're going to say is significant or not is, again, largely up to a person's choice. You can calculate p-values for correlation coefficients. So people can use a cutoff and say, correlation is 0.4. The p-value that it is 0.4 is less than 0.05. So yes, that correlation coefficient is correct. But whether the correlation is something you want to use or create a law for or follow it, it's sort of up to you. I wouldn't. But that's sort of a personal choice. So the correlation coefficient is r, not r squared. Many, many people make mistakes, especially because Excel makes the mistake of expressing r squared values. And that's not the correct way. Anyways, correlation coefficient is a linear coefficient. Product moment correlation coefficient is another way of calling it. But it is a quantitative way. And it's the way that we typically assess things like predictions, simulations, comparisons, and dependencies. And so this is this beef I have about correlation coefficient, which is r, and r squared, which is formally called the coefficient of determination. And I never see anyone write coefficient of determination. So they're different. Don't use r and call it r squared. Don't call r squared r. So avoid the mistake that Excel gives you, which is the r squared value. So yeah, is a correlation significant? So here's someone with, we've measured 100 plus points. And we've been able to get a correlation of 0.85. Is this statistically significant and meaningful? Here we've measured three points. And this is a perfectly straight line. Is this statistically significant? Yeah, this is a Pearson correlation coefficient. So it's r, not r squared. Does it matter if it's r squared? Yeah, you can use capital or little. It doesn't generally matter. The point about this, and there are other tricks that people will do, which is they might measure 100 points here and then get 1 point here. And that also gives them a great correlation coefficient. So this is an issue. It's a challenge. And there are, as I say, methods for assessing whether that, and as I say here is that example, where we just added two more data points to that, what used to be a perfect correlation. And now it's all messed up. Correlation is just dropped. So having a sufficient number, n equals 30 or 40, is important to be able to get or identify whether there is robust correlation. I mentioned this other trick that people will use, which is they get a cluster here. And they may sometimes just have one or two points. And then again, they get a spectacular correlation coefficient. Or the other trick of just using small numbers. So there are packages that will evaluate the significance of correlation coefficients. And so a t-test can be used to assess whether that's real and whether that regression line is statistically different than 0, which is the null hypothesis, which is there's no correlation. And typically, if we had these examples, you would have a p-value of an r of 0.95, but a p-value of 0.7, which you'd say this really isn't significant. Another thing that you'll sometimes see in correlation plots is something that looks really good. But then you have what we call the outlier. So an outlier, if you then now calculate the correlation coefficient with that outlier, it'll drop from, say, 0.85 to maybe 0.7. And if you have a couple more outliers, it's all gone in correlation. So a question that people often have is the outlier an error, a measurement error, experimental error, or is actually representing something important. And outliers, from the perspective of scientists, can be good and bad. If you're doing predictions and you see outliers, you think, oh, my prediction model is terrible. But from a predictor's point of view, if they're seeing that, then they'll ask the experimentalist and say, are you sure you measured things correctly? I have seen more than a few students actually just take outliers and an eraser or a whiteout and just sort of make these disappear. And that's not kosher. What you do if you're trying to remove an outlier, you need to go back to the original data and assess whether an error may have been made. And arguably, only the experimentalist can know if that error has been made. And if they can recall something or understand something or repeat the measurement and find that it's different again, they can remove it. But you don't take an eraser and arbitrarily remove things. On the other hand, if you're trying to identify some interesting trends, and so in the cases of microarrays and some cases with metabolomics, if you start seeing an outlier, that's sometimes an indication of something significant. Maybe this does correlate to an individual who has some condition or disease. Some cases we've seen outliers a few times recently. It's when we've done patient monitoring, people taking supplements and these seriously modify their blood or urine composition. And if no one's recorded what supplements they've taken, then that can really be problematic. So outliers, as I say, it's a discussion that sometimes goes to the heart of ethics in science, but it's one that people have to be particularly careful. This is a plot of height versus weight. And I think most of us understand this relationship, that there's a correlation between height and weight. So here's a scatter plot of height and weight, and you can probably draw a straight line, or a reasonably straight line. And you might get a nice correlation coefficient saying that there is a relationship and an equation that allows you to estimate someone's height or weight. But you could also do a different approach where you might now, instead of trying to draw a line, is look at clusters, and some people might have been able to see that there's maybe something a little more to this. And maybe this isn't humans, but they may see that there are very distinct clusters, say, rotants or giraffes or something like that, where there's a clustering of male, female. And in fact, the correlation all you were getting was just between male and female, because between these groups, there seems to be almost no correlation between height and weight once clustered. So sometimes looking at the structure of the data with different sets of eyes or with different knowledge can help you distinguish that. And so clustering is something that is used a lot, as well as correlation. We use clustering a lot in metabolomics. We use clustering in gene analysis in protein and proteomics. Obviously, we do it in protein interaction analysis, phylogene evolution, protein structure. Uses lots and lots of clustering techniques. So these are really critical. So clustering has a formal definition. It's something that we can do very naturally. It's a pattern recognition process. But the formal definition is clustering is a process by which objects that are logically similar are grouped together. So logically similar is the definition that you have to give to make that distinction. A gain that can be framed mathematically. Other times it can be framed in some sort of Boolean logic. There's a distinction between clustering and classification. A gain, a lot of us use those two terms interchangeably in common language, but mathematically, statistically, they are fundamentally different. So in classification, objects have predefined classes. They've been labeled. They have to be previously labeled. In clustering, the data is unlabeled. We haven't identified anything with them. Haven't colored them in pink and blue or male or female. So clustering without the labels can sometimes tell us what to do with the classification. So if I hadn't colored this pink and blue, but it simply said, here's a cluster, here's a cluster, I'm observing that, go back to your data and see what you notice among these clusters. And you might have noticed that all of these ones in this cluster were female and all of these ones are male. Now you recolor your plot to demonstrate that cluster, or clusters you originally identified, corresponded to labeled data, male and female. If we had originally had unlabeled data, this is unlabeled, and we were asked to cluster it and find those clusters, we would have got something. So clustering unlabeled data, classification, using labeled data. So to do clustering, that logical similarity, you need to have some kind of measure of similarity. So similarity matrix, like in sequence alignment, dissimilarity coefficient. Sometimes it's a cutoff value to decide whether something is part of the cluster, a way of measuring the distance between objects and clusters. And usually to start clustering, you have to start with a seed. So we do clustering all the time if you ever have to sort to socks from the laundry. We typically will take a colored sock, a red sock, and you will look to see in the mess of other socks that there's another red sock that matches both color and length. And you'll match it. You cluster it. Then you'll take another seed, and you might grab another sock. And it's blue, and you'll look for a similar one. So that's clustering. We do this. That's the seed of choosing a single object first. So there are types of clustering algorithms. There's the k-means algorithm, partitioning. So a set of n-alphax into m-clusteries. Then there's hierarchical clustering. So this is nested clustering. And things are progressively nested to produce, eventually, one large, gigantic cluster. This is the one that's more commonly done in micro rays and metabolomics. And then there's the self-organizing feature matches, which are in other ways of doing clusters, but there's a bit of training or iteration that's involved. So k-means clustering is grab an object. This is the algorithm. Grab an object and say that this is your center. So you randomly choose an object. Say that this is the central cluster. And then what you do is then start looking through your other objects to see if you can calculate the similarity to that centroid, which is your first object. If the similarity passes a threshold test, then you add that new object to your cluster. Now, it's two objects, but then you calculate the new centroid, the new average, based on those two objects. If the object, you couldn't find an object to move to this cluster, then you throw your original one, start again. But let's say you found an object. You found another one that does group to it. You now shift your average, because you don't have these two objects. And then start off the game and grab another object and see if it matches to these ones. So let's take some balls, five balls. We randomly choose one. So it's sort of a darkish turquoise. Since it's only one object, it defines that centroid. So we start looking for other objects to see if they fit. Choose one another, choose another one. So if you chose this one, it wouldn't pass our threshold test in terms of the color. We're using the color wheel here with the number of nanometers in terms of wavelength. So an orange to a light blue, orange light blue, that's more than 50 nanometers, so we can't include it. This color is different than the color I'm using on my screen, but it looks more like a turquoise on my screen. So if you chose this object, it's about here. Related to here, these would fit. So we would pair the two together. So we now cluster these two together. And we now calculate a new centroid, which will be some average color of this, which is about here. If we continue doing the clustering, you might get these two and this one clustering. This one is a separate cluster and this one is a separate cluster. So from five balls, we'll end up with three colored clusters. Hierarchical clustering is trying to create one giant cluster. And in this case, you have to do a pairwise comparison against every object as your first calculation. So once you do that initial comparison, all these pairwise comparisons initially, then you've got to merge those closest objects and essentially start clustering and clustering until everything forms as one cluster. Same set of five balls. We do all of these pairwise comparisons, lots of comparisons. And in this case, we choose the closest one. In this case, it wasn't this one, but it was these two that are closest. And then we find the next closest. And then we find the next closest. And then finally, this orange one is the last one. But in the end, we've made a giant cluster of five, but we've organized them according to which one is closest and which one is next closest and so on. So as I say, hierarchical clustering is what you find in gene studies. And here's the first pair, a nested pair, another nested pair. And this is a heat map of gene extraction. And you can see grossly there's one red set, one green set. But then within these green sets, you can see a group which is highlighted in the middle, another one which is highlighted at the end, and another set which is randomly scattered. So there's three groups in the green and the red set, maybe two well-defined groups. And then you can partition it even further. But everything is now one giant cluster. So that's hierarchical clustering or nested clustering. And that's different than k-means. OK, so we've gone through populations. We've gone through univariate statistics. We've gone through correlation analysis. And we've gone through clustering. And you need to know all of those to start understanding this next step, which is multivariate statistics. And this is the essence of metabolomics. This is the essence of transcriptomics. This is the essence of proteomics. So multivariate statistics is multiple variables. We can measure more than height. We can measure weight. We can measure eye color and hair color and clothing or whatever. In the case of metabolomics, we can measure 100, 200, 1,000 metabolites. In case of transcriptomics, we can measure 20,000 genes. So when you are measuring multiple features for an object, you have to deal with multivariate statistics. And you now move into a realm of, essentially, multidimensional mathematics. And this is non-trivial. So in a biotypical metabolomics experiment, transcriptomics or proteomics experiment, you may have biological samples. So there might be two or a dozen or 100 rats, which are normals. And then these are the ones that are treated, or diseased, or sick, or different populations of cells. You can also have technical replicates. So you may collect multiple samples from a single animal. And this is important for things like quality control. And then you're going to be collecting your data. So you have multiple measurements, but multiple metabolites, multiple genes, multiple proteins that you're measuring as well. So the measurements you're doing with LCMS, GCMS, NMR, many variables from many objects or samples. So variables and samples. And therefore, all omics data requires multivariate statistics. So what is the trick for doing multivariate statistics? Well, the fundamental trick is to try and reduce a number of variables. And we try and convert the multivariate data into univariate data so that we can apply those same things I was telling you about before. So once we've converted the multivariate data to univariate data, then we can do t-tests and an obatests. To do that, we do dimensional reduction. So dimensions, meaning the number of variables, can be done through a technique called principal component analysis, or PCA. So it's a process to transform correlated variables. So we're talking about correlation into a smaller number of uncorrelated variables. And those uncorrelated variables are principal components. So height and weight are correlated. Let's see what else would be correlated in human beings. Eye color and skin color correlated weekly. Hair length and gender correlated. Those are all things that we could probably see correlated. And so if you're measuring all of those things from an individual, you would see some of these variables correlated. Other ones are not. So hair color is not correlated with height. Eye color is not correlated with weight. Those things are uncorrelated. And so if we're trying to then partition these features that we can typically measure in a population, we would see certain groups, which would fall in certain dimensions. So doing this correlation analysis and separating the correlated from uncorrelated variables allows us to reduce maybe thousands of variables, genes, metabolites, proteins, to two or three combined features. So we can think of this here as a spectrum. This could be NMR. This could be GCMS, LCMS, whatever, of different cohorts. It might be the average spectrum if we want. But it could be individual ones as well. But what we're seeing are literally hundreds of peaks, hundreds of metabolites, hundreds of genes. And with principal component analysis, we're taking all of this information. And we've been able to group these into, in this case, three clusters. Control cluster, the antics cluster and the path structure cluster. And it's just two variables. So hundreds of peaks, two variables. So the x and y axes now are a multitude of metabolites, if you want. So it's taken certain correlated metabolites, maybe alanine, 3N, leucine, and something else. And another step, which is a whole bunch of organic acids. So this is a combination of linear combination of organic acids and a linear combination of amino acids, let's say. But those are uncorrelated. The organic acids and amino acids were uncorrelated. But the amino acids were correlated. And the organic acids were correlated. So we've grouped those that were correlated and kept the ones that were uncorrelated apart and plotted them. Now, just, I mean, that's, I don't know exactly what it was. But those are examples of how these things. So the x and y axes are linear combinations, or formerly, they're eigenvectories. So you do this PC transformation first, and then you do the test on the PC transform the data, or just do the visualization? Typically, it's a visualization. And people don't do that many tests on the clusters, actually. It's still a visualization. Because the macro will usually do the differential test after the differential test and then the result. Yeah, yeah. I think we'll get into this other one, which is called PLSDA that gives us statistics. But I'm going to talk about PCA because I think we'll get into it a little bit more because it's a concept that's not trivial. So one way of looking at it is a non-mathematical way of principal component analysis. And in this one, it's the PCA of a bagel or donut. And if this bagel is suspended in the air and we're shining light onto the bagel and we're trying to get shadows. So if we shine a flashlight onto this bagel and we're doing a projection of this bagel against this side, we would see what looks like a sausage or a wiener. And if we shine a light on this way, that's bagel, we would get an O. So there's two types of projections we're getting for this three-dimensional object. The one that's most informative that would tell us whether or not this is a bagel or not is actually this projection. Because we all know that bagels have a hole in them when they're kind of round. This projection is somewhat uninformative because we confuse it. We see a picture that looks like a sausage and someone says, that's a bagel? No, it's a sausage. Instead, this projection tells us that the bagel has some depth. So it's a secondary projection. So what we would call the one that has the largest or most information or most of the variation in the structure of the bagel is what we'll call principle component one. And the second projection, the one that's not so informative, that only captures a portion of the variation is principle component two. And this is an idea that we do whenever we do three-dimensional drawings or three-dimensional objects onto projection drawings. This is what architects do and draftsmen do. So it's three dimensions to two dimensions. In principle components, we're doing 100 dimensions to two dimensions or 1,000 dimensions to two dimensions. So most of us can't wrap our heads around this. But this is an example of how three dimensions can be reduced to two dimensions. So if you deal with 1,000 dimensions, you have to use math. And so what you're doing in principle component analysis is called a singular value decomposition or an eigenvalue decomposition, where you're taking a covariance matrix. So you're calculating essentially the variation, covariance or correlation, if you want. But you're going to have your samples. So these are those wrapped blood samples. And then you're going to have your metabolite 1, metabolite 2, and metabolite 3. So this is your matrix. And so that's your initial data matrix. You then do this singular value decomposition where you convert this data matrix into a set of load weights and the set of scores or eigenvectors. So the scores, each score, T, represents this linear combination of your loadings values, P's, and your concentrations of your metabolites. So it's a mathematical transformation. It's something you can't do by hand, typically. Can't do in your head. We let the computer do it, so we're not going to go to the theory on that. But it is called formally an orthogonal transformation. And as a transformation, it changes the data from 1,000 coordinates to three or two coordinates. So it's a new coordinate system. So it's a coordinate transformation. And it identifies, naturally, through the identification of the largest eigenvalues and our largest eigenvectors, the one that has the largest variation, which is your first principal component, and the second principal component. Technically, these orthogonal transformations, if you had 1,000 variables, could still lead to 1,000 principal components. But the most significant ones will have the largest eigenvalues. And in many cases, two or three eigenvalues are cover 95% of the variation. So you can neglect the other 5% or the other 990-something variables. So this is some data that Roy Goodacre at the University of Manchester actually had collected. And he was going to do the principal component analysis. And so this is a guy who was working in US airports, actually, collected data on all 5,000 US airports. And it tracked latitude, longitude, and altitude of these airports. So this is a, samples are 5,000 airports. The measured variables, the metabolites, are latitude, longitude, altitude. So it's only three variables. So it's not that complex. They could have also picked length of name of the airport, who knows color of the airline that associates with it. They could have had a bunch of other variables. Anyways, so this is lots of samples, three variables. What are you going to get with this sort of plot if you wanted to do a principal component analysis? Does anyone have any idea what you would get if you did a PC analysis of airport data? What, the map? Yes. What you get is a map of the US. This is continental US. There's Florida, California, Texas, eastern seaboard. This is Alaska. Canada's up here. This is Hawaii. That's Puerto Rico. So that was, they ran through. They did the eigen transformation, singularity, composition, crunching, crunching, crunching. And all they did was just sort of produced a map. That's what PCA did. And in fact, PCA is formally, mathematically the same as k-means clustering. We just talked about what k-means clustering was. So people have proven, mathematically, that PCA is k-means clustering. So again, I'm trying to give you guys an intuition of what is clustering and what is PCA and the process. So this is, again, the k-means clustering, where we're just grouping things. And what they did was that they found that a whole bunch of airports clustered around continental US. One route clustered around Alaska. Another clustered around Puerto Rico or Hawaii. And they produced the shape of those places. So once you have the dimensional reduction, once you've done those clustering, they are, generally, those clusters are somewhat normally distributed. And they do have means and variances. And so you can distinguish between whether those clusters are statistically significant. And you can use a t-test or a nova test if you wish. But now you're using these combination variables. So a nova could be used to do this. It's not done widely. Most people just simply visualize the clusters and draw circles and say, I see a cluster. And that's OK, too, because this is a pattern recognition process. And pattern recognition is something that people still do better than computers. So you can have a PCA plot that's in two dimensions. You can have one that's in three dimensions. You can't generally have one in four dimensions because you can't visualize that. But usually, in most cases, that's the number of significant eigenvalues, eigenvectors that you can pull out. So this formula is called a scores plot. We call it a PCA plot. But the formula is called the scores plot. And it's plotting the main principal components or PC. In addition to the scores plot, there's another thing called the loadings plot. And the loadings plot tells you how much each of those variables, so latitude, longitude, altitude, or metabolites and metabolite concentrations, contributed to the different principal components. So two principal components. Which metabolites played a role? So I have that example of HAP, anit, and control. And I said, was there organic acids and amino acids? Which ones? Threonine? Was it oxalic acid? We don't know. But often, it's the variables at the corners down here, here, here, or here that contribute most to the separations in the scores plot. So that's the loadings plot. But you want to know how they contribute, how these variables contribute to the scores plot. OK, so remember, two types of plots, often it helps to have both displayed. So sometimes PCA won't give you clear clusters or obvious groupings that your eyes can see. And if everything looks like it's all clustered in one giant group, well, maybe games over. If you can't see any modest separation, then trying a whole bunch of other statistical techniques is not going to make much of a difference. It's trying to make a silk purse out of a sows ear, is the term. So PCA has to be typically your first try. If you see something, or you're not quite ready to give up, you can go to the next level, which is called partial least squares discriminant analysis, PLSDA. And this is a classification technique. Now, as a clustering technique, in this case, you're labeling the data. In some respects, you're cheating because you know the answer. Now, it's not really cheating in the sense of someone says, yeah, here's male and here's female. Information you need to know anyways. And if you see that they cluster differentially, great. So PLSDA uses labeled data. PCA uses no labeled data, so there's no prior knowledge. Because the PLSDA has extra information, it enhances or is able to enhance the separation. And the way it enhances the separation is by rotating the PCA components so that you get maximum separation. Mathematically, it's a rotational or another orthogonal or coordinate transformation. So PLSDA is a statistical method for doing prediction. It's a class predictor. And there are other machine learning methods that can also do what PLSDA, like support vector machines, artificial neural networks. When you are doing classification that's predicting, they are creating models. They are training on the data. And when something is being trained, it can overfit. So if you are training to take an exam, and you test on the same exam and the same exam and the same exam, and then someone gives you a different exam, you might flunk because you've only trained on that one set of questions and that one type of test. So to measure the robustness of how good this training is and how good this model is, you use something called R squared and Q squared. Anyone heard of those terms? R squared, Q squared? Anyways, these are... It's not a regression one. It's actually, it's a misnomer, but it's not coefficient of regression formally. It's just a way of measuring the robustness of these ones. The other approach is called permutation testing, which is something I prefer. So you can evaluate with Q squared R squared. It's the goodness of fit. And Q squared is the predicted variation or quality of prediction. And they can range because there's squared values. They can only be positive. And they correlate very closely. So if someone only gives you a Q squared, you can say that's pretty much the R squared. So R squared is a quantitative measure of how well the PLSDA is able to reproduce the data. So it's not a Pearson index correlation, but it's just mathematical reproduction. And so there's some rules of thumb. If R squared is about 0.2 or 0.3, hopeless. If R squared is 0.7, 0.8, it's very good. So that's sort of mapped to what our idea of what a good correlation would be. Q squared is another one that's kicked out by some programs. There is some estimate then through permutation testing or cross-validation. And if Q squared is better than 0.5, you're happy. Q squared of 0.9 is almost impossible to get. So Q squared, as I say, leads to a thing called permutation or cross-validation. And this is actually your best way of assessing whether you've over-trained something. So forget R squared. Forget maybe Q squared. Think about permutation as a rule. So let's say we have our PCA analysis. And for most of us looking at this, we'd say this is not looking good. We're not seeing two clusters. Let's label our data. At this stage, we can see that the PCA, when it's labeled now, we can kind of see a slight separation. The red stuff is biased on the lower end. The blue stuff is sort of biased on the upper end. So it's not completely hopeless. So then we use PLSDA to accentuate the differences. So we now train it. PLS is looking at this. It's now playing around with those principal components, rotating, twisting things. And after a while, it actually sees, ah, here's a way of actually separating these things. So our question is, has this PLSDA or this support vector machine, or neural network, or any other random force, has this approach actually just over-trained on the data? Is it seeing a pattern where there really isn't? So the question is, how do you test that? Well, what you do is you randomly relabel your data now. You say what was red is now blue, and what was blue, well, not everything uniformly, but you just switch a number and say, blue is that red, and vice versa. So our randomly commuted data, our relabeled data, which now is intended to make no sense, it produces some kind of cluster like this. Okay, run it again on a PLSDA or an SVM, see if it can pull this apart. Sort of like, you know, I mess up a ball of string and say, see if it can pull apart. I mess up even more and now it can pull apart. Well, after crunching for several hours, it gets this. And at this point, you say, well, this isn't very good, it didn't separate. Let's relabel a gain and crunch away and see if we get this separated. Now let's relabel a gain and crunch away and see if we get a separation. And it'll attempt a hundred, a thousand, two thousand different relabelings to see if it gets separation. What you then do is measure your separation score. How far apart are these two clusters? What's their p-value or t-test? And so we'll get essentially a distribution. We may find that in some cases we get a bit of a separation. In other cases, no separation or absolutely no separation here. So we're going to get a score. But then if this is the one that we started with, then this is its separation score. And this is all the other ones that we saw. And we did a thousand or two thousand permutations. Is this significant? What do you remember from the very first few points in the lecture? How many standard deviations away is this? Is the p-value less than 0.05? Is it less than 0.01? It's probably less than 0.001. So this is significant. This one has not been overtrained. This separation is real. The groups are clearly distinct. The means are not the same, which is what we got back to with the very original thing about t-tests. So permutation is probably the best computationally easiest. It's horrible to do by hand. But it is something a computer can do that allows to determine whether you've overtrained. So as I said, there are other techniques beyond PLSDA. SYNCA, OPLS, support vector machine, random forwards, naive Bayes, artificial neural networks. These are all machine learning techniques or statistical techniques that have been developed over the last 20, 25 years. All computer-based that allow you to do separations and to do classification. But they all have to be validated. They all have to be measured so that they're not overtrained. And if you don't validate, then you're cheating. And in many cases, the results you get are probably false. So when you're dealing with data, metabolomic, proteomic, transcriptomic data, the first efforts should always be to try and work with unlabeled data to see if you can get separations, whether the groups separate. And you can use PCA, which is also K-means clustering. These are the same things. Factor analysis is actually another thing that psychologists have used, but it also is the same as PCA. So those are these unsupervised methods. So check to see if you get some separation. If you do, if there seems to be a hint, then go to the next level, out of hold to the cannon, to get past this big wall here and see if you can go from with PLSDA or PLS regression or linear discriminant analysis. There are different tools that are available and see if, in fact, that separation gets further apart. Sometimes this may not work. You can sometimes do a little better with things like support vector machines or random force. These are more powerful methods that allow you to see if, in fact, enhance the separation and find some features. But in each case, once you've gone from your catapult to your cannons, remember you have to do some permutation testing. So you begin with the unsupervised method, try unlabeled data, no prior knowledge, no natural clusters. Then you label the data with the information you have from experiment or from the research coordinator that you're working with. See if that separation is enhanced. And then you try and assess how statistically significant that is using permutation, label permutation. So supervised classification is very powerful. They generalize things. Machine learning is everywhere these days. They perform pattern recognition very well. In some cases, better than what humans can do to the point that it's almost absurd, and that's the case where it becomes over-trained. A lot of people historically missed the PCA steps of clustering and went straight to PLSDA and didn't even do the permutation. And so a lot of early papers in metabolomics and transcriptomics and others produced spectacular results that no one was able to reproduce. This is where those errors begin. Many people still haven't done proper significance testing, permutation testing, Q-scared validation. And I think the other thing is that if you can't see something in the data from the very first clustering or even some of the first measurements, applying more and more heavy arsenal of statistics and trying all kinds of permutations, again, you're sort of trying to make a silk purse out of the sows ear, and you are treading on that thin ice and it becomes very dangerous. Statistics just affirms the obvious.