 We've got it. So these are the mandatory slides. But what we're going to begin with today is actually a background around statistics. And some of you are probably very familiar with statistics. Others are coming into it relatively new. And we found this to be quite useful. So this is one part that has not changed actually from last year, although it's been expanded a little bit because the field keeps on moving. In terms of the schedule, this is day two. So we're going to have the background and statistics. Then we're going to go into a table analyst. So that's a lecture highlighting the different features and components in that tool. And then we're going to have a lab that starts at 1.30. And well, I've thought about this a little bit. But depending on where people are, if after two and a half hours or so in the lab, people are wanting to do something a little different, I can give you a short lecture on the future of metabolomics if people want to listen to that. But if you're really enjoying the lab, then we won't do that. And you can just carry on. I think the other point that Michelle raised is just to make things a little clearer. So yesterday, what we were looking at was how to go from spectra to lists. So that's what you spend a good chunk of your time running things like Bazel and GC AutoFat and XCMS. And so the idea then was once you had your lists of metabolites and concentrations or relative concentrations, was then to focus on how to analyze that data. So today, that's what we're going to focus on. We're going to look at multivariate statistics, rock curves, pathway analysis, and on and on. So there's a design to all of this. So now that you have your data in place, then you're going to be able to use it and interpret it. So the learning objectives today are, at least for this lecture, are to learn a lot more about distributions and significance, to learn about what are called univariate or single variable statistics. So that's the t-test. And then analysis of variance or ANOVA. And we're going to learn a little bit more about correlation and clustering, and also about rock curves, which are receiver operating characteristic curves. And then we're going to learn about multivariate statistics, or two or more variable statistics. And this is where you get into principal component analysis and partial least squares, discriminant analysis. So statistics is something that a lot of people hate. And it's one that is sometimes difficult to learn. But I guess there's different thoughts on it. So I think a lot of us have heard this quote, there are three kinds of lies, lies, damned lies, and statistics. It's often attributed to Mark Twain. But it was actually made by Benjamin Disraeli, who was a prime minister in England. And I think Twain picked it up. But I think it's very true. And then another one, 98% of all statistics are made up. What also might be true? And then Aaron Leverstein had this interesting quote. Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. Anyways, I've played around with statistics often on for a long time. And what I've kind of learned is that statistics is the mathematics of impressions or intuition. So our brains are wired to look at and identify patterns very, very quickly and very easily. And those things are, to us, impressions or intuition. Chess master can look at a chess board and instantly identify patterns and know what his moves or her moves will be, five or 10 moves away. Computers can barely do that. And yet it was essentially a pattern, an impression. Yet the foundation is built on statistics because they've seen so many patterns in playing so many chess games. What we've done with statistics is we formalize those impressions and try and capture them in ways that are more quantitative, more rigorous. And so that's why some people find it difficult because if you're an intuitive person, and most of us are, converting that intuition into distributions and p-values and formula is hard. So that's my view of what statistics is. And I'll try and give you, I guess, a non-formal description of statistics so that at least it's perhaps a little more understandable to some of you. For those of you who think in formal theorem these lemmas prove this won't be particularly appealing. So the first way to think about statistics is distributions and significance. So one thing that you can think of is that if you have a population, and this is an example of a population of about 30 people, I guess, there's all kinds of things you can measure about them. But if you measure by a single variable, so maybe the most obvious one from this population if profiles might be height, that's a single variable. Another one you could measure with this population might be weight. And again, just by looking at the people, you can see probably variabilities in both height and weight. We could have them take a test, maybe an IQ test. And that could also give us a measure, a value of something that's... So with univariate statistics, we talk about uni, single variable. And that's, as I say, this single measurement. So height, weight, test score, IQ, that's what we are measuring. And so if you plot a population's height, so everyone's height in this room or everyone's weight in this room, and measuring the frequency of that variable, you'll get this. So that's called the bell curve. At least that's colloquially what we call it. And it's a remarkable thing because almost any variable you measure in nature will tend to follow this distribution. It's not going to be perfect. There's always some slight variation. But as you get more and more measurements from maybe 30 to 100 or 10,000, it gets closer and closer to this bell curve shape. So as Carl Friedrich Gauss actually noticed this, and so the shape itself is called a Gaussian curve or the normal distribution. So the normal distribution of single variables is symmetric. It has an average, a mean. So that's usually given the Greek letter mu and it's at the center. The width of that distribution is called the standard deviation. And as I say, it's the most common type of distribution now. So almost everything we measure in nature will follow this. There are exceptions, but it's one that's very, very common. So this sort of reiterates what I've already said. Whether it's biological, physical, we get that normal distribution. The larger the set, the more normal the curve. And the rule of thumb, which a lot of people wonder about, is that to get that kind of distribution, you need a population of about 30 to 40. And it's one of the reasons why historically class sizes in schools, grade schools, are about 30 students. Because this allows them to get a nice distribution. So it wasn't about student-teacher ratios. It was sort of this drive, thanks to the, I guess, fanaticism in the 1920s about IQ tests that they wanted to get. School sizes, about 30 to 40, or class sizes. Anyways, Gauss in the, I guess it was at the late 1700s, early 1800s, identified this feature and actually wrote out an equation that describes the distribution. And that's given here. So this is a probability distribution and it's a e to the minus x squared type of curve. And again, most people are familiar with this, everyone I think should be, but it's a case where what I've partitioned this distribution into is the different segments between plus and minus one sigma, which covers about 68% of the area under the curve, plus or minus two sigma, two standard equations, which covers about 95% of the area under the curve, and then plus or minus three sigma, which would cover about 99%. So the curve itself is continuous, or at least can be modeled as a continuous one, and that makes for a number of things, makes it very mathematically appealing. The, from that, you can actually get a number of variables, the average, which is the sum of all the variables divided by the total, the variance, and the standard deviation. So sigma is standard deviation and sigma squared is the variance. And these are standard formula that gain most of us have probably been taught at one time but most of us just use Excel to calculate now. But this is what they mean, and this represents the spread, the variance to standard deviation represents the spread of the curve. So you can have a bell curve that's quite narrow, and you can have a bell curve that's quite wide, and that width of the curve is reflected in the variance. And again, the center value in the middle is always the mean or the average. So this just illustrates things again, which I've already highlighted in terms of the portion of area between plus or minus one sigma, plus or minus two sigma, so 68, 95, 99, 99.99% or four sigma, five sigma. And so when things are three, four, and five sigma away from the mean, that's very exceptional. So if that's someone's height, that means they're either a dwarf or a giant. If it's their IQ, it means they're sort of a super genius. If it's a weight, maybe they're basically a skeleton or they're basically very obese. So those are things that are examples of three, four, five sigma away from the standard deviation. You can look at things in terms of half values, so plus one sigma, plus two sigma, and of course then you divide those values by two. That's sometimes confusing for people because when someone say it's one sigma away, do they mean plus or minus, or do they mean just the plus or just the minus? So there's some ambiguity there. So that value of sigma is something that we typically talk about, significance. Is that the fact that someone is eight feet, 11 inches tall, is that significant? Are they significantly different than the mean? Your intuition would tell you, yes, that's pretty tall and that's probably pretty significant. But how do we explain it in a more formal mathematical way? So this is how we define significance in statistics. If you're thinking or working with a normal distribution of univariate variables, one sigma away, or one standard deviation away, the probability that something could be one sigma away is 32%, probability that something could be two sigma away, plus or minus, that's 5%, three sigma, it's about 0.3%, so it gets very, very rare. So we use this significance measure to basically grade students. And this is particularly commonly done in universities. Less so in maybe elementary or high school, but so in universities of big classes, 400 students first, sometimes second year. So if you score the average in the middle, we award you a C. If you score about one standard deviation above the average for the class, we'll give you a B. And if you typically score two standard deviations above the average, you'll get an A or an A plus. And that's generally a rule of thumb. And so if it's one standard deviation below, that's a D, two standard deviations, it's an F. And that's partly why we have this five letter grading scheme. So significance is one way of looking at it. And I think a lot of us have encountered that in this issue of one sigma, two sigma, and so on. We also use a different measure, which often you'll hear quoted when they talk about surveys or polls, probability that this poll is right or correct is 19 times out of 20. So this is essentially another way of commenting on about what the P value or that is P value is the probability, P for probability of obtaining a score, a set of events, a height, some sort of univariate number that is at least as extreme as the one that was actually observed. So there's in statistics, we have a thing called the null hypothesis, which is saying that nothing is different. So if you reject the null hypothesis, it says you found something that's different. So when the P value is less than a significance level, so 19 times out of 20 is often the significance value or one minus that. So that's how often will it be significant? So we use 0.05 or 5% or 100% minus 95%, 19 out of 20. Some cases people might use a P value of 10%. You can also use a P value of 1%. You can use any P value you want. Unfortunately, people have religiously been converted, I guess, to thinking that the 5% is an absolute. And it just happened to be sort of a convenient one that was recommended, but there was no formal justification. Again, it's intuition. How often do you want to be right? Most people may say it's okay if I'm right 95% of the time, but some people want to be right 99%. Some people are quite happy if they're right 80% of the time. So that's where you can choose your alpha value. So when the null hypothesis, the null meaning nothing's different is rejected, then it says something is statistically significant. So the average height, if we're talking about P value in terms of something that's a little more intuitive, the average height, if you could take all the people in this room would probably be about five foot seven, maybe five foot six. And the standard deviation is about five inches. So what's the probability of finding someone who is more than six feet 10 inches? And you could define your test in the null hypothesis that anything that is greater than six feet 10 inches or is not human. So that's our null hypothesis. Anything above this are your member of the human species. So we could choose an alpha value of 0.05 or we could choose an alpha value of 0.01. And I think if you did the quick calculations of adding five inches to five foot seven, that's one standard deviation. Add five inches to six feet, that's six foot five. We're now above two standard deviations. So even at the 0.05 level and the 0.01 level, we'd say that there's potentially no one above six foot 10 or if we did find someone, they are either very, very tall or they're not human. So that's how you can structure your question or your hypothesis, but it's also how you can measure whether this is something that's significant. At the 0.01 level, we're up at around six foot 10. And so at that level, is that significant or not? Again, you use those distributions that we just talked about. You can do another calculation, which is a coin flip test. So you can try this at lunch if you want, flipping a coin 20 times and let's say you get heads 14 times out of 20. On average, you'd expect 10 heads and 10 tails, but you could ask just random variation. So if that occurred 14 times out of 20, we can calculate the probability and that's 5.8%. So our hypothesis is, is this a fair coin? Like is it weighted or is it an unfair coin that is being biased in some way with having a heavier part on the tail side or whatever? So again, if we know what the probability is of that occurrence, we can determine is this a fair coin at the 5% level or is it fair coin at the 10% level depending on what our alpha is. So at the 5% level, it is a fair coin. At the 10%, it's not a fair coin. So that choice of alpha is kind of arbitrary, but as I said, there's a tendency to use this alpha value of 5%. Okay, so most of this should be something that you're familiar with. I hope, is any of this completely new to people? All right. Now, if a distribution isn't symmetric, we actually have sort of a different case. We will talk about means, medians, and modes. And this is a way of sort of dealing with skewed or asymmetric distributions. So in the case of a normal distribution, the mean, the mode, and the median are all the same. That's because it's a symmetric distribution. In an asymmetric distribution, mean, median, and mode are different. So mean is the average value, and so in an extreme value distribution, it's affected by extreme values. So it's not as meaningful. Median is the middlemost value. So that sort of defines halfway between the mode and the mean. And the mode is the most common value. So, median and mode can be used, particularly when things are skewed or biased. And our tendency is, at least intuition, is to choose mode. Statistically, the preference is to choose a median value. So you can get skewed distributions, but you can also get multimodal distributions. So Gaussian is a unimodal distribution, Poisson distribution is a unimodal distribution, but you can have two or three or four distributions that have large peaks. Usually that means that you're dealing with some very, very significant populations, and you haven't really pulled them apart properly, but it happens. And in some cases, it has to be dealt with in a more challenging mathematical way. So I've mentioned the distributions, the Gaussian distribution, but there's also the binomial distribution, which in its extreme becomes the Gaussian distribution. There's a Poisson distribution. There's an extreme value distribution, which is typically to take the tail end of a Gaussian distribution and then take those as a population. And that extreme value distribution is actually what you deal with in university. Typically, you're supposed to be taking the top students, maybe the top 40% of students from high schools. And so if we take that and still apply a normal distribution curve to it, we're actually making a mistake. And so you can argue to all of your professors that they should be using extreme value distribution when they're grading people. Anyways, there's also other types of things like exponential distributions and then just generally skewed distributions. So this is a picture of the binomial distribution. This is the one that occurs when you do the coin flips, probability P heads and Q of tails. And it's just essentially a polynomial if you want. And if you multiply P plus Q to the zeroth power, that gives you one. If you take P plus Q to the first power, the coefficient in front of P and Q is one in one. If you take P plus Q and square it, you'll get P squared plus two P cubed plus Q squared. So the coefficients are one, two and one in front of P and P cubed and Q. P plus Q cubed is P cubed plus three P squared cube plus three Q squared P plus Q cubed. So one, three, three, one. And you can go on with N going from zero all the way up to five, 10 and so on. And you can see that the values or the coefficients that are in front of that come or follow this particular pattern. And if you get N to like 100 or something like that, if you wanna try it during lunch, you end up with something that looks a lot like a binomial distribution in terms of the coefficients or a Gaussian distribution. Now a Poisson distribution typically occurs when events are much rarer and it typically models things like radioactive decay or alpha particle decay. And the Poisson distribution can look like an exponential distribution when the sigma or mu value is very small. As you start increasing things, as mu increases to near hundreds or thousands, the Poisson distribution actually starts following a Gaussian distribution. So the Poisson distribution is an extreme form of a Gaussian distribution. Just like a binomial distribution eventually becomes a Gaussian distribution. The extreme value distribution, as I said, is this thing where you take the tail end of a Gaussian distribution and collect it. And in fact, we use extreme value distribution formulas to explain the significance of sequence matching. So how many of you have ever done blast searches? Few, okay, most. Anyways, that's the extreme value distribution that's used to determine whether you've got significant sequence matches. And as I said, in universities or graduate school, we typically take an extreme value or take a cohort of people who did very well in high school or did very well in university as undergrads and we move them up into that. So we are actually sampling an extreme value. So it's essentially a strongly skewed distribution. So it's more Poisson-like. And so the fact that we apply Gaussian statistics to grades when we're working with extreme distributions is sometimes rather faulty. So if you have a skewed distribution, how do you change it so it's a normal distribution? Of course, you could say, why change it? But this is something that statisticians love to do, but it's also because all the theory for most of statistics is based on the Gaussian distribution, the normal distribution. So if you can work with a Gaussian distribution or if you can transform your distribution so it's a Gaussian distribution, then all of these tests like the T test, the ANOVA test and many others work pretty well. The transformation or changing that skewed distribution allows you to bring some of the outliers closer together, closer to the mean. So it rescales things. So the way that that's done typically is through a log transformation. And when I first started working with statisticians, I was just amazed that they kept on saying, well, let's just take the log. Let's just take the log. And it was always a case of, okay, whatever. But if you look at this, this is an example of a skewed distribution on the left. And that's in a linear scale. And this is most often seen in things like visual intensity data or acoustic data. Our eyes and ears actually do log transformations. And anyways, you'll naturally convert that linear scale if you log transform. This skewed distribution suddenly looks very, very normal. So now you can apply the standard statistics that we've all been taught or will be taught today. Here's some more realistic data. Again, they look basically like exponential decaying functions on the linear scale, do a log transform. And now they look maybe not perfectly normal, but they certainly look a lot more normal or Gaussian in the distribution. So log transformation is very valuable for converting these kind of messed up Poisson or exponential or extreme value distributions into Gaussian distributions, which we can use many statistics for. So there's another process that we'll do. So transformations is sometimes called normalizing that is converting it into a normal distribution. But we can also do something that is, people will also call normalization. And it can also be called scaling. So let's say we've got two populations, a normal population in blue jeans, and then we've got these green folks. So let's say we wanted to measure the height of these people. But if we were measuring the height and our ruler was miscalibrated, we would have a scaling problem. So even though the height of these people is essentially the same, you can kind of visually see that intuitively. It looks like they're all about the same height or overall the population average height. Yet because we've messed up with our ruler, we end up with two populations having very different heights. Now this happens, this happens in microarrays, this happens in metabolomics, this happens in proteomics, because our measuring devices are imperfect. So how do we scale or normalize to just for that error in measurement? So normalization slash scaling adjusts for that systematic bias. And so if we do that, we'll find that basically they're about the same height. So the two distributions fall over, almost overlap with each other. So as I said, there's this issue of normalization, which can mean two things. Normalization can mean this log transformation to get a Gaussian or normal distribution. Normalization can also mean scaling. And it's unfortunate that those two words are used and we try and distinguish the two of them in a metabolical analyst where we talk about normalizing the distribution, making a Gaussian and then scaling or rescaling the data to correct for any potential biases. So as I said, in normalization, which means to make the distribution normal or Gaussian looking, this is what we do. We transform it. And then if we want to adjust for errors or bias in our measuring tools, we scale or rescale things. So just a terminology issue. Hopefully it doesn't get people too confused. Okay, so this is all mostly elementary stuff. Stuff you could or should have learned in high school or first year university. Most of us forget it. But now I wanna get into something a little more detailed. How do we distinguish two populations? So here we've got the normal blue gene people and then we've got the leprechauns. Question is, are they different in height? Intuitively, or our impression is, yes, they are. And so we can do the measurement and we get this. So distributions look different. The means look different. But is that significant? How about this population, which we looked at just before? Blue genes people and people in green, are they different? We can take our corrected ruler now and measure them. And we get this. And there's a slight difference. But is that really different? Is that significant? How many think they are different? Your entire grade depends on this. So how many people think they're the same? Why do you think they're the same? Because they overlap or? Because of the extent. The extent of the overlap, yeah. So yeah, they're probably the same, but the point is how do you actually quantify that? Because everyone's just saying, it looks like they're overlapped and where do I draw the line? If in fact we had a very, very large number of people and this is what the distribution was, they are statistically different. So this is where numbers and measurements that's actually become important. So how do we quantify that? How do we deal with this impression? So the answer to that is the student's T test. So this is for two populations. And it's essentially a way that we determine if two populations are statistically different. So if you calculate the T test statistic, and fortunately this is usually done by computers and programs and R, and if it gives you a P value, probability that they are the same of 0.4, and then the alpha value is 0.05, that's 5%. Then the conclusion is the two populations are the same. If the T statistic gives you a P value of 0.04 and your cutoff is 0.05, then the null hypothesis is rejected and the two populations are different. So in T tests, we will use either paired or unpaired T tests. So with before and after experiments, treated, untreated, whatever, those are best used with paired T tests. And then the unpaired T test is for sort of two randomly chosen samples. So depending on your experimental design, you should use different types of T tests. So you can also use T tests to determine whether two clusters are different from each other. And so this is essentially like looking overhead on a distribution. Similar to maybe those two distributions that we saw in terms of height. But this is also something that you can see with a PCA plot, which many of you probably see. So in fact, the students T tests can be used to distinguish groups of clusters if they follow a normal distribution. So what if the distributions aren't normal? So you try a log transformation and things still look like this. Maybe they're bimodal, maybe they're just really bizarre. Maybe you're under sampling. Maybe you don't have 60 or 100 samples. Maybe you're getting 25, 20. So distribution hasn't trended to normal. So in this case, the way you get around it is use the Mann-Whitney U test. So there's a T test and a U test. Mann-Whitney is also called the Wilcoxon rank sum test by some, so same things, just different names. It happens or you use this when you have two non-normally distributed populations. You could also apply it to a T test. It's more powerful, more robust than the T test. It deals with medians rather than means. So remember, mean, median, and mode. And so that's why I say that median is generally preferred in statistics as opposed to mode, which is more intuitive. So in this case, if the U test, not the T test gives you a p value of 0.4 and the alpha that you've chosen is 0.05 and the two populations are the same, the null hypothesis is accepted. If it's 0.04, the null hypothesis is rejected. So again, T tests for normal distributions, U tests for non-normal distributions. Choice of the alpha is up to you. The propensity is to choose 0.05, but it's not an absolute. And this is something that many, like 99% of people forget. Okay, let's say we had three populations. The normals in blue jeans, the green leprechauns, and then we've got the elves. Are they different? So we plot those out, we measure with a non-biased ruler. And I think again, most people will look at that and say, yes, they are different. Now we take these populations and measure their height. We measure those, get the distribution, and again, it's overlapping. Is this statistically different? Now we're dealing with three populations and not two. So we can't use the T test. We can't use the U test. So what do we do? So this is where we use analysis of variance, ANOVA. So if you've got three or more populations, use ANOVA, it's a generalization of the T test. And it's essentially by looking at the group variance, determining whether the means of several groups are equal or not. So instead of a P value, it uses an F measure to test for significance. And you can have a one-way ANOVA, a two-way ANOVA, a three-way ANOVA, depending on the number of populations you're dealing with. So it can go all the way up to N. So most people just do the one-way ANOVA, which is just concerned whether any of those three or four or five populations are different. It's not trying to do a pair-wise difference. So a pair-wise one is two-way ANOVA. So are any of these different? So are the leprechauns different from the elves plus the normals? So just like with the T test, you can also use ANOVA to determine whether three or four or five clusters are different if those clusters follow a general normal distribution. So again, ANOVA could be used to distinguish clusters in a PCA or PLSDA whether they're significant. OK, so we've gone from two populations, three and four populations. Once we start doing comparisons with 10, 20, 100 populations and we start doing these comparisons, we have a bit more of a challenge. So here's a whole bunch of distributions, some which are almost overlapping, some which are quite different. So how do you distinguish N populations or perform many different tests? And so this is this challenge of where we're doing a whole bunch of pair-wise T tests. So we could have looked at all of these and said, OK, let's do a T test in the top left corner, a T test in the middle, a T test on the bottom right corner, and so on. So we're doing hundreds of T tests. And if we get a standard cutoff of p value of 0.05, what are the odds that one of these findings, because we're doing so many comparisons, could be false? Well, you can calculate that. So it's sort of the number of comparisons, probability of that in 1 minus 0.05. So that's 95 to 100 to power, which comes to about 0.06. So 1 minus 0.06 gives you 99.4%. Which ones are going to be false? So there's a near certainty that one of those things is going to be false. So if you want to make sure that none of these things, none of these 100 pair-wise T tests, is going to be false, you could divide your p value of 0.05 by 100. So that in order for these populations for you to be absolutely certain that none of these comparisons is false, you can choose a p value of not 0.05, but 0.0005. So that's called the Bonferroni correction. How many people have heard of the Bonferroni correction? OK. So the Bonferroni correction is bad. Anyways, it's used still in GWAS studies and probably was misused for a long time in GWAS studies. And it's an absolutist kind of correction. So again, if you were looking at these distributions taking the second from the top on the left, you can see there's almost no overlap between those two distributions. Yet probably, using the Bonferroni correction, that's not significant. So that defies your intuition. And this is a case where it's too extreme. And so it wasn't until the 1990s that it dawned on people that this was a nutty correction. And so what they've essentially done is come up with another one called the Benjaminini-Hochberg correction. It's called a false discovery rate. And basically it's saying, yes, it would be nice not to make errors, but generally it's OK to be 95% correct. And so if you use that 95% cutoff, then it's a much softer p value, something on the order maybe 0.01 instead of 0.0005. So this is another example that I grabbed from the internet a couple of years ago. But this is just something, let's say, there's a probability of weather conditions. 8% chance of rain today. 5% will be sunny. 6% will be foggy today. It might snow at 5%. It'll be thunder, 16%. 2% lightning, a tornado at 9%. So we certainly can be 100% certain that it will do something today or tomorrow. But which one is going to be significant based on the Bonferroni correction where we're doing, in this case, I think it's 18 different comparisons. So we take the 0.05, divide. The only one that will be at a significant level is the eclipse one. So this is, again, fairly extreme. To deal with the false discovery rate, to get rid of or away from the Bonferroni correction, it's sort of looking like, OK, if we do 100 different tests and we found 20 that we're producing a p-value of 0.0, less than 0.05, what's the chance that one of those will be false? So in that case, we can calculate roughly, one of them will be false. So 1 out of 20. So to adjust that, rather than dividing by 100, we roughly, let's simplify it a lot, would divide by 20 and to come up with now not a p-value, but a q-value. So that's a false discovery rate corrected p-value. So it's not that simple. And there's formulas in R that will calculate this for you. So it's the Benjmani-Hochberg false discovery rate correction. And this is how they came up with it. They basically took randomly distributed tests, calculated what their p-values would be for each test for these comparisons, and then ranked them. So it's called a rank fraction. So one had a p-value of 1 or 0.99 would be ranked very high. And the p-value that had 0.002 would be ranked. The other end. And you get basically a straight line when things are just randomly distributed. And if you use this cutoff of 0.05, just based on this, and they've done lots of simulations, you might get 10 or 11 that would be considered significant. As I say, the slope is basically a straight line with a slope of 1. If you do this with real data, like biological data, physical data, you don't get this straight line. You actually get a curve that drops, and then it bends. So it looks sort of like an exponential function. So this is where you're ranking a p-value by rank fraction. And in this particular case, instead of getting 10 or 11 things that are below 0.05, we actually, in this example, got 62. So this curvature was something that was picked up in the 1970s and 80s. And they said, there must be something significant here. And so it was extended by the Bench-Mounty-Hochberg pair. And they came up with this formalism and these q-values. And basically, to look at that curvature, they sort of looked at the trends for something that would just be a random set versus something that was a little more realistic. And they re-plotted them. So red dots represent p-values. And then they corrected those p-values to a q-value. And those are plotted in blue. So in the top curve, we've seen that one before, but now the false discovery rate corrected versions are well above p-values of 0.05. They're all covering around, basically, 0.9, 0.8. A few of them start dribbling down to 0.2. None of them are significant. They've also introduced a little red line, which goes from 0.05 and drops down to 0.000, that you can just barely see in these things. That's also part of the correction to deal with the curvature phenomena. So in the real data, which is in the graph below, you can see that the FDR rate is almost matching exactly the p-value rate. And that's just a feature of the distribution. So it's non-trivial to calculate the q-values. But you can see that, in one, which is just a random distribution, q-values tell you nothing is significant, not 10 or 11 values, as the simple plot would tell you. But nothing's as significant. With the one in below, basically, it matches or superimposes with the p-values. So why am I going into this issue of false discovery right at the Bench-Mounty-Hochberg? Well, this is because in metabolomics, as you guys saw yesterday, you're often dealing with hundreds of comparisons, hundreds of metabolites, hundreds of cases and controls. So effectively, you're doing many pairwise comparisons. So what you really want to do, and when you're assessing the significance of concentration changes or metabolite changes, is you want to report the FDR corrected p-values, that is, the q-values, not the p-values. But don't report the Bonferroni corrected ones. OK. So pairwise tests, you wouldn't use ANOVA, you'd use a t-test. But so two populations, treated, untreated, control, whatever, you would, so you're going to have a whole bunch of, I mean, you can talk to Jeff perhaps a little more. But essentially, what you're dealing with is that you've got many, many values. So in that population, 20 treated, 20 control, whatever. But you've measured, through metabolomics, 100 metabolites. So there's 100 values that you're going to be looking at that could be potentially different. So you're going to calculate a t-test. So let's say it was Alani, treated control. And to your eyes, they look like they're huge differences. And you calculate the p-value. And the p-value says 0.03. OK. Is that still significant? Given the number of comparisons you're looking at, because you also looked at glutamine, you looked at glucose, whatever, you need to look at that whole collection of comparisons, run it through the Sping-Mounty-Hatchberg, and get the FDR or Q-value. And so if the Q-value, instead of the p-value, is 0.02 or 0.01, it's significant. If the Q-value is 0.3, it's not significant. So we still use that alpha value of 0.05. You can choose it. I mean, you could choose something different. But if we were choosing that alpha value of 0.05, if your false discovery rate corrected p-value, therefore, the Q-value, is below that, then you can say this alanine difference is significant. If you had done the Bonferroni correction, you would have had to take that 0.03 divided by 100, because you had 100 tonalites. And you're having to say, in order for this one to be significant, it had to be 0.0003. And odds are you won't find any that would be significant. And then you'd throw out your experiment and your results. And that would be a terrible waste, because in fact, the likelihood is probably several things that you found that are significant, and which would be reproduced if you had another, if you repeated the population study again. Does that make sense? Yes? And the reactions? Yes. Yeah, it assumes independence. And there's often cases where things are correlated. And that's one of the things you want to try and also deal with if you see those correlations between multiple metabolites or other factors. And we're not going to get into too much of that, but that's partly dealt with with some of the other tools that we'll see in Metabol analyst. OK, let's talk about data comparisons and dependencies now. So there could be treated, untreated before and after. We're trying to measure one variable against another. So this is, are they independent or are they dependent? Some cases, people also want to see whether something matches prediction. So here's my formula, and I think I can predict everyone's height based on how much they ate at breakfast today. So in these cases, what we're doing is we're going to be plotting x versus y before or after, or prediction versus observed. And we do this a lot in biology, and physical chemistry, and physics, and what we generally generate are things called scatter plots. So here's a scatter plot of several hundred people where they calculate the wife's age versus husband's age, and basically it says that people tend to marry if they are of the same or near the same age. Not a big surprise, but this is a scatter plot. We can come up with some information from this. We can see, in fact, that they are correlated. So if there's some dependency between two variables, either it's before or after or some effect, then that relationship is called a correlation. So most of us, again, this is fairly intuitive. We've seen positively correlated data, husband's age, wife's age. You can see negatively correlated data. It depends on how you plot it. And then you can see uncorrelated data. So those are the independent variables. They're not linked to each other. And that was sort of related to Martha's question about what do we do if data is correlated or uncorrelated or independent. So there's different degrees of correlation. There are things that are highly correlated, which cluster, they follow basically an ellipsoid. There's things that are poorly correlated. You can see a trend. And for some people, it's obvious. Other people will say it's not obvious. And then there's the perfect correlation. So all the dots line up in a row. That never happens in real life. Most things always have scatter. And that's just because of the function of measurement error and randomness. So how do you quantify those things? Again, it's intuition. We can see that there's a trend. Many of us are very good at that and identifying trends. In fact, people are so good at identifying trends, we tend to over-identify them based on two points or one point. We can say, oh, all people who've had a cold and end up dying of Alzheimer's disease or something. So that's where we tend to go overboard. We can quantify and we can get what's a correlation coefficient. This is the Pearson product moment correlation coefficient or coefficient linear correlation. And it's a formula that's hard to calculate by hand, but easy to do in an Excel or a computer. So it's R is the correlation coefficient. So R for that cluster that I showed on the left is 0.85. So the poor one is 0.4. And the perfect correlation is 1. So again, statistics is about taking impressions and intuition and mathematically formalizing or quantifying it. And that's what the correlation coefficient does. So if you're coming up with a prediction, a good correlation coefficient is good. So it quantifies this relationship, this linkage. It's a measure of either independence or dependence. And usually it is linear. And as I say, it's often used to assess predictions, simulations, comparisons, or dependencies. So it's integral to most things that we do in the fields of life science and physical sciences, too. I have a bit of a pet peeve. And it's courtesy of the folks at Microsoft and Excel. The correlation coefficient is what we describe. There is a thing called the coefficient of determination. That's called R squared. Many people call R squared the correlation coefficient. Like 99% can call it. But that's not what it is. Formally it's the coefficient of determination. The strong preference is historically before Microsoft came along was to just use R and to quote that as the correlation coefficient. If you call R squared the correlation coefficient, that's wrong. It's technically wrong, semantically wrong. And so if you're using Microsoft, as I think everyone does here in this room, take the square root of the R squared and quote that as your correlation coefficient. So correlations can be used to hide things or misappropriate ideas. So here's a plot where it got literally hundreds of points all shown. And you can see a correlation coefficient of about 8.85. And so can you say, is this significant? Is this relationship real? Is the prediction pretty solid or robust? And you have another example where someone's just taken three points, perhaps very selectively, taken them at sort of extreme points. And they said, I get a correlation coefficient of 0.99. Do you think that's significant? Does anyone think it is? So again, this is sort of your intuition. Say, well, that's not enough. I'd like to see more. And so just like the sampling issue that we saw with t-tests and distributions, we'd like to see a larger sample size. So we could have an example where if we had this nice plot that gave us 0.99, then they did two more measurements. So now they've increased it to 5. You can see it's no longer correlated. So this is a sampling issue. And again, we've talked about having generally a 30 or 40 to get a good sample. Another trick that people can use to get good correlation coefficients, and that's done very frequently, is they'll measure a whole bunch of things at one end and then another bunch of things at the other end, but nothing in between. And if you see a plot like that, you should be very dubious of whether the correlation is appropriate or significant. And then as I say, people can also selectively measure things and just report a small number and say, my correlation is great. So you can use things like t-tests to measure the statistical significance of a correlation and determines whether the slope of the regression line is actually different than 0. And basically, the more points in the calculation or in the scattered plot, the greater level of confidence you have in that correlation. So there's also a situation that can be seen where you've got this great correlation plot. You've measured it for hundreds of points, and then suddenly you get an outlier. And outliers can be good or they can be bad. So if you're predicting or simulating things and your formula or predictor says, this is where it should be and it's way in the other end, well, it says something's wrong with your model. But it can also be a question if you're talking to the experimentalists and say, are you sure you really measured this correctly? And in some cases, it turns out it's an experimental error. Typically, what you like in looking at tables of concentration values in metabolomics or in proteomics or in microarrays or in a seek, you'll actually like to see things sort of off the curve. You want to see those outliers because that usually represents something significant. So when we evaluate predictors, we often like to use correlation coefficients. So something is trying to predict a person's weight based on their height or a person's BMI based on glucose levels. We tend to try and assess those predictors based on correlation. So those are continuous values. If we're dealing with binary values, so we take a person's pulse and their glucose reading and their BMI, and then we want to say, are you sick or are you healthy? That's a binary measurement. You can have the same sort of binary measurements when it's a SNP, a G SNP or a C SNP at some base or in secondary structures of proteins, whether it's a helix, a beta strand, or a coil. So in those cases where we're using predictors, whether it's secondary structure or a gene, not a gene, or SNP, not a SNP, we have other terms called sensitivity and specificity. And these relate to ideas of whether it's a true positive or a true negative, a false positive or a false negative. So if we have a whole bunch of, let's say, it could be a population of 100 people, and they're just sort of all lined up based on their first name, last name, alphabetically, we could look at how many of them were predicted to be in a certain state and how many were not in that state. So what I've shown here is sort of a binary state where it's a line or a blue bar or a line. So line is one state, blue bar is another state. You could also say, OK, this is gene, non-gene. It could be secondary structure, or it could be just, as I say, a whole bunch of people ordered by their names alphabetically, and they're sick and healthy. Sick is the line, healthy is the blue line. And then we use our prediction. And the prediction is down below, and it's marked in red. So we predicted the first group nicely, but then there's a couple where we have the true positives. But then we've overpredicted in red, and so those are false positives. And then we have a whole bunch that are the true negatives. But then we've underpredicted, and now we have a bunch of false negatives. So you can see the red is shorter than the blue bar. So these can be evaluated using terms called sensitivity, specificity, and precision. And so sensitivity is sort of a measure of the fraction of false negative results. Specificity is the fraction of false positive results. And so there's actually a formula for calculating sensitivity, specificity, and precision. And these are important formulas, and they're useful particularly when you're doing treated versus untreated case control sort of comparisons. In the realm of computing science, people generally report not sensitivity and specificity. They report precision and recall, but they're the same. So precision matches to specificity, recall matches to sensitivity. In the world of life science, we tend to use specificity and sensitivity. So just be aware of the different terminology. So how do you measure sensitivity and specificity at the same time? The way you do that is through the receiver operating characteristic curve, or rock curve. So that's a graphical plot and allows it to measure the sensitivity versus specificity. So false positive versus true positive. So it was originally developed in the 1940s. And electrical engineers, radar engineers to detect enemy planes flying over England. And they wanted to know whether they were actually mistaking planes for geese or flocks of geese or not. And so there was a true positive. Yeah, it was a German plane. False negative, it was a miss. We didn't see it in time. It bombed London. False positive, a false alarm. And then the true negative, nothing to worry about. And so they would plot these things, and they'd have sensitivity versus specificity. And you'd see these lines. So a really random predictor would produce a straight line, a slope of one. But a better predictor, better radar system would have these curves that would climb or grow. So you can actually calculate the rock curve by hand, if you want. So if you have the radar system, and you're trying to distinguish between geese, flocks of geese, and German planes. And so as the radar was turned off, you would not detect any German planes. Their sensitivity is zero. If you're able to detect flocks of geese by sight, to get them all, then the percent of flocks of geese incorrectly identified, one minus specificity gives you zero. If you improve your radar setting and set it to one, now you're detecting 35% of the German planes. And you're confusing the flocks of geese 93% of the time. So your one minus specificity percent is 7%. So you change your setting again, and you're up to 60%. And specificity is down to 85%. And so as you fill out the values in this table, you can plot it. You can plot the sensitivity versus specificity. And so in this case, we plot it 1, 2, 3, 4, 5, 6 values. And you can draw lines connecting those points. So that's created a rock curve. So you can calculate them by hand. You can also have a computer calculate them. So a good rock curve should look like a logarithmic curve. So it should go almost straight up and then bend over, flattening out. A poor rock curve would have a straight line slope of one. And then you can measure the area under the rock curve. So AU area under rock curve is actually a measure of a quality of something that you're measuring. So it could be measured of the radar, sensitivity, the biomarker that you're trying to develop out of metabolomics test. As a rule, if you get an area under the curve greater than 75, that's pretty good. An area under the curve of 1 is perfect. That rarely happens. So this is what rock curves look like. So there's one at random. That's in blue. And then the one in turquoise or purple. Those are rock curves that are exceptional. And then there's one that looks like an upside down L. That has an AUC of 1. So again, this is something that you'll be dealing with a little later, but these are very important ways of assessing binary classifiers. So continuous variable classifiers, that's correlation coefficient. Binary classifiers, healthy, sick, use rock curves. OK, clusters. So here's a plot where we've plotted some population's height versus weight. So those are continuous variables. And we could draw a line through those clusters or variables and produce a line. It gives us a correlation coefficient of 0.73. So there's a relationship between height and weight. But is that the right thing to do with this particular scatter plot? If you actually colored things, according to, in this case, gender, you'll see that the pink dots are female and the blue dots are male. So in fact, yes, there is a correlation, but it might be more interesting to cluster these things. And there's more information in this group. So whether it's a correlation or clustering can tell you different types of information. So clustering is something that we do in addition to trying to create correlations. We use clustering a lot. We'll use it today in metabolomics. Clustering is done very frequently in microarray, RNA seek, gene chip analysis. It's done in proteomics and protein chip analysis. Clustering is done in protein interaction analysis. It originally evolved actually in the phylogenetic and evolutionary analysis. We do it to cluster structures, proteins, and chemicals, and protein sequence family. So it's ubiquitous actually in most fields of omics science. So clustering is a formal process. It's mathematically well-defined. So it's a way in which we group objects that are logically similar in their characteristics. Formally, clustering is different than classifying. Classifying, we have to know something about or have created predefined classes. In clustering, we don't know anything about those classes. We assume nothing. However, if we've done a clustering process and we see obvious clusters, then it can start saying, oh, maybe we should make classify these things. So in this case, we saw some clusters. So it's clustered this way. In this set, we have now classified the clustering into male and female. So the information is added on. And so in that respect, this isn't so much clustering as a classification. So to do clustering, you have to have a method of measuring similarity. So you can use a coefficient of similarity, a coefficient of dissimilarity. You can use a similarity scoring matrix. This is how we measure sequence similarity. You also have to have a threshold value to decide whether something is a member of the cluster or not. And then you also need a way of measuring the distance between two clusters, Euclidean distance of some kind, Jacobi distance, whatever. Anyways, you also, to start clustering, have to have a seed or an object to start the process. So there's three different types of clustering algorithms. Here's the k-means or partitioning method. Here's hierarchical methods, which we often see more often in phylogeny or evolutionary plots. And then there's psalms or self-organizing feature maps. That produces a cluster set through training. So a k-means method of clustering essentially is to take an object and define that as the center or centroid of a cluster. Then you grab another object and calculate similarity to each existing centroid. So if it's greater than a threshold, the object is added to the cluster. But once you've added that second object, you now have to redetermine the centroid. If it's not similar enough, you grab another object to start a new cluster or use that object that you just measured to start a new cluster. And you repeat that process. So you could use this for maybe colored billiard balls as an example. So we have this initial group. We determine a sort of a centroid based on the average color of these things. And we may have had this measurement rule based on the absorbance wavelength of the color, which defines what the color we see. And we use plus or minus 50 nanometers of the centroid color. So we choose another ball in this case, calculate the centroid color test. And if it is close enough, and you'll see I think the two colors are close enough, you join it. And then you grab another one and carry on. If you grab the orange one, it's not close enough. And so it would not be a member of that cluster. So that's k-means. Hierarchical clustering is a little bit more like, I guess, is a little more intuitive. It's the way we might match socks if you've done laundry and have a whole bunch of colored socks. And you're trying to get the red socks matched, not just the red, the orange, but you want the two red socks. You want the two blue socks. But if you've got a blue and a purple sock, do you put them together? So with sock matching, basically, you look for the two closest objects, merge them, pair them up, find and merge the next two closest objects. So in some cases, you can actually have, let's say, you've got 10 white socks and they all look the same. So you might just start making a big stack of white socks and boot them all together. So that process is repeated. And so again, you start with your initial collection, and then you do a pairwise comparison. Again, it's something all of us are familiar when we're looking at colored socks. You say which ones look the most alike. In this case, we don't have any socks, but we're looking at balls. And so the light blue and the turquoise ball are most similar, so that forms a pair. And then we see the next slightly darker turquoise ball is another similar. And then if we do another comparison, the blue, dark blue ball would probably be the one that's joined. And then the orange ball, which is very different, would be on its cluster, on its own. So as I say, hierarchical clustering is most commonly used. And we use it in microarrays. We use it in metabolomics. We use it in proteomics. We use it in evolution. And this is just this indication where we show by the length of the connecting bars, the level of similarity. So if it's a case of metabolite expression values, you find the first pair that have the most similar metabolite expressions, then the next closest pair. And then we iterate. And so eventually we've structured this graph, where you can see essentially a heat map illustrating similarities based in overall metabolite expression level over maybe multiple days, where green is increased expression and red is decreased expression. And you can see those patterns are existing, where it starts off nothing, climbs, then falls, then you see another group where it starts off nothing, then climbs, then other ones you see where it iterates up and down. Then others you'll see things where the metabolite expression drops. So k-means and hierarchical clustering are two methods that are very important in obtaining information, both for visualizing and understanding omics data. So since we're looking at large numbers of variables, it's probably worthwhile to come to the last part of our section on statistics, which is multivariate statistics. And this is the most important or relevant component to metabolomics. So in multivariate statistics, we're looking at multiple variables. That's to distinguish it from statistics, where it's a single variable. So if we're looking at a population, and this population of people that we just showed here, we could be measuring height and weight, and hair color, and clothing color, and eye color, and so on. Or in the case of metabolomics, we're measuring 20 amino acids, all the sugars, all the organic acids, so 100 variables or more. So when we're measuring hundreds of variables, we're into the realm of multivariate statistics, and we're dealing with multidimensional analyses. So we'll talk about dimensions. And so what you want to do to make things simpler is you reduce those dimensions. So it's called dimension reduction. So in a typical metabolomics experiment, we have one cohort A's that are maybe treated, and the B's where they're not treated or just given a dummy or sham treatment. So we'll have both biological replicates. So in the case of a mouse study, we might have 30 controls and 30 treated. We may also have technical replicates, where we may take two samples and split them just to make sure that there hasn't been different treatments or different storage conditions altered things. Regardless, we're going to be measuring lots and lots of data from those sets. So it might have been that just for including both technical replicates as well as biological replicates, we've got dozens and the treated and dozens in the control. So as I said before, with metabolomics, we're measuring many types of metabolites from these animals or these samples. And so any given animal will produce 100 or more variables related to concentrations and chemical types. And so everything you do in metabolomics requires multivariate statistics. So the way that statisticians have dealt with multivariate statistics is to try and reduce the number of variables and make it more like univariate statistics, which we spent the last hour dealing with. And univariate statistics is perhaps a little more intuitive. We understand normal distributions. We don't understand how to compare two clusters and say that they're different. So if you can start reducing the dimensions, you can apply the same rules that we just learned about P values, T tests, ANOVA's, et cetera. So that's something that we understand, though, so they're more intuitive or we've learned them in high school or first year statistics. So to make multivariate statistics work in the univariate world, we do dimension reduction. So the best way to do dimension reduction is to do a thing called principal components analysis. So it transforms a whole bunch of potentially correlated variables. So this is where we want to try and get rid of the interdependence and make everything independent to a small number of uncorrelated variables. So it looks for correlations, clusters them together, and then just essentially reduces those to only the uncorrelated or independent variables. And those are called the principal components. So it's possible to reduce literally thousands of variables to maybe two or three key features or variables. And at that point, we can start using things like ANOVA and t-tests and u-tests and so on. So here's an example where we've got a whole bunch of spectra, whether maybe from hundreds of animals and hundreds of peaks, hundreds of metabolites. And these are some exemplars of one treatment with a drug called PAP, another treatment with Anit, and another whether we've got controls. And so we could have been dealing with, as I say, hundreds of spectra from hundreds of animals. And if we cluster it, now we can plot it in what's called a scores plot, whether it's principal component one versus principal component two. And from that, we can see three obvious clusters, the controls, one is treated in Anit, one is treated in PAP. So now, what would have been really impossibly hard to look at, and hopefully you guys appreciated yesterday when you looked at your tables and tables and tables of data values, it would have been hard to see any trends. But with principal component analysis, which you'll do later, you'll be able to see those trends quite clearly. As a rule, if you've done this, this dimension reduction with principal component analysis, and you can't visually detect some clustering, then doing statistics on this stuff probably won't help. This is where intuition comes in. If you're not seeing obvious clusters to your eye, there's probably nothing there. So how do you understand principal component analysis, dimension reduction? Well, one way of thinking about it is thinking about what we're reading today, which are bagels. So you can do the PCA of a bagel. So if you've got a bagel hanging in space with a thread, and you have a flashlight, you can basically shine projections, two-dimensional projections of the bagel. So if you shine the flashlight on one way, you get this nice O-looking picture. And if you shine the flashlight on the other way, which is the side of the bagel, you'll get something that looks like a wiener. So you can see the two shadows there. So which projection captures the most variation? Which is most bagel-like? Well, the projection that gives you the O is most bagel-like. So that one actually captures the largest variation and actually has the largest eigenvector. That's what we use. So that's usually identified as principal component one. The wiener projection gives the fact that the bagel actually has some depth, some volume. So that gives you how thick the bagel is. So that was principal component two. And in fact, between those two principal components is sufficient to completely describe the bagel. And it's the same sort of thing that we use in projections for building diagrams or architecture. So PCA involves the calculation of eigenvalues from a data covariance matrix. So you're looking for correlations or covariation. And you use singular value decomposition to help identify those eigenvalues and eigenvectors. Formally, in mathematics, it's an orthogonal linear transformation. So it takes the data, translates it to a new coordinate system so that the greatest variance of the data lies in the first principal component. The second greatest variance in the data lies in the second principal component. And you can go down to three, four, five, six, and seven. As a rule of thumb, it's not worth doing or plotting or considering much more than three principal components. So you have a table of variables, which might be your metabolite concentrations. So those might be the columns. And then you might have the samples. And what you're trying to do is essentially convert these tables of values to collections of scores. These are eigenvectors, which are uncorrelated and orthogonal, to some weightings, which are called loadings that you multiply the data by. So a score for T1 might be weighted to a loading P1, P2, P3, for all of your concentrations of valine for X1, leucine for X2, glucose for X3, and so on. So this is something you don't do by hand. You let a computer do it. And this is essentially what singular value decomposition does. And again, it's linear algebra. If you've taken linear algebra, you've probably tried to do this at one point, small matrices. But it's not something that is trivially done. So one way of visualizing a PC analysis is something that Roy Goodacre gave an example some years ago. And I thought it was really neat. And what he was asked to do, and this is a real project given to him, was to take data about airports in the US. And they had 5,000 airports, little regional airports, major airports, landing strips on highways. And they tracked the latitude, the longitude, and altitude of all these airports. And for whatever reason, they wanted to have some PC analysis done on that airport's data. And they were maybe thinking that they might find some trends about airports tend to be higher up, or further down, or maybe they're more on the east coast, whatever. Anyway, so he took this data that was given to him by the National Airlines Association or whatever. And he calculated the PCA plot. So what do you think you'd get? So basically what he got was a map of the US. So you can make out there's the continental US, there's Florida, Texas, it goes like this coast. You can see Alaska, which is kind of fuzzy. And then you can see Hawaii, which is way on the corner. And then you can see Puerto Rico, which is off of Florida. So those are the first two principal components. And one of them turned out to be basically latitude and longitude, second principal component. And then the third principal component was this sort of bizarre looking thing on the corner, which is just the height. Anyways, people have formally proved that basically principal component analysis is mathematically equivalent to k-means clustering. So remember what we did with, that was this tool about clustering. This algorithm we talked about of grouping balls. So PCA and clustering are essentially the same thing. So intuitively, even though it's a complex transformation of singular value decomposition with eigenvalues, eigenvectors, and everything else, it's basically clustering using k-means. But it still achieves sounds, and that is it allows you to do dimensional reduction. And it allows you to get clusters that are mostly normally distributed with both means and variances. And so it's possible to use things like t-tests and ANOVA tests to actually determine if these clusters exist or if they're significant or not. So you can have a cluster, plot it out PCA1, PCA2, and you can actually use ANOVA to determine whether these clusters are actually significant or not. Most people don't do that. Most people just simply say, yep, looks like there's three clusters there. That's OK. When we generate PCA plots, we generate either a scores plot or a loadings plot. Remember, we talked about these variables in the lower part. There's scores equals loadings times data. Loadings is p, scores is t. So if you've got t's and p's, you can plot t's and p's. So this is a picture of a scores plot. It's in three dimensions. You've got red, blue, and green. So you kind of see three clusters or maybe four clusters, depending on how you group them. So that's a scores plot. Here's a loadings plot. So the loadings plot is not often shown. The scores plot is generally more frequently shown because it tends to show the clusters along these principal component axes. And it tells you that there's some structure to your data, that there's some clustering that's obvious or not obvious. The loadings plot shows you how much each of the variables, in this case, a tabulate, contribute to those different principal components. So as a rule, it's the values in the extreme ends corners that contribute most to the separation. So if it was a two-dimensional cluster or scores plot, and you saw that there are two clusters, one on the right side and one on the left side, you would look at your loadings plot and say, well, the things that are really differentiating these things are the metabolites, which are labeled here on the right side, and the metabolites on the extreme left side. If the clustering was something that was going up into the top corner and the bottom left corner, then you'd look to see, OK, which metabolites are in the top corner and which ones are in the left corner that are contributing most to that separation. So you kind of overlay your loadings plot with your clusters plot to see which values are or which metabolites are really driving the separation. OK, so if you've done principal components analysis, some cases you won't succeed in getting any clear clusters or groupings, whether you're using the first two, the first three clusters, you try to compare principal component four with principal component five, which is not recommended. In this case, it's better just to say they aren't clustering. And this is where your eyes are not lying. And this goes back to this point about statistics is about impressions and intuitions. So if PCA analysis fails to achieve a modest separation, then it's not worth while doing other statistical tests to separate them. And this is a mistake that people will make because you can do some trickery, just as we said, to make things separate. And a lot of people tend to do that because they just don't want to have a negative result. So there is another tool, which is an example of a trick, to actually make things separate when they can't or shouldn't. And that's called partial least squares discriminant analysis. So this is not clustering. It's classification. It's a supervised classification thing. So it's saying, you know the answer. And you're going to use some labeled data. And now you're going to make the answer produce a separation. So PLS8 uses prior knowledge. PCA uses no prior knowledge. PLSDA, semantically, it enhances the separation by rotating the PCA component so the separation is maximized among the classes. So it's sort of a mathematical trick. And it will enhance your separation. So if you have no separation, you can run PLS8. You'll get separation. So it always guarantees that you're going to get some separation. So that's a trick that can give you an answer. And it's an abused trick. So how can you make sure that you're not fooling yourself? So what you need to do at PLSDA is to validate them. To make sure that you haven't put in too much information so that you're basically filling in the answers for the test from the answer key. So the way that we assess the quality or robustness or validity of a PLSDA plot or separation is to measure what are called the R squared and Q squared values or to perform permutation testing. So this is a way of making sure that the separation that you're seeing is real and not a trick. So as I say, PLSDA is a lot like PCA. It gives you these clusters, separates things usually pretty nicely. So how do you make sure? So one is to see if the R squared and Q squared values are appropriate. So R squared is the correlation index, not correlation coefficient. And it refers to the goodness of fit or the explained variation. So it ranges from 0 to 1. Q squared is the predicted variation or the quality of the prediction. So again, PLSDA is modeling its prediction. And it too ranges from 0 to 1. So Q squared and R squared are highly correlated. So they track closely together. So high Q squared usually means a high R squared and vice versa. So basically with R squared indicates the PLSDA model is able to mathematically reproduce the data in the data set. A poor model will have an R squared just like a poor correlation of about 0.2 or 0.3. A good model will have an R squared of about 0.7 or 0.8. So these are just rules of thumb. To make sure you have an overfit because you're modeling this PLSDA, Q squared is often preferred. So that measures the cross validation or a permutation testing to determine how robust the model is and whether it's a gain from overfitting. So the rule of thumb is that if the Q squared is greater than 0.5, you're OK. You have an overfit or you haven't cheated. And if you already see just from your PCA that there's two hugely distinct clusters, you can pretty much guarantee the Q squared will also give you some value of 0.9. So Q squared does this and spits out a value like that. You can also do something that's perhaps a little more at least appealing to me, which is permutation testing. So in the PCA, let's say you got this blue plot here. And most of us looking at it would say, I can't see two clusters there. It looks like it's one cluster. That might be a point where you'd say, throw up your hands and say, I give up. Now if you've done the thing where you say, OK, blue are male and red, female or whatever, you can see that in fact, now that we've labeled it, there is kind of a trend. The red cluster is sort of below and the blue cluster is above. They overlap, yes. But there's a trend. So maybe we're not so bad after all. It's just a difference. So we then try and use partial least squares discriminant analysis. So we apply it to the same data. And now we get this huge separation. See the blue on one side and the red on the other. And say, oh great, I think you've got something. Well, don't jump on your hearts too fast. What you've got to do now is to validate that. So this is what this R squared, Q squared. So what we can do is we can go back to our original data and we relabel it. So in this case, we randomly say blues are, one set is male, another set is female. And again, you just use a random number generator to label these things. So we end up with another relabeling. And that's down below. So that's called the permuted data. And you'll notice that the new labeling, we don't see any trend anymore. There's no separation from the PCA. So then we conduct a PLSDA test. And the PLSDA gives us something that also doesn't look like there's any separation whatsoever. So that's one permutation. We can relabel that data again and perform another PLSDA test. And we can look to see if there's a separation. And we can do it a third time and a fourth time. And we do it actually for 2,000 times, typically. But we relabel our data all kinds of random ways. Again, whatever is male or what is female and vice versa. And we test to see how well these are separated. And then we plot, after these 2,000 iterations, the separation score. And so that's shown in this little graph on the left. And you can see that in most cases, none of them really, they form a normal distribution. But there's this one, which is our top one, where we've got a fantastic separation score. And so based on that, we can actually calculate a significance value. And in this case, maybe the significance value is 0.002 or something like that. So in this case, we could say that even though the cluster wasn't obvious in the PCA, we did perform PLSDA. And we did validate it using, we could have used the q squared one, or we could have used this permutation test. And if the permutation test gives us a significance value of 0.002 or whatever, yes. This is real. Separation is real. They're distinct. We haven't been tricked by the statistics. On the other hand, if we didn't have that peak off in the corner, but instead we just produced this sort of Gaussian distribution, then we probably could say that, in fact, no. There's no clusters. There's no separation. We were tricked. So validation is very, very important in PLSDA. And it's not done sufficiently and not well done in many cases. So how do we assess a PLSDA plot? So we talked about scores and some plots and what? I've got a question. So what we do here is that we could say that you're a female and Dinah's male. And so we're going to take the metabolite values that we measured for you now that we've relabeled that. And what do we get? So if we relabeled everyone or if we just switched your names around, just randomly, we would come up with a new relabeling scheme. And we could switch everyone's names and you can think of the thousands of combinations just because there's 30 people in the room. We could relabel all of us with each switching each other's names. So that's the relabeling that's done. But your metabolite values haven't been changed. It's just your naming. So if we're trying to distinguish male and female based on metabolite levels, we would see, in some cases, well, there is truly a difference between male and female metabolites. So we should always get a separation. But if we had to add this clustering and we did the relabeling, we wouldn't see that because we've messed it up. So does that make sense, this relabeling? Yes. Can we drive you towards something that you think? So the relabeling process is essentially a model validation thing. So it originally came up not by statisticians, but it was machine learning people who were developing models that, in many cases, seemed to learn too well. And so they wanted to make sure that they weren't tricking themselves or overtraining themselves. So permutation testing is computationally intensive. But it's just a way of calculating statistics to know whether something that you've seen is statistically significant. Yes? So the point of permutation testing is you generate a null distribution based on the data that you have. And then you look and say, oh, well, where is what I observed on that distribution? And if it's in the end tail, you send it up like 5% or more, then you can say the same thing again. So you're basically saying, you know the way like PLS is generating your data already separates it too much? So why don't we basically generate distribution based on that and then see where others call us? That's right. That's right. I guess the concern I have is that if you're forcing something, when it comes down to biological. Is it significant or trying to see something that may not necessarily have any impact? So in the PLSDA, we're not forcing anything. It's just helping reorient what was shown in maybe a two-dimensional PCA plot into something that's a little different. Maybe it's a slightly different dimension, a slightly different principal component. So the separation is better than what was originally seen just with those two principal components. So it's not necessarily enhancing the separation. It exists. It's just that we didn't pull it out. So we didn't have enough knowledge to pull that separation out. So usually in any experiment, you do know what your controls and your treated are. And so you've added a little bit of information to the clustering process. That's what PLS takes. That's OK. It knows the answer. You know the answer anyways. But it now is allowing you to separate things more obviously. In that regard, it's modified the principal components in a way that allows you to see the projection in a little better way. So it's not forcing it. Can I give you an example of what we're thinking about? So the whole list that we had yesterday So you may have in a small group of people with tyrosine. So yes, that one pops up. But now are you putting too much emphasis because that goes levels of tyrosine in the element? Yeah, well, that's again what this permutation testing would help sort that out if, in fact, that was what it was picking up. It also comes from this VIP analysis, which also identifies whether tyrosine actually was playing an important role. And we'll get to that VIP analysis. But those are things that when you look at your data a little more closely and say, OK, yes, I see nice separation. But what's driving it? Those come out. And this is where this significance value and having a p-value or whatever helps say, yeah, it's significant. Yes, it was real. Yes, it is this metabolite. Connie? Yeah. So this is where it's useful and informative to include other variables in your collection. So if you have clinical or metadata, so where they live, gender, body mass index, ethnicity, anything, you can include that as part of your metabolomic data in addition to metabolites in the concentration. So clinical variables can be included. And those can be dealt with. And they'll look for correlations because that's partly what's done with both a PCA and a PLSDA plot. It's looking for covariance. So if, in fact, the levels of tyrosine vary based on place of birth, that would show up. And that would be bundled as a covariant collection. And it might be that it becomes a less important variable. You can do it more systematically. You can also analyze for those kind of variations, remove that or reduce the importance of that variable or weighting it. So there's lots of things that someone who's a trained statistician can do, but most of us aren't trained statisticians. And so the best route is to try and include the metadata in your metabolite lists. And it's perfectly fine. It's very feasible to do that. Yeah. Yeah. OK. So we're getting on to just how to assess. So we talk about the scores plot for PCA. In PLSDA, we don't use scores plot, we use VIP plots. And this is called variable importance in projection. So it estimates how important those variables are, the metabolites in this case, or it could be either clinical variables or physiological variables, and which ones are being used to drive the PLSDA separation. And so generally the rule of thumb is that if a VIP score is greater than 1, it's important. And if a VIP is less than 1, it's maybe not so important. So these are some examples of VIP plots. So you can see a score. It starts at maybe about 0.8 and goes to about 1.6, 1.7. And then you can see some metabolites. There's the acyl-carnitine or proprinyl-carnitine, carnosine, butyl-carnitine, methionine, spartate, proline. Those are metabolites. And you can see how important they are. So the ones up in the upper right corner are important. They're also indicating the difference between in the green. For in this case, EOE, it's elevated. And for healthy, it's decreased. And then for the another one, there's the C14. It's the opposite trend. So the VIP plot tells you what's up and what's down, relative to, in this case, the controls. And then you can see another VIP plot for another condition, or actually I guess it's the same condition, but looking at it, and then maybe it's urine versus serum, I can't remember. But these are just VIP plots. And they contain information similar to what we saw in the scores plots and other things that we saw as PCA. So there are other kinds of classification methods other than PLSDA. So there's a technique called SIMCA, soft independent modeling of classinology, OPLS. There are machine learning methods that can help do classifications, support vector machine, random forest, naive bays, classifiers, neural networks. These are tools, some of which actually are available in Metabo Analyst, but we'll do these kinds of separation. They'll learn and identify classes and figure out what should be used to sort of separate them. So I think an analogy in terms of how to deal with the data is sort of attacking this giant fortress of data that we collect in the big data world. So the light artillery, like the catapults, we typically use unsupervised methods to see if, in fact, our data has some trends in it and is useful. And so PCA, which is another sort of form of factor analysis, is one technique. And then as we showed, PCA is basically mathematically clue into k-means clustering. So it's the first line of attack. It's what you should use to see if you're seeing any interesting trends. Once, if you're seeing a trend or it's not as clear as you'd hope, then you start doing something like partially squares or linear discriminant analysis, or PLS regression, to see if you can get something that's a little stronger. So it's a little more powerful than a catapult you go to the cannon. And so that might be able to knock down your wall and get you a little further. If that's still not doing what you need, then you can start bringing out the really heavy artillery, which is some of the machine learning techniques. And these are less statistically based. They have capacity for handling exclusive or operations and doing sort of non-linear comparisons. And certainly with those, they're very powerful, but they can also be subject to overtraining and also leads you down the garden path. So if you're doing the data analysis, this is sort of progression. You go from unsupervised methods to get natural caller clusters using no prior knowledge, just seeing what the data tells you on its own. Next step, try the supervised learning or supervised methods. Use the prior knowledge, which is fair if you know that this is treated and untreated or sick and healthy, and see if you get that classification. But then you use Q squared permutation to tell us to see if that's available. And then really, I guess, as I said, the statistical significance, especially if you've applied these supervised methods, is absolutely critical. You need to validate them. And the permutation, whether it's in Q squared or permutation testing, is what's needed. So the techniques and multivariate statistics are very powerful. Machine learning is very powerful. But they learn from experience the same way that we do. They generalize from examples the same way we do. They perform pattern recognition the same way that we do. And in some cases, they do it better than we do. There's a tendency that I've seen in others have remarked on this, that many people miss the PCA steps, and they jump straight into supervised methods, which is not good. And when they do their PLSDA, they get a great separation. That's where they end. They don't do the validation. And so this tendency to forget the validation has been a fault that's still occurring widely in the field of metabolomics, and it's still occurring in other omics fields as well. And unless you're working with a statistician who can help you with this, or unless you're listening to today's lecture, you'll still fall into that trap. And as I say, a good rule of thumb is if the separation isn't obvious just by what you're seeing, your impression, your intuition, then you're probably treading on thin ice. So statistics won't make things happen. Of course, you can trick it or trick others into making things happen. But statistics is really intended to try and confirm your mathematics, mathematically, what your intuition or impressions initially are. OK, so that's it for our introduction to statistics. Any questions, any additional questions?