 So what we've done both this morning and yesterday was I think a focus on getting lists, peak lists, compound lists, tablite concentration lists. We've looked at some of the databases that can help facilitate that, some of the tools that can help facilitate that. So what we're going to be concentrating on largely for the rest of the day is how to interpret those lists. So this is a sort of a background or a statistics. There's a couple of you who I think probably have far more statistical training than all of us combined, so you might fall asleep for this, but there are others who I think are still learning a lot about statistics and hopefully this will give you some perspective because whether it's proteomic data, tablomic data, transcriptomic data, genomic data, you're often having to deal with statistics. And most of the introductory statistics courses that we've taken or had to take as undergrads don't really address that. So we're going to learn about distributions and significance. We're going to learn about uni-variate statistics. Everyone's heard of probably a t-test and an ova. We'll also learn about correlation and clustering, which is an important aspect of all omic studies. And then we're going to get into multi-variate statistics. And this is the kind of statistics you don't have to deal with when you're dealing with lots of data measuring many variables. So just some quotes about statistics. It's Benjamin Disraeli, British Prime Minister. Probably have the best quote about statistics. Mostly they're just lies. The other one about 98% of statistics are made up. That's sometimes true. And then Aaron Levinstein about how statistics are somewhat suggestive, but what they often hide is pretty vital. I think my feeling is that statistics is really formalizing in mathematical terms impressions. And sometimes these are things, impressions that we just haven't got feeling about. Observations that our brains kind of process. But, you know, if you knock something over, why does, you know, drop a toast? Land jelly side down? And, you know, what's the reason? We have some inclination about that. And we may be able to formalize that in a mathematical context that justifies our observations. So the first thing to think about with statistics has to do with distributions. And what we'll largely focus on is what we call univariate statistics, which are really just about how to compare a population or, in essence, two populations with one variable. In this case, we've got a population of people, and they're all lined up. Men and women, but what's perhaps most obvious is that they all have different heights. And that's probably what we're going to focus on here. So univariate means single variable, fancy term. And so if we looked at that population, we could measure things like height. We could also measure things, another variable is weight. We could measure test scores or IQ scores. But that's one variable that we're measuring, nothing else. And we choose one and we stick with that. So if we were to plot height, or if we were to plot weight from our population back here, or IQ, or whatever, we, and assuming it's a large population, and large in statistics is minimum 40, maximum millions, we get this. It's called a normal distribution. Collectively it's called a bell curve. It was formalized in mathematical terms by Carl Friedrich Gauss or the Gaussian curve. So I think everyone's seen this shape, everyone understands it. And we're just measuring the number of people with certain height. And if we're going from, you know, 4 foot 11 to 6 foot 10, we'd have all of these things marked off and the number of people in our population with those different heights over intervals of one inch. So the normal distribution, I think everyone knows, it's symmetric. It has a mean, or an average, and the mean lies in the center. It has a width. That width is called the standard deviation or sigma. And it seems to be a distribution almost like a rule of life but it seems to persist in many observations of living systems, of the physical world, of the chemical world as well. So it's the most common type of distribution we know about. So as I said, whether it's biological, chemical, physical, we typically see that kind of variation almost always follows a normal, a near normal distribution. Typically, if we get more and more measurements, the smoother the curve becomes and the more normal it becomes. Again, a fascinating phenomena that I don't know if it's ever been formalized into a law, like the laws of motion, but it seems to be something that persists. As I said, typically when we work in statistics, the rule of thumb, and it seems to be remarkably robust, is we work with distributions or populations of a minimum of about 30 or 40. This is one reason why, in fact, class sizes in schools typically have about 30 students. It's not necessarily having to do historically how many teachers can handle so many students, but it's because we could get good statistics with those numbers and it allowed classes to be compared. It's also how many experimental studies are set up. And you'll see that most studies in metabolomics have 30 cases, 30 controls. Many cases of prebiomic experiments and even transcriptomic experiments. Obviously more is better, but that minimal set of measurements to get that normal distribution is somewhere between 30 and 40. So as I said, Carl Friedrich-Guss ended up with a way of describing that normal distribution in terms of probability distribution, p of x. This is a normalization constant, but it's an exponential function e to the minus basically x squared. And sigma here again is standard deviation. So it's a measure of the width of the curve. And the integrated area under the curve covers, in this case, plus or minus one standard deviation or about 60% of the area. Two standard deviations covers about 95% of the area. And so we have these intervals and this is a function again of the unique character of e to the minus x squared function. We can calculate with that Gaussian distribution and also something that we all learned from elementary school as well. The mean or the average, so summing all of the heights dividing by the number of individuals, that's the mean. But that also represents the position here. The variance, which is the square of the standard deviation. So it's the summed difference between each population member and the mean divided by the number of individuals in the population. And then the standard deviation is just the square root of the variance. And we generally work with standard deviation rather than variance in most fields and statistics. And we give errors, usually using standard deviation, not errors with variance. So here is that point about the area under the Gaussian curve, it's 68%, which is one standard deviation. So mean plus one standard deviation still tells you that there's going to be about 15% of the area over here and about 15% of the area less than that one standard deviation. Two standard deviations, this is a 95% cutoff. And this is something we're going to see in other forms where we talk about 0.05 as a cutoff or a t-test. That's the 95% confidence interval that we generally talk about. It's the line that we draw in the sand when we say someone's getting an A or an A plus, typically, that they are exploring two standard deviations above the mean. That's exceptional. And so we've arbitrarily, as consensus has chosen, this two standard deviations is being important. It's not a law. Unfortunately, a lot of people think it is. It's just something that we've chosen by consensus as a community. But you can have one standard deviation if you agree or if your community agrees or if you're convinced that's important. You can have three standard deviations. In that case, it's a cutoff of 0.001. Again, it's up to you. And as I say, there's no law that says the value has to be 95% and above. But by, I guess, popularity, this is how we've chosen it. So as we go up to three standard deviations and four or five standard deviations, these are very, very exceptional. And when you're talking about, say, basketball players, typically you're looking at people up in the fourth standard deviation. People with IQs over 180, that's somewhere up in the fourth standard deviation. So this sort of puts into words what I'm trying to explain in that figure. So the probability that being something that's greater than one standard deviation is about 32% probably that something is greater than two standard deviations away from the mean, whether it's larger or smaller, that's about 5%. And being three standard deviations, either larger or smaller, is about 0.3%. So very small probability. So this is how we grade in universities typically when we talk about grading to a curve. So the average typically is set at C for a large class of first year students. Students who score one standard deviation above the average are usually assigned to B and then those getting two standard deviations get an A and maybe three standard deviations look at an A+. Yes? Can you relate these values to one standard deviation? Two standard deviations. So the number of objects, so the Y value. Why these values are Y in the terms of the population? Should we say that 95% of confidence is below the set percentage of the population? Yes, so that goes back. So we'll talk the question first. So how can we relate standard deviations into percentage of population? So roughly if we had a class of 400, we might have roughly 250 would get a C. In terms of those scoring above one standard deviation, we would get 450 or 400, we have a 60 would get a B and about maybe 10 would get an A and one would get an A+. And then others 60 would get a D and 10 would get an A. What's that? The statistician, we're using the yellow language. Percent of the population has numbers. Who is it? One sigma is just 100%. This is it, 32%. Okay. No, thank you. So we also use in statistics this P value which I brought up before and P stands for probability. And so that's the probability of getting a test statistic, a T test and an over test or a score set of events, a height, anything. At least as extreme as the one that was actually observed. So is that something that's going to be a highly probable event or a weakly probable event? So P of 0.05 says that it might occur 5% of the time which is not very common. So we reject the null hypothesis. Nothing is proven largely. When the P value is less than a chosen, arbitrary chosen significance level. And that value is given in statistics called alpha. And as I said, the value that we usually use by consensus is 0.05. But there's nothing that says it has to be. You can choose something smaller. You can even choose something larger. And you just have to argue or suggest why you think it's important. Typically if you have a P value of 0.1, which is 10%, people say it is trending towards significance. And if you have a P value of 0.01 or less, people say it's highly significant. So when you reject the null hypothesis, basically saying nothing's happening, you said you found something statistically significant or interesting. So if you were to take height again, the average height, if you combine both men and women globally, say in North America, Europe, is about 5'7". And the standard deviation is about 5". Meaning that 68% of people are between 5'2 and 6'2". So then you would say, what's the probability of finding a basketball player? A 6'10 person. And you could also frame this in a sense of more differently. If you find someone who's actually above 6'10, are they human? Because this is very, very different than the observed standard deviation. And likewise, again, depending on that threshold of alpha, if you've chosen 0.05, 0.1, or 0.01, do they fall into the realm of being human or do they fall into the realm of being basketball players? So that's one of the things you can ask, and if we know the distribution of the standard deviation, we can actually get an answer. The same sort of thing is flipping coins, and if you flipped a coin 20 times, but instead of getting 10 heads and 10 tails, you actually got 14 heads and 6 tails. What's, you know, is it a fair coin? And you can actually calculate the probability of that occurrence, and it would happen about 6% of the time. So depending on your threshold, if you used an alpha cutoff of 0.05, the answer is it is a fair coin. If you used your alpha cutoff of 0.1, you might actually say it is not a fair coin. And again, that's an arbitrary threshold, and sometimes things cut very close. And by looking at those trends, you can sometimes make comments on that. So you might have a commentary that's associated with your decision. So it's not hard and fast, but it is one that we will use, and the community uses a lot. Now, there's also distributions that are not Gaussian and not normal, but called skewed distributions. And in that case, we have to use some different terminology, and the meaning of mean or average changes as well. So we have terms that are like mode and median and mean. So the normal distribution, mode, mean and mean are all the same. So that's an equal symmetric distribution. But in skewed distributions, like Poisson distributions, like extreme value distributions, we have to worry about those. So an average can be affected, and the mean can be affected by certain extreme values, if you have a highly skewed or strange distribution. The median is the middle-most value. So that's the value that's halfway between the mode and the mean, whereas the mode is the most common value. So in this distribution, the most common value, the one with the highest frequency is here. The mean, because it's been affected by these extreme values, is shifted away from this sort of center, so it's to the right. And therefore, the median is basically halfway between the mean and the mode. Some distributions, most distributions are what we call unimodal, so they have one maximum value. But this is a bimodal distribution, so there are two maximum values. You can have trimodal distributions, or sums of distributions. And these exist, and statisticians have worried about them, and we probably see this sort of bimodal distribution if we look at men and women in measured heights. There are other distributions. The binomial distribution, which taken to its extreme, actually becomes the Gaussian distribution, but it's the binomial distribution that we use when we calculate probability-flipping coins. The Poisson distribution also is an extreme form of the binomial distribution. Probabilities are very low, and it's typically used in understanding radioactivity decay. Extreme value distribution. It's actually a phenomena that happens as you go from elementary school to high school to university to graduate school and then beyond. Each time you're actually taking the upper 60%, 70% of performers and then upper 40%, and finally by the time in grad school it's in the upper 5% of the class. Unfortunately, we assume that as we keep on winnowing these down that we should be following a normal distribution, and in fact it is an extreme value distribution. It's a very skewed distribution, and so that's the statistic we should actually be using to grade perhaps people in graduate courses. There's also skewed or other exponential distributions that occur, so these are not, the Gaussian is not the only distribution. The binomial distribution is to say the original one that people discovered it's sort of a polynomial, so if you have 1 plus x to the nth power you will find that you'll get these different coefficients. So 1 plus x to the 0's power gives you 1, 1 plus x to the, first power gives you two coefficients of 1, 1 plus x squared if you x squared plus 2x plus 1, 1 plus x to the cube if you 1 plus 3x squared plus 3x plus 1, and so on and so if you take those coefficients you get this Pascal triangle of coefficients and this can be used to help generate the distribution and as this n climbs to hundreds this gets very smooth and it starts to look a lot like a Gaussian distribution. Poisson distribution as said is sort of an extreme form of Gaussian or even a binomial distribution it's displayed this way and when the value of mu is small you get this sort of almost exponential decay when the value of mu gets progressively larger it progresses to look a lot like a Gaussian curve. So it's a function of both mu and the vex. As I said the extreme value distribution this is of high relevance particularly in sequence comparison it's the basis to statistics in blast it's the basis to statistics in mascot for proteomics and as I said it's this idea of sampling each time from the extreme value so if you're always taking same thing with hockey players from each division to choose the ones by the time you're in the NHL these are the extremes of the extreme and you can't necessarily grade them by some normal distribution it's a skewed distribution and it arises from very selective sampling same thing with sequences when we get sequence matches and sequence scores we are essentially sampling from a skewed distribution most cases are sampling from the right and this is what a sort of extreme value distribution looks like it tails off a little bit on the end but then it just has these long outliers its long tail so it's a little bit more like a Poisson type distribution when distributions are not normal a lot of the statistics that we think about kind of break down so skewed distributions are the bane to statisticians that have liked them but in many cases especially in metabolomics we tend to get these kinds of skewed distributions you can change skewed distributions to a trick that's called normalization or transformation and the trick is called log transformation and what it does is it brings the outliers on the extreme rights a little closer and in effect can make an extreme value or an exponential type distribution much more Gaussian so here's a kind of a skewed distribution if you look more at the turquoise bars you can see there's a little bit here spikes up but then it just tails off you can take a log of all of these values and now what you get is an almost perfect Gaussian or normal distribution that's called a log transform but once you've done that you've actually helped yourself a lot because then you can start using a lot of statistical rules and techniques that were developed about 100 years ago for being able to do comparisons here's something that's a little more extreme in terms of an exponential decaying another exponential decaying but each kind in a log transform and it's not perfect but these log transformed distributions now look somewhat more Gaussian and therefore better statistics can be calculated with those okay so those are distributions the idea of transformations log transformations being important is the challenge of working with skewed distributions that's another thing that we have to remember and we looked at essentially sort of univariate measures talking about height or test scores or something like that alright in statistics we usually want to distinguish between two or more populations we want to know whether something is different than something else cases versus controls sick versus healthy whatever so here's our hypothetical test we've got some normal people and then we've got a whole bunch of leprechauns all colored in green and leprechauns if you haven't heard of them tend to be short and they're in Ireland so what we can do is we can plot the two population curves one for the normals and one for the leprechauns leprechauns in green normals and kind of a bluish turquoise light blue and so now we've got their height distributions and the question we're asked is are they really different and so this is where as I say statistics is the math of impressions most of us could kind of look at this and say yep, normal people are taller than leprechauns but a statistician would ask can you prove it so this is our first step we plot up the distribution and we would do a statistical test now then another challenge might be we came across a tall tribe of leprechauns and then we'd say okay how about these two distributions normals and here's a tall group of leprechauns and we plot this out and the question becomes a little harder and we can't really trust our intuition anymore and we're asking are the two populations really different so this is what statistics can help with and we like statistics and the approach that was developed 100 years ago was a student's t-test and it's really basically used to determine if two populations, normals and say abnormals, leprechauns and regular people or healthy and sick are different so the t-test statistic and we weren't going to the calculations because it's something that is done sort of automatically these days you can do it on the web, you can do it on the calculators but bottom line is that if a t-test statistic gives you a certain p-value given a threshold that you have arbitrarily chosen alpha you can make an assertion that people will generally agree on so in the case of if the p-value for our test statistic for this was 0.4 and our cutoff was 0.05 then we would say that these two populations are not different if on the other hand we get a t-test statistic and we got a p-value of 0.04 maybe like for this one and our cutoff was 0.05 then we could stay with statistical confidence that those two populations are different so that frames it in mathematical terms in essentially quantitative terms we're able to say that based on these statistics this population is not different and this population is so there are different types of t-tests that are called paired and unpaired t-tests typically paired tests are used in sort of before and after experiments well unpaired are for two randomly chosen populations so ideally for these ones we would use an unpaired t-test when we're looking at leprechauns and normals you can also use a t-test to determine whether two clusters are different functionally different particularly as the clusters have kind of a normal distribution so you can think of we plotted out an x and a y this could be even a PCA plot if you wanted but here's a cluster, here's another cluster but these represent populations so in principle you can use a student's t-test if you've got these variables measuring your distributions now one of the challenges is what if the distributions are not normal and as I said we could use this log transform test that allows it to convert things so that they look normal but even if we try to do a log transform test on this and compare it to this population it would still be pretty grim and we wouldn't be able to use or trust some of the numbers that's why we use a test called a man-witney-u test where they will coxswain rank some test has anyone ever heard of these before? one, two anyways these ones are this is actually a more powerful test than students t-test and it's basically for non-normally distributed populations as I said more robust and essentially what you're looking at is not necessarily means but you're looking at medians so t-test is to deal with means because you're dealing with Gaussian distributions here because the distributions are so messed up we look at medians and essentially what it does is it assumes that the populations are sums of Gaussians and it uses ranking to help sort out some of the challenges so when the populations are Gaussian, the man-witney-u test essentially becomes a t-test it converges, that's been shown mathematically and whether we call it the t-test and before so this is now called the u-test so if the u-test which you can punch through with r or various programs online if you have these same thresholds you can make that same distinction again the choice of your alpha is essentially the decision point it's your threshold or cutoff so univariate statistics looking at two populations now if we're looking at univariate statistics in three populations so we're now looking at normals, leprechauns and elves which are really really short how do we distinguish between those so we could look at these three different populations the elves, leprechauns and normals and determine whether they are truly different same sort of question if we have these different populations of elves and leprechauns and they happen to be tall tribes of these ones would we know if they're different again that's a challenge so in some cases they're obvious in some cases they aren't this is why we turn to statistics and the statistic we use is not the t-test it's not the man-witney-u test it's ANOVA analysis of variants and it's essentially for three or more populations that's what we apply it for so it's a generalization of the t-test or we can use it for the man-witney-u and essentially you're looking at group variants and you can determine whether several groups are all equal that's the null hypothesis and you have one-way, two-way, three-way N-way ANOVA tests the one-way is just determining is any one of these three or more populations different it's not doing a pair-wise comparison obviously then two ways trying to distinguish between the three but that's the point and they're different classes just like there's paired and unpaired t-tests there are one-way, two-way, three-way ANOVA's so remember this and this is I think the most common clinical study and most common biological study is typically control versus cases so typically that's why we do t-tests but there are many studies where people have three or four populations when you get to that then you have to go to ANOVA tests ANOVA can also be used just like with the t-test as long as those clusters are normally distributed or roughly normally distributed so the question is if I were to do t-tests between each pair would I get the same results as ANOVA I think so although I don't know the robust statistics behind that and typically I think if you try to do that people might say why didn't you do ANOVA and I haven't tried it to see how different it is well this is where there is one way ANOVA so the question was if I used ANOVA can I distinguish between only one or two or three groups but the one way ANOVA could only tell you that there is one group that is different among the three the two way ANOVA would tell you essentially what the two groups are different or which ones are totally different like Jeff was pointing out that you could do that post-hoc t-test once ANOVA has told you that something is different then you could do the t-test perhaps by how much but ANOVA I mean the world has changed because we have computers and freeware for doing the statistics it used to be really challenging when you have to crunch it out by hand and calculators and people would try and come up with quick fixes to help get some answers but it's no point saying oh I will only do a t-test I only have the capability of doing a t-test ANOVA the software is there you can get that in any cheap package now this is an issue if we're either looking at large populations or in populations basically so we're looking at 100 microarray tests or thousands of genes or hundreds of thousands of proteins of metabolites this is an issue of false discovery and the false discovery rate so it might be that you're trying to look at 100 different t-tests so you might be comparing gene X for cases versus control gene Y for case of control gene XXY case of control and keeps on going down you've got lots of different and from your list of genes or your list of metabolites you've done your t-tests and you've got those values reported for cases versus controls and let's say you've got 20 results out of that hundreds list of metabolites all with a p-value of 0.05 the question is are all of those statistically significant and you can actually roughly estimate that one of those results one of those metabolites one of those genes or proteins will likely be a false positive and you can multiply that by the basis of the number of results that have that significant value 20 by the cutoff 0.05 that gives you one so one way of absolutely ensuring that none of these things are false is you can take that p-value and divide by the number of results that actually gave you that or would translate to a value of one false discovery so it could be divide by 20 and so you could use a cutoff now it's not 0.05 but 0.0025 so if you use that cutoff you can absolutely guarantee that none of these are going to be false discoveries false positives that's a trick that some people use it's a very extreme trick because it might be that if you chose that cutoff none of your values would actually be considered significant and now you have lost probably what was some really useful data just to get rid of that one false positive and so there's a lot of debate still going on I think in the community about what sort of correction Bonferroni correction false discovery correction or other variations should be used that allow you to at least still capture some useful information and still deal with the possibility of having a false positive so here's an example of some probabilities on weather like what we had for yesterday so this is our forecaster coming on with his probabilities on whether it's going to be rainy, sunny, foggy, cloudy, snowy windy whether it'll be hurricanes or thunder or lightning or hail or sleep these are all probabilities that they might publish for today's weather and we can be 100% certain that whether it's today or tomorrow that something will happen one of these 12 or 13 things will happen but only one of them criteria would actually be significant using that false discovery because there is 714 so if we used 0.05 and divided by 14 none of these would be significant except for the eclipse in terms of the p-valium using the Bonferroni correction criteria so that's as I say it's used a lot in my queries I don't think it needs to be used in terms of topolomics normalization and scaling this is another thing that we need to talk about this is something that's critical for distributions and this is typically it's a systematic error so we can imagine it's something that was miscalibrated so we start with two populations that are essentially to our eyes identical in height because our ruler is miscalibrated we end up with something like this clearly someone's made a mistake how do you deal with that mistake and that happens a lot in metabolomics and transcriptomics proteomics and we try to identify those either systematic or scaling errors and that was sort of what we were doing in essentially peak alignment we have scaling errors because things come off in different retention times so another term instead of transformation we call it normalization but it's trying to adjust for that systematic bias in your measurement tool so if we normalize or adjust or scale and again the terminology sometimes confusing so if we rescale we would actually end up getting things that are mostly close and probably tell us that the two populations are statistically identical as I said normalization is an unfortunate term because it can also mean things like making things look like a normal distribution it can mean scaling and as I said the point with statistics for ANOVA, t-test for a lot of the other statistics in PCA and PLSTA everything we assume is that we're looking at a Gaussian slash normal distribution so we've talked about the log transforms for normalization or making things look like a Gaussian distribution so that's another meaning for normalization and I don't think anyone has really sorted out that confusing terminology so I'm going to shift now and talk about data comparisons and data dependencies and statistics so in many studies in metabolomics and in biological studies it was the case control or the before and after because of some treatment or intervention we might measure one variable against another we might also try and assess how an observed property matches a predicted property which is another type of comparison we do and in all of these we're essentially measuring lots of samples which you might call say biological replicates sometimes technical replicates and we're looking at a population when we make those kinds of data comparisons before after predicted observed we typically plot things on a scattered plot so this is a scattered plot and this is a plot of the wife's age to husband's age and it's real data and you can kind of see a trend I think most of us could and so if there's a dependency between some variables or between predicted and observed or before and after you'll get patterns and what we're looking for those patterns is called a correlation what we saw with the correlation between husband's way wife's age is a positive correlation but you can also have a negative correlation so the slope of the line is negative and then you can also have things that are totally uncorrelated for instance price of gas and the distance of Hailey's comic to the earth to unrelated things and we'll find that they are not correlated you can also quantify correlation especially when it's visible so high correlation this is something that many people would be quite happy with carrying out some kind of prediction or looking for some trend people in epidemiology might be happy with this that's a low correlation and then the perfect correlation which never happens is essentially 1.00 you might see that in physics to quantify correlation we calculate the Pearson correlation coefficient and this is actually it is r not r squared and it's unfortunate that excel is completely screwed up world because it only quotes r squared but this is the Pearson correlation coefficient like standard deviation it's the square root of variance square root of this and so it's really a measure of how much the x and y variables differ from the mean and then it's sort of normalized so a correlation coefficient can range from 0 to 1.0 and this is a high correlation modest correlation perfect correlation so it gives us a number quantitative and that's important we call a correlation coefficient sometimes we call linear correlation Pearson product moment correlation coefficient it's quantitative and we use it in a lot of situations for predictions, comparison, simulations they said there's a correlation coefficient and then the thing that Microsoft Word gives you is the coefficient of determination that's r squared shouldn't confuse them unfortunately 99% of the world does in 99% of the world now calls r squared a correlation coefficient it is not so hopefully if I can at least get a few of you listening and stop using excel's messed up version of correlation take the square root that's what the real correlation coefficient is so one challenge also is what is the significance of correlation and here's a plot real data correlation here's another plot three points they get a correlation coefficient of 0.0 or 0.99 question to you guys is this correlation significant who thinks it is who thinks it's not who's neutral it's a classic case where people can also lies and downblies and statistics where people can work with a very small number of points so you can just add two more points to this small number of tests and suddenly you've gone from perfect to essentially absent so sampling is important but there's also a challenge and I can't remember if I included this one so this is another one that people use a trick that actually says data but what you do is you collect everything that's basically a random value here and then you collect a couple values way up at the extreme that are also sort of random and then you can get that an almost perfect correlation coefficient that may be a limitation of the study design but it's usually a hint that this is not actually a great correlation and really what you should be doing is this cluster and it's correlation and this cluster and that's correlation and what you would typically find is it falls quite a bit close to zero so that's one trick that people will use the other one as they say is using just a small number of good points and that's sort of blissfully ignoring the other ones and so those are things you need to look for when you're looking at studies and these are actually more common than you think so the t-test can also be used to measure the significance of correlations and determines whether the slope is zero or not so the null hypothesis is the correlation of slope is zero no correlation and if it's not then it's perhaps significant so you will find in better statistics that they actually give you a p-value to say whether your correlation is significant or not sometimes we'll plots data and we get a nice correlation and then we see this this is an outlier and outliers are handled by people differently some people if they see that take their pencil out and erase it that's technically unethical typically what you're supposed to do is repeat the experiment see if in fact that is really there if you repeat it and repeat it and it keeps on showing up then that actually says you found something very significant obviously a bad point can destroy a correlation it may destroy your theory but it is something that is important some cases we actually want to look for outliers so as I say they can be both good and bad if you're trying to do a prediction and model data you don't like outliers if it's an outlier is there that's why it tells you should repeat your experiment perhaps there's a measurement error typically if you're trying to plot metabolite concentrations or looking at gene expression levels or protein expression levels actually you like to see outliers it says something is being perturbed with whatever the treatment or the condition and those outliers actually are telling you something significant is happening and this is how we analyze microarray data or RNA-seq data so that's correlation and now I'll sort of shift to clusters so here we're looking at the correlation between height and weight seeing the scatter plot and these may be individuals humans or dogs or cats and you see this sort of trend so you can actually calculate based on the scattering here there's a correlation coefficient and you can say there's a relationship between height and weight I think most of us would expect that but if we didn't just simply do the correlation but started thinking about clustering we might see something a little more compelling and we might find in fact for this particular set of animals all the males clustered in this one group and all the females clustered in another group and the clustering should have been obvious just by eye so yes there is a relationship between height and weight but this is probably more important and that's why clustering sometimes tells you more than just simply the well-known relationship between height and weight so clustering in addition to correlation is used in lots of areas in bioinformatics and cheminformatics so we use it in metabolomics we use it in microarray and gene expression we use it in proteomics we use it in protein interaction analysis it's used in phylogeny evolution it's used in structure classification proteins and sequence found so it's used everywhere clustering has a clear definition that many people get confused with classification so it's clustering is grouping objects that are logically similar logic is your choice and you can choose a threshold to say things are similar you might choose to cluster socks because they look similar and you usually have to pair socks so they're the same color and size but you know rocks and shoes don't logically cluster together so why cluster them you have to define a criteria you usually have rules and we'll talk about them now the importance is that clustering is different than classification in classification we know what the objects are they're labeled in clustering we don't know necessarily what the objects are so it's naive, it's blind to some extent of the data typically if you can see if clustering starts happening then you can start classifying things but if clustering isn't producing any clusters then it's kind of hard to do classification clustering is kind of a test to see whether classification is appropriate so to cluster you have to have some kind of measure of similarity so it could be a distance matrix a similarity matrix a dissimilarity coefficient so the different things that people use different terms they use you also have to have a threshold cutoff so we could use p-values t-statistics if you want you also have to have a way of measuring the distance between two clusters and typically to start clustering you start with a beginning object so that's something you begin with clustering process so the different clustering algorithms there's a k-means method so you'll divide things and create a bunch of clusters with no overlap although you can't have some overlap there's hierarchical clustering which is more commonly used which produces nested clusters so you can see this commonly in gene expression studies and then there's another one called self-organizing feature maps where things are sort of trained to cluster so k-means this is one where you take an object and that serves as the center for the object or a centroid for the first cluster then you grab randomly another object so it might be a gain but it's not just socks so you dive in, pull out a sock that's your centroid and then you go in and you see if you can find another sock and you might do this blindly and you hold the other sock up against the one that you grabbed and it doesn't match by some measure some of us were color blind so start matching them others won't but depending on how your criteria allows you to do that and you use that similarity with some threshold and each of us may have a different threshold but you repeat that process and so if you've sorted socks as many have you will generate many clusters of pairs of socks so in this case we can think of trying to cluster balls billiard balls of different colors and we might choose one and we use color and we might use a very defined color with some absorptive meter some minus 50 nanometers in terms of absorption and so we choose color and then we look for another one that defines a centroid and so essentially you may see if the first one we tried didn't work, don't join it but the next one we do does so they're joined and then we repeat to say can we join another one, no then we start this as another cluster and see if something else will join and then we have three other independent clusters based on the color hierarchical clustering it's a little different so you find the two closest objects and merge them into one cluster and then you find and merge the next closest objects using some similarity so you start building a bunch of tiny clusters you start creating a massive cluster so that everything is connected so we're clustering we're measuring the distance so here's the same set of billiard balls we're comparing every one of them we do a pairwise comparison initially so this is similar to sequence alignment where we do a pairwise comparison to all sequences and then we start matching the closest then the next closest and then the next closest and based on the sequence distance we'll draw lines to tell us how far apart these are and we create a hierarchical clustering map which is this so light blue pale blue light blue pale blue dark blue light pale pale blue dark blue orange but this is how we can essentially cluster and this is an example of how we cluster produce a heat map for gene expression where we're looking for trends then we can see that dark green dark there's about several hundred and dark and then green that's another cluster and then we've got sort of this dark green dark green so we cluster these and then we have a bunch of here that are red so this is a totally different cluster than the green ones so in fact this cluster is B this cluster is A then we can see that they were fine in some other smaller details so that's hierarchical clustering and this is this linkage connection that measures distance so we've talked about correlation we've talked about clustering we've talked about classification we've looked at the different clustering methods we've talked about distributions we've talked about univariate statistics and T statistics and ANOVA and MAN Whitney U tests that's all to basically the foundation for this what we're mostly concerned with when we look at metabolomics genomics transcriptomics proteomics so we could deal with our same population as before but now instead of just measuring height or weight we measure height and weight and hair color and eye color and clothing color and now we have multiple parameters for each of these individuals in our population so it's the same thing now if we think of metabolites or genes everyone has a bunch of different genes everyone has a bunch of different metabolites and different concentrations and now we're trying to compare or group or assign or distinguish our population so once you have more than one measure one variable you are in the realm of multivariate statistics and it's more complicated whether it's metabolomics or proteomics or genomics we'll typically carry out experiments so we'll have both technical replicates we'll repeat things multiple times and we'll also have biological replicates this is essentially our populations and that's why we would say we have healthy controls those are sort of our biological replicates they are the same healthy individuals they're not identical twins or large families but they're people that we have measured healthy mice and then there may be the biological replicate group of the sick mice there may be multiple treatments as well and we could be looking at multiple populations it could be two or three or four and then we get into sort of a man over multivariate ANOVA so typically we are whether it's proteomics genomics transcriptomics or metabolomics we measure many compounds at once we're measuring multiple variables at once so as with all other omics metabolomics is a multivariate one the trick though to make it simple is we try and convert all that multivariate data into univariate data so if you can convert your multivariate data into something that is largely univariate then you can start applying t-tests and p-values in ANOVA so one trick is dimension reduction so we've got hundreds of variables all those metabolites you've just measured on your mass spec or NMR so what you can do for dimension reduction is called principal component analysis it transforms essentially potentially or probably correlated variables into a smaller number of uncorrelated variables and those are the principal components so technically it can reduce thousands of metabolites genes proteins into two or three features vectors combinations of metabolites some linear sometimes even nonlinear combination so that's key that makes the calculations it sort of shifts you from an incredibly complicated multivariate problem to one that is potentially mostly univariate or a simple multivariate so we can have in this case spectra with hundreds of peaks and really kind of complicated sets and then here we've now reduced this into two dimensions and three clusters and at that level we could actually use ANOVA or even t-tests if we wanted now as again this sort of highlights the point that statistics is the mathematics of impression you don't have to be an NMR expert really to say that if this is what the spectra looks for most controls and this is what it looks like for things that are on anod I don't have to be a gene is to say that they're different and the same from this to that and from this to that so what PCA did here was kind of do the obvious that group things separately as a rule if you can't see it PCA won't help so what is PCA one way of thinking about it is instead of this picture we're looking at a donut or a bagel and this is how you do principal component analysis of a bagel so you can use it it's suspended in space here hanging from a thread and you use flashlights and the idea is you're doing a projection so this is a three dimensional object but you're trying to project it onto two dimensions so if you project it with your flashlight shining it on this way this projection makes the bagel look like a wiener or sausage the other projection shining this way actually makes it look like an O the O projection actually captures most of the variation so in statistical terms this is called the largest eigenvector and it's the primary principal component it tells you I mean most of us could look at that picture and say if I asked you what do you think it is a lot of us might say and I said it's a food item most of you would say donut bagel if I showed you this and I asked you what is it it's a food item all of you would say it's a sausage or wiener you'd all get it wrong so one tells you more information what it does though at least the PC2 or principal component 2 does tell you that it's not a piece of paper it has thickness and so that gives you at least some information but not as much as the other principal component so principal component analysis is really looking at variation or covariance and you're trying to do what's called a singular value a singular value decomposition of the variance matrix so you're looking at covariance correlation it's it's an orthogonal linear transform in mathematical terms and it essentially moves things into a new coordinate system so the greatest variance comes to lie in the first coordinate with shining on the bagel when it captures that greatest variance or most information so you'll have samples and you'll have variables and this is your matrix and what you're trying to reduce it to is a set of scores which are your eigenvectors these are the uncorrelated ones so if you want your principal components and then your loadings or your weightings and the weightings are essentially values coefficients that weight certain variables the x's coefficients of p's that weight the x's the variables so you get loadings and scores so in PCA plots you have loadings plots and scores plots tell you the information so this is an example that's sometimes instructive for PCA as Roy Goodacre did he was given a set of data for all of the airports in the US and there are 5,000 airports little ones, big ones and he was given the latitude the longitude and the altitude of all of these airports and he was asked to do a PCA analysis to see whether there's are there airports cluster on certain locations are they high or are they mostly low or who knows what he was going to find anyways if you have that information and you're hoping to see something so this is a multivariate test at least there's 3 variables, not 2 and you want to see if you can find there's a relationship between altitude and latitude or longitude and altitude or something else so the question is what will you expect and I think all of you guys have probably got the notes out and you'll see we did a PCA and if you know a little bit about geography what his principal component analysis did was basically generate a map of the US here's continental US there's Florida there's Texas, there's California there's east coast here's Alaska here's Hawaii and I think it's that Puerto Rico this these are the principal components and all it came out with is basically latitude and longitude so actually what PCA principal analysis is just from this is it's a clustering algorithm and it's really K-means clustering and it's been proven mathematically PCA is K-means clustering he did plot the relationship between um altitude and latitude and it's produces something bizarre so nothing, no sense out of that but this one was a compelling one and it just as I say reiterates this relationship between principal component analysis and clustering so again this is what K-means clustering is so that's really what you're doing mathematically with PCA it's fancy you got scores plots and you've got eigenvalues and lots of cool terms but it's just this, this is really what PCA is so once you've done PCA clustering you've got clusters that are largely normally distributed although in the case for the US map it's not really but you have means and variances in PCA space so it's actually possible to use things like t-tests and I know the test to determine if those clusters are significant or not if they are distinct and separate or not because now we're working in two dimensions so you can do this you can actually say are these distinct are they different and assuming that they aren't shaped like the US but they're somewhat normal you can actually make those assessments so as I said PCA standard nomenclature it generates two types of plot the scores plot and the loadings plot most of us tend to look at the scores plot so this is the three dimensions here and you can see set of three clusters they're colored, a green cluster, a blue cluster and a red cluster and these define the principal components x, y and z are the three principal components PC1, PC2, PC3 the other type of plot that isn't shown so often is called the loadings plot so that shows how much of these variables these multivariate variables and metabolites contribute to the different principal components so remember we've condensed all of these many variables into a couple of variables and they're summed together in a weighted way P1x plus P2x2 plus P3x3 so the ones that are at the extremes extreme corners typically here or here or here or here typically contribute most to what we see the separation in the scores plot stuff in the middle not very important so it's the stuff out in the extremes that is typically telling you what's driving or pushing those score spots to separate so in some cases PCA won't succeed in identifying clear clusters or groups if that's not happening you just have to accept that if these things are not falling apart then it just isn't going to be if you can't get a good separation using PCA even modest then it's probably not worthwhile trying to use some other statistical techniques to separate them now the tough thing is obviously if you've done a lot of work and you're not seeing any separations at all it's tough to walk away from it and most people don't and they keep on trying and trying that'll separate it's sometimes a challenge to know when there's a hint of separation and as a rule most people aren't that good at saying when to quit or when to say let's try a little bit more in many cases people only look at the two dimensional PCA plot and sometimes you can actually see some nice separations in three dimensions and of course it gets very hard in four dimensions but if you're seeing that PCA values 20%, 20%, 20%, 20% that suggests you should be visualizing things in multiple dimensions if you're getting a PCA plot where the principle component counts for 98% of the variability in PC1 and 2% in the other you're done but it also says that all of the variability counts for just one one particular vector one small set of metabolites so if you are seeing a hint of separation and you're not ready to give up then you can use a technique called PLSDA some people call it OPLSDA so it's partial least squares discriminant analysis this is classification it's not clustering so PCA is unsupervised PLSDA you have some answers you have prior knowledge you're labeling things you're coloring things and arguably if we go back to this plot if we didn't color this stuff and they were all blue I think I'd be hard pressed to say that these actually separated into clusters but by providing some prior information most of us will agree that they seem to go into three potential clusters so this is what PLSDA is it allows you to cheat a bit and use labeled data color data allows it to sort of enhance the separation and it sort of in arm waving terms it rotates the PCA components so you get maximum separation between the classes so PLSDA actually is a prediction it's a class prediction it's think of it as a form of machine learning if you want although it's statistically based what you have to do is to make sure that you're not cheating too much and so you have to validate and assess PLSDA to make sure that things are not over trained this is something you can do in machine learning you can over train something so you always get a nice answer then you try it on real data from someone else and it fails miserably that's the failure of over training so people can assess the quality of PLSDA using tools called R-squared and Q-squared and they can also use a technique called permutation testing so R-squared R-squared is called a correlation index not correlation and it's a goodness of fit and just like correlation it ranges between 0 and 1 Q-squared is the predicted variation or a measure of the quality prediction it also varies between 0 and 1 R-squared and Q-squared correlate quite well and so it's kind of almost silly to use both I don't know I think it's just historic because people didn't know what they were so they just coded both of them so R-squared measures or indicates how well the PLSDA is able to capture or reproduce that data typically if R-squared is low just like with correlation coefficients so 0.2 or 0.3 is bad but 0.7, 0.8 is good Q-squared is a measure of whether you're overfitting too much and generally if a Q-squared is greater than 0.5 that's a good sign rarely can you get a Q-squared of 0.9 but sometimes you do Q-squared is partly derivable through permutation testing or cross-validation my own preference and what you'll find in terms of metabolism analysts is actually to do permutation testing and this is sort of the example here so maybe we have our data and we've done our PCA and we don't see any really obvious difference in cluster but now we label our data and maybe looking at this you can say okay they sort of separate red ones are sort of down here blue ones tend to be more up there it's not great so then we bring in the heavy guns the PLSDA we've now labeled things the program knows the labeling now it's going to rotate the coordinates to maximize the separation so we run this using PLSDA support vector machines and something like that and it gets this nice separation okay looks good what do we do next so we permute so we randomly relabel all the data so there's 100 data points we just say everything is red is now blue or vice versa so now we get this so this is what the permuted data looks like and then we try with this permuted data to do a PLSDA to see if we get a nice separation and we run it and this is what we get and this doesn't look separated so we do another test we permute the data again we run it and we see what we get we do it again we run it and see what we get how many times we see a good separation or how well separated these are and how well poorly separated these are so we can actually plot a separation score and if we find that this initial model is sitting out here and all the other permuted models sit out here then we have a measure of the significance or how good that PLSDA model is and and that significance based on several thousand permutations might be .0005 so it's really significant it's real question so as I said at the very beginning the numbers that you generally want to work with is 30 or 40 cases 30 or 40 controls you don't know you can't know that a priori that's what this is assessing okay so there are other types of classification methods so PLSDA, there's SIMCA there's orthogonal projection of latent structures support vector machines there are machine learning methods random forest, naive bays, neural networks those are also machine learning techniques they're all classification tools some are slightly better for certain circumstances some are very mysterious some are very straightforward it's also decision trees so all of these things are essentially supervised classification methods so there's sort of progressions that you do in trying to clear or handle your data so you've confronted this wall how do you get through it and interpret it so the first thing if you're coming up as an army or trying to beat this this challenge you might try the small arms the first thing to try is something like a PCA or K-means clustering or the old days it's called factor analysis see if you get some nice separation if those catapults don't knock things down then you can start bringing out slightly heavier guns and so you start using supervised methods like partially squares discriminant analysis or linear discriminant analysis or regression methods so now we've got our cannons and we're trying to knock down the wall and if still the wall is holding up then you can try and bring out the really heavy guns the big Bertha to try and knock down this wall and that is sort of the neural networks and support vector machines that are starting to become more and more ubiquitous each of these approaches if you're if it's looking promising at this level then you can bring out something a little more and then a little more and potentially get stronger separation and potentially more information to tell you how things are different why they are different so in terms of that progression from the small catapults to the heavy guns start with the unsupervised methods see if you get something tempting and then try and go to the more supervised methods where you start labeling the data calling it red and blue or whatever and then if you are using the supervised methods you must report particularly things like the permutation test um label permutation gives you your significance gives you a level of confidence it tells you whether you're essentially over trained or you're really just sort of fishing um and pulling out an empty net so people have historically made lots of mistakes by using or jumping to supervised classification right away um a lot of people miss the PCA um don't even look at it um and um if they get PLSDA and as I said they get some great separation or run an SVM separation as I said they just don't do the permutation test they don't do cross validation which is important um and as I said if the separation you're getting from PCA isn't obvious you can be treading and be cautious so I think that wraps it up and I think that takes us