 Good morning everyone, and so this first session in morning is on stats, so it will, the goal is just to give you the background of basic univariate and multivariate and other clustering approach, and we will follow by the meta-analyst because there's a lot of the concept is building the tools, and if you're understanding this concept behind it, and it will make it more easy to interpret the result. And here's the schedule, and you already know, and we just skipped this. And yesterday we covered several topics, and particularly compound identification, quantification for the MRI using BASIL and GCO2FIT. And so for the result, and if you open from Excel sheet, you can see that's a list of compounds with names, sometimes with the LODs of range. And for untargeted metabolomics, if you use XCMS, you'll get a peak list, so you won't get compound names. And so eventually, we get a large table contains other values, be it compounds, be it peaks, and it's all kind of features. So from this large table, we want to get some important information, what's the patterns, what's the pathways, and can we build a model to predict. So this is the goal for today's kind of the main theme, what's the statistical approach to do these tasks, what's the tool to help us to get there. Today, I think they are the same. This is from yesterday, and today we just mentioned it from tables to patterns to the pathways to the models. And so the left hand side is a represented table, but actually when we're doing a statistical analysis, sometimes we upload a list of the compound names doing something. Most times it's a table, like Excel spreadsheet, so it contains information that's more numerical, and this list contains a lot of information as for human read. So today, if we're doing something, we need to, sometimes we need to manually clean it because some information is for human to read, some is for machine. So that's, there's some difference here. So we have three main objectives. One is basic concepts about summary statistics, normal distribution, and p-values. Particularly, we're going to focus on t-test and ANOVA, which is for two groups or three groups for univariate analysis. So this is very basic, but it's actually very useful. If a lot of this large data, we start from univariate, then gradually move to more advanced, it's multivariate. In particular, we are going to focus on this clustering and the principle component analysis and the partial least square discriminant analysis. So if you were in metabolomics field, you'll hear these names a lot because you get the data and most of the public literature will use these statistical techniques to analyze the data. So it's better with understanding what's the statistical concepts, how to use it, how to interpret it, and what's the model is good or what is not good. So this is, we try to cover this basic stuff. So let's go to a very high level and what is statistics. So there's different definitions for our purpose. The statistics are tools to help us understanding complex data, which we just cannot eyeball in a spreadsheet, just to feel it because the data is large and complex and so many peaks, so many compounds, and so many samples. And we can view it, spend a significant amount of time, and whether the pattern is robust, whether there's a real pattern, we're not quite sure. So we need a statistic to help. Statistic to help you basically extract the information and also test whether this information, the patterns, is more likely to be a real signal, the real patterns are likely to be just by random. So this is a very important statistic to help us to distill the information from large amount of data. And this information can help us make a decision whether it's true, reject, now hypothesis. So this is statistics. So to use any of these tools or even a command line program, we need to understand what's the input and what's output. So in our metabolomics data in this context, most time, we need to give the tools a data matrix, a matrix just like a data frame, a table in your Excel spreadsheet, and it contains all the values like concentrations and the peak intensities if you use a different report from yesterday's exercise. And the other part we need to call the metadata. So these are probably new terms, metadata actually group labels. So there's all the ages, gender. So all the information is kind of your experimental design. It's called data about data. So a lot of times, especially in clinical studies, they have multiple metadata. So sometimes it's become very complex. The metadata itself is a table also. And but in today, we mainly going to focus on one experimental factor, basically disease control. And sometimes we also have some time difference within different experimental studies. So that's another commonly used design. So from this type of input, what we want to get is the significant features. So this is basically what we think about biomarkers. If you're familiar with gene expression analysis, they're differentially expressed genes. So this is called a significant features. The features can be compounds, can be peaks. And the other one is patterns. These patterns, how do we see these patterns is through the clustering analysis. And we're going to tell some technology like PCA, which essentially is a clustering. And the heat map also a clustering thing. And support a self-organizing map. We're also going to talk briefly about how do we put samples or features that are similar to each other, more close to each other, and visually represent it. So it's called a clustering. And somehow we can also build models to make predictions. Some models are more transparent. You actually can see the rules behind it. Concentration, like the decision trees, like logistic regressions. You can see how the decision was made. This concentration plus that concentration is higher than this. And this kind of the condition. So some of the models are actually easier to understand. But a lot of the more advanced algorithm, it's hard. It's very complex. People still treat it as a black box. So there's rules, but there's some rules. It's still rules. But we just, too complex for us to intuitively interpret in a way. So data. Let's go back to this input data. So we're talking about this matrix for metabolomics. So this matrix is quantitative data. And there's two types of data. So in the concentrations in our metabolomics, it's called continuous. Because there's a concentration from zero to all the way was measurable concentration. The other one is very common. We call it discrete. This comes up with RNA-seq data. If we're doing a gene expression doing RNA-seq, their expression aboundance is using counts. So how many times you see this transcript in the gene expression? And it's a sequence count. It's integers. It's one, two, three, or 10,000. So it's the integers. And you don't have the fractions. So it's the other one called discrete. But essentially, it's a quantitative table. And based on this table, you're doing a lot of statistics and visualization. So the other one is categorical. This is our group labels or metadata. And typically we have this binary and basically X and yes or no. And male or female, this is two groups. And nominal is usually more than two groups. So low, high, medium. Okay. This is more ordinal. So sometimes you have this ordinal, not only multi-groups, also all the matters. So here we're talking about this quantitative data. So yesterday we already see our data from metabolomics. It's all concentration, which is a continuous value. It's half fractions after it's double things. And the screenshot I showed on the right is RNA-seq data. The left is a gene name. It's from Anzamo on the right-hand side. It's all their expression values. You can see it's all integers. Some is zero, some is 100. This is from next-gen sequencing. So it's discrete. They call it discrete. Usually it is different, continuous and discrete. You need to treat it slightly different. So RNA-seq statistics versus metabolomics. And you need to treat them differently or you need to transform them to be continuous. So categorical data and binary, which is two groups, you can label it as use the numbers like zero, one, yes or no, case control, normal disease. So you can write some code. Only you understand. Nobody else understand. So it's all fine. But the suggestion is try to use a very short and concise label. And some people use labels very long and in between there's a space. So this kind of habit is okay if you're just manually doing it. But when you upload to a computer program, a lot of time, when you plot a long name with space, sometimes it calls issues. So and here we just have, yes, no case control. That's very clear. For the nominal data. So here the example, single, married, divorced, widowed. So this is, you can have all these multi groups and all this is not important. So the ordinal data, there's one called this low, medium, high. So sometimes you can treat it even time series at this ordinal data. Ordinal data is the most special categories. But we are not going to cover this. And we just focus on the binary and basically the two groups and the multi groups. So some terms is why we are talking about these features and tables. So there's an important statistical concept is that we talk about the data we get is the observed values of the variables. So in our tabulomics data, we have a big table. And each component is variables. So all the values we see married for this compound across different samples. And that's, we observe the values for these compounds. So that's the observed value for a variable. So this variable, all these compounds is a characteristic of these populations or the samples. So be it genes, be it compounds, be it peaks. So we see it from large populations a range of possible values for variables. So based on this variable, we can compute in this mean standard deviations on a normal range versus disease is upregulated, downregulated. So across populations, which includes it, we have different experimental conditions. And we're talking this a high dimensional data. So the dimension is means how many variables. So in our metabolomic data, if we're talking about targeted metabolomics, we have like a GC, we have yesterday, I'm not quite a record, like a 16, 15, 16, around the compound being quantified and identified. So that's where the dimension is 16 or 15. So that's it. If we're talking about the peaks, so yesterday we have hundreds of peaks towards thousand peaks. So that's the dimension much higher. So the dimension is the number of the variables in your data. It's not number of the samples in your data. We're talking about the variables. So we're constantly talking about this univariate, multivariate. So this is commonly used. Univariate is that we only consider one variable per subject at the same time. So we just consider one and doing this analysis you can consider this variable. One genes, one compounds. And if we're talking about bivariate, we consider two genes or two compounds at the same time. Multivariate is considered kind of all the variables simultaneously. So it's talking about if we apply like a t-test across all the compounds, it's still univariate because we consider one at the time. We just apply it again and again. So it's not a multivariate. It just you repeat the same thing. Later we're talking about PCA, which is taking a whole table in one goal and compute a pattern, which is multivariate. So this is still related to why we care about the statistic, why we should pay attention and talking about these p-values, probabilities that's using statistics here on this on our omics data analysis. And the fundamentally, we are collecting our samples. These samples is from a large population and we, unless we exhausted a whole population like tens of thousands, all the patients, all the populations, and we, most time we only have a very small subset of this whole population. And from this small subset, say 20 control, 20 disease, and we derive some descriptive statistics like their ranges, their mean standard deviation. And we use these values to see that whole population is the most likely, the normal is like this and disease like that. So we actually, we run some risk because you didn't see the whole population. You only see a small part and you try to infer the big population. So there's a certain probability or uncertainty involved with this. So this, how do you quantify this uncertainties? You need to use statistics. Here's some, just try to give you an intuitive feeling about why our inference and sometimes in accurate is from the same normally distributed samples populations. And we have some samples and we try to infer what's the whole, the population looks like, their variance. And you can see it based on the two sample points. And here and here, one is more or less similar to the overall population. The other one will be more broad distributed. And remember, this is all from the same population. The only thing is the sampling different places. So you get the inference is slightly different. So this is some uncertainty, variation. We need to quantify it. We need to describe it in a way. So the question, how do we know whether the effect we observed in our samples was true or genuine? And the answers, we don't. We just, we only know our samples, whether it's still true, holds a lot of populations. And we're not sure unless we have all the population data. How do we quantify this uncertainties? And we use p values. So p values indicate our levels of certainty that our results represent a genuine effect, represent the whole population. So we just try to put it in any other way is that the risk we want to make. So if we think, if the probability of we see these patterns or these signals by chance, if it's the chance is very low, and we think we can accept it. And like in our case and we're talking about 0.05. So if the p value is less than 0.05, we think it's okay, acceptable. It's most likely it's to be true. So this is a probability that observed result was obtained by chance. So basically, what's the chance it's not, it's actually a fake patterns. The chance is below 0.05. And we think it's acceptable. So it's, again, this is really based on the field. But we usually use 0.05 as a cutoff. And yeah, in certain field, you can need to be more stringent. Yes? In case of general mix, they use 0.01. 0.01, yeah. So for metabolomics, is there a reason we have to use 0.05 or we can go up? You can go up. It's really dependent on particular situations. And especially if you go to clinical samples, it's hard to obtain and hard to... And you can usually slightly use the higher. I know it's 0.1. I know people sometimes can argue and it's still okay. But if you really go to this very clinical application and the very stringent things, they really want to go more strict. And there's some concerns with that. But there's always situations like you just cannot get enough samples. This is a whole field that the signal is very low. And so 0.05 is not a magic number. And it's 0.01 is fine, 0.001 is all fine. So it's really the cutoff based on what's kind of literature, your community, which field you apply and you see what's published, the data, what's the cutoff. You try to respect that. If you cannot meet that and you really think you did a very good work, it's just marginal. And you can make some argument to see what's the situation and what's the validation. Especially p-values just give you quantification about what's the chance, what's the patterns by chance. But if you're really followed by doing a QPCR or doing a validation and you actually see it. So all the p-value doesn't really matter because you validated it. It's true. It's there. And you found it. So the thing here is at the initial exploratory stage, it helps you to filter this signal that's more likely to be false positive. So if you really, really already have validation, that's really p-value doesn't matter. You can just say I rank it. I just pick it. But it's turned out to be true. And that's true. The validation is a biological validation or analytical validation. That's really, that's a more gold standard. So now we are going to talk about some general, quite widely used summary or descriptive statistics. And so if we have a bunch of values and just like a million values and we cannot just enumerate these values, we need to summarize in a few, in a few, as less as possible. So what we usually do is that first we describe the central tenets, the location. Where's the most dots, most values concentrated on. We just found that one single value as we use it called the mean, media, or mode. So the other one, what's the spread? So if we have these single values, most values actually just spread around this single value. What's the spread around this? If everything is actually same, exact same spot. One million value, all the same value. Basically you just one value represent everything. But a lot of times actually you have the one central centers. There's the other one spreading around. So you want to measure the spread. So the wider the spread and the less meaning for your center. Because this is so spreading widely around. And this single value mean or stuff is not mean so much. But it's the spread is very, very, very narrow. And also around this single value. And this mean or medium will be very meaningful. Because most other values are very close to that. So how do we measure variability? It's a standard deviation and variance. So it's basically a spread. A lot of times we also use other measures like a quantile range and the interquantile range. So why we use this is because variance or standard deviation is really derived from normal distribution. And if we use some irregular distributions and it's not a good summary of variance. So people use quantiles in range and this IQR is quite common. So we're going to just see what's each one represent. Okay. So first of the mean, media and mode. And mostly we focus on the normal distribution. So if you're really perfectly normally distributed, they are equal. And but in a skewed distribution and it's going different. So the mean is average value and the extreme values on the left on the right side will be affecting the mean values. And to address this and people can compute the median. So the middle most. So the median is less more robust and less affected by this extreme outliers. And mode is most common value. So for us the most useful is the mean and the media. So here it just shows on a slightly skewed distribution where the mean and media mode located just give you some feelings. So the mode is mostly frequent value. So here is the one is in highest peak and the median is somewhere in the middle if you really rank them from the from the high to lower and median will be the one at 15 there in the middle. And mean will be slightly skewed here. It affected by this extreme value. In this time the extreme value is more in the right hand side. So they have some slightly larger values. So this value drags the means to the to the white. So if a very symmetrical like normal distribution and all the three will overlap. So how do we this is we talk about the center. We try to use one value represent the center. So the more symmetrical the more meaningful the center this center value mean will be. And after get a center we want to get a spread. So what's the things around distribution around the center. So we use that variance standard deviation. And so variance is average of the squared distance to the center. The center is mean. So you can also actually you can you can change to the median. So it's within R you can calculate the statistic very easy. It just means it's so efficient if you're doing the median. And it's probably more completely intensive. So you can do everything. This this command just you don't have the right to actually code. And the standard deviation is a square root of the variance. So standard is in this case is taken squared taken square root. So the unit will be the same. So if you want to compare and standard deviation one standard deviation above one standard deviation below that's that's kind of you you can have certain meanings because the same unit but you don't talk about in the various because this is squared. Once you have unit squared we don't cannot relate easily to a situation. The other commonly used the term is called a standard error of the mean. So this is to quantify the precision of the mean. So the formula on the right is you see it's divided as standard deviation divided by the square root of the n and it's number of samples. So this one is that if you have a large sample large sample you can always have this standard deviation of the mean much slower much the narrow. So this get it better. So if you for this one you can just increase the sample number you can decrease the standard standard error. So that's a guaranteed because it's within this formula. Now we talk about the first is the box plot and a lot sometimes I do receive people. Okay. I was told that I was at least learning the standard error of the means that you only use it if each sample is actually already is an average of multiple measurements and then you can allow to use standard error of the means. Yes. What I mean is that you already know the value of each sample to be very sure the already average of several repeats and then you're allowed to use it. Otherwise no, no, no. Exactly. That's the case. In our case we use a variance standard deviation. We don't use standard deviation error of means. This is for different application and exactly as you mentioned that's better. The thing is that I don't want people to get confused. People think a standard deviation standard error is the same thing as not. So first one is on the box plot. Sometimes I do receive people saying hey, what's that image looks like? How do we interpret that? So I just send them a Google link. There's a box plot and you see what it looks like. So they don't even know it's called box plot. This name. So it's just kind of a screenshot asking me what it means. So that's kind of come up with this slide. Just make sure we understand the box plot. So the box plot is very, very useful for all this data and genomics, genomic, gene expression. We should have a box plot overview of our samples. It's easy to see all the liars stuff. So if you're doing a box plot you can see there's several marks. And here's the minimum. Here's the maximum. And in between there's a quant house. Quant house is basically you rank all your values from, let's see, from 100 to 100. And you see that 25% is here. And 15% is here. 15% is it became the median. 75% is here. And 100%. So if you really rank our samples like this. But some of the box plot actually have this defined. If it's 2.5 away from this median, it will define some other liars. So you will see some little circles around. So this is really it just to show you what's the range. So this can be slightly different, but the overall it's like this. So Q3 is interquantile range. We're talking about one thing I mentioned, interquantile range is from 75% to the 25%. So you can see that if we're doing a normal distribution that's the majority of the data values captured here. And it's more meaningful data if we think it's relatively symmetrical centered. So compare this range what's the spread. It's more meaningful focused on this. Because if we go to including a range everything from the minimum to maximum and we are more susceptible to these other liars. So the interquantile range especially we don't know why this normal distribution is more robust approach. So if it's normal distributed interquantile range and the whole range will be same. But if not normal distributed interquantile range will be more robust because the other liar here and here will be ignored. So this is by yes. I was curious if there's like a minimum number of samples like an N value that you wouldn't go below to use box blocks. We've had this discussion in my lab before. Yeah, what's your thinking on that? Like three to five less than that is a bad idea. Or is it higher than that do you think? Yeah, if you're talking about differential analysis usually people recommend minimum up to five and six so with per group. So total we're going to have about 10 or 12. So in that case you can see a box plot but how much confidence there. As you can see especially we have this large number of the samples and in this omics we can't borrow the distribution across a lot of the common that similarly the range. You can still see some trend where this holds everybody have this kind of the variance like similar. So by the minimum five, six, that's about metabolomics we should aim for more. There's no hard, you have to do this but the more sample you have more confidence. That's just the data point used to calculate where is the mean, where is the quantile range. So if you have more data points you have more confident estimation. So the other part is you can see across different compounds, across different genes. If you more or less have the similar patterns then you should feel more comfortable about that. So here is mean versus variance. We're talking a lot about mean. We're talking a lot about variance and these two are very related and here this graph shows we want to make our judgment based on mean. As mean is different and we really think it's significant we think these two populations is different. This is significant from this control. But the confidence actually is based on the variance, very important. You can see that from this side. If we had a smaller variance the separation will be more sharp and we see the overlap much smaller. We are more confident like this is really different. But if we in the bottom it's variance is large, we see a lot of overlap. So even the same difference in the mean but you know there's a lot of overlap. The confidence is much lower. So this one is really how we're going to quantify this uncertainty. If we really go see how the t-test or ANOVA has been calculated they really quantify this mean versus variance. So the larger the variance the p-value will be less significant. So here it shows that give the same variance and the more the difference in mean the more likely this is from two population this is very clear. But here it shows that same mean the more narrow the spread the better. So now we talk about the univariate statistics in more details. We focus on one particular variables like one compound like alanines. Here it's that we talk about the weight and height and IQ test scores. And if we measure that variables of a whole population and plotting this frequencies what's the values distributed across the whole population. And like a histogram we showed on the first our tutorial how do we if we plot millions of values and we will see a shape like this. It's like a bell curve. So this is a typical distribution for some biologically relevant measurement. If you have a large population like thousands and you a lot of time you will see a shape like this. So normally distributed values. Human height and weight is a very kind of very typical and especially you have a large population when you plot it seems quite satisfying to see this distribution again and again. So the normal distribution you have this formula you don't have to remember but it really give you some hint about this mean and this is standard deviation. You don't have to remember this because we calculate everything in R and we can easily calculate all the values. So we focus on this normal distribution. So normal distribution is a symmetrical distribution. So symmetrical means a mean, media and mode will overlap. And we have a standard deviation which is same unit as we mentioned and we actually can quantify the standard deviation. One standard deviation around this mean and will be I don't know, 16% and two standard deviation will be 95% of the population will be within. So people are actually very familiar with this distribution. They have intuitive interpretation. So I hope after this lecture we can we also have certain feelings about these numbers. So this normal distribution is very often observed in a biological physical measurement. So a lot of times the number itself if you plot them it seems normal distribution. Sometimes it's slightly not normal but you can easily fix them doing some large transformation and doing something make it normal. So we can now we can do that computing a lot of things using normal distribution. So for our measurement like metabolomics and if we want to get a foundation shape normal distribution shape and usually rule of thumb is 15 to 14. So if you have this number of samples and try to plot some values you probably see the shape. But if you really just have about 10 or 12 and this pattern is not standing out so you need to have more. So once you have several hundred the pattern is very obvious standing out. So in large population studies thousands that's just a very natural thing. So we already discussed about this mean variance and standard deviation. So here the mean is you sum average values and divided by the number of the number of the values you used and the variance is you calculate the distance also divided by the total numbers of the values used. So standard deviation you take a square root. So here's the one I mentioned. So for perfectly normally distributed values if we have the standard deviation and the one standard deviation around this mean and you have 69% area being covered. And we have two standard deviation it's 95. So three standard deviation is 99. So some people actually have these very, have memories they can actually see what's so you can see if you are if your samples or your values located outside the three standard deviation low or high is the chances less than 1%. You can see that what's the chance if we're because 99% of the population will be within this three standard deviation around this mean. A lot of times we don't see perfect normal distribution and we see things like a slight difference. One is called a unimodal. If we plot them we can see the shape like this. It seems normal but it's skewed so one big hump there and sometimes they have the bi-model basically you see two peaks around that. Sometimes you see this from your control versus treatment. You see actually the population shift have two models. That's actually not a bad sign. It means you have a very strong signal so you're really different already see that these two populations separate. So we know very well about normal distribution. So if distribution is not normal and we try to make it normal so we can use our statistics. We don't actually de-wrap in all different models for different distribution. We try to use one model that's normal distribution and if the data is not normal we try our best to make it normal so we can still use our model. This is our approach. So for a distribution like this and this is actually we did that to our tutorial. Our concentration data some of them looks like this across whole compound merit. A lot of them actually very low concentration skewed like left is same to here. But some of these compound like acetates and glucose are very high. So if you're doing this whole thing and you see that distribution like very skewed across the whole compound being merit. And we cannot apply this normal distribution on this kind of things unless we want to do some non-parametric things. But before we try that, okay. We need to find out the outlayer. What is the general rule to identify the outlayer and how I can actually just after any definition I can exclude or if there are any things. It's a very good question. Outlayer is really a case-by-case. So in a large population we do expect to see certain outliers. So we can see that even normal distribution and we see 99% is within standard deviation but there's still 1% outside that. So the outlier is still part of the normal thing. It's just uncommon. So for each outlier you need to double check whether you have interpretation for how it happens. Especially if you're doing the experiment and whether it's contamination, whether it's wrong labeling. And if you have good reason for that you can exclude it. And you don't want to see because it's slightly outside the range. In this case it seems outside of this tail and you really want to exclude them. It seems a natural extension of just taping off. So I really don't think in this case you should try to exclude them. But outlier always deserves more attention. So if you find the reasons and you can justify it, you can remove it. But if you constantly remove outlier, if you remove this outlier, this next one come to it and you still consider this outlier, if you keep removing it you really have certain bias. In the large population there's always certain outliers. So as part of the reality and so we don't want to pay too much things. As long as overall patterns there, the outlier actually give us certain confidence. But outlier is good if you have experimental or biological reason for that. Do you have any statistics to find out outlier? In metabolism there are certain things that make you focus. It seems this feature is standing out. But they don't automatically remove it. It's dangerous. Because this is really just the human should think about what they are doing. If you use tools to automatically give that, sometimes the focus actually is on finding outliers. Sometimes in a large population study in some rare disease this is one thousandth patient only one have that kind of mutations. And this outlier you really need to focus on this outlier. So it's got to prioritize and make you pay more attention to it. But how to define outlier is always based on which part you're looking at, which perspective. So all the other measurements are normal. Just when you measure this particular genes, oh this one or this component, this is so high for these things. So this definition of outlier is context specific. So maybe you'll talk about this more but isn't this also a problem for things that are near the detection limit? So if you have like a metabolite and there's a lot of it in one population and you have a normal distribution. In the other population most samples just the zero that you can see and some of them have maybe a little bit, then that distribution for the low abundance is extremely weird, right? Can you do statistics between them? It's a pretty clear case, right? Yeah, that's a very good question. So the low abundant compounds usually you don't have too much confidence as the more abundant ones. Yeah, you need to think more if you're judgment based on this one. Double check. I know in gene expression RNC called microarray which is more developed how they deal with this across large studies, across things. They actually what they suggest very extreme in a very latest Nature Biotech MIQC studies they suggest you rank the genes from the most abundant to the least abundant, have the median and only compare this ones the top half abundant. It's so reproducible so robust and low abundant they just ignore because this is hardly to reproduce and this is based on large scale studies on all the past 10 years, 20 years as they suggested doing this it's kind of extreme but it do emphasize this low abundant things. If you do it within your own lab and you really make sure you have the standard, have this mixings and you actually see the variation you have the QC samples, you see the coefficient of variation, everything you see this measurement changes real and you have your confidence. But if you compare across studies you don't know whether that's the quality there and you do run this danger. I think running QC samples is important so if you talk about that, if you're the QC sample low but you're married accurately and you're slightly married accurately then your platform really have a good dynamic range and you're fine it's just it's actually very high abundant and also not good, it's separating the signal and the best measurement actually from the IQR intercom from 25% to 75% really that's a sweet spot and the most sensitive, most equipment is good at that. But data is valuable nobody want to easily trim their data only to the very subset. When they're paying for this all mix you want to see the whole pictures you really want to squeeze all the information from that so visualization is important and use good statistics and try to judge based on your background. You think it is a very low abundant or very high bound and probably beyond the detecting limit and you give them less weight or less confidence so that's all a good practice I would say. How do we fix skewed distribution? So skewed distribution is very common and what we try to do is try to do some data normalization or transformation and make it more normally distributed so outlier in the previous plot it can easily become not outlier once you're doing a transformation. So this is a don't try to exclude it at the first. Sometimes we can save by doing proper transformation then it became normally distributed. Once it normally distributed the statistics became more robust so we can draw our conclusion more confidently. So here is a skewed distribution and we see it's like a exponential thing and log transformation and you see it's normally distributed. This is theoretical you can generate in R and doing the log transformation you can see it's become normally distributed and here is actually the actual real data I think it's from about 14-15 samples and when you plot them and you can see that concentration left side skewed is a lot of times at the low end. So you apply log and you can see it's getting better so log brain extreme values extreme large values to as much have less impact so you can see this magnitude is from 10,000, 20,000, 15,000 it's just too much if you're doing log it became 16, 10, 12, 14 so it's became more comparable and you still see a little bit of skewed but it's more normal and if you're doing normal using normal statistics like t-test and it will be more robust compared to directly using here and this one actually became better. So the only thing that you mentioned is about when you're doing a log sometimes log emphasize the low so bring the very high values to less so influential but on the other hand bring the low end to this side so it became more have more impact so if the low value is not so confident so it's probably not going to do it. Personally I don't think it matters and you take a log and mostly you see the effect it's the same you can plot the log to a log 10 you don't see it actually change too much. Can you tell if a distribution follows normal distribution or not if there are any statistics and then the second is if you find it skewed or non-distributed it does not follow normal distribution then when you use log transformation and use parametric versus just apply non-parametric statistics. That's all good questions so how do you tell if it's normal or not normal there are certain tests to see actually to give you test whether it's normal or not but the thing that it's not robust especially you have very small sample numbers we only have 20 samples so we only have these numbers so all this testing whether it's normal or abnormal is based on certain assumptions they require you have hundreds of thousands of measurements to have dual conclusions so the testing of normality is there's a ways to do it but it's not working well and in most times it gives you very wrong things you'll reject 90% of your things it's not normal basically what you're going to do and the reality is that T-test is very robust so slightly not it is still working fine. Yeah most times actually you do it based on visualization also there's new statistics they actually try to borrow some information across different samples across different variables so that kind of thing also increase your estimation of this variance and mean so that kind of thing you cannot plot so like this it's okay but sometimes you do see very very extreme and you can try to use unparametric things and in that case actually you can see clearly a lot of cases in my experience that if you in that case you're doing a parametric T-test you get no significance of very few you use nonparametric actually it's fitting well and have more power so it's sometimes if you see it's something that's worth trying you just try it and sometimes it's working better actually nonparametric things so people actually... Is it always good to apply a log to your data or other cases where you just shouldn't do it? In concentration and at least in gene expression in the metabolomics and I think a log is most of the time it's working very well on the log scale so if you see all the normalization in the RNA-6 microwave metabolomics you try to use a log scale so unless you have other strong reasons not to use it and that's the first thing you hope to try but there's a cubic root so here you should see this one people go in extreme and try all different means try to make things normal so it's called a... Biological samples usually not normal distribution so it's good to use the nonparametric is it true? I don't know where you got that I think if you screen the last population their distribution usually not normal so it's better to use nonparametric? No, I think if you have a large population just one particular compounds you may like... I mean I said in a large population if you plot it it will be most normally distributed I don't know what kind of biological stuff before we show the weight or height it's normally distributed so it's actually normal distribution is more common in a biological environment so that's at least in my experience so if it's not normal and you try to take a log or do something make it normal because nonparametric is usually less powerful compared to normal parametric so if it's not normal try to normal then apply parametric this is the waste the last one you try nonparametric you cannot just directly go to nonparametric that's the same to me as you're not making best use of your data so here's that not too latest but it's still very thorough exploration of all different ways to improve the data by centering, by scaling, by transformation and making data looks more normal so it is published 2006 I think it's cited widely because they did a very thorough job comparing all the effect of different things on what's possible consequences and based on that they make some recommendation it's not universal but it's really review of what's the complexity it can be. The thing that we want to cheat our metabolomic data see we marry hundreds of metabolites we cannot each component use a different transformation we don't do that if we want to do transformation we do it consistently if we do log and log apply to whole data matrix we don't want to this compound is not normal we do log that compound is same as normal we don't do log we don't make judgment on the individual compound we do it on the whole data matrix if you do it on the individual compound each one doing different that's usually not a good practice okay so we do it consistent across all this on the whole table and probably you guys already know but here is the two normal distribution how do we think it's different or not it's two population or one think if it's the same raise your hand you think no it's just two just we're talking about transformation so we're talking are they different or not American here is that how about this that's less certain right so here is that we already mentioned about even the difference between the mean and also the variance have a play a little important fact in our decision how confident we are this is from two population of the same populations and we need to quantify this uncertainty and how do we do the uncertainty we do a t-test for the two samples so t-test is kind of most commonly used it compare between compare the means between two conditions so our assumption is that if they are from the same population they should share the similar means also similar variance actually but the variance I showed you is that if you really draw it slightly different sometimes you see the variance it could be different so the t-test actually two different flavors you have it for the variance you have all different variance to calculate it that's the parameters equal variance equal or variance unequal so that's the whole thing so if we if these two means are statistically different then the samples are likely to be draw from two different populations so this is what we try to see if we can make these conclusions so types of t-test so if samples are independent each one from different patient different people and we're talking about this independent but sometimes you do the drag treatment so before and after so actually samples are paired t-test and for each t-test we also have the nonparametric version and the man waiting to u-test for the independent or well coaxing test for the paired test and even for the parametric test I mentioned about before that we have a student's t-test is testing two populations have the normal distribution and equal means this is most time if we assume they are from the same population it should have the same mean and the same variance we assume they are equal variance and in some cases and people doing a well just t-test as designed for the unequal variance basically if you're doing a t-test you have the parameter equal variance non-equal variance so this is default is equal so it's most powerful what do you mean by equal variance yeah so you see this one they have different means but the spread is same right and here the same we have the we compare the means but the spread is the same we can pull the whole two groups and estimate the variance what's that but if the one is small the other one big you see the very early slides I'm just so I'm not quite sure I'll get there yeah here you see the variance is different so this is a test we do a student t-test it's assumed that is equal variance okay it's most powerful yeah so I already mentioned if the data is relatively normal distributed and you should try to use a parametric it's more powerful and if you have extreme values and you clearly see how liars tries non-parametric sometimes you do see that non-parametric gives you more significant number of significant features this in metabolomics most time is parametric because it's very continuous normally distributed I'm seeing this from the microbiome it's so extreme and this is going to how much you do it's just so not normal there's no hard and fast rule so most data has some data has equal variance and some is larger difference in variance when the count you decide what is too much or too little because you're doing the same test and you don't do the same transformation on all the samples so your decision has to be based on all the samples yeah and this is it you're talking about a very good situation we do we do see things you just described so when we do a large scale study we have to consider the overall trend and apply some more consensual things if you really go to individual cases particular variables really it's probably you better use the equal variance and this is use different normalization to make that particular variable more normal but we just cannot afford the time spending on trying to decide which one for this particular compound or for particular genes what's the best transformation what's the best statistic is equal variance non-equal variance so we just try to apply a method that seems majority seems agree so in particular case it could not be optimal but this is your practice and we just cannot go back to the individual cases studied to optimal because the time and the things that even you know it you just cannot afford doing it right so generally you would use non-parametric or you transform this from what you're saying then because my understanding is that equal variance is a serious no in most in most cases in most cases when you're doing a transformation of visualization and you see it actually normally distributed and the variance largely equal most time so unless I mean something reason along most time this assumption holds so I can see like see 95% so you don't have to worry too much so I think this certain cases do need to attention but if we we cannot focus too much on that it's basically you see some studies going on the majority the practice holds so we just talking about the general rules individual case I agree we you spend time and plotting that and it just find out what's the best to do that that's a follow up question yeah if we do transform long transform is the custom in magic blow mix to report in the descriptive statistic that you have magic mean or magic mean all p-values computing this p-values like like doing PCA the graphics this one you're being used transform data but if you really want to report absolute quantification using like using biomark you still go back to this once you see this significant you need to go back to the original values see what's absolute so you calculate this p-values if you're using normal or like a PCA, PRSDA you use the transform data to do that but if you want to report the values you need to use the raw computations yeah okay but do you report geometric or actual if you've got a scheme distribution and you've tested log transform data in metabolism yeah in metabolism you have both so you're going to see both one is actual means from actual values the one is transformed in log transform you have box plot so you're going to see both so again you just respect your field if you feel they report the things as geometric actual and you just follow that and tools will allow you to get both so this is not a very hard question it's just what's the practice in your field I think so here's we went through this test and we're talking about this paired t-test so the paired t-test is helped to reduce the mean because the same person tested it before and after the same person will help you when you're doing a subtract the same person in the background they will have much sharp signals so once you do that if the sample is really paired design and you do the paired you can see the significance so much everything is much improved here it just shows on the left side is that you have your de-serum here but if the spread is so big the variance is so big and here is a lot of uncertainties but with this narrow and narrow smaller variance by like paired design and you get much more confident, much more stronger significant p-values so the narrow the variance significant p-value so if your sample is paired and you use paired test sometimes physicians actually try to match based on age and medical history on the patient parent so even if it's not the same person but they try to do the pair if you know there are more background to reduce variance now I think we cover all the t-test and we are going to discuss about ANOVA so ANOVA is for more than two groups so the hypothesis that we have all the means are the same and the alternative is sort of what say one is that two or more means are different from the other so if you see this one you will realize ANOVA significance is that they tell you there is a difference between these means but who is different from who and you don't know but that's ANOVA try to do and the issue is that the difference really is relative speaking so for example how much difference is big enough so we need to compare the main difference versus again versus the variance but this variance is within the group between the group so this is a test slightly different from t-test because t-test is two samples that are relatively easy so ANOVA try to actually formulate as the between group variability divided by the within group variability so what we really want to see a big effect is that between group there is a huge difference the group A group B is a huge difference within group so suppose a more consistent and basically one have very homogenous group within each populations but between group will be very different so here you can see that blue, red, and green and you can see that within each group very consistent homogenous but between group they have a large difference so this is what the test so the more homogenous within group more different between group you have a larger p-values means more significant p-values so large f-values more significant p-values it's very intuitive you can see from here ideally you want to see each sample so narrowly focused here and here and between them is so far away but within each probability is so close so this f-test the between group variability it's very useful skills we don't do t-test but we do use for other algorithms because this kind of thinking is very useful so the group variance or variability is just the mean of each group and those are the values of the population or to calculate the group variability you see these slides here we actually define the between group and the within group variance so let's talk about here within this red group, within group variance each value how much different to their own mean within this group they have the mean here and each value compared to this mean what's the difference between group is that here to here so it's across compare within the group, compare across each group is like one data point in that each group is a whole group you see this is a group mean versus a local so this is mean and k is the number of groups if you want to calculate by the r you don't need to worry about this so calculate f is between group variance divided by the within group variance the larger the f and the more we can see the difference across groups versus within groups between the now hypothesis so we are going to get a significant p-value we are going to reject the now hypothesis we are seeing a significant effect so the drawback of this ANOVA is that we can tell from signal p-value there's a difference between the groups but not where the difference lines so what we want to get is we need to some post hoc test basically up with finding a significant we want to further test what's the difference which two groups different so we are not talking about post hoc which is within a type of analysis you can do this analysis easily and we just cover this what's the different flavor of ANOVA it's one way ANOVA it's a two way ANOVA so two way ANOVA means you are testing more than a male-female control disease so you have two factors in your design okay that's two way ANOVA so let's say with ANOVA if I have my 100 groups and they are all very relatively like a normal variance whatever within each group but one of them is way off from the others then the group variance still seems to be very small because the distribution is so that everything's pretty close together except for this one value except for one group and then if I do the ANOVA I just assume oh it's not a big f no you will get big f if that f value is really big different right but the more groups you have that are similar to each other the smaller your f gets if one group is a total outline right you need to normalize once you normalize that one where it won't be standard because what you describe actually I'm thinking it's not normally distributed if you really have a group all the other groups have a very equal 99 group that one group is so different and you're talking about extreme things but when you're analyzing that one you need to normalize this once you normalize that everything either bring up or that one going to bring back more distribution so I know your point but in reality when you do the transformation that the things will be addressed you cannot just apply ANOVA unless you artificially create that okay I'm just thinking about the one you described yes you're going to average get a group mean seems more not non-parametric ANOVA what are you talking about I'm just wondering let's say it's a drunk test ANOVA took anything in any of the other groups but one group of people took some drug and there's just nothing in all the others I think you're non-parametric probably going to help in that case so conclusion is that TIT has assumed assess two groups and can compare two samples or one sample to a given value so you can compare that sample whether there's a significant difference from zero ANOVA compare more than two groups and they compare between group variance against within group variance and now we're talking about p-values I already mentioned about p-value before and here we just make sure we really get it home so the p-value is a probability that a senior result extreme or more extreme than a result from a given sample if a non-hypository true basically all the p-value is that what's the chance if what's the chance you're willing to accept if there's no effect so how do we calculate p-values and we already know that if our values outside the three standard standard deviation we know that chance is less than one percent we know it from normal so if it's normally distributed we can actually compute the p-values and but there's some cases that we don't we cannot because it's not normally distributed we don't know what's the chance of that and so what we need to do in order to get as robust p-values one is normalization we try to normalize the data as much as possible because normally use a normal distribution the statistics are most robust and it's most efficient anything beyond doing non-parametric or doing permutation we are going to discuss more make things more complicated it's taking a lot of computing so that's what I'm saying but let's say 19% is normally distributed we are happy but in some cases we need to do non-parametric test and so in extreme cases we pretty sure we don't know the distribution we need to do permutation so in large scale studies and if you see the latest nature compare multiple omics all different kind of complexities nobody knows what's the distribution the only way you can do is doing permutation so we are going to discuss briefly about these procedures so normalization as normal distribution is assumed in most statistics t-test and over and we really want to log transformation the other one is z-scale called auto-scaling so auto-scaling is also a term does z-scale we just have a mean zero standard deviation one it's called z-scale so it's easy to do in R but the auto-scale is a scale in R you just get that so there's no more distribution so box plot if we're doing a standard data you can see that relatively there's distribution around the mean or mean around the median and so the mean in one is for example this one is very normally distributed and some of them like this seems more problematic but you shouldn't aim for everyone get the same like this because if you have a large omics data you apply this generally useful approach some will be happy some won't be if you want to make everybody happy to be normally distributed and really you just cannot do it because it's going to spend on individually and doing individual analysis that's fine but omics you don't do that you cannot afford the time and when you do a multi omics multi variety studies you just cannot do an individual difference yeah so here you are using z-scale but this is without the transformation this is a log transformation I think it's just for illustration I'm not actually sure it's a log or z-scale but I think it's a log so this is z-scale this is z-scale you can see 0 and 4 to 4 minus 4 to 1 this is z-scale okay I see it but conceptually would you do the transformation and lose scaling or it doesn't matter you do any of them yeah you do a log first and if you're still not you can try to further apply z-scaling so the thing that the more transformation you do and the more complex you're going to interpret your data so that's the thing so you can do this things to make it more normal but how do you relate this transform the value to the original scale that's your you need to create a link so that's nonparametric test so based on ranks the disadvantage is there's lots of information so you can see 1,000 to 1 as rank is 1 2 and from 1.1 to 1.0 the same rank 1,2 so you can see the difference so the quantity of difference is being lost if you transfer to a rank so it's useful but not as powerful as parametric so power means the ability to detect a difference if the difference there can you detect it so if you convert to the unparametric sometimes you just detect no difference but if you're the parametric you found it so that's the power we usually talk about the other part is very significant concept called empirical p-values so we can compute the p-value from the normal distributed values like t-test ANOVA but in some times we just don't know whether it's normally or not normal distributed so how do we do it is we do a permutation and then can compute in the empirical p-values and how do we do that is that we can create a now distribution computationally now distribution is that how do we create a now distribution is by computing this group labels so if we really think that's two populations like 100 patients 15 100 people 15 is control 15 disease and this label is meaningful okay really characterize the populations two populations and if that's true if we assume we randomly shuffle and label so put some patient label to the healthy label to healthy to patient we can calculate the same numbers whether the numbers the signals that become lower or different right so this is underlying assumption if we really permute all these labels we still get the same statistics really means that the label is not meaningful so you think that's patient healthy or control and that's actually not meaningful because when I randomly permute I still get the same result so it's random right so this empirical p-value try to try to give you this confidence if the p-value is very small the chance of the random effect is very small so you can confidently see yeah the pattern where all the label actually carry this carry this meaning this control versus disease actually really is biologically the patterns there the signal features robust okay so this is what we try to do and we don't do it manually because this really takes a lot of time but computing is so good at this what we're going to do is that I give you the steps under now hypothesis our data comes from same distribution then we calculate original statistics just like an original t-test or f-test for ANOVA then we shuffle the data labels with respect to group labels then we repeat and we repeat a thousand times millions of times and the more you compete the more and you still didn't find anything as good as original one the more confident you are so if you shuffle one million times you found the original one was still best all the other 9900 things very different so it really means you're very significant so your p-value is less than one in one million but you cannot see your p-value is zero because you didn't compute the infinite which you cannot do it and if you all get this right I'm not going to bother you with the details here is that basically how you shuffle this labels so you can see that we have the case and control it's defined by 1 to 9 and 10 to 18 so this is the assignment now you can see I'm doing a sampling so the 9 came to here and this one came to here so every time we re-computed a mean so we can see the mean values we computed for the sample data so now we repeat hundreds of thousand times so you want to see whether this mean is very close to the original is very different if we're doing a lot we can plot such a difference so the difference you observed here is here and actually if you're doing this all the permutant once you permit label you found it's very close the difference not so big okay and you can see the signal is so strong and in that case and you can see the p-value is less than 0.001 because you did 1000 times and none of them is good so you cannot see zero but you can see the p less than 0.01 so this is very common in the larger omics studies so I hope you understand these permutations procedures you don't have to program it but at least you understand how it was obtained so general advantage is that you do not rely on distribution assumptions so normal up normal so anything you can do permutation almost anything that's what I'm saying so there's a hidden correlations in the variables a lot of times omics in some components is correlated with other components and if you're doing permutation this kind of structure cannot also be reviewed so there's some advantage the only disadvantage is computational intensity how you have to write a procedure that's really try to recapture the reality only permuting that sample label so the thing that you need to be careful once you write the permutation procedures is very hard this is a more customized different situation permutation on what to write and so here's a question if you ask to compute empirical value from multiple groups what will be the value to compute and here what we're going to compute is that I already mentioned that app test so if you want to compute a if you run the shop of the labels and within group versus a between group variance well that change so in a region one is F value will be very high and in the shop of the one F value will be very low if you found this pattern and the signal is real it's not random right if it's similar then there's issues right and I'm not sure so here's the other issues we're going to do is a hypothesis testing and a multiple testing issues and so usually we do the statistical test is we assume that now hypothesis is true basically we assume there's no significance then what's the chance if the chance is very low we reject this now hypothesis so this is a very traditional approach so I don't know what there's some issues so the goal is a statement regarding this population based on samples with some overlap so let's see so the assume okay here's another that important so the assumptions that sample is randomly selected so we see then we select from population then we select the now hypothesis and then we're based on our hypothesis we tested the p-values so based on the threshold usually 0.05 we're going to reject or not reject so this is a very ritual thing and so it's so we follow this consciously or unconsciously so we already talking about this so if the p-value very low very small and based on the objective and we declare this result statistical or significant and the issue here is that we do a multiple testing so we have the compounds with hundreds to thousands of peaks tens of thousands of peaks so we're testing multiple times so we're willing to accept this small chance 0.05 for each compound but if we're doing the one like if we repeat tens on the times so basically we are talking about the 500 will be false positive this is the chance if you're doing it 100 times you have 5 false positive if we're doing it tens on times we're going to have 500 false positives so this is issue with mildly testing how do we address this and one way is very natural we just improve the stringent cutoff how many times we cutoff we test we just divide it by p-values by that number make it more stringent so for example so this is called the correct p-values if we have this p-value to be significant enough so even after doing this all this kind of tens thousand testing the p-value is still less than 0.05 so we can still be fine and this is a this is called a family-wise error rate so the most famous one is called the Banferoni corrections so this is a lot of people use this so if your data is very significant and you're still after doing the Banferoni correction still significant really that's a very conservative and very significant p-values and but a lot of time if you're doing the Banferoni you'll find out a lot of the features that's significant became not so significant it's just because Banferoni is so conservative try to try to make a safe bet but then in reality this is too conservative and you lose a lot of the real signals real significant features so yeah one of the results of the XCMS online is this cloud plot most significant features that you get from an experiment some of them are Banferoni corrected some of them are not which ones do we really have to go for I don't use XCMS online so and I think so this would be consistent what you're saying that some is Banferoni corrected some is not they don't give you an option to use Banferoni or not? Yeah so that's a kind of dilemma everybody is facing even me as I'm doing metabolism is sometimes Banferoni corrected basically will give you more confidence but you don't want to disregard the features that's not Banferoni corrected because there's like false discovery rate also give you some things so it's up to you what you're going to choose if you really have a lot of the signal features you can use most conservative ones you still have lots but if you don't and you can use the FDR false discovery rate adjusted you still have much to think about so the statistics are not really the truth it just help you to select prioritize and if you're too much trust use a cutoff and really below it is no true biological signals that's not true statistics reality, biology is there if you really feel that feature even below this all the cutoff but you really have your gut feeling this is my significant molecule and you do the validation and it's significant all the statistics just don't work so I think in the tools like XCMS online you know Metabolinist shouldn't make a judgment for you it's just reveal the information as much as possible make you let you to think let you to make a judgment so here's the false discovery rate and this is very popular so in Metabolinist we use false discovery rate because it's reasonable compromise between a very liberal raw key values versus a Banferoni FDR.05 is that 5% among the significant genes will be false positive so this is it will have 100 genes 100 compounds is significant on 5% that's 5 of them so I'm just this is I think why it's 20% this is somewhere so it's 5 back to FDR we can choose how much we want to do so actually if we use FDR I know very more general people can use 0.1, 0.15 if you really willing to stretch it so FDR is not 0.05 FDR can be slightly higher so high dimensional data yeah please and that taught me so well how FDR works it's actually pretty cool you can do it in a cell you go step by step you say this is bigger than that it's all about ranking and key values and you spend 3 hours doing it you know FDR by heart you understand this before yeah I suggest if you have or really want to go that direction that's a good article so high dimensional data we all talk about this very basic t-test ANOVA and now we're going to talk about multivariate statistics so multivariate actually is more natural for this omics data because it's consisting of all compounds simultaneously we know the compounds are not working by themselves independently it's actually working in a pathways in a biological process a lot of them actually interconnected one way or the other so it's a system and we should consider whole system in one go rather than separately so multivariate statistics naturally fit there so they consider all these variables together so for example your taste, your personal profile your height, weight, your hair color, your clothing, your eye color individually a lot of people share the same height and share the same weight but if all these put together it's you it's unique you, it's your profile so it's we need to put all together and become multivariate descriptor of you so that's and this is the same thing multivariate descriptor of a metabolism for a particular sample for a particular population where the statistics is superior in this sense it's just how do we have a multivariate actually capture the reality complexity that's a lot of development going on but we just go to the basic thing that normal distribution is a single variable and we can do a bi-variate normal you can see it's an oval shape and if you project to one dimension you can see the normal distribution on each one and if we do a tri-variate normal you can see that it's a globe and you can project in each direction on this so reality is that we can visualize to up to three dimensions and reality when we go to a multi-dimension 10 dimensions 100 dimensions and we cannot visualize easily the most statistics developing early days is for a univariate they assume very unreasonably you have two or three you just have a focus on one variables but you have so many samples up observations basically hundreds of things and it's all fine you're doing a linear regression doing the t-test logistic stuff it's all I learned all that early days but once you use these tools go to omics gene expression and now we can use directly because the assumption is violated so we need to really use a new approach and special challenges we have much larger variables than the number of samples so the n much less than the p so n is your variable number p is the sample number so what we do is that we still use university very statistically try to help us to understand the data because it's simple and people know it so it's still useful but in reality we want to describe the overall pictures and we try to use this multivariate machine learning, chemo-matrix visualization tools and I give you some terms in machine learning there's two general approach one is called unsupervised learning unsupervised means is that they just explore your data without looking at your labels so they don't know whether it's samples from healthy or from diseased ones so they just reveal the patterns hidden within the sample if really you see from this unsupervised approach you see the samples naturally distributed according to the sample labels control is one group then that's really you see your sample your study or the data really contains that signal because it's really standing out by themselves but a lot of times it's that data, your study design a lot of times being confounded by a lot of other factors and you cannot control especially in large population studies and if you really just visualize data or use unsupervised and you see this mixed together and signal does not have no signal, it's just signal being buried within and in this case you try to use supervised learning supervised learning is try to find out hidden signals by using some linear or sometimes unlinear things when you're able to find these signals they need to see find out what's the signal what's not signal related to your labels so basically they not only use your data also use the labels so they know they need to see oh this feature really to the disease this feature really to healthy by doing this they found the signals then we distill and extract the signals then they'll let you see it so the supervised method the supervised learning is more powerful and it can let you see the patterns but on the other hand it can be over powerful or over fit basically it will reveal patterns that's not there but they try to please you because you ask him to do it it just to show you the patterns and so supervised learning you really need to do cross validation just basically hide certain data with no labels and okay you said you found the patterns now here's new data tell me which one is from which healthy or disease so this kind of thing for supervised learning you really need to consider about cross validation and permutation to make sure the pattern they found is real because machine learning is very powerful and a lot of times it give you some false patterns it's so easy to make you naively believe there's something real but unsupervised learning is more or less safe because it reveals the patterns naturally hidden naturally within the data so it's you don't have to justify it you put the cross validation in the unsupervised learning you can do that but usually it's not required what I'm saying so if you really want to do the class and pattern you can do some permutations see how stable is the cluster you can do 100 to see how much time is still to that there's some a lot of variations at each step what I'm telling because we have very limited time and things we're just telling some very general things and advanced step at each step we have a lot of flavors so so unsupervised learning we have to approach one is clustering one is just called dimension reduction dimension reduction most we talk about principle common analysis but a lot of people also think about PCA is also clustering because in PCA space if the dots close to each other they're also like a clustering they're also similar to each other so we can really put them together if you wish so clustering is that a process by which objects are logically similar in data logically similar in characteristics are grouped together so we want to find out the things that are similar to each other we put it together so this seems very intuitive but we are not doing it manually we let the computer do it that's the thing so what we need to do in order to do clustering is that how do we marry a similarity so we need to marry the which variables or which samples are more similar to each other and we also need to decide at what threshold they are similar enough we can see they belong to a cluster okay and also how do we arrange them distance so we put them far away from each other close to each other so also the other one is called a cluster seed so we need to begin from somewhere we started from random samples random variables so we can this is a lot of thinking but in reality is that we have a lot of tools already implemented and some of them have to prove very useful generally so we can try to understand them and see how it works so two common clustering algorithm one is called the key means of partitioning method so they try to divide the unobject into m clusters without overlap so there's a you need to define how many clusters you want to get so you can see the unobject unobject means your sample numbers usually you know you have 100 samples but how many clusters you sometimes you don't know and you think it's five clusters three clusters or ten clusters so this is something you require you to have some prior knowledge about how many cluster you get but you can explore and see when the other ones easy most commonly use the hierarchy method so basically you do the clustering from the bottom everything is one big cluster and divided everything became to an individual cluster and so this is you don't have to make a decision how many clusters you just see the patterns from layers and finally you choose a one because they give you everything so came in clusters and assume we know how many number of cluster there and we just then the computer choose a random seed and try to the samples who close to your seed and reorganize things and very close enough they will merge it then re-compute the new center and repeat the whole process again and again until it stabilizes converge never never shift around so you get this stuff yeah we're comparing to issues 100 measurements of one patient 100 measurements of the other patient first is that we're not looking at two patients we're looking at hundreds of patients we're really talking about 15 is healthy 15 is disease and we have the metabolized measurement across like 100 metabolized is 100 the patient for each of them right and we just compare the concentration profiles across this 100 metabolized so are they similar enough to each other yeah we compare whether this for example the healthy if we see the if we visualize some Excel spreadsheet you have all the concentration tables if you think about the normal patient healthy the concentration is more similar to each other it have low everything is low and disease everything is high you want to compare the concentration the chain the patterns the concentration path values for all these metabolites this is what we are comparing we this is the same thing like what we doing a t-test on a particular metabolites but this time we are comparing all metabolites together this is called multivariate it's a very long for each compound they have 100 measurement because they have 100 kind of things okay I think once we see the real data you you'll see what we're comparing and here's just came in clustering and before we have no group assignment and we just randomly assign them okay we think they are going to this cluster okay this cluster you need to decide there are some ways to help you and now they try to recolor clustering this distance to this seed okay after that they recomputed a new center and after recomputing new center they also now reassigned this group and now after that the new center is close to here and everything close to here became the same color okay so the new after that they can do several permutation but as this time they become stable they don't change a lot so if you're doing 3 or 4 10 times all the new center to the previous center they not change so the most change actually here this you randomly drop and they move to here and after here they probably shift a little bit to the center but after that we will stabilize so that means so after you do some random start and converge so this is a guarantee that the patterns you found is more or less more robust so high rocker clustering I think most of us are already familiar with so we just join the one who's most similar and move on to the other and all the way join everything all from everything together all the way to everything as individual so this is easy so how do we do the color rocker we need to calculate distance between each samples or between each features then we put them in the clusters and most similar one we are combined and recalculate the whole thing again and we go back and each layer we can create a cheese and each gradually join and move all the way up to the top so this is kind of the iterative process until everything is done but similarity so how do we calculate similarity this will probably help answer your question we try to complete distance between like see two samples okay this is from patient A this patient B and we can compute it at each metabolize we just do it do it A plus B plus C so all the metabolize for that two patients we are going to compute it and add it up together as one distance between these two patients okay and here is an all measurement for this patient here's an all measurement for that patient we can calculate distance just use U cleaning distance so and we get a number so if it's very similar you see if it's identical you get a zero okay so it's the difference very low well you get a very close distance right so this is very natural and the other one is this is the similarity this is the Euclidean distance the other one is the Pearson correlation and this is the China similarity and let me show that so this one have this positive correlation negative correlation so this is positive correlation all the China is synchronized so you can see have changes consistently and this one is a change in the opposite direction so if you have Pearson correlation yeah so that's Pearson correlation and Euclidean distance okay the classroom and you are going to see some parameters called a single linkage single linkage is that if we have two clusters what's the distance between these two cluster and the single link to try to find the dots that's so close to each other okay the most closest dots and they have this distance okay and this is single linkage the complete linkage complete linkage is one of the furthest dots so this two is complete linkage and it's tend to generate clumps so this kind of if you really play around you'll see this pattern it's very typical actually when I see certain things I can say okay generate from this plus the other one very commonly used average so basically you have this center you just find the computing center it's called average the other one is called a wall it's also you use the wall it's very close to average it's have certain adjustment it's here's a typical result you get from hierarchy clustering so you're from most closest put together then you find the second closest to this one so you merge these two and then from the another one you gradually merge more and eventually you get from everything and create a huge patterns so by doing this you find the similar ones close to each other and this similar one to the further way but you can see the pattern actually standing out so this is so PCA the PCA is that we're talking several times and a PCA try to find the most variance of the data and the assumption is the main direction of change also that your data characteristics so if your data really have no other influence from other experimental factors so the difference will be correlated to your experimental stuff I know I still have some I think in 10 minutes I'll finish so PCA is you can think about it's a clustering so it's a try to most people think it's a projection okay and this is a think about from three dimensional space you have the flashlight you project them and you can see at the lower space and one dimension actually capture this like oval shape the bagel that oval shape it's called the other dimension capture less so the PCA actually help you to choose the dimension that have the most capture most of the variance of the data so if this variance is your experimental condition cost so you are safe but sometimes your result a mixture of multi factors so the PCA probably revealed a lot of time PCA revealed patterns of your batch effect what I'm saying batch effect cause a huge issue so you can see oh this is day running day one this is running day two so PCA actually tell you very clearly because have a strong influence on your data so if you don't address them properly so how do we compute PCA and if you want to manually and here is a way you want to do actually it's easy it's a linear thing you just combine all this weight for each available to find even the weight and project back to this score T1 to other one so the T1 will capture the most of the variance T2 will capture the last most second and T3 all the way but we usually focus on T1 and T2 and then we look at which variable contribute the most by looking at the weight weight is this P so it's this P is big and means that contribute enough to the patterns on your data okay so PCA is a linear combination of all the variables try to project so we even we don't think we don't just a small dimension you don't have to but on the basic it's like this so again this is your score we only focus on score one two or three so score plot and we want to focus on loading is A to this is loading so this is your variables like metabolites one metabolites two so if the weight is high that really means this metabolites have a high influence the score is what you're going to see the patterns so from pattern you want to go back to the loading this is this weight is your loading okay so I give you another view of PCA is that each T is uncorrelated so T1 and T2 is orthogonal so try to be give you independent information on the same thing so T1 is give you much variance and T2 is the second largest so the T the first component give you the largest variance of the data so if the largest variance happens to be your experiment a cost then you are fine sometimes the first one is your batch effect you look at this T2 versus T3 or T3 versus T4 and because the top one actually batch effect the second one is actually carry your disease versus control so you need to see several component so principle component analysis is that because we can do that transformation in the data we can calculate this PCA on a covariance matrix also on a correlation matrix so this is how we normalize the data how we doing the things so the correlation matrix basically you do auto scaling and the variance can have different units all the variable have the same impact on the variance if we doing covariance matrix some large value is going to have strong effect so sometimes you are not sure what's the best one you can choose how you normalize your data auto scaling you are really doing a correlation matrix here is a typical result we are going to see from PCA analysis that you see you have this spectra and from different patient when you doing a PCA you can see it's correlated with it's nicely classed in different places within PCA and you are happy and now you want to see which features contribute to this and your back to it's got a loading plot how much of the variables contribute to the different I will show you that formula so it's actually weight when you multiply your metabolites so it also have the sign positive and negative so this is this one we are plotting here so when you see a loading plot you want to look at the same location so you see here and this is the patterns you see here and then you are going to look at the features here basically this one is going to positively correlated with this features here so they are correlated and here and you will see here is a positive correlated there so I have a more simplified stuff is that you can see the score plot it's a loading plot so you want to see here this feature and this is your samples located here and this kind of measurement is positive correlated with this grain group and this one this feature going to be negatively correlated with this grain group because this is located within the opposite direction but it is a positive correlated with here so very intuitive okay same direction positive correlated opposite direction is a negative correlated so it's drive the separation pattern it's why PCA is so intuitive why they use because the interpretation actually very visually intuitive so PCA is that PCA will not succeed in some time in clear clustering no matter how many things how many different transformations you tried and sometimes if that's the case and basically means the signal in your data is not strong I reviewed in the PCA in this case you probably want to try more supervised approach so but PCA is very good for data overview and for outlier direction detection for look at the relationship between the variables so again PCA is your first your first you should have view your data like first like a box plot PCA also need to you need to see the PCA POSDA is similar to PCA the only difference is that POSDA use the group labels they try to maximize the covariance between your data and the variables so it tend to always produce the separation tend always because that means it's really eager to please you I give you something better but you need to be cautious about that whether it's really true or something is just artificial so and you can see the comparison actually is always better so from PCA to POSDA again the closings of POSDA is based on regression at first convert class labels to the numbers and perform POSDA regression and it's very susceptible to overfitting okay this is a very well known fact for POSDA so it's widely used but you really need to pay attention to overfit so you need a cross validation and you need a permutation so this is a cross validation I'm just to quickly cover so cross validations you allowed or not to let's say or does it change it if you pre-filter your data I guess it makes it like a supervised method right? It's a great area people are actually doing it field data based on the fortune based on some t-test field in that really if you do that your PCA will be very significant but if you really review that I'm the reviewer comment on that so it's you can do something if you don't want to field the data based on the signal to noise but don't use that group label okay people actually filter before PLSDA by choosing what you are feeling it with so that's a definitely the PCA is kind of the PLSDA you never do that even PCA if you're doing the filtering a lot you're going to get a patent because you feel the things not correlated you get the patent more coherent towards that separation so it should be let's say in like two populations and there's maybe about ten metabolites that are really different between them will that show in a PCA out of like a hundred thousand metabolites there's data filtering and in metabolites you'll see that the filtering is based on overall stuff don't based on your group labels okay based on the signal quality of the data and not based on the variance across the whole thing regarding to the group label if you really filter based on group label it shouldn't be doing it but if you're doing a little bit five percent sometimes people are okay I know it's very popular in the microwave stuff but in general it's frowned upon so if you don't review it it's fine if you review it I'm definitely going to write a comment you shouldn't do that so that's you shouldn't fill the data based on the group label before doing an analysis which will make it much nicer always so cross validation is for the validator of PLCA so this is very intuitive that you hold your data one part you divide the data into three part and use two change your model and predict on the test model so you repeat it three times and you can do it if you have very less number of samples you can do it leave one out so you can do it multi times because sometimes three-fold you don't get a very smooth patterns so you can do it so cross validation is used to determine the optimal number of components used to build a PLCA model and because when you're doing a prediction you want to use the first, second prediction yeah this is a cross validation you have three-fold and this is a leave one out cross validation unfold so it's the same but it's basically you use it if you have 100 samples you're doing 99 to build a model predict one so you repeat 100 times so this time is really more average so you can do 10-fold cross validation so it's all cross validation just different flavors so component and the features so PLCA because you're doing a build a model to predict you want to see how many components you want to involve so you can use r square and q square r square is the sum of the square sum of squares captured by the model and cross validated r square is called a q square so q square is more objective so r square is based on your training so q square is based on your cross validation and also you can based on the prediction accuracy so you will see how many components you use to get the best performance and this one is commonly used so cross validation and help you determine what's the best model and the other one is permutation is try to decide whether the signal is significant we already discussed about how permutation is done and it's exactly the same thing you can do two groups or three groups and you're doing permuting and you calculate the empirical p-values so this p-value less than 0.01 it means if you're doing one final test and none of them is as good as your original labels so this is a safe if your label is in the middle so that really means you're not quite sure it's not a means your model is a fake or false it just means this PRSDA model your confidence on that is very low you just because randomly you can also get the same patterns so the other part is the PRSDA VIP school so VIP school is a summary of the your loadings together with weighted sum of the squared correlation between the PRS component and original variables so it is commonly used in metabolomics to see what's the features most important so you can use the loadings but the PRSDA VIP seems more capture more information so it's a weighted sum and the weights corresponding to the percentage of values explains the PRSDA component so the PRSDA VIP more than one people use that as a cutoff it's more significant in biomarkers less than one is less and metabolomics give you some summaries and you can actually see the plot and they are changed with regard to different conditions so this is the last comment VIP scores help you to decide which features are more correlated with your conditions for biomarkers it's not for outlier outlier is not message release it's not in this context so here is regression so regression we don't compute in accuracy prediction we use the RMS root mean standard deviation this one is if you have a result it's not it's not a patient health control it's actually a quantity result like a weight so you can do the regression so classification performance we can do the accuracy accuracy means out of how many times you are right so 9 out of 13 you get 69% right every rate is 1 minus the accuracy is 31% so this is some kind of issues if you are unbalanced so one sample is disease 99 is healthy if I always say healthy I'm 99% right which is not a useful model to predict so it's not balanced because if the sample unbalanced it's become an issue so for this unbalanced data what's the most useful is sensitivity specificity sensitivity is out of how many times you percentage you're seeing right actually right and specificity how many percentage you're seeing the wrong actually it is wrong you see it's healthy it's healthy you see the disease the disease so the sensitivity specificity you can see that so if we combine this sensitivity specificity and we use different we use different cut at this threshold we think it is disease we think it's normal we have calculate sensitivity specificity we can actually create a curve called ROC curve it's the receiver operating characteristics curve so it's widely used for clinical biomarkers so it's combining true positive versus false positive rate and I just show you in this thing so you have the two populations you have different cutoffs and at each cutoff you're going to have sensitivity specificity and based on the values you can get this dots and connect them to get ROC curve so it's ROC curve seems a kind of slightly not intuitive but if you really spend some time think about it it's actually combining both sensitivity and specificity so if you choose ROC curve have a cutoff at a certain range you have the optimal sensitivity optimal specificity so this is why it's widely used for clinical biomarkers so here's regions that you have the much high specificity low sensitivity here's high sensitivity low specificity here's balanced one so you can see where you want your cutoff before so area on the ROC curve so how do we have one single value combining sensitivity and specificity is that we use the area on the curve so it's we can compare different models so if we have the ROC curve the area on the curve is 95% it's all good and random it's 15% so in between you can see the 17% and 18% so usually you want to be for clinical you want to be 18% 75% 70% is still not good and I know you can get 75% you can get published there but it's just not a good one you need to be high so there's other supervised methods like Simca, OPSDA, support vector machine random forest so a lot of them are actually immotivonist but it's very advanced and we're not going to cover them and you can think about we cover the basics and there's unsupervised is more powerful like big cannons and machine learning all the neural network is really really super powerful stuff and this is your data try to help you overcome stuff generally the fact is large enough like an unsupervised method should pick it up you really only need supervised and machine learning to get at the subtle stuff so if your pattern is strong the dialogue effect is very strong it will pick up easily unsupervised and you have your PC and you show the good separation you have your biological story everybody is happy it's done you don't need to use PRC because you already have pattern there right so data analysis unsupervised from supervised and so a statistical significance so the statistical significance especially for unsupervised is you need to do some cross-valid interpretation and to make sure it's fit it's a real signal unsupervised generally safe and people you don't acquire about it so no doubt caution is that super-adjusting are powerful and learn from the labels and from pattern recognition and too many people skip the PCA and the class to interact with super-wise you find patterns that happily to send out the manuscript and get snipped because yeah reviews actually knows so you actually do your homework actually follow steps and do the precautions and so if you're supervised a method like C-style of PST you need to do cross-validation and report the R-squares Q-squares and really refer to the literature and which tools you use so really if you still don't see the pattern separation it really means that your data don't have strong signals you need to increase the large sample size or have a better design so if that's reality then that's an active result people are not happy but a lot of time it's the case clinical studies like this