 OK, good morning, everyone. Today, the first section in the morning is background statistical method. So basically, the purpose is to go through these basic concepts in statistical tests or some machine learning algorithms. Because in the section followed, we are going to use Metaban list, which a lot of ideas being incorporated into the tools. So although it's user-friendly, but you need to really understand what's underlying concept. So you can understand the parameters and whether the result seems reasonable and also choose a different method. So it's a lot of button-clicking. It seems easy, but sometimes you can lead into some silly result. So you need to understand what's the statistical concept behind it. So yes, they mentioned about the main ideas. So we have this spectra. And from this spectra, we want to identify and quantify the metabolites. Now we get a big table. And yes, they will use a base, or use a GCMS auto-fit, and also use XMS online. So for those of you who didn't attend your night, we actually managed to have about 20 minutes. Actually, the server works. And I was able to do that demo on my using my account. But overall, it's just data upload is a little bit glitchy. You need to really disable a lot of the Java security things. But it's doable. Just be patient and read the alert. And you should be able to pass all these steps. And then you can upload your data. So all the main steps and the screenshot we provided on yesterday's slide. So it should be fine. And I hope you'll fix the server and make it more stable in the future. That's it. Now the task is we have a table of the compounds and their concentrations for this untargeted metabolomics. And we usually get a peak ID is usually identified by their mass and retention times or retention index. So we want to do some statistics analysis. And for the target analysis, it's possible. We can map in these compounds also to these pathways and networks. So we can directly see whether there are some biological process functions that we can, that's changed in your experiment. So today, we're going to cover some of this process, statistics, biomarkers, and a functional analysis. And so the first, I think it's probably 20 minutes, is about summary statistics. So we usually don't directly look at all these values at tables because it's just impossible to feel it is too much. So we need to use some summary statistics. And then after that, we're going to do some basic comparisons, univariate statistics and ANOVA. So it's very useful, even for this high-dimensional omics data. These basic steps still helps a lot as a first overview. How much difference is just using these simple univariate statistics? And we need to also understand the p-values. This is important because we are doing multiple testing in omics and how it can be calculated. And if it's normal distribution, but if it's not, how we can do some other approach to derive these p-values? So we are going to talk about some permutations. And this is slightly more advanced, but it's very popular. So I'm going to give you these basic steps, what's behind these p-value calculations. So the last section is more multi-variate statistics. This is like clustering and principal component analysis. And partial least square discriminant analysis. So this last two is chemo-matrix method. It's very popular in metabolomics data analysis. So just make sure you understand all these key parameters. So any time you find this question, just raise your hand. Just ask me. So I try to cover these basics. But if you feel there's some more related question, even it's not related to this particular topic, just you're welcome. Just we ask and discuss. So I'm happy to try my best to answer. So what is statistics? And different people probably give you a slightly different answer. So for my kind of, for our purpose, statistics is just a way to help us understand the data. So because data is so large, we can just not eyeballing it. So we need something to help us to bring out the patterns and bring out the most significant features. So we can focus on these features. So the remaining features are other things we can assume is less informative. So statistics basically help us to condense the information and bring out the most important things sometimes could be directly knowledge if you can understand it. So this kind of, from data to information, the whole process. And we can talk about the statistics and people are talking about data science, data mining, and also there's a lot of machine learning, all the targets. So we just don't discuss what's the main difference between statistic, data mining, and machine learning. We all talk about the whole process. So far I'm just thinking about the statistics. So this is very general definition, but it's, we don't need to know the subtle difference which is not the focus. So for any statistical analysis there's two, you need to really think from the program or from the computer point of view. So what's the input? What's output? Once you figure out this is my input, this is the output I want to get, once you have it clearly defined, now you figure out the steps, which steps are which algorithm leading to get there. So this is very logical step. So if you're talking to other peoples and if you get this input output well defined, it's much easier to discuss and think about what's necessary steps, what's the possible algorithms. So the output, you need to have certain ideas about what you want to get. Otherwise you just have a data and you have no ideas and it's very open-ended and you can spend a lot of time doing analysis, a lot of things could be interesting, could not be, it's kind of not most efficient use of your time. So once you have a data, you somehow need to have even general hypothesis what you think about, what you want to see and use appropriate method to help you to see whether it is there or it's most likely to be true. So there's different statistical method that help you doing this approach. Sometimes you really have no idea and yeah, you can do some very general overview like a PC or a head map can give you help you build some hypothesis but majority of researchers are well designed, they have a clear hypothesis. So in that case, it's much easier to do certain analysis. So this is talking about the data. So our data, there's two main component. One is quantitative numerical values basically this is a concentration table or the peak intensity table or the spectral beings. So this is your big X, the capital X is a table, it's not a single column, it's a long white table. This is typically, if you're familiar with microwave, this is a big matrix. So it contains all numerical values and also for our data, we're not just interested in data itself, we want to relate it to our phenotypes. So this is very important. So phenotypes is disease control and sometimes it's a time point. So for most of the tools development analysis, the best support is always a binary, basically we'll have two categories. If you have more than two categories, a lot of statistical analysis actually done working that well because it's also slightly more difficult to interpret. So majority, if you can put your design into a two class, it's the best way to interpret and have most tools support. That's what I can see. Especially statistical testing and machine learning. So this is talking about this X and Y. So we need to have both values present in order to analyze them. So as I mentioned, the quantitative data is basically a data matrix. You can open it in an Excel sheet and it contains this like a microwave, it's a gene expression intensities and metabolites the concentrations. And there's recently a different categories, so read accounts. If you're doing this next general sequencing, we need the read accounts. So the accounts is considered, it's not a continuous value, it's a more discrete. So it's, in general, you need to use a different groups of statistics. But we're not dealing with this next general sequencing, I'm not going to discuss about it, but it's basically, you need to consider different categories of statistics at dealing with the accounts. And although there's ways you can convert from the discrete to continuous, but it's a very different category. Now we're talking about the metadata or the phenotype labels. And this is basically a zero, one, yes, no, and case on control. And this is the most common data and the majority seems to follow into this category. And the second category is called the nominal data or just multi-group. So you have one, two, three, four and all this is not important. So this is, you can do some ANOVA and do some clustering, it's all fine. So the third category is ordinal data. It's like a low, medium, high. And this is, yeah, you can think about time series could be belong to this one. So it's orders actually can be some information. So people want to, when you do the analysis and people really want to keep this. So for the most of the statistical test, at least to my knowledge, majority of them don't really have special ways of dealing with. The best way is just you treat them as a numerical value just to one, two, three, just doing more like regression things. So this is how most people at times, ordinal data has been dealt with. So you can do this. But the other, this is general way, but if the other one is just doing time series, that's another way to deal with ordinal data. So now I'm going through some terms of jargons and people really used in statistics. So they often call it what's observed values. And basically this is our data. We observe, basically we're married from our instrument. And this value is called observed values. And a variable is a characteristics of a population. And in our case, a variable is actually a compound. It is a one. So for each data, we have so many variables. So a variable is compounds, metabolites or peaks, as depending as using target or untarget metabolites. So each variables have a range of possible values. This range of possible values is dependent on your samples. So if you collect about, say 100 samples, you can actually get a range of the conditional ranges. But for metabolites, we actually, majority of common metabolites are human. We actually know what's normal conditional range. So it's quite well known. So we also talked about the high dimensional data. So high dimension means we have multiple variables. So for metabolites, if you have 200 metabolites you're married, you can see 200 dimension data. So this is, you think about it, one dimension, two dimension, three dimension fine, 200, you don't know how to draw it. But these people can call and when you analyze, this is a computer guy who treated them just. Any questions so far? Okay. I'm going to discuss a lot on this, just use the experiment, univariate, bivariate or multivariate. So univariate is, we study just one variables on time or just one subject. So basically marrying one thing for, see for the, just marrying for blood, we just marry glucose. So that's univariate. And if you marry two metabolite countries, it's bivariate. In metabolomics, we generally marry as much as many as possible. So we call multivariate. So this is based on the variables, how many of them being married, you get it. So usually for our omics, we need to see at least 10 or 20 above. It seems that if it's too few, it's not, you don't really call it omics things. So here's a kind of important concepts I want to bring to your attention is that anytime we study it and we have this, we crew the patient, we grow plant and we are sampling from a large population and then we marry that sample and then we calculate the statistics from that based on that sample. But eventually the goal is try to infer the result we observed from this sample and put it within the context of big populations, assuming basically our sample is represented from the population. So how much confidence from our sample infer to the population, how much confidence you have. This is very important. So there are some issues and you can clearly see, assuming our sample is normally distributed and if we take the whole population, we took the sample, we can, we marry them and we can get exactly what it is. But in reality, we just get a small subset of this population and we get them and marry them and try to derive and what's distribution of this population based on your sample. So you can see from this example if you choose basically based on your sample approach, you choose different area or you can have a different estimate in the top one and you get her okay, here's distribution and here's variance and you choose the bottom one if you, when you have much wider estimation of the variance. So this is all reflecting the bias in the sample. Of course we try to be as representative as possible but sometimes you just have no control and so this is something that issues you can lead to. Sometimes if you choose the things very narrow focused on the like top you can get very significant P values but on the bottom you basically have to use a large population to increase the power. So it's all something you need to think about this population but what kind of sample you get it's sometimes it's hard to control but this is the reality we try to deal with trying to infer from your sample to the population. So this comes to this very basic question. How do we know whether the effect we observed in our sample was true and if we do go into the population, the answer is we don't because unless we have, we marry the whole population otherwise we just, it's our best guess. So how do we kind of quantify our uncertainty whether this is likely to be true on the whole population then one of the most widely used is P values. So the P values to indicate our levels of certainty that our result represent a genuine effect in the whole population. So this is why people are talking about P value, why it's important because it give us confidence. You get this P values and second P value means more likely to be true within the whole population. So if we talk in the statistical terms the P values actually probability that the observed result was obtained by a chance. So this threshold you want to see this P value is significant. So it's called alpha level. So most people probably just use default 0.05 and sometimes 0.05 is give you too many significant metabolites. You can use 0.01, 0.001. So it's all commonly seen in the literature. So you use this one and if the chance is lower then it's value and there's basic practice. Yes, because this random chance is very low. So we reject the hypothesis and it declared the effect or the result is statistical significant. So this is a basic procedures for doing any statistical test. So we have this now hypothesis. Then we calculate the P values. This P values means what's the chance if there's no association, no correlations or no information in your data. So what's the chance? If the chance is low enough and what's enough is based on your pretty fine P value. In this case like 0.05 you think it's a statistical significant. So a lot of time is dealing with how to calculate P values. That's the key thing. In a univariate test, it's more or less standard and straightforward but it's became trickier when we have multiple multivariate or mixed data. So I'm going to cover, I'm going back to this topic later but let's start away some summary and descriptive statistics. So before I start, any questions? So how do we describe the data for giving a variables like along the concentrations from, say from 100 samples for glucose and it's just 100 numerical values and how do we summarize? Because 100 value is still too much for our brain to understand, to grasp. It's still too much. So we want to summarize them. So the most commonly used is the mean and the variance. So the mean is their location, the center of the data. So it can be mean, media, or the mode. And variability is how spread is the data. It's everything is exactly on the same dot like all the concentration of the glucose is exactly the same. So it's very easy to understand, you need to absorb it because of one single value. So always on the same, but in reality is the value is spreading out. So the more narrow the spread is more, the mean will more likely to sum more valuable to summarize the overall your values. So if you're spreading a lot, so the mean is carrying much less value because actually you have the other values going to be too far away from the mean. So we usually talk about the summary. We need to talk about both the mean and all the media and the spread of the data. So these two combined will give you some feelings about the data, okay? But still we assume majority of data is normally distributed. We think about in this term always. Sometimes the distribution can be very different in this case and probably there's not, still not going to be a good summary statistics, okay? As most time it's working very well. So the variability is one thing. The other thing that's commonly used is called the relative standing or distribution of the data within a spread. So in this case you can think about this more robust manner of the variance. They try to ignore the outliers like to a maximum or minimum. They focus on the thing that's more captured majority. So for example, as a quantiles, quantile into quantile range, the IQR, which is commonly used. Basically they rank all the data from one to 100. So the top 25, from 25% to 15, 15 to 75, 75 to 100%. They get this quantiles and then they choose the one in the middle. So this more capture, for example, from the 25 quantile to 75 quantile, they capture the middle majority and ignore the tails and the head. So this one is more robust against the outliers. So this is very commonly used for a lot of purposes. So it's not always the variance. So the quantiles in the quantile range is very common. So here is some graphic summary of the means and medias and the mode. So if in a perfectly normally distributed values, so the three should be the same. But in this case, we show a slightly skewed distribution skewed to the left. And we see there's some, we can see the difference now. The mean is a green one, and which is, you know how to calculate the mathematical mean. And the median is a center if you rank all these values from the top to the bottom and the median will be the center. This regarding the before and after it's just the rank in the center, this is the median. The mode is actually the value it appeared most often. So in this case, and you can always see the mode will be the value that if you see the free value the mode will be the value that if you see the frequency is on a Y axis, the frequency, the top frequency that one will correspond to the mode. So the mode is not as often used as compared to median and the mean. So mean, median and the mode is for describe, describing locations or central tenets of your data. Now the second one as I mentioned is spread of the data. So there's commonly used the mirrors like a variance and there's standard deviations and there's people sometimes confused with the standard errors of the mean. So it's, I just put it here in case I just need to clarify. So variance is the standard, basically the average of the square distance to the center. And variance is calculated regarding against the mean. And if the standard deviation is that you take the square root of the variance and this is if you compare to variance it kind of have the same unit as your measurement. So it seems in terms of interpretation this is easier to interpret one unit above and below as compared to square. So standard error of the mean this is not used for marrying this variance stuff. It's just for quantified the precision of the mean. So it's taking into account of this sample size. So you can always improve your standard error of the mean by increase your sample size. And so this is not confused with standard deviation and the variance. So which is used for our statistical analysis of the quantified spread of your data. Now here's the quantiles. Most people use box plot. Box plot actually most commonly used box plot actually uses quantile to draw all these lines. So the top one is maximum and the lower one is minimum. And this box actually is from Q1 to Q3. This is basically 25 percentile to the 75 percentile. And the middle the red horizontal line which is medium. So this is the box plot. So I like box plot very much because if you plot it you actually have a feelings of your data and you compare all your data list them side by side. You can get a lot of, basically this is a very good quality tracking for see your data, you don't need the Y labels. You just see your measurement whether there's a strong batch effect or some sample just have everything's high. You can see all the shape of the quantiles all up. So you can see, you can get a lot of information by checking the quantiles. So if you feel yeah this is just slightly up or low there's probably very normal variations. So the quantile, the box plot is important. If something is obvious jumping out you definitely need some, you need more attention to whether this sample is the real or the batch effect. So this is simple but very effective to do this quality tracking. So here's the mean and the variance relationships. So most of our tests and like t-test, ANOVA and a lot of things without telling you they assume the variance between your two different groups is equal, okay? They are in different groups like case and controls the normal disease or they are different you assume that there's two different populations or things the online assumption most likely is they have the similar variance and based on the assumption you can compare the means. So as I show it here, if the mean is very different you think oh this is two different populations you get very statistically significant p-values. So on the bottom of this over here you can see the same distance between the means but there's a variance is as large you can get a lot of overlap. So in this case your p-value won't be as significant as when the variance is low. So of course when we do t-test we have the option to choose the variance different. In this case the p-value also not going to be probably it's not as good as this but the picture is not clear-cut here. If you choose you think about two populations have different variances. So what it says is we need to consider both the mean and the variance to see whether this is significant because we're the same just the mean alone and you see the variance. So the confidence is much higher when we have a smaller variance of the data. So again we go back to some details of the univariant statistics. So we study a single variables and basically what it means is that for all the object we just focus on one of their characteristics like the height, weight and their test scores. And if we plot this single variables and we have a large sample, large pool and we plot this variance on the population and if we plot this one their height and the frequencies and most time if we are going to get a shape like this it's called the, this looks like a bell curve and it's normal distribution or Gaussian distribution. And this is probably most well characterized that distribution and also describe actually majority of the distribution in our life sciences in the human life just concentrations of the height and the weight. A lot of times if you have a large enough population you do this plotting you'll find the more or less following this curve. So normal distribution is very robust and the majority of times it's safe to assume your data is normally distributed. So the point is how do you quantify them? What does it mean? What's the variance? So getting this parameter as a, it's actually, it's a key challenge especially when we have much less samples in early day like microwave. So how do you measure the estimated variance and just with a few samples? So you guys probably heard about the Lima linear micro, linear model for micro analysis how they able to improve the powers of testing differentially expressed genes. They have this empirical basing approach actually borrowing information across different genes increase their estimation of the variance to which significantly improve the result compared if we just use a standard t-test. So a lot of our actually statistics is, a lot of focus is on how do we get a bad estimation of the mean and the variance so we can calculate p-values. So we always assume it's normally distributed in most time. So why we care about it? Because there's so many nice features about the normal distribution and it's symmetrical and it has a mean just at the center and also have a standard deviation and most people know it very well if we talk about these terms and values and everybody's comfortable and happy if you think, okay, this is normally distributed and here's significant p-values and people understand, they follow, they accept. So this is one if we can put our data and prove it's normally distributed and then calculate their robust parameters then we can do a lot of estimational inference about this, our data. So we need to understand it more. So as I mentioned, most biological, physiological measurement is actually normally distributed. So majority would just assume our data is normally distributed and if it's not, we can do some normalization to make it normally distributed. So it's a lot of effort to put our data into this category then we can apply our favorite t-test and then several other things to calculating the parameters. So I already mentioned before about the difference. How the color means, you sum everything and divide it by the number of the subject or samples and you get the mean, which is very easy. And the variance is a distance to the mean and the square and the sum it up and divide it by the sample size and standard deviation as you take a square root. So this is very easy if you're using R and it's basically just, they have the name of SD, mean SD or variance. So they calculate it for you. So you just don't need to write all these formulas. So standard deviation, a lot of you are very familiar. They actually know how much areas. So 95% is almost two standard deviation. One standard deviation is 68% and two standard deviation is about 95%. If you want three standard deviation, 99% of the area being covered. So this is, if we know the mean and standard deviation and we know how the likelihood of your data is within this percentage is almost intuitive to a lot of people who know this standard normal distribution very well. In reality, we have our data including metabolomics data and they tend to form different shape, okay? And sometimes union model, if we plot them and plot the values as under their frequencies, values on the x-axis, their frequency on the y-axis, we can see this distribution across different values. So the left most ones, union model, basically have one peak and a lot of time we also see the bi-model. So we see two peaks, sometimes you see more and most time I just see these two. I'm not quite sure things. So this one is quite popular too, especially if we plot our p-values in. So this, sometimes you see all the p-values here is all close to zero and majority of others as more evenly distributed, which means your data actually contains significant contain information, which is a lot of significant features. So this one, I'm pretty sure it's not a p-value distribution. It seems quite a smooth, smoothly skewed distribution, but sometimes you can also see the type like this and it's can be a exponential or extreme value distribution. And although there's a statistical models actually for this type of distribution and a lot of times we don't use it. We apply some normalization. We actually can do a pretty good job to get this distribution into the normal distribution and then apply our favorite statistics to understand them. So now I'm going to talk about some common approaches to fix a skewed distribution. So because normal distribution is mostly well-known and well-understood one and so we always want to bring other, fix other distribution, which is not and make it more normally distributed. Then we can understand using this familiar statistics. So for both gene expression like microarray and metabolic concentrations, and we found out most cases it's a log normal. So if you take a log on the concentrations on a gene expression, it seems much more normally, it looks much more normally distributed. So it's just the same thing. We found out and it validated from microarray to metabolic, it's true. And most time it's working well. Just you can visually check in them. And so a lot of transformation is actually very simple and widely used. So most time we don't generating doing a statistical analysis on a raw scale. We're usually on the log scale. We can always visually check in before and after but this is almost that you should consider this as default or standard you need to try that. So this is a log transformation on the real data. So the top left is as a measurement of, I think the value looks like probably on target peak area, I'm not quite sure. So on the right is if you take a log so you can see it's more, less, much less skewed, it's more normal distributed. And I'm pretty sure if we can get more data points like a larger sample it will be, it looks more normal. But the thing that's with the limited number of measurement, I don't know what's the sample size but still you can see there's a much better after log transformation. So a log transformation is as simple and we should try it first because it's also easy to understand. But sometimes it's not enough. So in that case, and we also have several other tools or methods to try to further apply. Sometimes people just directly use it to see what's the effect. So there's beyond the log, there's several other quite popular approach. One is called a centering, basically all your data just subtract the mean, each variable subtract the mean. So in that case, all the sample will be a variable above, below zero. So it's called a centering. Most times centering is not enough. So they were also going to divide by the various measurement. So the first one is also quite common because auto scaling. So they subtract the mean and divide it by their standard deviation. So basically you can consider this as a standard score, this score. So this is also quite popular. After this procedure for all your variables, the mean will be zero and the standard deviation will be one. So this is auto scaling. So this is a lot of people ask me about the hate map. Why does my number is from minus four to four minus two to two. So how do you interpret it? So if you see your score is symmetrical distributed around zero like that, most likely it's been auto-scaled. It's better for a hate map. It's almost a standard. So there's auto scaling. And there's also either less popular but it's useful in certain cases. It's called range scaling. So you can see the main difference in the range scaling is that you're normalized by the range from the maximum to the minimum or by the percentile and by the different various measures because sometimes if you divide it by the variance, sometimes you put too much correction factors there and make some... So there's a big penalty on the very large values and improve the significance of very small values. So sometimes it is good. Sometimes it's not. For example, if you have very small values close to the zero and more likely to be noise, you don't want to improve too much weight. So people can divide by the variance like by the square root or the sum other things. So you can, there's always some debate. So if you're interested, here is the... I put this journal, this BMC Biomec here, and they're really debating about the center in scaling and different transformations and what's the strength, the weakness and on what situation you can apply. So I think it's very informative while under the auto data, very comprehensive review and discuss. But the majority, most time, and I suggest try the simplest one first and because the more difficult one you do, it's hard to interpret. And so it's... If simple one can do the job, just leave it and you can spend a lot of time and try and it's actually hard to decide what's the best one. Only a very rare case is... And you say, yeah, this is a clear fit but for my experience, it's usually not that clear cut. How often do you also consider dichotomizing? Like that situation where you had this situation? Because it's almost like the majority of people... Yeah, dichotomizing, do you mean... Zero more. To look out with it and also different transformations and still significant, then you're pretty caught for that. Yeah, that's... I see what your point... And then it's also more meaningful to write, if you've got... Yeah, the point is that here if we're talking about the concentrations, if you're doing that, it's actually real continuous values. If you're just doing zero, one, first, I don't see it's quite commonly done. If you're doing that, you lose so much information and you see, at least in this case, it's a very continuous range. That clear things, if you really think, I'll remove that sample and try to redo it. So I think, yeah, in some cases, it's probably valid and always some clinical situations. I'm thinking about what you described that, probably more like why variables and how to get this patient phenotype descriptor rather than this metabolomics measurement, because this is something that you actually merit. It's not subjective, right? If you found these outliers... So I think we can discuss later, but what I'm thinking is based on the metabolic range, this is usually we don't do this. Just we... Even if the outliers, something's totally... It seems normal. You just exclude them if it's really outlier. You can see it. But sometimes if you do normalization, you just double check with a box plot, with a heat map and a PC plot, and you see it's not that strong range. So for the outlier, unless it's hard to clearly identify, unless there's a very visual distinctive and also if you have some biological regions and behind several collections and things, and that's more comfortable. So that's the things. Now here's some... Basically the idea is about the compare populations and based on their distributions, right? Here's if we measure the height and two groups of people and we plot them and we see that, and we want to see are they different, right? And this is another two groups and we plot them and are they different? So this is the questions we want to see. There's some confidence whether they are similar or significantly different. So of course we can see based on the distance between the means and the first one is more different compared to the second one. But we always want a quantifiable measurement. What's your uncertainty? How much confidence you have it? So we need a few values and here is two groups. We need a t-test. So now it's basically a t-test where we try to compare the mean between two samples or two conditions. So the assumption is that if it's the same populations, we just choose slightly different regions within the same population and they should logically follow the same means. And if they are statistically different then we tend to agree they are from the two different populations. So they are different. So this is what t-tests try to tell you. So most times if we really think they are from the same populations they really should more or less follow the same variance. So that's what I'm saying is that we need to have the compare mean, assume the equal variance. But in some cases you want to assume a different variance. It's also fine. But if you think about assuming we have a representative from a different population so we have different types of t-tests. And most commonly used we use just the independent sample t-tests like Welch's t-tests. And we also do a paired t-tests. So also for the t-tests we have this for the non-parametric and the man-with-need-you test. And if it's related we do a Welch Cox's t-tests. So they see all these names related to it. It's just that you remember, okay we do have this paired, unpaired for this parametric and non-parametric t-tests and you know the names and you can use an R or use a type of analyst and they're doing these things and this name associated with certain t-tests. So the other point I want to bring attention is that given the same mean if you have ways to reduce the variance you'll definitely significantly improve your power to detect the difference. So here it shows this basically with the same mean all the same mean and this is your measurement value. If it's very narrowly you have very small variance and it's very significant but you have a large spread and here it's not that significant. So same mean if you can reduce variance it's going to improve a lot. So if you are clinicians you have the patient we think they can be paired and sometimes we found that the pairing reduce a lot of variance so it significantly improves results. So if you can do a paired analysis or you can match a patient or just like a patient before and after this kind of for the regarding themselves so this kind of paired t-test you would much better compare if you treat them as independent. So before I mentioned about this two-sample t-test and if we have a lot of cases we also have multiple samples so if we have multiple samples what we usually do is do ANOVA and analysis of variances. So in this case the focus here is not mean anymore it's a variance and I assume there's no differences in variance between the groups and we want to our test is actually a difference and this time it's not t-test it's called AFT distribution and AFT test so ANOVA is commonly used but there are certain disadvantages with ANOVA so people are after doing ANOVA people are significant and what's the difference between which groups so this is still left wandering what's where exactly the difference lines so ANOVA don't tell you that so you really need to do an actual test to find out where the difference lies so for ANOVA it's based on AFT test so AFT test is testing between group variance divided by the between group variance divided by the within group variance so the assumption is that if we have a significant between group variance compared to within group variance then we'll get more F values and we know the AFT distribution and we should get a significant P values so this is how AFT test is done so we can see the significance between the groups but we don't know where the difference lies so usually we need to do an actual test it's called a post hoc test so there are several several ways to design a post hoc test to find out but the one tricky part is if we're in an omics studies we do this multi-testing if we're doing T test multi-testing ANOVA multi-testing is fine we do a multi-testing we can correct it for multi-testing but if we do this ANOVA followed by post hoc there's no well-defined approaches how you do this multi-testing this is no well-defined procedure so I'm not aware so a lot of times it's kind of help you to get it where it is but if statistically in this omics level what's the most statistical correct way to do it it's kind of still people just just we talk about exploratory data analysis so we see this is a feature it's mostly like where the difference lies this is fine but it's just in some cases everything we wanted beautiful clear single variable test it's not directly applicable when you go to multi-variate omics level so it's just a no well-defined procedure to calculate it's just so like T test we have different flavors in ANOVA so most commonly ANOVA so you have three groups, four groups you can just do ANOVA directly on it and and we have a multi-factor ANOVA so we have like when it's time, you have different time and they add once treatment and non-treatment group and there's also mixed design ANOVA so if we have just the single variables and you can basically try all different ANOVA on the things but when we got an omics scale so you're very limited when we ANOVA is fine as you can do it, two ANOVA and yes you need more samples and do something and three-way and multi-way I rarely see it down, okay? it's because it's very difficult to interpret and very well-defined procedure to adjust the multi-testing and also if you really want to find the significant things it's it's very complicated now if you have multi-way and on an omics scale so most people actually do just if we have multiple groups and the best way is it's just doing pairwise you have group one, you have three groups group one against group three or just pairwise and try to do like a Venn diagram so it's easy to interpret also all this each step statistical is well-defined so people talking about this all fancy designs they want to apply this complex ANOVA at an omics scale so I'm basically strongly against because this is just no well-defined interpretations of all these things it's just so hard to define a valid procedure and interpret it properly it's also that if we want to get a good things it's a good result it's the requirement on the number of samples is usually much larger so conclusion for the t-test is we test if two group means is different we can compare two samples I'll compare one samples against like what is significant from zero from a given value we can do that and ANOVA is compared more than two groups of more complicated scenarios and ANOVA use variance between group variance against the within group variance so I'm seeing this is because sometimes it's the idea behind is very easy and we can use this simple concept to help us do some comparisons even we are not using ANOVA but we can use some of their approaches questions always interesting so if you collect the samples go find out why so if you control it if you're just in charge of analyzing data and you still need to contact the person who collected it if no clear reasons you see a strong and you have a relatively good number of samples just exclude it I do think that will help improve sometimes a bad sample really mess up your whole analysis it's just the case if you said it's totally out of range sometimes it's not that clear and people are just doing this charity and they just visualizing the PC and say oh this one seems overlapped this one looks causing trouble they remove it and they remove some of the overlap making them want to find it more clear that's not I'll see it's ethical way because on the population there's some variation some in some cases some just subjects it's just not that clear it's we need that uncertainty to otherwise you have such a clear good biomarker when you really go testing the population it's not a case so that that's no more part of the things so it's really case by case and I like it's more biologically or analytically justified rather than statistical justified so our liars it's different people have different definitions so it's always discuss with people who collect the samples and understand the situations before you decide to exclude but once you have several hundred samples it's very easy you can stratify you can focus on once it's more more like you can compare with more severe and to the more just have you can get all the symptoms across the range and you can compare more severe to the not so much and you find most important patterns you can stratify it's fine but it's not the outline it's still you stratify the population right so now I'm moving to p-values still I mentioned about p-values so p-values probability of seeing a result as extreme or more extreme than a result from a given sample and if the null hypothesis is true what's a null hypothesis or null hypothesis there's no difference the same population what's the pattern you see is a random so these are null hypothesis we want to see if null hypothesis is true and this is a population distribution and we found our values somewhere here and we want to see how do we compute a p-value so this is an issue I just mentioned before so we know how to calculate p-value for a normal distribution it's easy because we know how to do a t-test we know how to ANOVA we know their distributions if it lies outside this three standard derivation outside less than 0.01 if we know this chance of lying two standard, three standard derivation we know it for ANOVA we also know this chance where it included but for non-normal distribution and our approaches not to model this non-normal distribution using other tools we want to normalize our distribution to the normal more normal distribution then we can still calculate in this t-test this is one thing so but still in some cases we couldn't do it right it's very strange distributions so we can use non-parametric test so non-parametric test in general is less powerful so if you want to get a now I get to a point if non-parametric you just get a rank probably just not a continuous value that's also doable it's non-parametric so non-parametric is in cases if you all the values so widely different but you only care about the rank it's just who is more who is less you can use non-parametric so this is usually less powerful but if you couldn't normalize so it will be more powerful compared to normalization so this is very so in the last case it's doing permutation test so you don't even know whether it's what's distribution in cases it's you're just not quite sure about distribution you can simulate the distribution and find out disregarding what kind of distribution you can still calculate p-values I'll go and discuss it later so here's the several methods that I just mentioned before about we do the log transformation do the auto-scaling these are two things we want to try and here the result on the is basically a gene expression data of metabolic data it's usually you look before and after you see it's pretty normal distributed but if you compare with before and it's this is one this is you should be very comfortable yeah this is you can apply your regular t-test or something it seems comparable so this is a box plot we have this you can see it is yeah it's this is all the scale data so you can see each sample is symmetrical seems it's symmetrical so for the non-parametric test so we only care about the orders or the ranks so we don't care about their absolute difference so for example I give the example is that you have 1,000 and 1 and the other one is 1.1 1.09 so in the non-parametric the rank of the same this first value is higher than the second value even in between the difference is huge but what that means that you have loss of information because the quantitative difference is lost so keep in mind is that if you choose non-parametric most time is it's not that powerful but if you cannot normalize and you cannot normalize but you have a relative large sample you can still try so if you get a statistical significant values from non-parametric it's very basically there's a lot of difference there and now I'm going to introduce about the empirical p-values so this is slightly more advanced concept of calculating the p-values how much confidence what's the chance if you get the values compared to random so this is useful because in a lot of cases in our metabolomics or ad-omics data we're just not sure what's the distribution and in that case we need to simulate from some random data and calculate these p-values so if you read about some nature method a lot of publications they have a large scale data complex designs and you just want how they can calculate p-values because no well defined statistical approach and you will find out a lot of time they're just doing some permutation okay the basic idea is that behind the permutation is that the if we assume there's no relationships between your data and your phenotypes and then everything should be just that you should be all the class labels and you should still get a similar result so if the original label disease or control actually carry the information it should be standing out otherwise if you if there's no relationships it should be all similar you're shuffling all the labels and you calculate the t-statistics or f-statistics or whatever the values you can recalculate it random shuffle the data and you calculate values and compare it so if your values based on original label is similar to values you use randomly permuted label then you couldn't tell whether your data contains the information that's at least the information related to your phenotype it's statistical significant or meaningful compared to random so in that case you reject the you cannot reject now hypothesis basically there's no statistical significance in your data okay so that's that's a basic ideas so so the the step is actually simple because we usually have very powerful computer first step is just you have now hypothesis and they're all from same distribution and they decided what values you're going to calculate you can calculate the mean difference which is t-statistics or you calculate if you're multiple groups you need to compare winning groups against the between groups okay and then we shuffle the data label basically randomly shuffle it and we we calculate the t-statistics whatever then we repeat it thousands of times and we can see what's the permitted what's the values based on permitted data and we compare our original data so this is a barrier seems it's difficult before when we don't have access to computers now it's very easy we can do almost calculate p-values for any kind of complex situations the thing that we need is slightly more samples because once we need to permute we need to calculate permute one thousand times but if you have very small sample if you randomly shuffle you don't have many combinations so if you're just the six samples at a combination it's very few probably one hundred two hundred it's all down so if you won't be able to get as t-statistics of computer p-values if you have very you cannot permute enough numbers so the permutation the only request is that you have relatively large number of samples so here's a very simple example I hope you guys all understand so this is just on the univarit t-test how do we do in a permutation okay here is a statistic here's the values from case and control okay and this is original ID so we can calculate the mean differences 0.0541 so this is just values from two populations this original one now we reshuffle the case and control basically so you see the case is one to nine and control is 10 to 18 and we shuffle the label so now the case became a different number of let's see the patient a subject and control is different then we recalculate the values okay now we just redo the process once on time and we calculate the mean difference so of course we can do it manually but you use the computer program it's very easy and we do it once on time you calculate the mean difference based on the permuted case and control and it looks like this and here is what we actually observe based on the original one so we are pretty sure the difference between case and control in our original sample is meaningful because when we commuted they are not that different so it's just based on this permutation and how do we get the p values so here is the situation if we do once on the permutation and within this permutation three times we get so for example if our result based on original label is somewhere here so actually three times we do the permuted sample get a slightly better result so in that case p value is three divided by one thousand so the p value is not it's empirical p value of permuted p value is p equals zero zero three okay but a lot of times in this case and what we can see is that p value is less than zero zero one because we didn't observe but if we keep permuting model times probably there's one times it's still you get a value that's higher than our original one but we don't know because we stopped at one thousand times so we cannot see p value zero but we can see p value is less than 0.01 so that's usually how the p value is calculated any questions? so this is kind of a very commonly used so how if we don't know and we can calculate it's still able to calculate p values actually this getting more and more popular because with complex experimental designs and assume it's normally distributed it's sometime it's hard to justify but if you prove using this permutation procedure people will be more comfortable there's a difference so this approach is that you can apply it almost in any situation also during permutation it will take care of this hidden selections or hidden correlations and the only disadvantage is competition intensive except for large data if you want to permute one million times you're probably not for over nights but it's done really well because if we can do things and we get a good p values that's fine so that's much I could just ask do you run the simulation in Rs and is that how you do it programming permutation in Rs is so easy but the disadvantage is if you're very large data at each round you need to think about memory so you probably need to write more efficiently because if you do one million times and every time you save the result in between you're going to have memory issues but it's easy to write a permutation the step is just like that so actually I already tell you the answer before the example I gave you is compared to the mean difference between the case and control if we have three groups time course or different things how do we calculate the empirical p values if it's not normally distributed if use ANOVA it's not appropriate we can still calculate p values in permutation in this case we use the F test we can still calculate the not mean difference we can calculate the within group variance compared to the between group variance and we can get these F values from permuted sample versus your own sample and get this empirical p values so these are two typical situations but a more complex situation is still doable we just need to think about what's the distance so questions so here's more details about hypothesis testing and multiple testing issues I mentioned that before so when we doing this statistical analysis we always have a kind of non hypothesis versus our our so basically we want to see whether is how likely is the result we get is from the sample by the random chance so the p value actually tell you this random chance so I'm going to slightly faster first it is I don't have enough time second one is that this I already mentioned that before and I put a lot of details here we can if you have questions we can definitely ask we can discuss it so the now hypothesis hypothesis testing is that the p value is basically reflecting what's the chance if we are obtaining our observed values and now it's a hypothesis so we reject it if it's less than cut-off here I give you example 0.05 and 0.01 so if that's the case and we declare it is statistically significant okay and the issue here is that we are testing out once we are testing tens of thousands of times so if at testing we are seeing this 5% the chance we are going to get this get this result from just by random so 5% if we do it tens on the times so just by the random chance we are going to have 500 that's false positives so this is as we call it usually we call it multi-testing issues because if we do a single test it's fine but once it's on all mixed scales this became an issue so we need to do some corrections so to control this high levels of false positives if you're just doing it so many times on all mixed scales so the most commonly used is a Banferoni corrections so the goal is to try to control still the same level of false positives of this alpha value like in this case if it's still the 0.05 but since we are doing this so many times we need to so at each level we have to really control it doing more stringing so in the end when the total combined we are still within this alpha level 0.05 so in that case for each statistical test on each genes or each metabolites we have to use a more stringed cutoff in this case for 1000 genes on metabolites we need to get this stringent cutoff so it's 0.405 so this is a very stringent cutoff so in this case yeah the final p-values we control this false positives but so it's very conservative in the case we're going to increase the chance of false negative because we are taking a very conservative merits here and if we don't have large samples and if you're doing this basically you're going to get a very few metabolites or genes that's significant so that makes the multi-testing Banferoni is not that desirable so then another one is called false discovery rate so all we need to understand is that the false discovery rate is that we accept there's certain proportions of the genes can be false positive but we just want to control how much we are comfortable with so for example if we have FDR is 0.05 what it says is 5% of these secondary genes are expected to be false positive which is fun but if we have much less samples you can have 0.2 basically 20% can be false positive so this is more lenient compared to Banferoni actually if you're doing the downstream pathway analysis which is more tolerant to false positive so if you have very few use Banferoni corrections you can get very few genes you couldn't even do this meaningful gene analysis because you have very few genes or few metabolites that does not benefit from the omics scales so FDR is very well suited you can just consider all the adjustment values for a secondary rate so summarizing our basic concept is T-test ANOVA and in our analysis even we're doing omics analysis this is still the basic we need to visualize the box plot we're doing T-test ANOVA and see some signal features before we really venture into multivariate statistics and I'm going slightly faster now so multivariate statistics we are considering not just one metabolite so two we consider multiple ones and so normal distribution is just like this I just showed and if you have a bi-variate normal this is what it looks like and if we think about three-dimensional four-dimensional it's hard to draw but if we're talking about these features so if we think about this situation so we really cannot just analyze based on one facet like T-test it's best to consider as a whole all these things the correlation and the structures all things should be taken together and analyzed so this inherently this should be more superior compared to T-test ANOVA so this is all people just talking about the advantage in this multivariate analysis the issue with this approach is that if we want to multivariate analysis usually a lot of the procedures we want to use this we need more samples to estimate all the parameters but in omics data we just don't get this so many samples so it's hard to see metabolomics because it's cheaper it's in a much better situation compared to RNA-seq or microarray so we we still suffer from this smaller sample size but overall it's a much better situation because we have more samples the feature is not 10,000 just a few hundred so it's easy so here's several main categories of the multivariate statistical method and listen I put a visualization here and of course visualizing is very helpful so the main thing is that chemometrics basically we talk about this PCA, PSDA and the other one is called machine learning it's like running forest like support vector machines and also this clustering approach is all considered multiple varied at the same time so we're never going to forget about visualization because all these approaches give you numbers a lot of time we need a visual inspection to see all the features the distributions, the values to help us make a difference so visualization is always helpful so just a simple head map tells about our data the qualities, the patterns so here the first one is talking about machine learning so the machine learning is to approach when it's called unsupervised and the other one is supervised so the main difference is unsupervised is based on your data the X, the capital X they don't consider about your metadata the supervised learning is that you need to learn the correlations between the X and Y so basically they also try to build a model create a connection between your data and your phenotype so it's much more powerful but the supervised learning means that they usually require more samples to train so it's more we need more cautious when we're doing supervised it is powerful but it needs to be used cautiously so unsupervised learning method as mentioned there's two things just based on data itself so it's a clustering what we basically want to see samples of variables that are similar to each other will be more closely clustering together so we want to see the patterns we want to see the subgroups of the populations or patients that's more coherent so this is some patterns or some information we want to discover using a clustering method also the other one is called dimension reduction like a PCA we can also consider the clustering method because when we project the samples into low dimension and samples close to each other similar to each other naturally close to each other so it's this is also unsupervised approach so if we want to do clustering and we need to a ways to calculate the similarities between the samples or between the object and we also need to decide how close if they are close to each other the threshold we think they belong to the one cluster or one group and then after that we need to see how to calculate the clusters not samples at level and sometimes we need to start with some random seed because we need to start from somewhere to gradually build a cluster so it's two common clustering algorithm one is came in partition method basically we need to first decide how many groups we want to have and then computer will choose some samples and start to build the clusters they can try multiple times finally they will stabilize with clustering the other popular approach hierarchy method they basically calculate between all the samples what's the distance or all the features what's the distance and then gradually build a cluster emerge what's the most closest sample is to merge them then find the next one and merge them so it's from the bottom up and gradually build up the cluster so it's hierarchy clustering more basically you don't need to specify how many clusters you want to build you just build from start every sample is a cluster then final to the one single cluster and you choose to decide where you want to have the cutoff so here it shows the came in clustering and here the color it basically shows relative distance between each samples so you can start with a random choose the choose a this cyan color and then define what's the next one and the closest next one then if this the distance is within this threshold they merge and they become new cluster then calculate their centroid the new distance with other things so the process just repeat and finally they form into a number of cluster you define so this is came in the nearest neighbor is more or less the following same or self organizing maps they do have slightly different flavor but more or less use specified numbers and they have try to find the distance and group them and finally stabilize the reach the number of cluster you define it will stop so there are certain randomness within that but so this one is most popular hierarchy clustering basically start from bottom to the top and as I mentioned there's two similarity you define as what's the distance similarity between samples what's the similarity between the clusters so for example most popular one similarity between samples like Pearson correlation coefficient and you can calculate this and similarity between the clusters it I'll show you in the next in later on there's how do you calculate distance between two clusters so similarity and you can always the Euclidean distance between samples it's easy just to compare each metabolize concentrations from sample A to the sample B and there are difference and squares and multiply this distance so you can do in the Pearson correlation so here it shows that the distance here is overall similarity so it's one and here it's opposite minus one so there's the distance most close and most far away so there's the things now here it's calculating what's the clustering distance for hierarchical there's common used is called single linkages the closest one and so complete linkages when they choose the one the furthest from each other and average when they found the center of all this clustering and calculate so this is several common approaches calculated clustering distance and if we do this hierarch class as I mentioned we can just from the top to the bottom or from bottom to top from everybody's a big class and gradually refine to the everybody's individual class or from the bottom everybody's a two class and gradually merge back so you don't have to specify how many classes because they will do it all the possibilities and you can choose for here you can choose at levels here for example if you cut off at here you can see there's two classes one class is red and one class is more gray if you choose here if you choose cut off here you can see there's I don't know one there's multiple there's probably five different classes so it's really give you more control and how you interpret your data based on the patterns you see now we're talking about principal component analysis so principal component analysis is the main idea is that it's try to capture the most variances from the data so the basic assumption is that the main direction of variance or the major variance from your data is is your main data characteristics so because principal component analysis does not consider your phenotypes so it really tells your data itself so a lot of times for example if you have a strong batch effect the first principal component will capture this and say oh your data just the most variance actually reflecting the batch effect so it is just tell the data the main direction of the variance so this is sometime if your data is not correlated and so if you have a data like metabolites or peaks it's highly correlated and a lot of times it can summarize in just a few components just top three and explain majority of your variance within your data and it's going to be very helpful to understand your data because the majority of variance actually captured but in some cases if your data is uncorrelated principal component one just explain your variance and PCA is not useful because your data is uncorrelated you can do a t-test and the projection is not designed or not useful in this case because your data so what's the PCA try to do is try to capture the most meaningful section of your data even you have 100 variables you can have 100 dimensions but only the top several dimensions capture the most information in this case if just like you have your flashlight on a bagel and you see this oval shape this capture most information for this object so this dimension is most meaningful we can basically ignore the second one so this is an assumption just focus on the most meaningful ones so if we really look at PCA how they actually done it's actually based on a projection from your original space to the low to the low level space so it use a approach called eigenvalues it's designed for this usually for the image processing something so it basically designs the first component capture most variance and then the remaining one again just capturing on the second one on the condition it's not related to the first one so actually each component is independent to each other but each component try to capture as much variance as possible so we don't need to go to the algorithm details but here I want to see is basically your score and P is your loading basically strong P values contribute more to your score the locations within the sample so this is we don't need to know but this is your metabolite concentrations and it's your score and it's the loadings and it should be T somehow so PCA details already it's a try to capture maximum variance but keep each component orthogonal to other so so each component is orthogonal means it's a second component and the first component if you see that it's basically it's 19 degrees and we try to capture most variance is actually in the red one and the second one is in green one so the principle component can be applied on the covariance matrix also can be on the correlation matrix so this is your choice and if you apply the auto scaling covariance and the correlation will be actually the same because our unit is one if divided by variance is one it's done a real change so as I mentioned this one is originally developed for the face recognition so if you just capture these details of the face and actually use the agon face it's basically you just capture the main generical face features so metabolomics is a very well suitable for doing this PCA or chymometric method because the spectral features or metabolites are more correlated it's my feeling compared to microarrays to metabolites correlation with metabolites or metabolites peaks is much stronger so if we do the PCA analysis the variance capture in the first few components is larger compared to microarrays so that's why some omics have different characteristics so the PCA PLC is well suited for metabolomics so here we've applied all these peaks we can do this PCA we can clearly see the score plot they have different clustering as well based on patient groups and as I showed before we can actually see which loadings or which coefficient actually contribute to this separation and in this case we really need to read the loading plot so here's the side-by-side between scores and loadings and each colors represent groups and if we do the PCA1 PCA2 and loading and loading to side-by-side we actually can see which compounds which in this case is basically minerals or some other features are in combined contribute to this one this area will be positively correlated to this green one because this one is actually trying to drive the separation in this direction so this one and this feature will be positively correlated because this one drives the separation so we can also see this feature and the green feature will be negatively correlated because they are opposite directions and this is an intuitive explanation and see the scores and loadings so that's very easy so if PCA gave you good patterns and basically you are very safe because PCA is unsupervised the patterns tell the data so this data inherently has the patterns if the patterns overlap with your phenotype labels it basically tells it's a strong your biggest patterns or variance within your data is actually your experimental factor so it's very safe but sometimes this is not and you probably need to use more powerful approach it's more supervised approach it's going to be called PRSDA so for the PCA it's very good for data overview for detect outlier the relationship between the variables so it should be PCA head map and box plot should be your friends to track the data qualities outlier detections and also if a PCA give you a good result that's very good because that's a PCA relatively safe to use but if not you need to use another approach called PRSDA so the biggest difference between PRSDA and PCA is that PRSDA is supervised it considers both your data and your data label which is Y axis Y basically a case and control your your phenotypes it tries to create the model to correlate them so it's nice but a lot of times we found that PRSDA is too eager to please you it will always produce these separation patterns with regard to your conditions so even if you give a random it will try to produce a separation pattern visually you say wow there's separation in your PCA by the way don't be too over excited things you need to double check make sure it's justified first there's a PCA PRSDA always produce a better separation compared to PCA so I hear that so it's just your new type of analyst so you can see it every time you apply PRSDA you get it better because it's supervised it's expected it's nothing so we need to double check here's it's susceptible to a phenomenon called overfitting overfitting basically there's no pattern they try to find the pattern for you okay so this one we need to guard against whether this pattern is true or not so in machine learning field they have very well developed approach called cross validation so what they try to do is they divide the sample into different groups and build a model in one group and test in another one so the one I showed here is that three fold cross validation so they use a chain they can do a so they have two-third of samples for the chaining and testing the test sample then doing another two-third and test so they do it in three rounds and calculate the performance and this is fine if you have a large number of samples if you have very small number of samples and if you're doing putting in one-third one-third you're probably not going to get enough sample to chain your data so if you have less number of samples you use approach called leave one out cross validation so if you have 20 samples you're going to 20 rounds of build a model and predict build a model and predict so this basically it's called leave one out so this is called cross validation so the thing they try to prevent whether if it's a fake pattern the prediction won't be as good so this is one thing the PRSDA the cross validation PRSDA also have another purpose is that how many components we can use for PRSDA to build a model because just like PCA PRSDA is the first PRSDA component it's not called the principle component it's called the latent component it's most predictive but sometimes the second one is also predicted, the third one is also predicted it's less predictive but it's sometimes still contributed to your model so you need to decide how many components can be used to build your model so this cross validation is the first as a guard against the overfitting and second is try to decide if I'm using how many components can achieve good good performance so for PRSDA cross validation is the prediction is one thing also it's called sum of the squares captured by the model so this is another measurement of how good of your model so you build your model you have to explain the variance with your covariance with your data and the wires so this is called R-square and because we're doing cross validation we can also know cross-valid R-square so if we use the model during cross validation how much we explain the covariance so it's called Q-square so we can also this is R-square Q-square and prediction accuracy this is three performance measures so this one can help us to decide whether our model is a good model or overfitting model whether it contains and also PRSDA is so easily overfitted so another second one is permutation test so I'm not going to cover details, it's the same thing exactly I covered before about the t-test but here we're just computing on the PRSDA model because it's very easy we're just computing this Q-square, R-square or even prediction accuracy on the permutated data basically we shuffle the label and redo it so if we also we can see the statistic based on the permutation and based on original data we can get this so we have cross validation and permutation double check to see whether this PRSDA is a valid model tell us more information so these two approaches so if both seems to give a very good result now we are safe our PRSDA model works and so that's don't just show a PRSDA separation we're not reporting this measures review is definitely going to question how valid it is so in better report other Q-square, R-square on the permutations so this is make it more credible for this model so it's very different from just using the PCA and for PRSDA things we can we can do use the like PCA doing like scores and loadings but a popular one is called the VIV plot it's called the variable importance in projection so usually people just have a kind of VIP larger one is a significant features so here's the result we we can get if we use Metabolonist you can see based on there's a lot of actually larger one so you can choose about top you don't have to choose all of them here's features that's more than 1.0 so the other one they have this small square to tell you the expression within each categories I have a question about that yeah here's the three we actually do all of them prediction accuracy or error rate which is there cross-validate R-square or Q-square as three so those all are related to the prediction yeah cross-validate Q-square and prediction is all prediction the first one is just based on the model itself the first one R-square is not prediction the last two is so the higher the better you can always see that and plus the permutation okay both so it's because the PST is not have a good reputation but PST is initially it's not designed for this prediction but it's somehow people just want to use it so we need to pay more attention to it so it's not like compared to a random forest support vector machine PST is very susceptible to this over-fitting thing so that's we so we mentioned about this is a binary it's a yes or no and a control case or even one but if we can also do it on a regression but on a regression it's it's we use another different approach called the root mean squared errors basically we calculated our predict predict the values can be either original and so much difference but this is not commonly done in classification we mainly focus on it's yes or no so it's not regressionism at least we don't do as much as classification so here's a how to measure a classification the result and its accuracy and error rate accuracy is a percentage of times we get this right so it's very easy but this classification just based on accuracy error is this problem if we have this imbalanced data right so if we have populations the majority is healthy and one or two is patient if we predict everybody's that have very high accuracy but it's useless because this is imbalanced so in term of evaluate performance if we want to do biomarkers on the clinical settings and there's the more they use a different set of things they call the sensitivity and the specificity so and so we have true positive rate and true negative rate so here's actually it's positive and we predict the positive it's true positive and if actually it's negative we predict negative it's true negative so this is clinically we see how it's been done so for example if we have populations we have a cut off here so on this cut off left side is negative and right side is positive so for this one it's a true negative okay here is going to be false negative because if it's a grain and it's positive case but we have a cut off here so it's a false negative and here it's going to be a true positive but this red tails on the right hand red tail on this right red things here's a false positive so based on this true positive positive we can calculate the sensitivity specificity and this is a different for the things for biomarkers so if we calculate the values actually we can easily create a curve it's called a raw curve it's receiver operating characteristic curves so it's very popular in biomedical applications if you use a raw curve and we don't need to worry too much about the imbalanced data because the true positive is all built within and so as I mentioned we are already able to calculate them so we have these populations we use different cut off and we can create these false positives, true positives and false positives and create a dot we connect and this is our secure so it's trade off between sensitivity and specificity and we have here we have a very high specificity but low sensitivity very high sensitivity, low specificity and here is the trade off so a lot of times people think about the the point here, closest to the top left is most optimal cut off to do the diagnosis because these have a balance, have the high sensitivity and high specificity but in some cases people probably choose a different one to have a much higher specificity even less sensitivity so it's really a case by case how do we choose it so for the sensitivity specificity that's one we want to capture it in one value it's called area under our secure so it's overall mirror of the performance so we have the good test we can do a 95% AUC if it's 100% it's perfect always and if a random sample it is 15% it's like a diagonal so the 70% is probably I see a lot of report about the 70% something so this for clinical use it's not enough you should be pushing to I don't know 18% above so this 75% you can publish but it's still not enough to use things so I mentioned about this PRSDA which is part of actually a machine learning method but there's supervised classification method called the SIMCAR OPLS support vacuum machine random forest and neural networks it's all powerful things and we didn't have analysts actually there's several of them I put there and you are welcome to explore if you have questions you can ask me but I'm not going to cover them for this lecture just we're already over time now so so the supervised is very powerful but you will require more samples so the generally progress is we want to start with PCA and see whether cluster naturally cluster is separating well if that's fine and PCA should be enough to fit your needs but if you want to do predictions and you should use a supervised method but also you need to pay attention supervised method you need more data and if you don't want to do prediction you just want to test to find the signal features you just do a simple statistical test the t-test I know what should be found and so don't naively the caution is just don't naively use a lot of with limited data don't naively use a lot of supervised approach and you tend to overfit and you don't pay attention to it and you're just so excited to write papers and most I review or come back and give you some comment you probably don't have not happy to to address it because it seems you don't quite understand I didn't address these overfitting things and just over excited about the results so basically try the simple first if you're doing supervised and make sure you're doing this cross-validation and the permutations and all the report all these values and that's people will be more make sure this is robust so question yeah power calculation power calculation okay this is not covered but within my table analyst it's do have a power calculation okay the power calculation in all mix is always difficult and the best one is I found is you have a pilot date okay you have pilot data either from a small scale study from your own research or from a published data but very close to what you are going to study and then you can do a power calculation on the effect size and if you have a certain cutoff values it will give you how many number of samples you're going to get and the type of analyst actually do have the modules and you can try and you can ask me questions so you need to have a certain data to estimate the parameters yeah just one quick question so after your PCA you still could do you still could establish your false discovery rate in later analysis depending on what tools you're going to use so PCA is for the pattern and doesn't really tell you anything about the significance so you don't even false discovery is not even relevant here because PCA just give you patterns and don't declare anything so you couldn't blame PCA for anything it's your visual stuff right right but could you later in your data analysis could you at least establish what you think is a meaningful false discovery rate like a cutoff of what you're going to consider and what you're not no, if any cutoff is supervised just like a key test so PCA if you want to do based on the patterns of PCA doing the things you need to do like a further statistical test on the PCA patterns which is separate so it's not PCA anymore but you can always do something based on the PCA patterns