 Okay, good morning everyone and hope you had a good night's sleep and a recharge for statistics and metabolomics today. So this is slides for creative commons every you saw it yesterday. And today's module, this morning, first one is background in statistical method. So yesterday, we basically David introduced all the technologies underlying metabolomics. So how the data was generated. And we also had labs how to do the spectral processing for NMR, GCMS and LCMS. So in the end, and we have a list of metabolites with concentrations or peaks with the intensities. And today's goal is try to understand what are the patterns, what are the biological changes. Hopefully you can come up with a good hypothesis to do more validation study. In general, omics is give you a lot of the leads and potential directions, but usually you need to do some more targeted analysis to finish the story. So this is not, it's all back to the traditional ones. Here we, let's see, I cannot proceed, okay. So today's main goal is just from the tables or from the list to patterns and to biomarkers and to pathways. So this is the learning objectives. So before we actually started using the tools and get the results, we need to understand in this basic concepts in omics data analysis. And although this is a focus on metabolomics, a lot of these concepts very generally applicable to all the other data analysis, like transcriptomics, microbiome. So I do want to share these concepts. And if you learn this omics, you definitely help learning other omics. And the key focus actually on the p-values. And there's a lot of the things going on. People are very superstitious about p-values, p-value significantly, you think it is true, which shouldn't should be moderated. So this is not necessarily a case, biology is most important. And this also extreme, people don't give enough replicates to get to have the proper power to get a good p-value. So we always need to have a right attitude about p-values, interpret p-values, and come up with a reasonable hypothesis interpretation. And also we're going to introduce about common multivariate statistical method like PCA and a POSDA, and which are very, very popular in metabolomics. But it's also quite useful. For example, PCA or their variance is quite common in single cell RNA-seq data analysis. So understanding how PCA works, how to interpret PCA is important. So a few years ago, you think PCA is advanced, now it's really regular. So try to have an intuitive feeling on PCA, interpret PCA, and POSDA is basically similar to PCA, but it's more supervised. So we will come to it more details on that later. And we're also going to introduce about this more classical machine learning, like clustering and performance evaluation, cross-validation. And I'm not going to talk about the deep neural network with AI this year. Maybe next year, if there are strong feelings, even it seems quite hot. But let's see. So as I mentioned, all the omics share the same data and share the same pattern. So I put it in a nutshell, it's just summarizing how at a high level everything almost identical. So the data processing and the quality tracking, this is platform-specific. This is omics-specific. For example, if you're doing RNA-seq and you are dealing with sequencing data and how to deal with the fast queue files, and that's specific to RNA-seq sequencing. And if you're from NMR and that's specific data format, it's directly from machine. So you need a specific things. Mass spec also have a unique format. But after you get out of that specific format and process them using the tools we discussed yesterday, and you will get a list of features, be genes, be metabolites, or tables with their abundance profile, then you can do a statistical analysis. And when you do a statistical analysis, this is shared across all omics. So, for example, the common goal is a comparison like t-test ANOVA and clustering. How do you do a head map and see the patterns? And the classification, this shows a PCA or linear descriptor analysis, even SVM. So all the technology we use today can be applicable to other omics. And here is that if after you've found patterns and signature features, what you want to do next is usually you want to interpret them. And how do you do that? It's enrichment analysis, pathway analysis. So this is, again, it's common. And just the library needs to be created differently, metabolites and genes or proteins. And finally, there's some unique functions. For example, if you're doing clinical, you have some survival analysis, biomark analysis. And if you're in some environmental toxicogenomics, there are some dose-response analysis. So this is slightly different. But let's say 17%, 5% or 18% is the core of the approach you shared across. So understanding concepts, it's basically you understanding other omics as well. So let's go to the statistical analysis. And this is basically the steps we were going through in the coming 17-ish slides. So how do we do it? First, we need to read in the data. And what this data is, usually it's a matrix, a table. You can open it from Excel and you see it's samples in the column or sample in the rows. And the column and the features or metabolites is in the different. So it's basically a squared data. And you can see it as a text file, then upload to statistical software. For example, metabolites, we are going to accept such a table. Up to you read the table into the computer. And next step is visualized data. Why we want to visualize is we want to see what the data looks right. So again, this is just a very important procedural thing. So you want to make sure data looks normal before you spend enough time to go to next step. So this is a quality checking step. And if you have a QC samples, it's clearly you would like to see with how QC lines, where it's located compared to other samples, whether there are some patterns at the beginning. If you see some patterns at this step, already a very good sign. Hopefully that pattern is meaningful. And the other one is after doing this QC and you want to do a normalization. Why we want to do normalization? Because the statistical tools mainly developed based on normal distribution. So it works much better if your data is normal. And also the batch effect correction usually is happened at the normalization step. So data normalization usually can reduce the batch effect. So normalization going to enhance your signal. And it's much better for the statistical analysis in next step. So finally, we go to statistical machine learning. We're talking about the univariate like a t-test ANOVA and PCA heat map. So this is, we just need to understand the concept of what it is and how to interpret the result. Because the tools actually nowadays is quite common. So it's just that we need to, before we get there, we need to do all the necessary steps. Make sure the data is good data and the result is meaningful. And believe me or not, is that I found most time using a tool. A tremendous effort is how to get the data into proper format so the tool can accept. And this is interesting, but we, with the time, we do understand that there's some other format issues. For example, how to save a CSV format, a TXT format from Excel. And it seems easy, but a lot of time is not so straightforward. So from the computer perspective or the data analysis perspective, and there's two type of data. One is, here is a big X, it's a real data, like a peak intensity table, a component concentration table. This is your numerical table, okay? And the other one is metadata. Metadata is data about data. So it's the descriptors of the X. So for example, class labels, experimental factors. And sometimes people want to describe something even about the compound class label. So it is, so most time we analyze the data just on the X. And if we use a Y, it's either for label, coloring your samples, or for supervised analysis. So these are two different things. And computer must understand what is your data, what is your metadata. So what's the goals after we get this data into a computer? And so first is tell you which data, which feature, mainly we talk about your compound or peaks, which are significant p-value, let's say lower than 0.05. And this is the most commonly, common question, one to find the significant features that are different among experimental conditions. And on the other side is that you want to see the big picture, what are the group patterns, like classroom patterns. And if it is significant, there are some things, can you build a classifier to predict the next, like next, if you see the samples, do you know it's disease or control or healthy? So this is, you want to have the rules, have the models, so this is more advanced. And our focus is mainly on the significant features and the classroom patterns. And the other one is more advanced. The metabolites do have these things, but I will wait for your questions if you actually get there. And so X is quantitative data, and it could be that metabolized concentrations, microarray, its intensities are RNA-seq gene counts. And so it is different, it's continuous or discrete, use slightly different model, statistical models, for example, continuous, usually normal distribution, and discrete, like sequence counts is a person distribution. So they are treated differently in statistics. But somehow there's ways you can actually convert, for example, you can convert person distribution, which is this discrete distribution, more into normal distribution, then apply the regular statistics, and you don't lose power. So it is possible to switch between, but you need to have a concept. If you don't do the proper procedures, you directly apply statistics. And that's conceptual, it's not right. So even sometimes it's working well. For example, you apply t-test on the discrete values, and you still get some good result. But it's not really designed for that. So you need to convert, do some conversion. And if we talk about why, why is metadata? And sometimes people can write a very detailed text about what's the sample about at free text. So it's readable by, understandable by a human, but it's not for computer. So computer, you can be yes or no, case or control, zero or one, they are the same for the computer. And I would suggest most time we should try to focus on the binary data, because that's very easy to interpret. And also the statistical method is very powerful for a tool class. Excuse me, Dr. Z. Yep. I have a question on your previous slide. How do you know if your data is continuous or discrete? Oh, if you open it, you can see here, and you have the fractions, and you see there's dots, right? This is concentration. Oh, oh, okay. I'm sorry. I get it. This is a top. That's a figure. It's an ensemble gene ID. This is from RNA-seq data. You can see it's integers, a zero or an integer count. So, but sometimes, as people doing bioinformatics, they don't receive the raw data. They receive data already processed. So once you normalize, you'll get some fractions. So that time, you're probably not sure what original data looks like. You need to contact them. So for data analysis, I do encourage people in the data science of bioinformatics to understand how the raw data is generated. So because once you're in the middle, you don't know the previous steps. And it's very dangerous. And all your effort could be wasted if the previous one is not done properly. So definitely this is, we should get into that. So, sorry, Jeff. So you mentioned that we should not use some kind of normalized data. For example, if it is about glut microbiota, we should not use the relative frequencies here for doing a statistical analysis. Is that correct? Yeah. If you are doing bioinformatics, you are doing data science, you should get the raw counts. Don't use normalized data. A lot of statistics work fast on the raw counts. They have built in normalization to do that. Once you have normalized, I can guess, you probably use one of the most commonly used normalization count per million, I'm not sure. The other things, they will introduce some bias and some statistics won't be applicable. So if you know what you're doing is ask a raw data and you do your normalization. So once you have normalized data, we cannot go get the raw data. If you have the raw data, we can get normalized data. That's what I'm saying. It's preferred raw data, just a table. And the next step is not so hard to do, but once you get normalized data and it's hard to retrieve back, that's the thing. For the why, I recommend focusing on two groups. And yes, you can have three groups or four groups or 10 groups. It's good that you have a global view, like using PCA or clustering, say the patterns. But when you come to interpretation and also build a classifier, it's much better in the two groups. So it's very easy to understand. And for me, if I'm doing two-way ANOVA or three-way ANOVA, my mind doesn't work. I just don't kind of think about that. So it's much better to stratify your data or just think about how do you compare your current main question versus the other. Don't put everything together. You want to solve it. In the end, it's very hard to interpret. So this is not my against it. It's just when you're doing a barrier, introduce so many factors. If you have so many replicates and it's probably fine in the traditional statistics, but for all mix, usually you don't have so many replicates. And you actually have so many features. And statistics or machine learning, design for all mix, then working for that complex design. So it's just a reality. And so try to not make a very complex multi-factors in your experimental design for all mix. So if you have so many replicates, still when you compare everything compared to your control, so you're doing two diseases like control, but don't do everything together in one go. If you're doing just put in the control versus your condition one, control versus continue two, and if you do it, understand your data very well, then you put them together. That's what I'm saying is simple and gradually introduce complexity. Don't do everything in one go. You start at the beginning, which will prevent you from developing a good story. And that's my feeling through the years. So this is a screenshot if you open it from Excel. And for metabolism analysis, if you use for today, most formats should be looks like this. Either you can put the samples in rows, for example, the control one, control two, is your samples in rows. And the first column will be your group labels metadata. So next column will be component names. These are concentrations. This is target metabolomics created by NMR, I'm pretty sure. And here is on target metabolomics and generated from like XCMS. And so because you have a lot of features and now features is in rows, and the column, it became samples. In that case, you have metadata labeled in the second row. So always the metadata directly described the data directly all of the samples. So this is intuitive and you can upload everything in one data table. So only yesterday because you upload a spectra and we need a metadata because the spectra itself don't tell too much about the group. And most time this one is very convenient. You don't need to upload two files, just one table you'll enable in Excel and save it as CSV or TXT in an app load. So this is common terms when we're talking about statistics. And the first is dimension. So dimension is number of variables. We're not talking samples. So if you're talking about the metabolomics, usually it's hundreds of dimensions, like hundreds of the metabolites. So you rarely would get thousands of metabolites probably pigs will be there. That's a dimension. And the univariate is we marry one variable per subject. And multivariate is marry many variables per subject. And so we omics always multivariate. But a lot of times we use univariate statistics. And you can see we just treat one variable at a time and ignoring other variable. So it will lose a huge advantage of the covariance between the variable because we just consider one variable in its isolation. But again, univariate is a basic starting point. And don't thank univariate. Well, it's univariate very good for initial data analysis just to see what's going on. And understanding data. And so this is we shouldn't have some bias, say multivariate naturally better than univariate. So we should try to see when we go to data understanding and univariate much the easiest to understand. So when we see what's different between univariate and multivariate, we probably see what's the covariance, what's the new story coming out. So we need to use both. A question, please. Dimensions, is that the number of variables or should that be the number of measurements? Yes, number of measurements. Oh, number of variables. So measurements is, we're talking about compounds, right? Yes, in that. So here is that how do we summarize the data? We can, if you, if we mirror the compounds across. I'm sorry to interrupt you. I have a question regarding to the variables. Normally in a statistic for doing analysis, you need to use independent variables. And for me, that was a question every time I use the TAV analysis and do multiple comparison because you, we take the analysis of the intensity of the peak of each peak from the same chromatogram. Are those independent variables? That's a good question. I totally understand your, what you said. Yes, I mean, I actually talked about independent variables. One is why it's more responsive variables. So the, I'm strongly influenced by machine learning. I don't, even I understand what's traditional statistics are talking about it, but I use variables. And here, so it refers to independent variables. Why is it independent or not? Because we talk about multi-omics. We know it's actually, they have some, it's independent. Maybe it's from, it's, you are referring to why, or we didn't know more variable themselves. So it is X, I think this, this unit referred to as independent variables, yeah. Yeah, but my question was ours are not independent. If you use metabolomics or trichetic on trichetic, you use the same chromatogram for measuring different variables in the same subject. So the variable will be linked to the run. Yeah, I get it. But this is a naming thing and how it is, we just refer to variables. We, in statistics, and I know they talk about this independent things and IID, I mean, identically independent or random. So there's a lot of assumptions there and we clearly violate that. We know that. So, but what's a better term? I'm also, I'm with you, but I don't know how the, how we can better give a better names here. But we, we just talked about data, okay? How do we get in the name? Yeah. And here is that, let's go back to this. How do we understand the data? For example, one metabolize glucose, we measure through about 85 patients. And how do we describe it? We can, for computer, they can get 85 numbers and in their memory to whatever, they have no problem. Million data points go far. But for our human brain, we cannot get a deal with that large number. We probably have a few, five or 10, below 10, we can do it. More than that, we need some help. And we need to summarize the data. If we talk about one variable, married across 85 patients, how do we summarize the data? When there's several things, one is the center, basically the central tendency. So if it's normal distribution, the center is very informative. You can see the first on your right, the first top center. The center will be the most dense part. If you know the location of the center, you already know a large part of the information. And the second one is the availability, the spread of the data. So if it's very narrowly focused, everything is almost hit at the center. You know, using the center is very descriptive, very representative of your data. So this mean of this, if it's normal distribution, will be very meaningful. You don't need it much more. And I'm talking about a normal distribution, but if it's by module, and the user center is not helpful, so you need to use more, like here is a quantiles range, IQR, so normal distribution, all this kind of center spread distribution is the best is for normal distribution, is better narrowly focused. But then more spread, more variant from normal distribution. And such a summary isn't that meaningful. So that's, you need to put your mind is, we are talking about data, mainly using normal distribution, and we're using mean or media. And you can see the challenge here is that if data behave as normal, we are fine. If data don't behave normal, we use just these numbers. It's actually way from reality. So this is box and the whisker plot. And we, a lot of time I receive question about how to interpret that. I just refer them to the Wikipedia. And because this is also we got it from Wikipedia. And I hope you guys are fine with a box and the whisker plot. So it is a median. And it's, I think it's a one point quarter. So it's a quarter from, so it's a 75%. This is 25%. So the, this IQR is basically from 25% to 15%. So this one is more meaningful, more robust compared to using the whole range. Why is because here you can have some outlier on the top and bottom. But if you use 25% to 75%, it's more conservative. But it's more robust. This is something we do in normalization or using a lot of this upquantile or interquantile to add the variance rather than use the whole from minimum to maximum because outlier can play a big role. And in IQR is much stable. And here is that, since we're talking about the challenges in compare means without talking about the variance, we can see even the same distance same difference between the means, like group one versus group two. And you can see that this difference is same. But if the distribution, the spread is different. The confidence is different. So in the top, like on the right side, on the top, and we clearly see we are going to get a very significant P values because they are almost totally separate. Only have a very few overlap. But on the bottom, there's a lot of overlap. So the challenges that we need to consider both mean and the variance, and they all estimate from your data. So the more data point you have, the more confident it will. So the last data point is become less confident. So this is a challenge foromics with less number of replicates. So we're talking a lot about normal distribution. Everybody should be very grateful for this magic distribution that people discovered and described and actually working a lot of life sciences. My background is not initial from statistics. So I'm from a more medical field. So when I say normal distribution, I'm just always wanting to say, is that real? And it's most time it is indeed distributed at that. The magic number, it's in 13 to 14. When you have this 13 to 14 replicates and you marry from some biological response, and you see some distribution is very close to it. And if you apply some normalization, actually it's very normal. So eventually I became more and more comfortable with this concept normal distribution and stuff. But at the beginning, I'm just thinking about how, why it's behave like that. So you can see the formula. It looks scary, but we don't need to remember this. Computer always deal with normal distribution, so efficient. And that's another reason is that we want things to be normal, normally distributed. Otherwise computing time going to much longer if you assume the other way. So we will describe about the permutation if we don't know the distribution, we're doing permutation, which is quite a cost in terms of computing time. And so normal distribution is our friend. It's normal distribution is very powerful. So if we can normalize our data to this one, we will have a much better chance to discover something significant. And in reality is that we found a lot of the distributions like a unimodal or bimodal are skewed. So skewed is quite common. And bimodals are also quite common nowadays from RNA-seq data. So it's assumably from different cell populations or development stage. So this is sometimes concerning because I know the next statistical analysis using normal assumption for that. So normalization is always preferred. You cannot make it a perfect normal, but at least a close improvement. And so if data is not normal and the algorithm assume normal, so we cannot redesign a new algorithm. If you're really talented, you can design algorithm to fit your data. And but most of us, we don't develop in algorithms. We have control over data. So we transform our data to normal distribution, to suit an algorithm, rather than develop an algorithm to suit your data. So you see the different ways how it works is because we have no control over the algorithm. We want to apply t-test. Now we need to be normally distributed unless you come up with something different. So Jeff, I have a question here. Like we have a data, like from starting from a raw perspective, when we have a raw data, we know that in an annotation, like in a pipeline of data analysis, we do the normalization. But how can we decide like when the data required the normalization or not? I mean, in the pipeline, it's fine. Like the next step of raw data processing is to go for the normalization. But like, but if we have a data, how we will decide like it's now it's required the normalization? Yeah, yeah, that's a very good question. So you need to visualize your data. So a lot of times QC data visualization is the first step before you go to normalization. That's the step I mentioned in the previous slides. So visualize data. If the data is already normally distributed, there's nothing you just choose now to go to next. And normalization also don't have a fixed rule. So which approach is the most suitable for your data? So this is also try an error. So you need to try the simple one and see if it fits it. But again, you need to visualize your data and to see its distribution. See how it looks normal or not. And there are some statistical tests to see normal distribution. But the statistics for multivariate normal distribution is not so strong and the universe is fine. So in the end is that you visualize and see we will demonstrate how it works. And here is the distribution. You can see the skewed distribution. And then on the right is the normal distribution. So all you need to do is apply log. So a lot of the distribution is that there's some maybe outlier and very extreme high values. So you get a very long tail on the right. So this is common in the biological measurement. But if you apply log, log is basically have a big penalty to a big value. And they also exaggerate in the small values. And you can see that they push this whole distribution from the skewed to the left all the way to the center. So here, while we are here, just make sure that you understand log transformation because they make smaller values in large of smaller value and reduce the impact of big value. Usually we don't have too many issues with big value slightly reduced because they are still ranked at top and number one. The issue is usually with the left side, the smaller value, and it's been exaggerated or enhanced. A lot of times the smaller value is noise. Okay, it's close to baseline. So this is usually not a big issue for targeted because targeted, you actually manually use some reference and standard. So usually fine. But for untargeted, sometimes you enhance the noise. So normal log distribution, there's some other flavors for log transformation. So this is something log transformation is very simple, very powerful. But there's some downsides to make some noise became more pronounced. So this is when we just choose log. So you can see several compounds or something, sometimes a ratio. You apply log and become more close to normal. So again, we can clearly see the left side is not normal. So we need to do normal. But after doing that, you see, are they normal? And it's better than not normalize. So we tried our best. So we cannot get 100% comfort. We make it an absolute normal distribution. We don't know that and there's no guarantee. But at least you don't try is wrong. So that's how I would like to see. So try your best to make it more normal. And sometime the community going to publish some papers on which one is the best using some good benchmarking data. And usually if you see such publications, it's cited like citation like crazy. I don't know. I put that centering, scaling and normalization for metabolomic data published in 2006 and which cited almost 2000 times. It's very simple, but it's just very thorough. So and it's very satisfying. You read it through and make a lot of you clear a lot of doubts. And this is something I just to tell you is normalization. It's no hard rules. So you need to understand in what's the risks, what's the challenges, but what's the common practice and together with your experience by reading with what's published. So try simple first and gradually go to a more advanced and see the visualizer data before and after. So this is what you can see within a metabol analyst. We have so many different options, sample normalization, data transformation. For example, data transformation, there's a log, there's a cubic root normalization and data scaling is about the mean centering, auto scaling, parietal scaling, range scaling. So I don't, I won't comment on each individual matter. Because it just takes time. But on the other hand is that we have good documentation on each method. Also there's papers referred to and you need to read and think. So this one just takes some thinking to to find out which one most suitable for your data. And there's definitely, there's a established practice for each field. For example, NMR and they tend to use normalization by reference samples and also a reference normalization by some by medium. So there's a practice. So you can always gain some comfort by see papers published in nature and science use approach and you follow these things. At least you know you are not wrong. And gradually understand it better. You can try to be more adventurous, see whether it improved your result. But begin with something is more, more established, gradually move out. Don't just ask, what's the best for my data? And there's no good answer. That's because you know your data best. That's the one. So now let's fix that you're reading the data, visualize the data, decide that you need to do some normalization or not. Then you, then you, then you get your data normalized as soon as here when you normalize the data, you do some statistical and machine learning tests. So this is overall thinking. Right, Jeff? Yeah. Sorry, before moving to the next step. So in, especially in LCMSs, we always come up with a lot of missing values. So is it advisable to estimate the missing values or what's the best practice on missing values? Yeah, that's a very good question. You, metabolismist, they do have, it has multiple method, five or six method for infer the missing values. And again, this is a question, have a lot of publication on that. And my preference is try to do the simplest possible. So for example, if you don't do missing value estimation, the default, it will use the sync. Why are you missing these values? Because the below detection limit, so, but it's not necessarily zero. And so you put a zero or missing value, make some statistics not working well because they don't design with missing values. But you can give a very small value. So you might have analysts that they choose, they found the lowest value from your, lowest and non zero value from your data, then divided by 10 or divided by five. I don't know. So assume that's your detection limit. Then use the detection limit to replace all your missing values. So you can go to next. So this one is the same as reasonable because you are not absolutely sure it's zero. It's not there. So you just, below detection limit. But there's also other approach like using k-means. For example, I want to find another sample. It's so similar to my sample, but for this particular variable, it has a value. I'm going to replace with that or replace with the mean or median. There's so many different things. But I want to say this because we don't quite understand the distribution of LC-MS spectra picks and stuff. And you're doing this and you're going to change the data a lot. And especially you see the value replaced. If that value replaces like mean or something. And that's a good number. It's a good number. And you see this number and you should detect it. Why you don't detect it? So for me, it's a more conservative. Try to see it from biology, from the analytical point of view, I think below detection limit. So purely from the analytical or statistical point, there's so many different methods there. But they come with the risk. And for example, the other one is a quantum normalization. I'm not sure I'm going to touch it. For, it works very well for RNA-seq data. But when you go to metabolomics, it causes a huge difference in distribution in the end because it forces every distribution to be the same. And we are not quite sure in metabolomics where the data looks like that. If you're doing it on target and with so many replicas and tens on the features, more or less it's probably okay. But doing target hundreds of features, if you're doing that, there's tremendous impact. So statistics can give you so much more power. But what is applicable to this technology, to this suit for that biology? It's not sure yet. So it's a good question. But I'm always on the conservative side. Try to do a minimal change of the data. Jeff, can I ask questions? Sure. So the outlayer, have you suggested about how to deal with the outlayer? Or actually there should be no considering an outlayer in at least metabolistic data. Yeah, outlier in metabolomics is an issue, more outlier and batch effect. So we do have some illustration about what you considered outlier. What do you think it's... So outliers, you always compare with the mainstream and all compare with QC. And so if you clearly, you see that pattern is different, you need to consult with people who collect the samples, whether the sample is mislabeled, whether the sample is being contaminated. So you can do something, whether you can remove it or not. Also, you can try to do some... Some of the outliers can actually be addressed with normalization. For example, if you don't do normalization, for user samples, some of them very concentrated. Yes. And if you do normalization, total using some or using some media, and that outlier will be gone. Why? Because that normalizes, just deal with that overall mass or overall volume. So that became actually... Everything is normalizing, the outlier is gone. So the best is that we don't do things blindly. We try to understand the cause of the outlier. And then we will do things more meaningful. And we are more confident about the result. So I personally found this. So I did my PhD in David's lab. I'm sitting in between all these people who collect the samples, who run an instrument. So I always understand what's going on. So I found it's very worrisome if people doing data analysis are separated from people who are very separate from people doing data collection. Then sometimes you can do something that actually change the conclusion. So again, I just think about is try to understand more and choose the way, choose the approach that meaningful. Okay, thank you. So here is... Yeah. Yep. I'm sorry. I don't know if I missed it or not, but I would like to know more about the policy control samples that are integrated in the series of injections. So how it's prepared, how it is ordered in the series of injections. Maybe David can answer your question better. This definitely there's protocols. How to do the LC-MS or GC-MS. How do you put the standard or reference around in blank and really depending on the platform or even your technician who has a tradition. So every 10 samples, every five samples. So I don't have a good answer myself because I don't do it in the lab so far. Jeff, I think there are protocols that people have published, and particularly papers by David Broderst. If you look up him and some of the things that he's described, they've developed a pretty robust protocol in terms of putting in quality control samples or reference samples to help with normalization or scaling. But it does differ between labs, and it's okay to have it a little different between labs. Some cases it's how much you can afford or what your instrumentation is like the platforms you use. But as long as you can justify it and as long as you can point to perhaps some other protocols that are similar to it, then most people are satisfied. Many kits, like the biocrities kit, also have a series or protocol in terms of their quality controls. And that's also a pretty good model to use and to think about. And let's move on to P values. So why we need a P values? Because uncertainty or we need a more certainty or probability. And we're doing statistics and the reasons that we cannot get a whole population. And until the end of the day, we can marry everybody and we get that. And then at that time, we're just using all samples. We don't need to do the statistical P values because we have all samples. But before that day came, we are using a small subset of samples. And from the samples, we want to estimate the populations. We want to get some confidence. So we need the statistics. So you can see at the bottom, there's something based on your sampling. There's some variations. So for the big population, some random sampling going to generally something that probably not the same identical. But are these statistical significance or not? And you don't know. And you need to do some statistics. So what's P value about? So P value is just to give you such as indicated levels of certainty or uncertainty. And whether our results represent a genuine effect in the whole population. So because we don't know the truth unless we marry everything. So we don't know the truth. So if we want an intuitive interpretation of the P values, I'm talking about the P values in the frequencies and the general meaning like a T test. Okay. So P value is the probability that a observed result was obtained by chance. So basically, if we assume the now hypothesis, now hypothesis, there's no effect. What's the chance you will see such an effect? For example, if you said the P value equals 0.00.05. So if you repeat this experiment, assume there's no effect. How many times you will get that effect? If you see the 0.05, that means if you do that 100 times, five times, you will get observed that effect. Even there's no true effect. So that's just by random. So the P values tell you the probability that observed result was obtained by chance. So if this P value is very low, that means the chance of obtaining it by chance is low. And we are comfortable. Say probably this is a real biological effect there. So we reject the now hypothesis we accept and the Jack treatment actually is effective. So this is all about P values. And so how do we calculate P values? And it's very easy if we know or we assume it's normal distribution and you don't have to remember everything but the computer will always tell you you know the mean and the variance and you get a value where it's located. The P value is very easy. You get that. And so some people really remember it actually just in their mind to see the one standard deviation to standard deviation. What's the P values? So that's we always want the normal distribution because the P value is so easy. And but a lot of times our sample is better from all mixed metabolomics. We don't know whether it is normally distributed or we are not sure it's normally distributed. How do we still can get some P values? Because if you don't get the P values review is not satisfied. So we need to get some P values. And one thing is that we just don't assume a distribution and we can generate a now distribution from our data itself and calculate the P values. And this P value is not based on model and based on our permitted data is empirical P values. So again, if we have a large number of samples like we have 1 million samples or even just more than 200 samples you probably get the P values very close to the real P values. So this is a but we can use computer simulation. How do we do that is that because we assume now hypothesis now hypothesis is basically there's no effect. So your control and disease is the same. If that's the same, you can suffer the samples. So your patient and the healthy group can mix and you recalculate the same statistics and see how many times you can get the same values as you get using original lab label. So this idea is basically if it's not hypothesis true and you can randomly shuffle the label without any affecting of the conclusion because this now everything is the same from the same distribution, no significant difference. And if we do this again and again like say one million times you definitely will see what's the chance you get a random effect like the one you use original label. And one million times is too computational intensive. Also, we don't need to have enough sample to do that permutation. Sorry, that might not help this. And so here's shows just a schematic view of how the permutation is done. So this example shows case and control, this original label. So the case is 129 and control is 10 to 18. You can see there's a big difference. The case is minus 0.0 something and the control is 0.5 something. Okay, but if we assume they are now then we can shuffle them. So if we shuffle them you see something from the control became case and some case became control and re-computed their difference. And you can see the bottom mean difference below. So there's difference. So we repeat again and again again. So once on the times for compute there's nothing. Then we see the difference using the permuted value. So this is a very simple graphic but summarizes very well about what you see, what's the chance of p-value. So on the right is have a red vertical line is called observed. This is observed using your original labels basically called a patient versus a healthy. And you see so much difference. But if you think you are not hypothesis there's no difference and you shuffle in the sample. The patient can control just randomly reassigned and a calculate difference. And you repeat it once on times. All the repeated differences located on the left. So there's a shaded in a gray. You can see what you observe. This is quite different from shuffled ones. So that means just by chance you want to obtain a difference as good as you observe. It's very, very low. So in this case it's almost zero. I don't see anything there. But if you want to have a threshold like the 0.05 basically you're doing once on times you're willing to see 15 times that going to have a threshold high. I will accept this 15 times the 0.05 cutoff as blue line here. So you can see everything onto the right of the blue could be significant. If you happen to see that effect. But now your effect is actually far away from the left. So you are very confident. But again, because this is a permutation and you can even you're permitting for one million years you still cannot get a zero. Okay, just because it could be there's a chance like when you permitted one billion times and there's a chance you get significant but you just cannot afford that. So let's do it a reasonable let's stop at one thousand times or ten thousand times. So here's if you have three times so the p value will be p equals 0.003. And this is your empirical p values. But if you have zero and you cannot see p value equals zero just because you cannot you cannot exclude that chance if you're doing forever. So but so you you can see p value less than 0.001 or sometimes people add in one basically one at the top and one at the bottom. So it's just then they can change the conclusion. More accurate here is not that meaningful. We know it's significant, right? So if everybody follow this concept I can tell you this is probably one of the most hard questions that when we do a metabolism we're asking how it comes but this is a very intuitive interpretation behind it. So we just shuffle the shuffle the label and redo the difference, okay? So advantage is that we just use the computer to do the permutation and we don't need to assume it's distribution. So also there's some hidden correlations it will address that because it's when you permute this these things can also be addressed. For disadvantages you need a large number of samples so if you have only less than 10 as I can tell you probably you cannot get 1,000 permutation it's not enough to permit 1,000 time. It will so if you have 20 or more than 20 probably you're fine. Computational intensive so it's not a big deal because computer is getting so powerful these days. Once on the even 1 million permutation it's you can finish in a reasonable amount of time. So multi-testing multiple testing issues I hope most of you are already quite familiar with this. Why is that we're talking about if on the now hypothesis if the chance is very low like 0.05 we think the chances it's a small one and we are willing to accept there's a true effect disregarding their small chance but here is that we are not testing just that one we're testing a lot. For example if we're doing 1,000 we have 1,000 here is that let's say 10,000 peaks and 10,000 peaks and each one have 0.05 chance of getting there. So just by chance alone we are going to have 500 significant by chance. So it is very clear just based on theoretical we will get 500 significant things. So this is called multi-testing issues because each one we accept a small chance but when we're doing so many times this accumulated effect and then the chance will be very high. So how do we deal with that? So we can use bimperoni correction so which is just the cutout P value became more strict. How strict it is? You just divide the 0.05 by the number of tests you have done. So for example, if you use the univariate test 0.05 you're doing 1,000 genes and divide it by 1,000 so they adjusted the cutout become 0.0005. So this is very conservative but it's definitely you get something more confident. The other popular one is called false discovery rate. So it's Benjamini-Hodgberg things. So it's interpretation is FDR of 0.05 means that 5% amount of significant metabolites I expect to be fast positive. So we are not talking too much about how the FDR is calculated but it's more accommodating. So if you use bimperoni you don't get a significant features you should use FDR. So FDR is more well-accepted and it's more generous so you can get features to work on. A lot of times if you don't get significant features you cannot continue. So p-value, strict p-value is not critical in omics data analysis. The omics is for you to develop your hypothesis. If you're really strict with p-value end up you'll get nothing. You basically cure this omics thing. Omics is more for exploratory so that you should pay attention to the p-values and don't do like optimistics and don't use like you need to adjust multi-testing but you cannot just too strict. The reason is that even here you don't get significant features but you have hypothesis and you validate using PCR or using target metabolomics and marry the compound and indeed the change. So the p-value doesn't matter. The biology matters. So at this stage you need to be generous. So you have hypothesis and have some to validate in the later stage. So you need to do things reasonable but you don't have to be very strict. That's the beauty of omics is give you some flexibility and you know validation need to be done as long as you do that and you'll be fine. So high-dimensional data we are dealing with so let's see we're talking about t-test ANOVA and what we are doing is at analyzing a single variable and then apply the prestige to all variables finally doing a multi-test adjustment. So I would like you to feel comfortable with what we have discussed. So we are mainly talking about like t-test ANOVA then they doing multi-test adjustment. So this is univariate in things. And visualization are limited to three dimensions and how can we do the high dimension? This is something we're going to move on to machine learning multivariate statistics. Questions so far? Okay. Yes, Jeff. About the FDR corrections what I found is because I have a study for now it's only have a third page variables. It's about it's a it's not a metabolism it's just clinical outcomes. And for the circuit variables and I have two groups cooperation means only have a 30 computations. But what I found is like after FDR the significant all significant about like a five or six odd is here it's not like equal to 30 like multiply 0.05 like only have a 1.5 chance got a wrong false positive. It's all significant disappeared. Do you think that makes sense? Yeah, I think sometime up to doing FDR it will become all 1.0 or something. It's really depending on the ranking. So it's this depending on how the FDR was calculated. So what's the cutoff you used? 0.05 So FDR usually I see people can go into 0.2 so it's 20% it's all fine. And also I do see sometime you get a result like yours just because of the ranking of your results. So that's on the other hand you can improve like 0.1 0.15 or 0.2 it's all okay with FDR. It's not like yeah. Do you think it's that in other ways may make sense? It's we strict to the low P value from 0.05 to 0.01 we consider it significant. Well, there's all about context all about evidence. So you can use some people don't even use P value. Just use a photo change plus some evidence plus some literature. And as long you convince the reviews convince yourself and you do validation. It's fine. I can tell you there's a strong movement talking about not too strict on P values. But my suggestion is see your normalization whether you did a proper normalization and see your approaches whether it can be improved. If you want to go to extreme give up P value which is I don't I'm not favoring this part. So you have to use a statistical guide you are thinking but you need to have more evidence. So if you can have statistic it's close like marginal significant like 0.05, 0.06 but photo change is significant and you also have other peripheral evidence or literature to support. And it's yeah it's nothing wrong about that. So you just need to build the context and and you have an argument right. Yeah we can we may discuss later if you have all your data and see what's it's about. Okay. Sure. Thank you. I think it's also important to remind people that the 0.05 is is sort of an arbitrary number that is not strict. It's not like a law of gravity. It's it's been a consensus that people have almost turned it into a law. And in fact, as Jeff was pointing out you can relax it to something that's 0.1 or 0.15 as long as you're rationalizing or explaining it. But it is quite remarkable how people if they've got a you know p value of 0.055 they throw out their data and say it wasn't significant. And that's just wrong. It's sort of like saying you know there's only up and there's only down. Statistics is inherently a fuzzy science. It's there's lots of grays. That's right. Yeah. So we need to understand the statistics on an understanding pitfalls. So use the taste of property to help us making these serums. But we cannot use let statistics to restrict us. I'm just saying as a biology wins always because at the end of day it's really it's person. It's clinicians and you see the patient who actually have that things. It's not statistics. So statistics that help you develop a hypothesis and you need to do some more validation. And people just some clinicians don't have statistics. They just say say the things and then make some bound and biomarkers and get things work. So for all mix analysis we definitely need the statistics and need a visualization. But we also need strong context and biology and the literature the peripheral reading. So all these things need to come into play to making these serums. Don't do it very early to cut off your coming up downstream analysis. That's I would like to see. Okay. Yes. So here's we talk about the general challenges. So all the traditional statistics you really not designed for all mix data. Because they require huge number of replicates and we which we cannot afford. And so what we so if we want to use multivariate statistics to do such things and you can do it. But nobody going to understand it. Nobody going to use it. And I took the magics algebra. And advanced advanced mathematics magic algebra and read through it and try to use it. And yeah, spend solid one and a half years. And I'm giving up. Why is that? It's just not so far as I don't find it suitable for this for the omics. It's hard to use, hard to understand. It's not designed with omics features in mind. So I don't see. So a lot of the matrix algebra beautiful, but we just don't have the data. So you cannot force our data into that format. And there's no way to connect easy. And so we so it's disappointing to see that. And what else we can do is that pattern recognition machine learning. And it's when I switch to this direction. I found it's much more comfortable and much more useful and making results also interpretable. The other thing is called chemo matrix or dimension reduction, which is also actually machine learning like PCA, POSD. So let's put it all together. And here's a slide that borrowed from someone who presented in the metabolomics two years ago. And he is really the pioneer in data science in chemo matrix in machine learning artificial intelligence. So here's a big picture. Here's where we are. So we're talking from the right side about statistics, hypothesis testing and parameter estimation, which is mean and the variance. Okay. And we gradually move to clustering classification. And this is what we are going to cover in the coming. We're going to talk more on machine learning artificial intelligence not in this workshop, maybe next year. But I think the whole idea is that we understand it's basics and it's much easier to absorb it in more advanced. Sometimes it's basics are much more useful. And this so machine learning and the pattern recognition. So the idea is we don't focus on individual features. We focus on the groups, focus on the patterns, which is a group behavior. Patterns is one dot. You don't have feel it's a pattern. But if several dots, multi dots moving together and in a location, you think that's a pattern. So our eyes are very good at finding patterns and the computer also can be changed to find the patterns. So in here is that visualization, which I'm very, I'm kind of feveral visualization plus machine learning unsupervised machine learning were very helpful for exploratory data analysis for the larger omics data. So we all, I hope everybody is familiar with the supervised unsupervised. So supervised is basically you do the analysis with well referring to why basically you only you try to find the patterns in the X related to Y. Okay, unsupervised is I only want to find the things information within X. I don't know it's related to Y or not. I only found it in here and the patterns within X. So this is more just X alone and the supervised X plus Y. Okay, so unsupervised the machine learning for our high dimensional data and there's two general categories called the clustering and dimension reduction. Okay, and fix on PCA and but sometimes people think of PCA also clustering. So it's all fine based on how do you view your result. So what's the key idea here is that because we are human brain, we can only think about a few variables. Okay, now if we give you give 1,000 or 10,000 what you cannot comprehend it easily. What you can do is a clustering. Basically, you put the things that are similar to each other into an individual block. For example, you put 1,000 variable into 10 clusters. Basically, each cluster became more homogenous. So you can think about this 10 cluster or 10 unit rather than think about the 1,000 unit. So it will help you to easier to in turn of your think about the behavior. You're not think about the 10,000. You think about 10. So that's simplify the interpretation. And then if you see the 10 different cluster, then you focus on individual cluster. What each class is about. So we divide and conquer. So this is how the approach tried to help you. How do we do a clustering? And while clustering is put a similar thing to close to each other. And how we need to marry the similar similarity. And at some point that we think is similar enough and put them into one class. Okay. So we need us marrying similarity. We need a threshold to think. Okay. Perhaps that they are the same. So I'll be a father than that. They are not the same. So this is something the key parameters for doing clustering. So commonly used method like K means they all call partition your method. So K basically if you're doing a K means you want to give them a K K is based on your ability. How many how many cluster you anticipated there. For example, three means it means you K equals three. You means that there are three clusters of K meetings. So you are not based on mean. You are based on medium. So it's very flexible. It's very powerful. The only downside that you're not sure beforehand how many cluster there. Of course, you can run a test from one to 10 and see which clustering patterns that have a very tight clustering. So it's common. The other most commonly used is the hierarchy clustering. So this one is basically your hands for it. You don't need to specify any parameter. They will try to cluster from the bottom all the way to the top. So from everything is a cluster individual cluster then became one cluster in the end. So let's talk about hierarchy clustering. So you don't need to specify the how many cluster you want to get. So what it does is find the two closest object and merge them together. Then you merge the next two closest object. And then you try to merge even because the next layer the cluster became one individual object. So you need to measure the how distance between object and the distance between the cluster. And so until just doing this until it's been done. So it's boring. It just repeat and repeat. So compute it very good at it. So all the key parameters similarity between sample or similarity between clusters then we just repeat and everything will be calculated in a few seconds. And you will see the result easily from everything and to only everything one class to the one. So you can visualize the group patterns and decide which number of cluster you want to have. So you cut out that level. And that's the flexibility of hierarchical clustering. The similarity you mean the distance closer or you mean other like spearman relations. Yeah we will discuss that later. So similarity is it's a parameter you can you need to choose by absolute distance Euclidean distance correlation coefficient is all kind of similarity. Some of this for example ecological similarity it's quite different if we're doing a microbiome the environment is very different. Also evolutionary distance. So all this thing is we just use a concept similarity but how it's been calculated is that you can you can give use your own calculation based on your specific field. So we'll discuss that later. So came in clustering came in clustering is that we need to specify key and then we want to put everything close to this seed. We need to give a key then then that give key seed. This key seed can be a random put it there or just randomly choosing one of the current variables and starting from that and you choose the next object that close to the seed and put them together and just if a similar in higher than the threshold and you put them cluster and then if it's similar to that seed and put that close to that cluster. So so for example here it's a came in clustering it's just a you you visually say it's to cluster but then let computer decide that how to do that came in and whether they can find it and actually because it's so clear and the computer is very smart and in this case they were easily found it's to class so if a you can see that's the data and it be and your job to seed I just think there's two clusters and then you drop it to there and you can see in the sea you can see the red one assigned to the red seed and blue assigned to the blue seed and because they are close to that you just assigned to that now you recalculated the centroid and when you assign it initially then you remove that two random seeds you dropped then you calculate centroid you can see that in B the centroid that became more already moved to individual cluster so the red one is on the below red across and the blue cross in the top and then you're doing one more thing and they already converge and stabilize this then one more thing at the app it's very stable the center in the blue one in the center blue and the red one in the center blue so once you're doing more iteration there's no change it converges and it's done so k-means is very efficient also if you have a huge amount of data almost hierarchical k-means that the one is currently can run easily some other clustering could could take much longer that's the so this is similarity how do you measure similarity I mentioned about there's a lot of the different approaches here we talk talk about Euclidean distance basically this is just we know in a Cartesian coordinates and x and y this is minus and squared so this is we usually do this so this is far and commonly used but it's not necessarily the best one so Pearson correlation so it is this is more about similarity and you can see it is has a sign plus positive negative you can see from left side it's all positive correlation is one and that one right side is negative correlation is minus one so sometimes the direction matters and so how do you calculate the similarities really depending how you think this similarity meaningful direction matters or not I'll just absolute distance there's Manhattan distance there's a lot of things so that's I there's no absolute answers so you need to try or need to think about the distance mirror in there also clustering and when you have the cluster how do you marry it how do you put them together class the distance between once you assign a class how do you do that how close they are one is called a single linkage a single linkage is try to use the closest the data point you can see that lines between here this is single linkage and of course there's as a tend to generate very long chains and the completed linkages try to use the furthest to to a data point you can see here so this is a generated columns so most people probably use the average so average is more give the result is more reasonable sometimes but why it's there is that for some time it do work well single and complete so average is more more there's also what or what too if you use some of the algorithm it is slightly improved for some subtleties so most time this is a you just don't need to care about too much about that so hierarchical clustering and the heat map so hierarchical clustering we discussed so clustering is you can see here we we color the hierarchical clustering called a cheese or dendrogram on the right figure and you can see some of the clustering there and the most interesting one is that they call the heat map heat map is your data it's exactly your data it just didn't show that number in your excel it shows in color and so this is red and green so hopefully you're not color-blinded and if you see people our eyes are very sensitive to the color so we can easily see some patterns dark red or darker green or black patches of inside your inside this heat map so if you display everything like like a number absolute concentration there you will normalize you don't see anything and that's that's a clustering plot plus heat map really give you a lot of the power to help you to discover patterns to do more hypothesis testing so this one is developed in the microarray but it's been very popular in a lot of studies so PCA we mentioned about PCA is very important for metabolomics also used widely in a single cell RN-6 so the main idea for PCA is that the main direction of variance is the main data characteristics PCA try to capture the most they capture the dimension that capture so we talked about dimension is the number of variable inside your data so if we talk about like metabolomics we have hundreds, two thousands of dimensions but PCA will condense them into two or three dimensions so these two or three dimensions that capture most of variance so with the variance we think the change of the data is the most characteristics of data so this assumption is most time is meaningful especially your your phenotype your drug treatment cause a huge change of your data of the data then it's definitely going to capture it here it's that's that's how PCA works so you see a big I could just mention something as well because it was brought up early on which was this issue of dependent variables and independent variables and another way of thinking about PCA is that it converts a lot of dependent variables into a small number of independent variables and it it inherently finds the correlations things that are highly correlated kind of gets rid of them simplifies them and reduces them to the things that are somewhat uncorrelated and so that's that's another way of thinking about PCA yeah I they've got a good point here is that if once you're doing a PCA all the new dimensions and become orthogonal they are not correlated and the downside is that this new new dimension is we don't know they lose their original IDs so they're not metabolized anymore they became the PCA1, PCA2 so that's the interpretation we can challenge but the interdependence is this removed for metabolomics PCA working very well because metabolites is actually also peaks is quite interdependent and we can summarize the data very well in PCA it's interesting why we don't do it in transcriptomics in microwave and then I try the PCA on this data the way I found most time PCA doesn't work well why? it's because the correlation in between genes is not as strong as metabolites so the PCA does not reduce the dimension in a meaningful way like what it did in metabolomics data so that's why they didn't use it too much or don't find it useful in transcriptomics but in a single cell it's working well so I would see that so importance is the PCA reduce the interdependency so the more interdependent correlated your data the better PCA going to work if every data point every variable is totally random the PCA will give a perfect sphere the round ball does not give you too much better so here is the PCA try to summarize in this is a bagel so three you want to project in two so the O shape is really capturing more variance which is your PC the hard dog part is not as too much variance so you basically if you have to choose between O and the hard dog you have to choose O because that's clearly give you more variance and capture more characteristics of your data so this is the PCA and how it does is this is a mathematical details and I don't think you should pay too much attention it's a very linear linear transformation so it's very fast most time and your original variable so on the bottom is that the t1 a t1 is your PC and it's basically the score it is calculated by your coefficient multiplied by your original variable so this coefficient it became loading and so the more impactful your coefficient or more particular variable is more important they will have a stronger in terms of absolute value will be larger okay have a much impact on generating the PCA score so this is how we we are going to later discuss how the scores VIPs it's here is that how the effect is because of the here they became correlated and coefficient they are both assigned more positive and active all the absolute value have matters because they will direct affecting the final number of calculate here so you can actually rank them easily are they positive are they negative absolute value to find who is most who is most impactful loadings or features so keep in mind that PCA can be applied to just a matrix you can use a correlation or covariant matrix so if you use covariant matrix variable must be in the same unit and besides variable that's the most have most of variance so the correlation matrix you basically called a standardized with mean and standard deviation and the variable can be in a different unit because you're standardized much that all variable have the same act on the analysis so again is that this is your consideration and you need to keep in mind so sometime you do the scaling basically make them standardized or use the covariance and we just emphasize them the overall things so I need to get faster because we closed our break and so here is what PCA does is you see from this spectra you get this PCA plot which beautifully separate in the samples now you feel happy about it but this samples is called PCA with samples is called scores plot basically this is your new transformed coordinates caption most variance so and there's most of variance actually correlated with phenotypes which is a very good news and now you want to see which one contributes to this separation and it's called loading plot loading plot which eventually is your coordinate as your coefficient which positive and active or value big or small will drive this your samples to the directions correspondingly so if we go to the next one which is this one I put scores and loadings on the same same slides you can see how to interpret them intuitively so for example this is four groups and they separate very well on the score plot on the left and on the right is your loadings so the loadings on this top left will be positively correlated with green things so essentially this one is driving the separation towards the top right the top left so similarly the on the loading plot on the right that on the bottom bottom right it will be positively correlated with the stars brown stars if it to drive them to there so and similarly to the all these columns so this one is very clear to interpret because they are good separated but most time if it separates not well you just capture the on the edge so the edge distribution on the score plot on the edge is margins more have a more power especially the I'm talking about away from the center so P so PC is also you can interpret the PC is rotating your PC is a transform your data and your data actually essentially the same but it's rotating the axis so so the first axis is corresponding to the biggest variance and the second one to second biggest variance so this is you can interpret it that way and what is good for a state overview so if you want to the QC and usually you are going to use PCA you can also do some box plot across different samples of features you can also good for outlier detection and you can also good look at the relationships between variables so you can see here is that go back to this loading plot loading plot is about variables will contribute to a separation score plot is all about samples so on the loading plot the the this variables close to close to each other are similar to each other so they are positive they are similar to each other so they are also most likely positive correlated so this is a one is looking at a relationship between different variables some some variables if they close to each other they are similar to each other so PCA is unsupervised as I mentioned and if it's working well and basically you almost down because that means that you are inherent data structure already reflecting your experimental conditions so the effect is very strong and obvious and but a lot of time especially in clinical data and and it's that not that of promising so you need to use more supervised approach so PSD PLSDA is a supervised that approach it maximizing the covariance so PCA is maximizing the variance PLSDA is maximizing covariance between the data matrix X and Y so they considering X and Y together so it is supervised so PLSDA is very always try to please you so it will always produce certain separations even it's for the random things so PLSDA is a very popular because you can always get something but PLSDA also overfitting issues we are going to discuss later so so we need to be cautious here's a result in metabolism we do PCA then we do with PLSDA you can see that on the top left there's not so much separation if you're doing PLSDA you can see separation became much better so this is in general you always get a better result from PLSDA but we're at meaningful and you need to do more testing and validation so PLSDA is very susceptible to overfitting and so overfitting means the patterns you see from the computer like that separation it's not necessarily the real pattern you can give some random samples there's absolutely no separation PLSDA will also give you some could also give you some pattern like that how do we deal with that? we need to use some machine learning approach and one is called cross validation one is called permutation so permutation we already discussed basically randomly shuffle the label and redo the computing and see the distance between the two groups are they that the groups send out that two groups for example these two groups is the same distance or different distance so we can always do that and cross validation we are going to discuss so later so overfitting overfitting is that we found patterns that's only suitable for this particular data but when you use these patterns to rule to predict a new data it doesn't work well that means you are not generalized to the population so we want the biomarkers to be working in a population okay we don't want to work only in your own data even if you're working 100% it doesn't matter because you are once translating your discovery to the general population so overfitting had to be reduced you want to see the truth of performance and it working well so what how overfitting happens is of course the picture below shows you something about distribution of a data point on the right as you can see it's overfitting because the data naturally has some variation if you really fit everything data point and as like a curves you're overfitting because that's the you're fitting a noise that's the on the left side is that it there's some patterns it's real patterns so and you can order give that street line so you're not doing a good job so the best one is the second in the middle one it's just the right so you're fitting true signals not fitting noises so this is one is a conceptually you should try to get the best model most powerful one but not overfitting model so if you're underfitting also not good because you should do a better job than that so how do you do that is a cross validation is most commonly used cross validation is try to divide your samples into a different different sections then you try to do that use the one to chain and predict the one it's not used for to build the model so for example the bottom you change your models using two-thirds and predict in one-third of the test so you can always rotate the three sections the other two to predict the first then the other two predict the middle one so you can do tenfold cross validation or leave one out cross validation leave one out basically if you have 100 samples you can do use a 99 to build the model predict the one and you do it 100 times so that's that's all doable and this is leave one out so it's more granular Jen sorry but like we are definitely late if you can please try to wrap up thank you yeah we have a lot of questions okay so cross validation is used to to tell you how good is your model for POSD it is also R-square Q-square and so so so cross validation R-square is also known as Q-square so it's a prediction accuracy it's also probably used and so permutation test so we I'm not going to give you too much on this because we covered the in very well in the previous one so you just to shuffle everything and do the test and calculate how how how separate they are and so if we do in permutation and we see our result is far away that we are doing fine if we it's squarely in the center then that means that we don't have too much signals in our data it's almost like shuffle the data so POSDA we found it out is that it is not talking about the variance it's talking about the covariance so but in in the score plot all this kind of separation here's called talking about variance so sometime your variance your first component explain less variance compared to the second component why is that the component it's here is not maximize the variance it's maximizing covariance so so like sometime got the questions about this and I wrote down on the web so make sure you you scroll down and see what's the interpretation so so we have this R square Q square and we will cover it in the lab section next one so also for have VIP scores we're talking about the feature coefficient coefficient and and VIP is basically adapt from this coefficient and have much have some other minions to make it more robust so VIP higher than one is important less than one is less important so this is the empirical rule so this is for example here that VIP plot from tab analyst and you can use your one to see higher than one probably more useful for biomarkers and downstream analysis so final is we talk about that going to talk about that performance and so there's accuracy error read so so this one is not suitable for imbalance data why is that if we for example use some disease that's not so have so many occurrence so only five in one thousand people if you put everything as healthy and you you have 99.5% right but it's not useful so so the accuracy is not more meaningful is not so meaningful in clinical so clinical actually looking at sensitivity specificity and true positive true negative false positive false negative so this one is well covered so basically you want to see how many healthy you are diagnosed as healthy how many disease you are diagnosed as a disease so true positive true negative basically it's emphasized you have to work well for negative and positive and rather than just focus on overall one single value so if we put this positive negative emphasize true positive true negative and we are going to get this it's it's making a good decision so how do we how do we if the population have a natural overlap and we have a cutoff and we know we are going to sacrifice something there's going to true negative false negative true positive false positive if you have this one and we want if we want to combine all of them into one single number it is called the ours a curve so a receiver operating characteristic curve it is used for radar studies but it's also summarizing what we want basically give you both and how do we generate that we just do the cutoff and moving around and we connect them and basically this ours a curve and so ours a curve is widely used in clinical and because it kept both positive for true positive and true negative so you can basically what people want to select is you want to have a trade-off it's really depending how much cost okay and so here is that you have a high sensitivity and you you have and here's a balanced of sensitivity and specificity and here's have a high specificity so which one you want depending on your disease and cost and when you combine it's called area on the ours a curve called AUC so this is a more commonly used for compare biomarker performance so here is a AUC 95 percent and 100 percent if you so 95 percent are pretty good and 100 percent are very rare and but in in our research most types that we see this is 70 percent so 70 percent not that useful so you have to be higher than that and in the bottom is random so you have no performance so other cost of method like same car OPR as SVM random forest they all available and not same car it's not available why it's because it's proprietary I don't want to get some lawsuit and ideas ideas it's it's very good but it's commercial but OPR support back machine random forest is very powerful too so we're talking about this unsupervised method we're talking about supervised method and we're talking about the statistical significance and so just pay attention to that when you're doing supervised and you meet pay attention to this overfitting because it is don't be be too excited if you didn't do overfitting you see some patterns you think that's true so you make sure cross validation permutation is done properly and before you became big bigger celebrate but if you unsupervised the method is doing good separation and you're mostly safe okay so here's that one we just mentioned so if we don't see good separate from PCA you need to be very cautious to the next step so if you do PCA good separation and the PSTA the other remaining well most likely doing good and you have signals so if you want to do the PSTA supervised analysis usually you need a good number of replicates so sample size may be decent so if you don't see good separation in PCA you only have about six samples you still want a good classification nobody going to believe that okay that's just impossible task so let's let me see so finally this is it's you it's biology so don't put don't just think about the statistics on machine learning really you use the right tools use the stuff well make a huge difference and you just understand them use them properly it's up to you and and biology and other literature and you have to interpret everything there and to come up with a convincing story so just using statistics give a list of you try I tried the metabolism use this method this method this method and get this result and if I'm reviewing paper like this I will be very annoyed and waste biology and you still need to think a lot and put all the result in the context and you don't have to use all the results you just use the results that easy to interpret and or supporting your story so that's that's I think it's from review a point of view it's much much much much easy to to accept you cannot give just a lot of the result without your own interpretation that's that's a lot of people try to with metabolism they can easily get a lot of results that's just the cautious on that guess that's it