 Okay. Good morning everyone. It's my pleasure to continue our workshop from yesterday. David, the laptop. So, let's start. Okay. So this is our creative common agreement. Then today's topic. We are going to focus on data analysis. So, we are going to mirror almost all the schedule we have done yesterday. So the first one and a half hour, we are going to focus on the omics data science. What's the basic concepts, how do we do it. So there's a lot of the concepts that's this quite a strong patterns across all the omics data analysis. There's also something more unique to metabolomics. So I'm going to go through this. So the next session is on various specifics for functional analysis and raw spec processing, and this is more unique to metabolomics. After that, it's a lab, and it's similar to yesterday we're going to have two hours, and we are going to go through the four tasks. The first one is the multiomics and I see we have one and a half hour. So, let's see how much we can cover because we do a lot of multiomics I see this morning this question on that. So, let's start. So, for the first session is on basic concepts behind the omics data analysis. So this is a very important and if we do the metabolism analyst to do the data analysis a lot of a lot of the reasons and motivations behind this is driven by this concept. So if we understand this concept, we will have a very smooth journey in the lab session. And next one is about P value, how to do the calculating P value so everybody talking about T has an over which is easy, but when we do all mix analysis there's certain specific things about P values, and how to calculate in when we don't know the And we are going to cover very commonly used the PCA PRC what's the concept, how to use it, how to interpret a result. And we also going to cover some basic machine learning concept, including in clustering classification and performance evaluation so if you're doing some biomarker analysis and this is a very important. So, yesterday, David covered how the data was generated and from an arm from GCMS from LCMS. So we don't do that hands on using the machine but up to the raw structure and you get it. And if you're using a magmat and GCL to fit or LC outfit, you will get a concentration table. So, now we have the data. And today's topic is how we get from the data to biology and to patterns to biomarkers so I guess most of you if you are doing research project this is most time consuming part and we are going to help you go through this. So, from the tables or peak list. How do we get a pattern spell markers. And this is, we are going to cover in the next session, and then the, in the module six, we are going to focus on how do we get from this significant compounds or picks and to the functions so that's covered in the two modules. Now, here's the overview of all mixed data analysis. So, we cover yesterday about how data was generated in metabolomics. So we see there's MR LC and GC. And if we go beyond metabolomics, we all see a lot of lot of data generated by next generation sequencing, which is you'll get get some fast queue files and need to do some risk mappings and stuff. And there's also a lot of the studies are using image diagnosis or stuff. So this data is very platform specific you need to design specific algorithm to process a raw data and generate a table. Once you have that table. And now we can use a variety of the statistical and machine learning algorithms to actually understand the data. And we can do a simple ones like T test and over, we can do more advanced the classroom and classification which we are going to cover soon. And after that we want to understand what's behind these patterns. And what other potential biomarkers and what are their performances. So this one is in module six. And further down there's more unique specific analysis, which we are not going to cover, but the majority is that stats functions. Okay, that's the main shared pattern. If we focus on the statistical data analysis, and the statistical data analysis also have four major steps. And the first one is how you put your data into computer memory so computer you can apply a lot of the methods on the data. Okay, this one is actually seems simple, but actually quite challenging, especially beginners computer and humans quite different they expect different things. And we can visually correct or understand but computer have to be really specific fit what they need. And so this part is how do we prepare the data and for my type analyst. Then after we upload the data next one is actually how do we visualize the data and doing a quality checking. And this is a very, very important. We know garbage in and garbage out and the data analysis takes a lot of time. So we must make sure that it is okay before we spend that time. So also quality checking require a lot of understanding about what's important. How the experimental design. So this part is a require a lot of the both statistical understanding and your data experimental design and which machine which type of the instrument used so and the QC if you use QC. After that, we would like to do the data normalization data normalization is part of the procedures to help you improve the data. So some of the most of the data you generated nowadays is better in the metabolomic center, like Timmy, it's a high quality, and you just need to improve and further make it more compatible, like a normal distribution making more normalized. But in some cases, if the data quality is not good, and you need to use some batch corrections and the mission value input implications so improve it. And the last part is that data is clean data is high quality and data is normally distributed hopefully, and we are going to apply a lot of machine learning and classification and we're going to spend most time try to understand this patterns. So keep this in mind. All these steps you cannot miss you cannot just reading data directly go to the stats or functions, which you could get some result that's not reliable. So this step is really common, not just a specific for metabolomics. Now we focus on the how to prepare data, and, and what's the main path we want to go through, and the data input for almost any of the machine learning of data analysis require one is that you are data that can contains the table that's actually quantitative. It's married. For instance, in metabolomics for concentration targeting metabolomics is concentrations and on target is intensities relatively relative intensity from peaks of our beans. So this is the data you married from machine. Then there's another data called the metadata week here we call it why why is your, your study design like healthy disease or group one group to time series. So this is your expanded design so both x and y must be understood by the computers to do data analysis. And from this to type of data what we want to get is biomarkers, clusters, and some understanding how they can rules or models. So a lot of things can be done. So this is in between is what we tried to accomplish today. That's a journey. So preparing data. And acts data is a data table I just mentioned, and the most of the table we will use should be continuous, because that's intensities or the concentrations. If you're doing a next generation sequencing, and you can think this is the count and account data is is integers. So you can see the top which is from R and sick data. And all of them as a count you see count to five 100. So, count data and continuous data is as statistically they should treat it with different algorithms and continuous you can use a normal distribution calm data you by default people think as a personal distribution. And, but nowadays you can do normalization actually convert calm data into continuous range to do the same analysis, but overall that I really need to pay attention to what type of data, because it requires some pre data treatment. Now here's why the metadata. So metadata is about your study design, and you can actually label it any way you want a zero one yes no case and control. And in general you should use a very descriptive and very concise label. So you can understand and when you're plotting that plot is also informative people can guess what it is if you're very long is that it contains some empty space or stuff it can be cut off so although it's okay to use anything just try to think a bit about what's the best label. And the next one is nominal data and usually this is for modern two groups. And, for example, low, medium, high, and also they contain some something that the order actually making some sense. And in statistical analysis or visualization and the tool will respect such order if this is indicated. And so, and this is audio data. So, the other one we talked about is multi groups. So, it's basically modern two groups we can apply but if you have orders you need to specify this all the preference and that's so on the right hand side you see some examples. The first top one is just one single factors is you have your sample and you have your factors that label is called disease some is control, some is QC, some is CD, this is one factor, because this is just one column. But below is multiple columns from sample you have about four but if you're in the clinical settings you can have 2013 and a lot so this is getting very complicated. For my type analyst, we really try to be as accommodating as flexible. And so if we just have one type of factors or just one column, we can directly embed in this design within your data table. So if, because every label is labeled the samples. So the samples can be in the rows samples can be in the columns. So label must directly follow that sample. So in the, in this top left, and we see sample in a rose so you put the samples directly follow that the second row. The second column. So sample and sample label in the bottom one you have the sample in the column. So each first row is sample name second row is is label. So make sure it is changed transposed and do it consistent. Once you have multiple factors you need to use a table to describe so you need to upload the data X and upload the data Y. But for today I guess what our lab is going to focus on the simple ones because yesterday we generated to basically have one, one factors is lung cancer or control. So this is just pay attention when you upload data also need to pay attention to your samples in the rows or the in the columns. So, we need to be aware of what we are talking about sample is sample so it's consistent but when you talk about the metabolites and metabolites genes we talked about it is called the features of variables. So this this commonly in machine learning steps is features of variables and dimension dimension is number of features. So, if we have 100 metabolites merit is 100 dimensions. Okay. So the unit varied and basically we're talking about we analyze one compounds or one genes or one peaks at a time will regard into the class labels so it's called a unit varied and multivariate is that we can see the multiple ones, usually in in this case we usually try to like in the PC we can see that all the variables all the data is simultaneous. So, so multi varied or high dimensional data, it is really for omics data, because we are talking about hundreds of tons of compounds or genes of peaks being married and we analyze them simultaneous. We receive a lot of the questions about how do we do some basic data checking and understanding individual data. So, although we focus on the global but we have to understand individually how data is described. So, if we try to understand individual features like glucose and we marry them what it looks like. And one is we talk about me or median. So this is called a central tendency. And it's in the center of the data and a very bad is how spread is the data. So if all the population we have like sent in like a normal distribution we know they actually there's have meaningful kind of the summary using me. This is very useful for statistical analysis data summary. And you can see on the right hand side, if we plot in all this numbers in in a row, you see it's center actual meaningful because center is actually concentrated if you choose you the mean or media, you can use it as a representative number to do a lot of computing. And the plus the spread basically a variance. So it's very meaningful, why we do the normal distribution because the mean and the variance actually capture the key characteristics of this data. And in the below, you can see this distribution is more you uniformly distributed. So in this case, and you want to find the patterns or want to summarize in the mean it will not be that useful why because they're not centered. They're not condensed to that region. And you are not using one or two values can capture the whole characteristics of this data. So it's distribution is important and most distribution is try to use a mean and a variance, which is best is for normalize data you can see like this. Now, at the bottom, you can see data is skewed. It's all left skewed to the to the left. So, how do we do that we actually with the skewed data we can do a log transformation, or sometime auto scaling, we can make it more normally distributed. So, visualizing this data and understanding how the data distributed it will help you how to choose which normalization algorithm. So, the spread of the data or distribution of data we we have some other mirrors, like a quantum normalization basically each 10% or what is the range looks like. And also, IQR is called an interquantile range. So, this is important summary of the data. So, here if we think about data is not normally distributed. It's distributed as a box plot of box and we plot it will looks like this. And this is the median, and if it's really normal median mean will be very close. And this interquantile range. This is a quantum is like 25% above and 25% below. And now it's like further extending high and low as medium and the maximum. So this was a very good summary of the data. And if your data looks like this, and you can really comfortable. It is normally distributed. So, this is quite useful box plot. Sometimes people actually pop a plot in the dots on top of the box plot making more informative. So this is very good visualization just make sure you can interpret the box plot intuitively and help you to guide yourself what it what's the method to use. And the mean and the variance is too important parameters. So mean summarize the central tendency and a very central summarize the spread. So if you compare two populations and the mean is very different and and variance a very narrow basically they are very two different things on the top. On the top right like this. So mean is very summarizing the state group and summer and this group and they are so tight individual. So the p value will be very very close to zero is very significant. So when they have same difference in mean by the spread or variance is as wide, they are overlap. So some of the positive negative won't you have some fuzzy judgment. So, because in this region there's you're not sure which groups they belong. So this one going to reflect in your p values p value won't be that significant. Right. This is clearly we are talking about what's why we care about the mean and the variance because they are important for the disease and stats just summarizing them into some measurement and reflect this distribution. So normal Gaussian distribution why we care about that because they are just really kind of in almost all the data biological data or physical data measurement. But once we get enough large data, we found that they're normally distributed. So this one is quite amazing discovery and almost we get a certain number of the data high good enough and we do some treatment we found it's more or less follow the same thing. So also not only normal distribution is common in biology and life sciences in and they are also computation is so efficient. And once this is a we don't we know this is what this for example there's a height of a class and if you're plotting like 13 or 14 people or 100 people you will see there's always normally distributed. There's some people is very high and outlier but the majority will follow this distribution. And this is formula how the normal distribution is there it's very seems complex but we never need to memorize it because computer deal with Norman distribution so efficient. And this is a lot of advantage. And, but most times if it didn't all mix data, especially a lot of the studies and mix of different populations disease and developmental stage, we do see a lot of other normally not distributions. And you need model is quite common but by model is quite common and you see the RNA cells different populations. And each of them is actually more or less normal but you mix to different population you can see the two mode. It's it's very intriguing when you see this and skilled is quite common and our ratio metabolomic data and some of the a lot of them quite skilled and we need to do some log transformation. So data normalization is a really really try to try to make the data more normally distributed so because a lot of downstream analysis is using assumption so we cannot feed them directly raw data. So we want to make sure they are feeding the right data so the algorithm happy and result is more reliable. And so this is the step we want to do normalization we call it normalization but a lot of time also relate to some batch corrections all the stuff so we just captured transformation. So, and here is the common, they use the approaches skill the distribution and log transformation and this is quite quite common, almost you for almost all the metabolomics data blood concentrations are stopped, and we feel it's good and after log and it feels normal. So usually it's your first lines of choice. So do the log first do not try to do it very fancy at the beginning, try to keep the data raw and try to do a simple ones and gradually increase the complexity. And because more things you apply layer and the layer and it will hard to interpret also introduce some overfitting issue. We are going to discuss. So, here's that real data, and we don't have a lot of a lot of samples you can see it but we can, we, there's no much guarantee you get a beautiful normal distribution if you have like a 10 or 20 samples. But we can still apply you can see it's getting better so overall the log transformation is, is a safe bet. So you don't want to do the raw data process. So, one reason we metabolize is very popular early days of one thing is we supporting a very large number of data normalization. Why we do that is because there's so many different type of data and normalization a method, and, and a lot of publications actually claim which one is better which is more suitable. So, in order to accommodate in different needs we actually try to capture that different type of data, and we implemented them, and people can choose overall that seems a lot but you always starting from log from very basic ones you can do some centering. But if you do have internal standard have something just to choose them appropriately. So they are not that complicated the every option actually have a reason behind it. So there's a lot of people actually really like certain normalization because this is their study design they have that internal build in. So, if you have a very general one just started with log and see the result and then and choose another one and see how it looks like. So this is not no hard rules for across all the data, the normalization is really depending on the instrument depending on the data. And so, if we talking about data QC normalization and normalized up to normalize the clean. And this is time we are going to understand the data and how do we understand that we use simple stats advanced machine learning visualization so this is a really you're going to spend a lot of time to do. And before we really get in there, and we need to understand p values so everybody think they understand p values actually I received a lot of questions from users they asked you how to interpret when to make a difference. So p value is very critical and help us to simplify the, simplify the understanding of how it works and we're always going to use, can we just use p values to help us making the series. And, and I always hesitate to give that answer. It's because at the end of this biology really matters is science matters stats just help you simplify certain data and summary. So, over reliance on p value is going to be have this pitfalls and there's actually a recent moment, always we need to be causes about over reliance on p values always consider about your studies and context other evidence is, and in the future if we really talk about multiomics and stuff and we just cannot get enough number of replicate to get the p value requirements so overall that we actually have a lot of evidence and point to the same thing and we know logically it's convincing meaningful, but it's probably cannot be captured using p values which is designed for a lot of the replicates repetitions and all these things. But seeing that we still need to understand how we interpret p values how we can using p values to help us. So, overall that we want to marry a subset of sample and from the subset sample we want to extrapolate to the whole population for example biomarkers it is you find from like a group of patient and you think the biomarker can be used for the whole population, right. So, unless we marry the whole population we cannot see it for sure. So we're just using that subset of data to extrapolate the result. And certainly there's a probability. So, what's a probability. And it depending where we assembled if we see the place is very tight we have high confidence. If the pop, we see it's very widespread we have low confidence so this is a, I mentioned about the spread of the population and how heterogeneous, how homogenous the population. So we never able to say for sure. And we need a certain manner of this uncertainty and the p values help help us give us such a policy such a conflict. So, so that's how the whole p values in normally interpreted. So, in a commonly used like called a frequent system p values, like t has an over the p values design is defined as a probability that observe the result. So, it was how obtained by chance, basically, if we repeat the experiment. It's sufficient time. What's the chance we get is just by this random, if the signal is not real, and we actually get it for just by random chance, just what's that chance if the chance is very, very low, and we think we can ignore this, we think that what we observed, it's, it's likely to be real. So we basically, we, we reject an our hypothesis, and we, we, we think there's an effect. So this is a very traditional p value. And the most time this is how we're going to interpret, but how do we calculate p values, and because we actually don't do the experiment to repeat repeat enough. If the data is followed normal distribution, and we actually can can use them normal distribution and and get the p value was propelled. So it's before all days and we can actually look into a dictionary and we get to where the is a mean and standard deviation we get p values. But nowadays we all have computers and we can just just get it immediately from the moment. So that's very cool if we our data is really robust. And for this is, we are very comfortable. And here I'm going to introduce one value called the empirical p value. This is very important, because some advanced features you might have analyst started calculating the empirical p values and empirical p values try to address the issues related to this model based p values such as using the normal distribution because we know sometimes they are not normally distributed. Actually, we don't know what's the distribution it is. And can we compute this p values based on data itself rather than assume they are normally distributed because we know the reality is very different and very complex assume normal is make a lot of people uncomfortable because when we plot the data we see the distribution is it's not normal. So even you try different normalization approaches, but how can we calculate it. So, if nowadays we have a powerful computers, and we can actually do the do the simulation easily, and we can actually derive this now distribution from the data itself, and calculate that p values. Now the model based we call the empirical p value because it's based on data. So, and our now hypothesis that there's no effect. So, basically treatment and control have the same, same, and the treatment have no effect, but basically treatment or not treatment is same. That now hypothesis. If we accept this, then basically we can shuffle the label and and see whether the difference is the same. Okay. And so, this is very simple just because if we assume there's no effect so class label became not meaningful treated or not treated as the same. So in that case, we can always shuffle label and the calculating the difference and compare with original one. And we can repeat many times we're talking about the thousands of millions of times, then we found out how much different from the original data versus the shuffle the data. Okay. This is very simple, and it's mentally we think is simple, but when you do it as a calculator manual is hard, but for computer is so easy. And I just hope you conceptually find it simple, and you can just use the empirical p value tried to interpret it intuitively. So here is the case and control is the original score we're married and some the mean between the case and control is about point five point five for one. And if we assume there's no effect. And we just shuffle them. You can see the sum of the, then the sample is go to case from see here is one and one to nice in case and 10 to 18 is the control right and the case mean and control mean is below and you see the difference like this. But we will shuffle them you can see from some of the 15 and 17 and 16, they became case and 13 they became case and control and mix. This is permutation number one, we just randomly shuffle random shuffle because if there's no effect, the difference between me and between me our case and control should be similar. Okay, so we just do it. Actually, we can do it hundreds of thousands of millions times. And, and we just continue to calculate difference how much they are. And then we plot in the difference. So this, in this case, we know this original difference from a preserved the control and the case, the before you permit. And you see a lot of difference between the mean of two groups is, is, is distributed like a normal and which I'm not surprised. So a lot of the result you're going like this. So if we talk about a 5%, if you repeat 100 times point or five, one five times, like by chance you'll get observed that's have a bigger difference. And where it's located, you can see it's located out here. So this is like 19. If they say 1,000 times 900 and 15 times below here about three times high. So, but this is your original one. And it is really, really competent. There's an effect. Okay. Yes. Yeah, sure. Okay, the question is how many permutation we should do. So the answer is the more the better, but it is depending how big is your sample size. If you have six control six disease, and you can exhaust all the combinations are probably just 100 ish, you want to be able to get a very high distribution like in the few values, even the effect is strong just because you have a less number of samples, but once you have more normal samples, you shuffle is you do the random shuffle in the combination different you can get a very rich distribution like what's shown here. So this one I think is about 1414 or 1313. It's okay, but if you have really 1010, and you want to get a very good distribution just because the sample size is too small. But the most precise the value that you can get is 0.001. Right. And so maybe 10,000. So it's really up to you. So if you want a really super strong effect value, and you want to be able to talk about, you know, value, then you're going to need to do. Yeah. So the, okay, the question is how many what's the number of permutations to get sufficiently high significant p values. And you can see there's some some of the already hints below. So we can never get p value equals zero empirical p value because we even your data is large enough. And just because if you do 1 million permutations, and you never see any of the permutations higher than original data, you can only see the p values less than 0.00 probably 601. What type for you doing a one more permutation up to 1 million and you get that high. So, so we never able to be say that, but we are pretty confident. The p value is less than that number. Okay. And sometimes people just adding one padding like here to get the p values get around because sometimes it gets to zero will have some computational issues. But overall that we are fine if you can data is relatively large you have more than 1000 the permutations, and we usually have a good feeling. This is very promising result. Okay. So, the question was how the threshold was calculated. And here this, we don't have just calculated threshold, which is plotting a normal p value threshold like 0.05, which it means if you repeat the experiment. 100 times five times, you can get by random you can get the significant higher than this. We just here with plot this point of five cut off. And, and you can see about 5% of the times on higher than this, this is from 1000. So we're talking about five zero times higher than this, and this is original data. So, this is how the normal traditional p value plotted within this empirical p value calculation. So if we accept that the same point, oh five cut off and this where the located. So interpretation in pure value and the regular values same. Okay, almost just like this. So how do we think about p values in pure hope you are small robust because it's don't have a distribution assumption. And if there's some hidden correlations as something, and we can address it because when you're doing a permutation and actually everything is more or less being completed when the data being considered, because we driven by that data. What's the disadvantages require a large number of samples. So I'm talking about at least that 2025 per group once you're less than that you don't have enough combinations to get that. Commutational intensive, it may not be a big issue because nowadays computer is so powerful. And, but the last one you keep in mind, if your data actually can be normalized fit into a normal distribution. It is much more sensitive, you get a strong p values than using a permutation based just because quantity models, statistical models is always more powerful compared to non parametric, as long as you can get the distribution to fit into that model. Okay, parametric is always more powerful than non parametric if the model distribution fit into that parametric. So it's really, you should be clear on that. So multiple testing issues. And we, we are doing all mixed data, we have hundreds of tons of variables so we're doing test the one by one. And, and each time we accept, if we repeated experiment enough, and we will by chance to see the signal result. So, using a point of five as the cutoff. And if we have 10,000 features like peaks or genes, we will see 500 just by significant will be significant just by chance, because this is what we defined what's the p value. Repeating up time we all get that so we repeat the 10,000 times on each one so this one is really bother statisticians because this is the define and when it defined this they didn't think about all mixed data. Now we're doing all mixed data we are doing exactly like this. So the chance alone we will have a very high number of the significant p values by chance, and how we adjusted that. If we go back to the traditional stats we do the baffer on your correction. We have to make a cutoff more stringent, if we do that 10,000 we just divided p value from all five by that time we've been we tested so the p value cutoff will be became five zeros behind divided by this number of the genes test so make it very stringent. In theory it looks fine but in practice, if we do that, we will have nothing to work on, because the two stringent most of clinical data most experiment usually don't have enough. So, again, stats have their own operations this is how stats works, but we are biologists, and we know a lot of things can be reason can be actually meaningful. And so if using a stringent cutoff we cannot move on and that's that's cure our research. So we still need to really control this fast discovery or random chance. So another popular ones called a fast discovery read as in Benjaminie hodgeberg as our BH by abbreviation so in my time I honest almost if you don't see specifically all fast discovery read by BH method. The interpretation very easy. FDR point of five means whatever you this declared as significant 5% of them will be positive. So this is a much more acceptable. So, now we are about to move to a more high multivariate but before we end we need to just summarize what we usually do, and we should always be doing starting from universe normalization then we try the univariate because it's simple, it's robust. It's actually meaningful. Okay, T test and over and from two groups or multi groups, then we apply do a multi test adjustment. Okay, then we basically try to understand whether this there's a significant number of significant whether this making sense or not. So this is a very basic baseline check check. And now how we move on. How can visualize all the samples and thinking all of them together. And this is a more to machine learning multivariate statistics. So a lot of the things that what we try to do is borrow a lot of the concept from computer science. And, of course, the stats they do have their own set of the tools to help with high dimensional but again, a lot of the, the traditional statistical for multivariate is still not how to say not fit in the current on the scale, we, we just cannot get that number of the replicates. This is really, if you read some literature is called and much less than P and is the number of the samples and P is number of the features. So, most multivariate statistics, they have a degree of freedom. And if you want to estimate that can coefficient and or number of the parameters you need to have the replicates almost double or triple. I don't know the empirical rule than that it's almost impossible you think about genes, you have 25,000 genes, you need to have a lot of samples, it's really hard to get. So, current practice is that we gradually move more to a pattern recognition machine learning and dimensionality reduction dimensionality reduction. Such a PC or PSDA is a small from machine learning or camel, camel, camel informatics. So it's a, this is a, this is a big picture about where we are standing. So, So, artificial intelligence and machine learning. And here is clustering classification regression. And we, we already talked a little bit on hypothesis testing, because all the p value we talk about is really to hypothesis testing and treating Assume everybody is more less comfortable with t test and over. Okay, parameter estimation. And this is a really traditional stats we have talking about p value mainly focused on that. And what we are going to focus on the clustering and classification. And there's a lot of the area of machine learning, we are not going to touch. And just because we don't have time. But this is a really the, the basic things we need to understand this before more to more advanced. And there's a lot of the tutorials on that in the future. So machine learning keep in mind that we have now have three categories. If we think about AI. And, but traditionally we only have two categories called unsupervised and supervised unsupervised is a really chat understanding the data itself. We don't consider about class labels, we don't consider what is control whether disease, we just want to see that how the data is distributed or whether there's patterns with similar to each other. So it is inherent data structure. And we want to do the data money. And the other one is called a supervised learning supervised learning is actually try to find the data which features which patterns correlated that with the class labels. So you tell the algorithm try to find it features in X that's related to why this is a supervised and unsupervised really you try to find the interesting pattern within X, and they totally know why. So this is a really the main difference between supervised unsupervised. The last one we started talking about it's got a reinforced learning. And this is really not a simple one step. It's a multi step a task oriented and maximize the reward. For example, how do you drive safely and how do you play games and so many things. So you can imagine because this multi task and very complex, and there's a lot of explorations and strategy building. So it is, it is not really related to AI and play go and drive a car. So there's a lot of literature there we're not going to touch. So we are really just looking for unsupervised supervised. So unsupervised data analysis. And we talked about two things, one is called a classroom one is called the dimensionality reduction. And clustering is trying to find the samples of features who are most similar to each other. And if they are similar to each other we can almost simplify the data. If one, all the features are similar to each other we can almost use a representative tools to summarize to understand so it's a really help us to reduce the dimensionality by you consider each cluster as a, as a unit. So the dimensionality reduction is a try to find some new directions that we can project data to that. And for example principal component analysis, and we are going to focus on principal component analysis in a later. And so here is we are going to focus on how to do clustering and then focus on how to do dimensionality reduction. So clustering is very simple. And we just want to find the, find the samples of features who are most similar to each other, then we just merge them. And then once they merge, they become a new unit, new, new mergers sample, then we find out what's remaining one. So overall that we repeat the process until we found some enough clusters of groups that we found that we can understand. So clustering and really reduce the samples of features to some blocks that's the most similar to each other within each block. So within the table analyst or actually within the literature was the two most commonly used clustering algorithm is one is called a hierarchy of clustering and basically starting from each individual samples of features then you build the other way to the number one everything merged. So during the process you they are going to gradually increase the hierarchy all the way from the top to the bottom. So this is quite, you will see a lot of the beautiful figures generally by that the other one called a teaming or partitioning method. So you give them a number hey I want the samples to into all three groups, and they will try to do it. So, and some of the glass classroom algorithm actually allow you have a fuzzy mapping so one group can potentially belong to two clusters so it's slightly more advanced but it's a lot of things that it's possible. So we are going to touch in the first is a hierarchy of clustering. So this one is very simple and we find the object and merge them and repeat until it's done. So overall that we started from, like from bottom, every samples or every features then marry a similarity and until we done it. So it's a very intuitive let's say that it's just boring if you repeat again again. But for computer this they are really good. Okay, I see there's some formatting from the slides. So overall the if we want to the classroom, we need to two measurement. And then is what's the similarity between samples. Okay, this time we're assuming we want to find the samples were most similar to each other, and we can do, of course, switch samples to features, basically you can clustering a sample, a data by the samples or by the features so X and Y, they are both okay. So it's here we use the samples. Then we need to measure how similar to samples how similar to classrooms and we are going to calculate from the lowest then we merge then we after merging and we do the whole thing. So this one is just to repeat, repeat, and until everything finish. And how do we calculate similarity, which is easy, we just computing Euclidean distance. And this is Euclidean distance basically we are talking about a vector vector means that you have a whole For example, this is a sample, a sample B, and all this metabolize in a sample a going to be here, other metabolize sample B going to be here, and you just from a compound A versus a compound A in in sample A to common B and just add them together as you clean the distance. So many multi dimensions not just to we're talking about all the dimensions, dimensions number is the same as number of features. So it's very easy for computer to calculate in that. And that's Euclidean distance, and the other ones called Pearson correlation. So Pearson correlations that you actually have a center and see how you vary it regarding to that center so you can see. It is a global center than this individually how they change with regard to that so it's Pearson. So the Pearson correlation similarities and can be positive coming active so basically positive means you change on the same, same direction and negative means opposite direction. And you can see, here is Pearson correlation so this completely synchronized and completed opposite. So, all of this meaningful so it's just a random that are not meaningful but it's a change. Opposite our similar it's giving you a strong strong feelings about their biologically relevant either co regulated, or maybe one inhibited the other one so it's a really, really meaningful signals. So after we calculate a similarity how we actually, we put them in a cluster now we need to actually calculate the cluster and put this cluster new cluster with the remaining. And how do we link the clusters how do we calculate distance between clusters. So, there's different or reasonable reasoning, for example, a single linkage as a method they basically choose the distances and minimum similarity between the members of two clusters so here. That's this class to see one class to see to, and they choose one close to each other because this is the distance. This one is called a single linkage. Okay. And imagine is it because it's so common, and you will see such parameter in the table is to make sure you are not too scared with this number name a single linkage means they are very using the closest data point. If you choose this parameter you tend to have some long chains and even distributed so it is just the, the commonly used. The other ones called complete linkage, and they choose the furthest members within the two clusters then then use that as distance. So, definitely this, it's been used. And you can see sometimes different linkage give very different patterns some is more meaningful biologically. Okay, another one is very clear it's called average. So how about we just don't use any member we just calculated a center and the computer the center. And this is in between the two centers is called the average group linkage so you can try theoretically average is more in in between so. Here is the Iraq class thing if we will visualize across the samples and if we are starting from like merge and be similar then a B became a one simple one emerging C and ABC D and gradually merge everything iteratively and build a hierarchical heat map. You can see the heat map like this across samples. So if the samples like see the bottom. This is more like cancer, this is more healthy. You will see that have a strong signal for you to interpret. And some of them is, Oh, this is a small classroom. This is immune cells response. This is a metabolomics. So you feel excited because the genes are metabolites responding have certain patterns that update and we regarding to the sample label or we So a lot of time computer can do a lot of the fancy things, but an end is up to you to interpret whether it's meaningful or not. How can I write a beautiful paper. And if you don't have enough biological background you can totally ignore this patterns and miss the important discoveries. So this is what I would like to say is that stats or machine learning is helping you, but not nobody can replace you and you have everything and observe and spend more time with your data. So I came in classroom is another one is mostly commonly used for machine learning and I came in is is is key is a number parameter you should supply. And you can have a key median classroom basically you don't use a mean you use medium and probably kind of K mode the classroom so the mean is really they use that mean as the center to do the classroom. So how do we start is a first you need to give a number key. And, for example, we give key as to from. So we basically we want to do two clusters, and how they try to start it through to sees on this on this data, for example, this is to to see, and basically, and data going to be calculated how close to these two seed, and then based on the threshold that they're going to assign the whole group became a two new cluster. And after they have called new cluster you can update the center, they basically the new class to have a new center, and then we do the assignment, and buy it again again stable never changes, then that's a key mean. So, so came in is very efficient, and people just uncomfortable is how I can choose the initial group key, how do I know that. So you can use key hierarchical classroom, for example, you see visually you see, huh, probably just two groups, and some people think oh probably five groups, you use this global patterns to guess. And sometimes you actually think I believe there's five groups because I have five diagnosis, you just for some to do that. So the other ones computational I can try different grouping like from two to 15, and compute the homogeneity within your cluster, probably at a group of five or six, I found that within each cluster it's very pure, and it's very tight, and across different it became very different, then there's a mirror, just testing the purity you will know okay data driven, the key equals five or six. So overall that yes, there's need some subjectiveness to get a set of key, but overall that you have different ways to guess it. Okay, that's flexibility actually the good thing. If you don't have questions I'm going to continue now next one is go to principal component analysis. So principal component analysis is probably the most widely used multivariate stats in metabolomics, and it project the high dimension data into low dimensions that capture the most variance of data. So this is what they tried to do most variance of the data. So the confidence clearly the variance means biologically important. Okay, if you really have a study, and you're well under control like an equal and see elegant some other species, and you control it. And the only change is experimental factor, and you can actually get a very, very good as principal come on as because the experimental factor that's the only change so the change of data will be driven by the experimental factor is very good. So, this is assumption, and it's, you need to keep in mind, but if your data is more observational, and you have other factors. So not necessarily it will be captured in the first. It's not necessarily your most variance is really to your interest. Okay, so this is what PC tried to do. So how PC actually works. And it's, you can always think about like project and data. And here you show a bagel and project from to different side when it's like O shape when it like the hot dog. And then, now you want to choose one demand low dimension as from 3d to 2d which side you captured right we always think the old dimension is probably more informative about the bagel and better than the hot dog. So we choose from three dimension to two dimension and we still capture the main characteristics of a hot dog, which another characteristic of bagel which is ocean. So keep in mind is PC once you choose a low dimension, you will lose data. Not at 100% from completed data you will lose it but you focus on data that's most variable. In this case we focus on all. Okay, we'll lose that dimension hot dog part, but that's, that's avoidable. And we are talking about high dimensional omics data and so always a part of the things we will have to, you know, and focus on things that most interesting. You can actually calculate PC and you actually don't need to know as long as you can interpret the result. But if you really want to know and here's the details. And you can use in our whole writing your own code to actually nearly transform the data and try to capture most variants in the PC one then then subtract the PC one then the calculator for PC to which are kept most of the variance and PC three and one by one. And you will get it down. So math is actually PC is not that very deep, it just looks quite horrible, but overall it's doable. And I just let you know is that my first one is actually spend time to write that PC manually to make sure I understand. Let's just make me feels comfortable, but eventually interpretation the same. So you can see each principle components that calculated by combine the coefficient with this original original samples so the core features. So the coefficient is actually the loading and this new values is a new new coordinate where it's going to be. So principal component analysis can be computed on two magics one is covariance magics one is correlation magics so you not necessarily need to understand detail if you do an auto scaling, you will get a correlation magics. Okay, and if you do this, and you are PC the principal component you're going to have zero original is that a zero so it's more typically used, but if you didn't do all the scaling. So the center may not be zero. So this is things, but overall is we are caring about relative position between the samples are between the features so whether it's zero zero is not that critical, but if you really did the centering and using the scaling, it will be at zero so that's I'm seeing this principal component analysis. It's all about how the relative position samples within each other. And this is a score plot in the PC a and this telling about how the samples are similar to each other. And the this is a loading plot loading plot is a coefficient which I showed them out apply and and cause this samples to be in this patterns. So this, this loading will be positively correlated with samples here because they are positive and drive this one here. And this loading is going to be positive correlated here. So, more or less, PC, why the same, why it's being popular is because they're very intuitive and similarity and the drivers, the features that drive them is, is here. So scores is about the samples and loading is about the features that are underlying them. So the same direction and the same location, they are positive correlated. And it's, and this one's have the high this feature here going to have a positive high high concentration in this blue group. And this one going to have a high concentration in this potential to see in this samples group, because they are driving them so they are really most abundant. If you are doing data analysis, use them to balance, you can see this. Okay. So they are positive correlated this one, drive this one is here is is that what they call almost so linear this one going to drive this group, you will see why it's driving because that is most abundant in that group, and probably least abundant in this group. So the opposite. So this is a PC, I hope you feel comfortable and just because it's intuitive, it's useful and it's also reliable as sound. So let's be used this overview outside detection and looking for relationship between the samples or variables. So here that you can actually almost swap them. So who's similar to each other is similar to close to each other similar to each other. Okay, samples close to each other similar to each other. And the features close to each other similar to each other. Okay, it's just that we want to we use this. You can use the data table, you can rotate and calculate them and which one kind of scored which kind of loading is really depending on your purpose. So, so PCA is unsupply and if your data is really have a experiment that it caused the most variance and PC will capture it, you will get a good separation. You will get a good interpretation, which features that change the most as drive the separation, and you can almost write your paper and of course you need to do some functional interpretation in the next module we're going to discuss. But a lot of times, it's not that easy. A lot of data, you have no control and clinical data, observational data, wild field study data, and you don't have good separation and but you have a lot of not a large sample size. You can use supervised learning and supervised learning is try to find the features that actually related to related to the groups and you're interesting. So they consider the both X and both Y and similar to PCA and PRC is that supervised the version of PCA. What they try to maximize is not a code. It's not a variance is maximize the covariance between X and Y. So because it's supervised and PLC can always produce some separation between the groups you're interested can always produce it. Why is because you have so many features, and you can always find some features that seem related. All mixed data just by chance, we mentioned about they will get some features that are correlated randomly correlated with certain with your group labels, just because this high dimensional you can always find these features and produce separations that seems make you happy, but they're not necessarily true. And here's the result we do on samples difficult samples. And we see the PCA and from PCA to PRC clearly separation is getting better. So this is a captured patient. It's from human studies. So it's very challenging. We don't say clear separation, but this is very typical. And if we apply to PRC, we really get the bad separation and So, you know, to use the PRC, you do need a rather large number of samples. And here you show the dots you can count it probably 1670 so it's usually a good number of samples. And very low you have six samples three control three groups and you see there's patterns you feel is good using PRC, which is, I personally have zero confidence if you have only three and three, and you apply a supervised method. It's just that makes sense. And but once you get to more samples like here and now you have a lot of the leverage is to test. And here at shows what happens with a clinical samples and complex cities. So you see good separations and you see the PLS DA scoreplot on the right hand side. And it is similar to PCA. But now you will see the, the most explained the component one only explain the not 7.9% of variance in acts. So, in the component number four X, it playing the night policy as a variance in acts. So, this is happens in PRC, just because PRC and not maximize explain the variance of the acts. It maximizes the covariance of acts between X and Y. So this one explained the last number, compare the component for, but this component actually explain most covariance between X and Y. And here it only shows X. So I get a lot of people asking, why, whether this is valid. I should tell you, this is valid, because this is not maximize the variance in acts as a covariance. So the first one in PRC is the most predictive, because it's explained the covariance in Y. It isn't. And here it shows the variance in acts. This is. I don't think you will get a sample like this, but we have some of the users and some of the data is quite interesting. And we just make sure that you, if you see this, don't be too surprised. So I keep mentioning about the PC overfitting, because it's supervised and it's, it's like high dimensional overfitting is so common. And in machine learning, how do we do, what's overfitting, they have a huge book on that. And, and what's overfitting means over, you've talked about overfitting is definitely this one called underfitting. So, this is your data on orange dots, and you want to fit them. So if you fit nicely about which, which is the model generate and this is blue line here. And a lot of random variation across this, it is all fine looks, but if you fit a street line, which few, few, very few parameter, you're going to underfit, because some ways high have a higher variance. But if you want to try to fit every dots, and you use a lot of parameters, and you're going to overfit, just because you want to make everything so overall the overfitting is very common is better use a very advanced algorithm, especially with non linear like neural worker stuff. So many parameters, you can just make sure every data can be fitted in this model, but it's capture noise. So at a certain time, it's not a capture meaningful patterns that kept you know what means overfitting. Overfitting is so typically machine learning, but we need to really be cautious about that. How we deal with that there's two approaches one is called a cross validation. And cross validation is, is try to split your data into different compartment, and you sections, and you use the part to train, and you use another user and to test. For example, here we showed, we have like 300 samples 200 using train and 100 unit test. Then you do that, another 200 train and test, and basically just do it three times. And if the model is robust, you have good training, and you should have good testing because it capture real signal, the model based on the two third of data actually can work very well on the test data just because you capture real signal. And this signals is true. But if you tested 100% but here is like 15% or 14% it really means you capture noise. So it's very, very clear, right. And so the cross validation is very typical machine learning and as long as you have enough samples we can do it easily. And there's a huge difference between train and a training and a testing result. That means the model is overfitting. And sometimes people if you have a less number of samples you do the cross validation you get a result and sometimes all good sometimes not good. Why is just because if you have a smaller number of samples, you happens to have like all the training contains all the cases and the testing is actually control. And you're chaining on these cases and pretty control you'll get very bad performance right. So this part is unavoidable if you're doing a random splitting and your data is unbalanced or have this kind of issues. And sometimes you can do in force them to do more balance sampling which needed to really have a percentage I must is randomly sampling from case and control, and you can do it. And this is actually we forced them to do the balance the sample and to avoid that. So you always get a very stable result. And the other task is called a leave one out cross validation. So if you have a small number of samples, what you can do is you chain all the samples except the one and do it. If you have 20 samples, you can do 20 times 19 to change predicting the one and just do it. So this is more stable overall that machine learning a fever, like five or seven cross validation three is fine. If you have done the samples I can, I can tell you, you don't see cross validation will be fine because that's, that's a lot of samples and you can get the very stable result. But if you really have just 20 samples or 10 samples, you don't have much choice, you want to get a stable result is the leave one out cross validation. So what the result you get is both for PLC, we do try to capture some of the measurement. One is called a prediction accuracy and basically how many times you predict the right. And but for PLC, which is actually have some because PLC traditional is like a regression, they can capture this R square q square and ask where is that what's the variances captured by the model. And this one is square is actually cross validated how is called cross valid. So all of this is so you like you square is more robust we we use to square to to guiding our choice. So the another one, we talked about cross validation another one called a permutation and, and we already touched about using permutation to compute in pure copy values. So you should be be familiar with what permutation try to do. So now we actually try to test is not calculate p values, what we try to test test is a class difference if we permuted the same, permuted sample labels. So, if we are permuted a class labels and we try to run PLC and we can actually get that to be for separations. And how many times we've got a separation is actually further away compared to original one. Okay, this is similar to how many times we get the mean difference. How many times you get the center of this new separations is different. And we can compute that empirical key values, actually, we can think about the same thing. So if we get our original label, and for this, and even use PLS DA, you'll get some separation but it's not as far away from the original one. So you do get confidence, right. So they try to please us, but still original give you the strongest signal. So overall this gave you another layers of confidence. Comparmentation tell you whether you are data separation capture some real signals, and the cross validation actually tell you how good it is. So, you can tell your prediction accuracies are square to square. So, for those if you really on the understanding what the data looks like, or how it can be interpreted and you want to see the citations. And we do in almost every, every figures. So if you do some interpretation, literally reference you can click and see there. I'm not going to touch very details because some of you actually very comfortable using the current level interpretation some of you require very precise. And you just click and see fun. So there's also we have a forum and people actually discuss we have more details. So another result is called VIP school VIP is a variable importance in projection, but there's also some other abbreviations, you are saying it. So it's a weighted sum of square the correlation between the PSDA composition and original variables. And it's claimed to summarize the, the contribution better than just using the loading plot. So this is the basic of the basic on publication people want to use it so we provide this in PST result. So really you have VIP more than one and people think that's a potential biomarker. And here is that the VIP school here is the mini hit map and shows how many classes if you have 10, 10 classes you'll have 10 mini hit map, corresponding to which changes there and give you some information about where it's going to be. I think I have 10 more minutes I think I'll be able to finish on time. So last part is performance mirror and our secret analysis. So, we already touching the supervised learning, and we know that super learning very susceptible to overfit. So it means that you capture noise, and you can do it very well on the training but when you're testing a real data you never seen before and perform poorly. That means model capture some random signals, and it's not as good as you claim. If you talk about 95 when I testing real data you have actually became 65. That's, that's overfitting. That's people if you invest money on that and 65 not as useful in the specimen clinical setting. How do you ask so we need a more how to say more robust and error of classification. And the simple one, if we use a balanced data, and we just do an accuracy, like if we predict nine out of 13 correct, and it's 1969% accuracy, and the other called error rate. So, if you get accuracy, you minus one minus accuracy, you get 31. So it's both is a common. Okay, it's, it's better you need a quick simple intuitive manner. This is all good. I just, it's done working when you have in balance data. So, it's better this is for clinical is very meaningful. And very few people are just sick like HIV and cancer. And if you use that the population doing that cancer stuff. If I predict everybody's healthy. And I can get a very high accuracy because if it's 100 people, I pretty only five is cancer. I, I'm 99.5% accuracy, but it totally miss the point because the goal is try to detect this five cancer patient. So, in clinical, they actually have a different environment. That's not just the accuracy use a false true positive false positive and true negative false negative and finally sensitivity and specific visit. So, this one is very important if you have a medical background you want to talk in the clinical settings and because our doctors talking about this sensitivity specific visit. So, what's, what's the true positive is that if, if you're actually positive you predicted positive, and you, you, this is called true positive. What's true negatives as healthy, and you predict the healthy is true negative. Okay, so in that case, they really have a clear difference between positive negative because you are sick. or healthy. This is a positive negative in between a false positive, false positive and false negative. So overall the sensitivity is that how good you can detect the patient from from all available patient and specificities if you see that they are healthy and how good they are truly healthy. So, overall all on this is important. Okay. And for diagnosis. And this is when I showed if two populations not clearly separated they have some spreading and spreading overlap. And you can see in between if you make some claim this is a positive negative there's some risk you're misdiagnosis. It could be either false positive, false negative. And so, a measurement of this is just sensitivity and specificity. So sensitivity is true positive divided by true positive plus false negative. This is all the people with this with the disease. And this is the one you actually diagnosed with disease. So this is your sensitivity and specificity specificity is true negative versus the true negative plus false positive. This is all the healthy person and you claim they are negative to their health. So they're really telling how good you're telling they are not how good you're telling they are true. So sensitivity is how good your claim is positive is truly positive specificity how truly you claim they are healthy and they are actually healthy. So, all of this is critically important in clinic. Okay. So we want to actually combine them. And in part because sensitive and specific is two different things we need to balance and how do we have the right balance. And one was called the RC curve RC curve is called receiver operating characteristic curve. And, and they do have a history of names. So eventually try to balance them both. So, if you have these two populations and you have different cut off. And you're going to have different positive, true positive false positive false negative. So you just for each cut off what, what's that false positive scientific or specificity. So you can have different cut off have different data point and just draw connecting all of them. And you're able to get this curve this curve called RC curve. And you want to maximize the area on the curve. So somewhere here you have the right balance of the positive sensitivity and specificity. And of course, this is a mathematically maximized area, but in clinically, and you actually have different costs associated with it, you can actually use this weight to address them pushing to be highly specific or highly sensitive to this. It's really this one is this selection is based on that mathematically but in reality you have different costs. So you can always push into higher or lower. And because that's by adding a weight. In clinical if you really go to like a 95% that's very, very good performance 85% actually very good already. But if you have the below 18% as really really you if you want 70% is quite common and it's, it's, it's you can publish it. But once you go go to 16 something and close to this, it's, it's not useful. And if you see 100% you have to think I'm overfitting because 100% as in clinical is very rare. And a lot of time you need to do a lot of a lot of the studies across centers multi populations. Okay. So, we have a lot of other supplies the method, and such an OPS or something our projects to our late instructors and support the vacuum machine random forest largest regression, you're both available in the top analyst. So if you have chance you can explore and we are happy to help you. Simca is this commercial and proprietary we don't have it, but overall that we believe. And based on other people's own our own study. They are not necessarily better. Okay, it's just that's OPS the appears the and support like machine random forest really handpicked or very robust. So, um, data analysis progression. So this is really your, your guide is started started from follow the simple one like universe. And if you go to multivariate and think about PCA, which is just tell you about the data whether there's some patterns. Okay, if you have a lot of number of large number of samples, and you do see some patterns you want to separate them you can use supplies. The, like a PRC, but keep in mind is that cross validation permutation important do not jump here I see patterns I want to write a paper, just be cautious. If you only have about 10 sample issue, and probably you can only do some universal test T test and over is okay you do a PC is okay, but I get people asking me how can I get better separations. I might be in pure value is very low, I see they have six samples one psda. And basically you cannot do it. It's just nobody can do it. And you have to have a large number of samples to the supply is the classification and implementation. So, I think I'm right on time.