 So, good morning everyone and today is the second day of our workshop and we're going to focus on the stats and afternoon we're going to be on the type of analyst. So, the type of analyst mainly is kind of a statistical and functional analysis and a lot of the statistical concepts is built on this morning's lecture. So, if you have questions, not quite sure what's, what's about to raise your hand and I would like to go to details in the past years is that sometimes we go to very low sometimes very high so we try to adjust as Andrew's mentioned that we are not going to go to very details on details than the distribution. So, we slightly go to more machine learning and hopefully we'll be better received this year and so yesterday we mainly focused on this analytical part and the different instrument and collecting a spectra and use different databases try to identify and quantify them and from that we get a list of metabolites or table metabolites with their concentrations. If you're doing on targeted you probably just get a list of the peak and identified by their retention time of the mass and the peak intensity which is usually the standard output from XCMS online and today we actually want to show you how, what's the method to actually go from this peak list from this compound concentration table to go to their significant features, pathways and functions and this is a overall kind of concept is that we can use stats like multivariate statistics on the top left, on the top right is basically chemo-magics, PCAO, POSDA and lower bottom shows our secret basically help you to identify robust biomarkers and performance evaluation and then on the back end it basically shows all the keg pathways basically you can map there and do analysis on there and then visually explore it and this all going to be covered today and metabolomics basically is a relatively new field and it's a follow all this kind of standard or approved practice from adomics field mainly from like microarray or transcriptomics because it's metabol... you can think about metabolite concentration is like a gene expression it's a they're higher and lower corresponding to a lot of the same biological regulated by same biological process actually when we go to analysis we can find a lot of things that's common but there's also some unique method that's more specific to metabolomics and while we don't talk about t-test spend too much time on t-test we need to talk about the different approach or the idea behind the p-values and how do we do some test even the distribution is not normally distributed it's not normal and this is very important concept in a lot of the complex data where we don't know their distribution and the other ones very common actually this is very specific for metabolomics is multivariate test stats like PC or PSDA and this method is not usually used that wide in gene expression data analysis and the slightly more expanded module today for this year is that machine learning concepts basically how do we tell which one is doing good job and what's the pitfalls and what is clustering so so this is a lot of people actually asking here what is the statistics how it holds different from what statistics versus bio-informatics or machine learning versus artificial intelligence this is quite quite a fuzzy concept and there's a lot of progress in recent years and and disregarding how people think of the similar or dissimilar the mainstream actually there's some people in the practice identify them as machine learning or bio-informatics or stats there's some overlap you can see it here and the machine learning is more like from computing science and the statistical is more concerned with hypothesis testing and parameter estimation it's more traditional but now they are more merging to data science and machine learning and stats became more fuzzy in between and there's common things we can see during the class frame classification and regression analysis so a lot of stats use the machine learning to evaluate the performance and machine learning the stats to estimate parameters so it's it's became more blurred it's a good thing and the other thing I'm not going to talk is artificial intelligence which is concerned deep learning and it's very hot area and it's you should attend a different workshop if you were interested in that so again if you have questions you can raise your hand okay and here is a general workflow in all mix analysis so this is a you should gradually build a workflow or concept in your mind this regarding metabolomics or in the future you're probably going to the RSA analysis is same thing same high level of steps so RSA basically you need to get the count table and metabolomics you need to get a really feature table this feature table could be peak peak or compound compound concentrations and after you get your data what you need to do first step is a quality checking at the data looks normal is that something fishy or liar stuff you need to really before you spend the months is analyzing your data you need to really make sure the data is seems okay and deserve your time so how do you do that is there's no just one button until your data is okay it's then you need to consider the experiment design and visualize your data and use some common approaches like like box plot overall and there's a scan plot is different in the hit maps the different things help you and summarizing data looks at the data from different perspectives to to make sure this they are okay of course you have more experience this process can be shorter you can just spend one hour and think of this data is fine for other people start learning you probably spend more time and if data is fine you need to you really need to normalize the data normal because omics data is really multi-dimensional means that features some like some metabolize constitution very high some is very very low and if you do I compare them has become very hard not comparable and the small changes in the like glucose can have like hundreds of macromole but in other small low concentration metabolize it's just a few very low level of variance so if but they are not they don't mean they are less important so we need to usually normalize them to so the comparison will be more less equal and you can really find a more significant on less bias directly compare disregarding the concentrations of variances usually it's not that powerful you just you can think about it's more like unparametric attest so unless your data really really robust have very strong signals and you probably miss too much so data normalization is very very important and the last one is statistical and machine learning so if you really your data good quality data properly normalized and then you can really explore all the possible approaches from the usually you should start from simple and see there's some interesting patterns then gradually go to more advanced you don't go to the other way if you go to more advanced go to simple this is usually our mind don't work this way gradually build up hypothesis see if it's confirmed in another method and this is a start from your comfortable method gradually expand and learn so keeping your mind input output this is also today for our lab is that it's not a specter even it could be but a spectral processing is very specialized a task we need to respect what other tool like mz mine xms and they are doing a very good job and if you're using target that you really manually doing is spiking and doping and quantify so that's a lot of work and I don't think automatically can give you really good quality input data so the data supposed to be like Excel spreadsheet like a matrix and it's with numerical values like concentrations and peak intensities and the other things yesterday you realized is as metadata basically we want to know the class labels especially want to do some coloring I want to control is in red and disease in the blue or whatever color the program need to know this different labels associated with that so you need to give the metadata so they can do t-test also on these two groups so indicating what's your samples of interest the output is basically the first one said what's significant like what signal features and the compounds the other one is that sometimes you even don't see significant that but you see overall interesting patterns so you can see the PC or hit map you see some patterns so t-test or significant testing is not the final it's actually one part the other one the clustering also very important even you don't have significant genes you can still see a good patterns and for you to develop your hypothesis the other ones the rules if you really use classification sometimes you can see which compound associated with these outcomes and you can develop some hypothesis and the models so we these models if you're doing that's like a regression or logistic regression models and you can actually see how they interact so this is actually became less commonly used in omics because the models became more and more complex and it's hard to develop some gut feelings about this model so the models like the one most commonly more powerful ones like SVM random forest is usually it's a black box and you have to treat them and if you trust them and it's hard to interpret internal working on that models but they are still very valid robust models so for the data types I hope most of you are very familiar so we have X and Y basically we're talking about the data is your X and your data labels Y is metadata and so for our data we mainly is a continuous data because there's concentration and for RNA-seq data it's count count is an integer it is a discreet so you just think about as stats treat them very differently discreet and continuous it's very using different models so regarding to why basically at the metadata most the robust the easier part is binary basically yes or no one or zero disease control and the other one is just the ABC or nominal the other part is called all you know all you know is last one it is so this all you know means some orders meaningful so this one is like time series those response so sometime is 10 20 or 100 so there's some orders inside when you do analysis you actually want this order to be recognized to be used in modeling this is a less common when when I studied like 10 years ago now people talking about personal matter to personalized medicine or procedural medicine you want to collect data based on there for each person follow different days so time point all the those became more important so so this is not a key focus today but in some method in my type analyst actually starting respecting order so I will do a demo I will just show you some of the parties that you need to be careful which one to select and here is that already said about this so different concrete and a continuous a different and you can see this is a continuous because you have fractions you see this account concentrations and this is discreet because it all integrates it's a count how many times you see this genes all these reads been occurred in your sequencing so this is if you talk about a stats and the two very general thing that said normal distribution this is a for these continuous person distribution is usually for this account so this is a different school in stats and again I just covered this just keep in mind there's such such things and the in stats actually it's different treat it differently and some of course some of the method can consider all that a lot of method like I know but they probably don't don't take even you give that you probably don't actually cannot handle it and this is a PhD actually the in the recent that metabolism is they have this kind of class order matters class order does not matter basically this going to be this going to be all you know this is a if you have a time series and you really want respect all the 1 2 3 4 and you should keep it changed if you don't this is going to be treated as a ABCD there's no particular order the thing with this is that people from users says that if this change your label their group pattern is different this is if especially if you're talking about this ordered if you use ordered and change the label and it's you can switch to order so the result will look different but if you switch using this one just all the matter they will you'll still keep the same so this is something when we do our lab we'll discuss about these options and this is right in you and this is about common terms I know a lot of you already know and we just quickly go through the universe is that one variable or one feature at a time basically we talk about alanine across all the samples as alanine unit just one variable five variables we can talk about alanine versus creating in this is two variables multi variables more than two okay dimension of data is a basically a number of the variables so gene expression for humans 22,000 if for the metabolites we usually get two thousands or hundreds hundreds or thousands of dimensions so it's much lower dimension at the moment and we already talking about how to read in data and stuff and I guess actually that's the most tricky part but once you understand what the input required what's the labels it looking for you you should be able to quickly use your excel to prepare such things and next one we are going to visualize your data and what's the how do you visualize your data and sometimes you see it's very simple and but it's very helpful and there's no magic bullet in visualizing multi-omics data and give you a whole picture so you have to try and starting from basics is that we cannot see all the data we can look at the summary of data what's the center basically the mean and the media and mode the other ones how they're spreading out and so the normal distribution the best we have a lot of models we can summarize with just a simple value like mean or standard deviation this one but when you have the data distributed like a uniform form it's very hard because you don't see a strong convergence or anywhere you have to actually treat all of them so you if you summarize this one as a mean or media it's not informative because you're not they are not standard at the mean even you can calculate to me and this is extreme distribution and it's so it's really interesting to see your data and choose proper approach to summarize it some of them is not even summarizable for data like this even you're doing pca you you can see everywhere and no patterns okay and uh relative standing basically the distribution you can see the quantiles range and the iqr is into quantile range basically i'm going to here and the box plot so i get the emails people asking box plot i'm just sending email to your google it because it's top and i can explain through my email but google give you a much better wikipedia but anyway we go through it you you you assume this red normally distributed it's actually very nice you can see send this the median this is the interquantile range basically if you divide rank all the concentration from the high to low and you can divide this top 25 and 24 75 to 15 and 15 to 25 and 25 to 0 and this is basically the focus and assume majority information is there and focus on that a lot of time we just use mean or median to represent the whole data to do the modeling because we cannot take all of them so if your data have followed this nice distribution and it's very a lot of a lot of stats going to work well but if you're just like previous if you're really distributing everywhere then that's going to be hard that's um and we a lot of time we just use average to to as a representing like group average we don't look at each individual samples we use group average for that concentrations are used we ignore the variance but the variance is very very important and we assume they have more or less equal variance but a lot of time it could not be true and it's it's very hard so the smaller variance is a much better you can see that even the same same mean but if the variance is smaller you can see you have more confidence these two different much significant because they don't overlap but you you see the same distance between mean but the variance is higher and you can see that overlap is a lot so variance is also very important so I hope by this time you should really have the concept that because we cannot consider all the data points simultaneously we use summary stats basically mean to do the calculation so we really choose a standard here so if so if your data is really centered around the mean it's very dense here and we really choose a very good data point or value to represent the data so if your data is spreading very large like this spreading everywhere you still choose a center point like use a mean to represent it it's not going to work well because just because you can see overlap is a lot and you just use one value and this is the part we are going to not working well but if you like this you choose one value represent the whole thing it's going to be separating very well and it's very representative of your data okay whole idea is summarize your data you one or two values and use that one to do the computing okay so normal distribution and we're going to quickly go through and a lot of why we're interested in that because it's well accepted actually in both biological physical measurement this seems all normally distributed and some of the already figured out that all the complex steps behind it and the computing actually doing a normal distribution so fast and anything else became much slower to compare to normal so so because of this convenience we really wanted our data to be normally distributed okay most time it is and sometimes it is not and so we need to do some tricks to get it normal and so this is a scary things and i hope most of you already quite familiar you see this one but you don't need to remember because it's all built in the computer they take care of that and so a lot of times we see actually metabolomics we see data looks like this okay this is not normal but it's closed so we can apply like a log or stuff it's became very normal and this is the last comment if you have a population actually very divergent and sometimes you see this and it's it's hard to normalize this type of data and and this one also quite common skewed metabolomics raw metabolomics data sometimes you can see this side is all low and high it's become less common so data normalization is a whole idea is that because all the statistical model work on the summary number okay that summary number it's like mostly like a mean or average so which works best if it's normally distributed if it's any other things we are not working well so just for this reason we need to normalize the data so the stats we are working best okay if yeah this is the main motivation stats working best means the p-value is more accurate and the conclusion is more robust okay we have to please the statistical method and respect what the rules and so here's a skewed distribution and we just it's exponential and we do a log transformation and now it's become perfect normal so it will be easy and yes when you do a log transformation on your data yeah do you do this reporting on the log transformation or after transformation are you talking about within the tools on the tab analyst yeah oh you upload your data you have your copy called original data and your filtering data a second step got to process the data the third one normalized data so you have three copies in that tool in theory every step you should have a copy but in that going to create a lot of the save a lot of data in copies data you are going to confuse so every step actually been recorded so you have a normalized data original data and process data yeah and you give all these parameters for each set of data yes have named by that and also the each steps what's parameter you chose to normalize the data will be there and you're going to have an analyst report and tell you exactly what they have done so this is real data from tablomics and you can see it's skewed the distribution a lot of them is very close to zero and after normalization you can see the shifted to the right and more normally distributed so one thing I keep in mind is that normal apply log and you're going to exaggerate some very small values so small value means like close to zero if the low close to zero actually is low quality you're actually you're actually it's how to say you're enlarged the noise okay the attention log is log transformation is not a magic since if you all your data is a high quality and you can apply apply log it's no problem but if you really actually close to zero this kind of thing this is low quality and you you're not quite sure apply log you give them more weight compared to high ones and this could be the side effect okay this is log is not the better good thing with log is that log is easy to kind of understand and there's more fancy approaches and this is published 2006 but it's cited widely and why is because it's important but it's also boring people people don't research on it a lot so this person actually try the best and evaluate all the possible and give some guidelines there's no golden rules so you have to do this go do that but it do give you some ideas about what's the advantage or what's the limitations of certain certain approaches so the title is the centering scaling and transformation and you can read this paper okay it's very educational but i'm not going through details it's just too long and yeah actually i read this paper and i'm interested in the barrier of scaling because it says that it reduces the effect of the high concentration metabolites would that be acceptable approach in order to make the comparison between high and low concentration metabolites more fair or should i just do my analysis in two separate steps one for the high concentration and one for the log no i don't suggest you do conscientious separate steps yeah i you're basically it's all married in one go and you should do it and a lot of times your your your equipment have certain detection limit above that and they're just working well it's not because you're high let's thousands you're going to marry the more accurate than 100 it's all working fine if you became a point a point zero like micromolar and then became the same but once above certain threshold and machine became accurate you should trust that you shouldn't discriminate because high and low i think it's artificial you created some division and they're causing troubles future we're already doing some multi-omics integration you're talking about even within metabolomics you want to divide that divide that it's method-wise i really think it's introduce more work and regarding Pareto is what this person or whoever he is group probably some relation he is slightly advocate for this approach within metabolites of course you have the Pareto scaling we actually first want to put it there but in reality is that i don't find it much better i would advocate just if your data like doing target metabolomics all the measurements seems fine just doing log it's easier to do to try it can be a log with Pareto see what's the difference so this i don't strongly advocate any method include the Pareto or whatever probably the quotient normalization but i do suggest that your data quality need to be double checked it's good then second one start from simple ones simple easy to interpret and see the result patterns whether it's fit some of your biological sense your gut feelings and then gradually increase the complexity don't just pray condition i only use this because this paper saying this your data is different it's really case by case okay yes here i actually shows all this method there and potentially we can add in three or five more based on user request and but enough enough and we probably adding gradually adding one or more one or two so the data normalization at first is like a sample normalization biological normalization because you can have the different dilution for example dilution effect in your sample like urine samples and you have tissues or the volumes difference so you can normalize by the sample meeting this per sample specific and you can do a transformation like a log or a cube root somebody just really strong said this very good so i put it there i personally and i'd be honest i never use this i only try log and like below this one also looking fine Pareto auto scaling means centering so and of course and the people you can do it here like normalized by the medium normalized by the reference sample if you really your your sample is some is more like a reference standard and you use that it's going to change your result dramatically medium also same thing so so the first one biological things that really is based on your design and what do you think and it could have a very dramatic effect on your result okay we can discuss later if you have an interesting result on that if you you assemble all good quality and no particular reasons you just use it like just the first one i should try a surgeon just use log only don't even use scaling because much less procedures applied it much easy to go oh this concentration means like this you can go in back think about the reality once they apply a change of calculations on your data you cannot just walking back and think about oh this actually increase this amount and cause this one it's become so complicated so starting from simple and gradually increase more steps yes yeah yeah that's a good good question if your data actually truly by model and what you have done is right because we assume all the samples from the same population with just the same sample it's our assumption if you really you can see that's two population your data the best ways model separately do that separately in the in the case actually you can upload your data labeling with different like gender as a sense and to see whether their genders cause some issues then you should if it's really gender have a different distribution you should analyze male one and female separately this is really they talk about two populations and stats it's always talking about one population if you don't compare merge two problems together right i don't i don't think you should try your best to if by model try to normalize in one model it's it's it's just you hide the truth and the truth is two populations and two models and doing a separate the only down time is do you have enough replicates and it is for example male or female do they have the equal or relatively balanced the design of treatment control you can do that analysis if you have very small samples you just have no confidence right so good question and now we're going to the last step is this and stats and machine learning and probably it's going to cause a lot of the long time and the first two steps it's really simple but it's very important okay if you're not doing it properly you're going to have troubles in here so what have a conclusion probably not that robust so understand p-values so everybody think they understand and i think every time i yeah yeah you it do have below there's going to graphic output ask you allow you to see it i just show the options and when we do the lab you will see that yeah basically visually you can actually explore whether it seems better okay um but be careful so yeah just another question about this part is how to deal with data to know it yeah that's a good call how much trust you gave to it if there's a lot like 35 percent i'm just from it's all below detection and you can try to i don't know replace with very small smallest value if it's really to go through it if you're more than half and just remove it i'm thinking about it that's you couldn't do too much so you can use a e-metabolist and we usually replace by half of the lowest value so your decline limit is like 0.05 and then that missing value or you just don't detect it i give the 0.00 something and so you can still calculate but that doesn't really discard your data so stats and p-values so what i'm saying that p-value is important actually p-value is just every time i'm reading it i also feel i'm learning deeper so i i think just let's go through it together and see what's p-value actually about so here's the population and with stats it's all about population so you sample it and come up with some summary like me and standard deviation and you want to infer the whole population probably behave like this this is a whole stats tried to to model population so it's really based on your sample if your sample is biased and you're probably not going to have a very accurate estimation and here we assume that this is the whole distribution and you sample it here this area and now you get this mean and standard deviation like this is more reflected overall population but just somehow for some reason you sample in this area and you get the mean is slightly shifted here it's not go to population and variance very large so you're not really you of course you have these estimations but it's it's not as reflected very closely at the population so we talk about sampling bias so in this case you have actually bias and the the common is not that high but still if this is your data you have no clue about how you're estimating close to the population okay this is a traditional and how we married is uncertainties and there's p-values so this is how it comes so we don't know what's a population and we don't know what's the truth and and instead we use p-value to indicate our levels of certainty that our result represent a genuine effect in the whole population okay so p-values probability observed result was obtained by chance so if you see such a result and just random there's no real effect or what's the chance you can see such things you have your model and you can compute in what's the chance especially for normal distribution actually a lot of people can do the calculate in your mind i couldn't but because i'm using r i can always get ask her to tell me so for example 0.05 this all normal distribution if it's chance is very low we think it's it's not random actually a real effect so we reject the null hypothesis so if the p-value is slightly the chance is slightly high you think oh could be random we are not quite sure so you don't not reject the null hypothesis it's not saying that the it's not if the p-value is not significant it's not it's just that you don't reject the null hypothesis it's not saying your treatment don't have effect if you increase the sample size i'll have a better experiment design and probably you'll be able to detect the effect size so what it says the p-value just tell you it can don't have sufficient evidence okay this is it's not for decision making and that you really think need to think more on that calculating p-values it's actually easy for normal distribution and you can see if your observations within certain arrange and you can see that what's the probabilities of course you if you are this is the population your model and what you observe is the file here you can be very significant no problem and you can see if you are a lot of your observations here you can see actually it's total different population and this is this population this is became by model almost if you have a lot of observation there so it's you're going to have very very significant and yeah and we're talking about the ideal word is normally distributed and in a lot of cases we don't know and actual review don't don't don't think it's normally distributed so how do you convince the reviewers say hey this is a very robust conclusion and we have p-value very significant based normal distribution and it's hard to believe that the whole omics data very complex like see metabolomics target on target at different platforms and normally distributed it's most time if you view it it's not and how do you do that very robustly it's calculating empirical p-values empirical p-value it's become very very common and nowadays in a big data because when we talk about big data it's very complex and most time it's not normal and we have powerful computing we can actually be noble calculate p-values okay just be comfortable with it and try to understand as much as possible because it's going to be became very common so it's used that we don't know that distribution and we don't believe it's normally distributed and how do we find and we need to compute a now distribution basically there's no effect what it is random and how how many times we observe such things okay this is the whole thing and basically pretty simple it says on the now hypothesis basically there's no effect okay basically all your class label doesn't mean anything okay you can you can think about your male female or treatment on or control that making it cause energy defect in this case all the samples mixed that are randomized now you calculate what you're going to observe the like mean of mean difference of like or fault change and basically you can do it multi multi time like a million times and you compare with the one with original one it with your class label and original design because you believe there's an effect how many times the effect you observed from this randomly sharpened data is more significant than yours original data okay if that time happens quite often and you cannot believe that the effect in your data you observe the real is caused by that because even random can get a better result right so but if if the random sharpened never generated data as good as yours even in one million times that random permutation and you should be very very confident so what's your p-value p-value is less than one out of one million times right this is called empirical p-values so empirical p-value is never going to be zero because you in that you have to do the permutation infinity you cannot do it so one thousand times never get a better use less than one out of one thousand okay so i guess the principle should be very clear because we have powerful computing we just cannot we just cannot we just can't do the simulation or permutation and if you write a code and i'm just going through this if you want to write a code this is how how you're going to do it but most time this kind of building in macabre analysts a lot of the tools so how can i do it use a very simple simple example is that we have just one student test score okay i have one group let's think about student one group of case one group of control okay we are not talking about typologics oh we can think about just one compound concentration in this one is like this and this one like this the mean difference is 0.5 for one so is that significant or not you're not sure of course and how do we do that we assume there's no difference if you know difference we can shuffle them so so we can shuffle you if you compare to previous to the next you can say one to nine is all belong to case and the 10 to 18 is belong to control this is original okay let's assume it doesn't all this label doesn't make sense now we just shuffle put it in the case randomly draw from the over all 20 okay you can get nine three some are still the same case some are already shuffled from the other one so this is the same thing and we do the difference this is 0.329 okay that's and you can just shuffle them just shuffle them and you repeat one thousand times right and you can see this is your originally observed the difference and this is all your shuffle the data if the case and control label the original doesn't mean anything there's no effect and it's all here now at this time you should be very confident that there's a difference between your case and control right and if there are three times the permitted data has given a larger difference and p-value going to 0.03 so out of one thousand times there's three times it's so this p-value is 0.03 so if you never so it's p less than 0.03 but it's not equal to zero sometimes people adding one or stuff just this is all fine so it's all standard so in the past it seems to be very challenging for people to understand but if you don't like i'm happy to try my best to go from the other ways or you think it's very easy and that's that's good okay so general advantages that does not rely on any distribution assumptions so it's not necessarily need to be normal and the corrects for hidden correlation because a lot of time your data have some certain hidden structures we don't know and it's very complex if you're doing a permutation within the same data and the thing is still kept so randomization will also address that so it seems to be quite robust disadvantage computational intensive especially for large data but on the other hand how many possible permutation you can do i mean if you have only three and three and how many population permutation you can do i'm not good at much probably 20 or 30 times you cannot get a very signal p-value because you're going to exhaust quickly and so for a good permutation have like one thousand times i think at least 12 12 or 10 10 at least think about that so it's just a permutation is a randomly draw if if you have very small sample if a random draw will quickly repeat us some cases you did before so here is another important concept i'm moving to another okay we already went over the p-values and empirical p-values that's a very important concept i hope you guys accept and the second one is the multi-testing issues so for most of you probably is already common and so i'm going to go over it quickly yes so why would you not always oh because it's not powerful if you if you really follow a normal distribution use p-values you can get use it in normal distribution the p-value going to be significant if you're doing the same thing do it it's it's not as significant at permutation also cause some also has some variations sometimes you're doing this permutation multi times you don't get the same result so it's need to be yeah we don't have a universally bad method to replace normal distribution okay that's hypothesis testing so so this is already we covered briefly on the p-value calculation so how likely result if we're going to get it from the sample if now hypothesis is true okay this is a p-value calculation so this is standard approach from very conventional statistics and we already covered that and we're doing this you know to do that we need to can compute the p-values and we have some arbitrary cutoffs like 0.05 0.01 if we normally distribute it and the issue here is that all these ones based on the even invariant basically one variable at a time and if we're willing to accept like um there's a small chance we see this random per variable that metabolomics or gene expression we have hundreds of thousands of variables so it's accumulating so at this time if we talk about 10,000 hypothesis testing and by that when we're going to have 500 is significant just by chance so this is a issues so we need to address and how do we do that is that traditional is not a baffler-onion corrections so basically we just really become very stringent and divided by the number of tests divided by the number of testing so if 0.05 you really divided by n now it became very very strict so that could make cause you have no significant genes or metabolites to work on later so which is not ideal so you always want to because that means a stat sentence you can have nothing to work on later so it's became very strict as then people found it's not reasonable and my idea about stats is you should help support people developing a hypothesis to reduce narrow down the search space so they can have a more focused hypothesis to develop new experiment rather than just give them data and so stats should not dictate a biological research so that they need to be something else so there's something else like more like more common ones a false discovery rate fdr and it is more lenient so it's for example if five percent means if you select five 100 significant compounds out of one hundred or five of them going to be false positive which is okay because you're doing omics analysis five percent is totally fine so because you really need to be like a qpcr or you're doing some validation so even five percent so all this is going to be addressed it's the omics test the omics experiment usually is a start and it's not the end so validation is going to be followed so you have a slightly lenient cutoff at the beginning it's actually it's very conductive for your later research so high-dimensional data and we have metabolomics and so far we mainly discuss about the all this concept is mainly for a single variable and like t-test ANOVA I know most of them still try to use this at the beginning as a first try which should be I mean SE mostly well accepted and and the common approach is that if you have your metabolomics data or atomic data and if you have sufficient replication you should try your t-test ANOVA and you apply them single this statistics then and across all the variables and the visual you can visualize and you can you can try to do something this is a this is all okay you should really try the at the beginning because directly go to this more advanced is for me as I'm slightly against you should try to explore individually especially you have a lot to understand this compound supposed to be high here if this one is true and you have high confidence on your data okay and once you once you've done this now you move to the high more advanced stats it's multivariate and multivariate is basically try to summarize your data to a low level so you can still explore visualize and this is a three example here it shows that we cannot directly view our data in a high dimension and this is a normal distribution in one dimension there's two dimensions there's three dimensions four dimensions are going to hard actually yeah I think about so multivariate statistics multivariate this means you can see the multivariate variables simultaneously okay this is not one at a time across all the variables so simultaneously that define it it is really more complex and it's so the reality is that most of us being trained or being taught about this univariate and because all these kind of stats concepts they divide like Fisher exact test even t-test last century okay at a time they don't have this omics data show up so it's they developed that very well accepted well established and now we need to consider more than that because you directly apply that we already see there's a multi-testing issues also modeling because if you have very low number of replicates but you have a lot of measurement and the one one issue caused by is called n much far less than the p so n is that your number of replicates p is number of the variables basically it's very hard for any program to model it because this is just the reality so how do we do that and stats actually almost give up because all the theories i mentioned the development early last century is dealing with low dimensional data and they cannot just omics data and even the address it's just not satisfactory because you can buffer on you apply all this stringent and does not consider other variables simultaneously so what do you need to do and computer science come to help actually computer science is very open minded and yeah at the time i'm i'm i frustrated with stats quite a lot i just attending all this kind of the class asking these questions they look at me and think that i'm kind of doesn't make any sense but come on i'm working in the omics it's a real data real challenge why don't they address it so then i went to a computer science and took all the classes and they are very open minded and try try the best to do it so it's it's gradient now it's not stats not computing science we can more data science and but the two approaches like here is that uh clustering and classification okay regression is more a stats part so this is very helpful so omics data have developed beyond the scope of classical stats and it borrow a lot of strength from from like machine learning field like image processing and chemical magics so um the one particular thing that's a dimensional reduction and machine learning machine learning clustering like actual pc also you can think about it's a clustering approach okay and uh before we really jump into machine learning and uh have a i just put your in your mind some overall things are called unsupervised and supervised and you why not here this is the supervised learning is that new to anyone do you oh it's all fun i'm just i'm not going to details because i assume this is very basic so unsupervised uh basically don't consider the class labels whatever based on data itself supervised the need to do the modeling we regard into the class label okay and i'm not going to cover too much details so how do we help understand the large high-dimensional data so one is clustering so this is it's close your mind close your eyes think about it because the data is so high-dimensional like 20 000 features how can make it more understandable one is that they put the features that are more similar to each other put them in small in the different beams like like 100 beams and based on they are similar to each other now you can reduce 20 000 to 100 beams so each being is represented by the average so this is clustering basically reduce the number of things in to consider and as long as that each being representing the data well it's it's fine and the dimension reduction is actually it's trying to reduce high-dimensional data into low dimension and how they try to do is they try to reduce the summarize the variance yeah yeah it's a good question so he's asking that when i do in class then it's applied to the sample applied to the features so usually of course we can apply to both in RO in where it's just a block of data you can transpose rotate so any direct data matter but for us is that we already discussed that high-dimension talk about variables it means you apply to the features in by default okay if you really want to see which samples are more similar to each other you should apply to the samples but mainly we apply to the features want to see which features are close to each other so we can organize these features see this 12 components more similar to each other because it's most likely involved in same pathways we are looking at this understanding of the data okay this is default but you can always do that different ways it's all the algorithm is really agnostic to whatever your direction you apply so if so now if we talk about the put the samples of features into similar the beings that's was more similar to each other but how do we marry the similarity right this is a this is important so if if we want to the clustering so we need to marry a similarity we need to add a certain threshold we said okay you are similar enough to put in the same being and now same being then after we organized like from 10 to 100 beans and 12 100 beans how to raise similar to each other so each being also need to marry to you how similar to each other so we can we want to respect the data as much as possible but you also want to summarize them so this you can really see is that it's nothing fancy it's just practical we really want to reduce the data but still representing the real data okay but it's more digestible understandable smaller size okay so the clustering is basically doing this and there are some fancy terms called the key means they belong to a categorical partition method so they want to do is divide this object into an m cluster basically m beans without overlap it's based on the and hierarchy clustering they don't actually give a particular cluster but they just from the top to the bottom from bottom to top so it's up to you to visually check and and I said that's right where to cut off and the key means very well used in machine learning but for biologists people want to do the hierarchy clustering because they don't have any clue about how many cluster we are looking for in the end they show me all the possible things and I'm going to decide so hierarchy clustering is very commonly used and expect for biologists who actually knows their data they can see at where it make a cutoff small biological meaning for so came in clustering and I'm not sure how many of you want to know more details because this is a very basic building and so a lot of randomness so you throw your object there and and then just go to the next and see where they close to each other and put them together and recalculate the new centroid and move just the back and forth let's go here and so you have your data as soon as two-dimensional data just data here and you want to have them like two groups so let's see a b so you don't know where it's located just give them two random seeds you can be located anywhere then ever who close to it going to be recorded as for example to see the red and blue so you just record it red and blue and who close to them just the color that this then you recalculate the new center and recalculate the new center and move there and redo the whole process eventually it's going to converge and once you're just going through back and forth several times you'll converge so if you have a strong pattern disregarding how many different seeds you're going to converge to the nice separation like that but in a lot of cases there's a fuzzy so you have different start points sometimes you end up using some class and slightly different which is also quite common unless you have the nice data like this or guaranteed you're going to separate like this sometimes it's there's a fuzzy decision depending on the initial state of position and something can be changed but this is also expected so in computing you can do it very fast and easy okay k-mean is quite quite popular you can a lot of actually online decision or customer profiling in a lot of things it can easily bring you into different categories and give you some suggestions you should do but by this you do that because you can use the k-mean and all your shopping or buying a habit it can be actually easily do it at a real time and hierarchy class ring so this is more commonly used in biology by biological researchers so we found the object very close to each other so we here we seem to object can be your features can be samples and actually in reality apply to both if you're doing hierarchy class ring apply to both sample and the feature you can see a 10g is on both side okay so you merge them and repeat the whole process and like this is also like k-means but you don't need to specify the initial seed position it's just starting from either highest level to lowest level and start working so key parameters similarity between the sample similarity between the clusters so i think based on the question it shouldn't be his similarity between the samples similarity between the features so because you can class across the variables okay so i think that's you should understand that so we're just doing a calculation and merging them and recalculate back so unless you are going to write a code yourself i'm not going to details and because it's all been done in the program so similarity mirror and why we need to care about because this parameter has been exposed to you in the type of analyst and if you you need to make informed choice like most commonly used the similarity mirrors including distance so this is easy basically you just do this each corresponding features just the condition difference you just the difference and the square and doing this add it up and take the root so it's common it's like our physical distance in the space and the other one similarity mirror is the Pearson correlation so it's normalized so it's divided by that so if your data is already out of scale so Pearson and Euclidean will be the same because you have unit one so it divided by one it's still the same thing so Pearson correlation is that you have this difference here is that you have directions you can see in the top because it's not a squared it it can contains the negative so you can see the positive correlated the one and negative correlated minus one okay so this is one this is minus one so this is you can see this questions no that's all so you remove the preserved features or higher correlation before we analyze that can i uh can you repeat why why we do that the similarity case do we remove those preserved based on the level of analysis no no no this is actually your part of your you already normalized your data you are you are analyzing data right now doing a pca doing this stuff is actually used in hit map clustering you actually see the patterns and which groups of features are close to each other this is you are actually analyzing data right now just like using i'm talking about different parameters when you see the hit map hey which parameter i should choose right yeah so clustering is basically you need to have several things are called a single linkage this is a closest data point and and generate long chains if you sometimes you can see a hit map with clustering you can say huh this is using a single linkage because it's some very high very long i don't like legs or long chains and this is a computer linkage it's actually quite commonly used it's the furthest data point and they have the clumps it's very shallow and this is the average it's more or less comfortable since and so you can see that this is from very typical probably first the hit map being used so you can see this is only applied to the variables for genes you can see this is up down regulated up regulated and it's nicely separated and from the top to the bottom so so anyway if you have a difficulty is understanding hierarchical clustering and we can definitely discuss up doing the lab so i'm not going to too much details which is okay so hierarchical clustering is very natural or intuitive and pca is also supposed to be very intuitive in your metabolomics but it's less so if you're from r and c called gene expression and uh but i i try i'm i'm working in a both field i found pca just most time done working well in the gene expression it's just but in metabolomics working very well it's uh it's just an interesting observation to see that different systems so what pca tried to do is a project a high dimensional data into low dimensions and keep most variance and here they think the variance of data is a key information so it's keep as much as possible that means more high fidelity to your data and here's that a commonly used things to think about pca is that what's your data characteristics and here's one because we can only view in 3d so here's the three-dimensional bagel and you project into dimensions and there's o shape there's a hot dog shape and and you need to choose one so the o shape is yeah if you need to sacrifice something and you're going to keep the o shape it's more informative instead of the hot dog because that's that seems to be more characteristics of your data so the pca actually commonly used for image analysis and people were actually found to identify the face that's representing the characteristics of your face or spec features to facial recognition so it's borrowed from machine learning and they've been refined and for pca we are not only for the visual representation a lot of time we want to see which features actually contribute to the separation and so we not only see the pattern we only want to see the rich compound actually drive the separation so that's that's i'm giving you some scary slides you see that how the principal component that being computed is that it's actually a linear combination of all your original features but multiplied by a weight so the p is a weight so it's if we in the pca plot they call the loading so you can see this is original data this is your weight so the the sign positive negative of the weight or the value basically how big is the weight actually have make a that's that this one this p value actually contribute to okay so this p value the sign and and magnitude or the absolute value of this way actually contribute a lot to the separation so how you rank your important features by ranked by the signs by the absolute value so of course the big that value contributed more to the final score so this t is your final score going to show up on the map but who's contributed more so it's because the x is your original data it's it's here that's actually have more influence so this is bigger it more or less have more influence okay if you go under the hood and see what's going on so this is actually how the from original to a new dimensions and the rank them based on the based on the variable explained because we are not only looking for the most variable so like say top two and hope to explain at least half of the variance in your data and the more detail the pca is that once you extract the one component that's explained most variance and now you're going to extract the second one and the condition that the first one the second one is unrelated so so this is orthogonal they call it so this one's maximines the make it easier to interpret some of the approaches and make a good separation but there's some some overlaps it's not orthogonal so and there's some there's some more complex approaches but not necessarily make it easier for you to interpret your data and but pca is the same to be a standard and if it's working well and really just try to use it because easy to interpret also this is unsupervised it's based on your data so if your data already is separated well in the pca you're really just to think about the biological story and based on pca probably plus some ANOVA or t-test you can write your paper you don't need to do more analysis even pca already showed the difference okay so the other part is that pca is a rotated axis basically if you need 2d and and they just create the axis just like from the current ones act to move to the one red and green okay it just basically makes sure the first axis like x is following the most data variance in this case it's almost a diagonal and the other is orthogonal okay it's pca can be interpreted in different ways but just you need to be comfortable you can choose anywhere you you like so pca can be operated on two different things one is covariance and matrix the other one correlation matrix so covariance is that i'm not sure i should go too much details because if you normalize your data it's auto scaling so covariance and correlation will be the same thing because all you need one and if you didn't don't do normal scaling the covariance is going to be more influenced by the one is more abundant okay more abundant to change like glucose it's very abundant the change of that going to cause a huge variance change out of the global all your data because that's a very abundant thing if you auto scale it you're going to each one contribute equally almost so this is up to so when the usage of the covariance matrix will be relevant it's it's i know you're going to ask that and i don't know the kind of the covariance it's as long as you see if you really think that the most abundant the content compound is going to have their changes it's very important compared to the one less abundant then you just choose the covariance if you really think that 20 micromolar versus the 5000 micromolar it's just this is alanine like this is glucose but as long as the merit accurately they change two-fold change more or less the same or similar you should try to reduce the covariance driven by the high abundant ones right otherwise it's it's you always see these guys and overshadow the other small compounds right sometimes people actually removes the glucose removes the urea to because they know you're interested in some less abundant ones so again this is up to you how you think and how you want to modify it and i would like to come up with some clear kind of rules and build into a meta balance you click button to get a result but that's not good because people want to explore people want to learn and once everything being wrapped in a box and you just click and believe me and this became not fine in research and just so there's no clear kind of rules you have to use your brain and try to convince the reviewers right so why why they use them type of law mix is and we talk about this PCA and this it's a PCA like this you should be very happy and so initially it's NMR this NMR spectrum it's a three different patient and the control and it's BAP so and i think it's one of our first studies and when you're being this spectrum and just to PCA it's wow perfect separation it's basically very hard in these cases PCA can be used as a clustering method because in textbooks they don't mention that PCA is clustering yeah i believe you can be cluster method you have explicit like a similarity mirror between a sample between clusters we discuss about this PCA got actually their main purpose reduced dimension you can see the computing reduced dimension but in the end the actually cluster is you can see that on the on this figure things that close to each other more similar to each other it's definitely clustering but their primary goal is reduced dimension so yeah but if i interpret this in the paper that no problem yeah it is class yeah it's it's acceptable i actually when i put it as a chemo matrix or clustering i'm just back and forth it's it's it's so here's the hidden loading plot i mentioned about all this kind of co correlation coefficient when you're doing the loading how you're calculating so this is a kind of people don't like it but when you see that the signs are magnitude so if you have more big values absolute big values you go more close to the corners up and down and these ones that tend to have this more influence and the signs basically you're driving this word this direction or this direction okay the good thing with PCA is that this is always correlated so if here and here let's just focus on this control okay so the here this group of things we are going to have very positive correlation with samples in the control group okay and we can see the here the group of compounds here we're going to have a negative correlation with with that control okay this is a very intuitive interpretation because you can just walk through that math in your mind you'll find out it's just it's not a coincidence it's just a very nice feature to there yeah it's exactly if you put PCA side by side you can see there's a grain and this kind of the minerals is a correlated positive and negative correlated here so separation here and versus these two groups separate here for these two groups so this is a PCA it's why people like it you don't understand but we sure that you honestly actually found it's very pleasing and you can do it so in some cases PCA will not succeed in identifying any clear clustering no matter how many components are used and and in this case it's wise to accept the result assume that you cannot distinguish and actually it is so true a lot of times if you your data is high quality and everything is okay if you go to the PCA you see you don't see a strong pattern and it's very rare you may use more advanced they're going to give you a very good result so this is a PCA actually when you see the PCA you actually already have a you should have certain feelings about what's the result you're going to see a PCA is a very strong overlap and you get some like a PhD later you find a good separation you are so happy on the publish most likely called overfit basically the pattern is not robust and you're not quite sure that pattern you see is it's real or not okay we're going to talk about PhD later and questions PCA very good data overview outlier detection and looking at the relation between different variables so really you should use a PCA with other like box plot stuff as your quality control as whatever your first aligns of weapons to to to understand your data so psda is this is also a sibling of PCA and people don't like to use PCA directly go to psda and tell me a good separation which is wrong i can tell you i just don't do that do PCA first and direct go to psda is really dangerous and why pca psda really good is that psda is a supervised they try to maximize the covariance between the data access and the class label and supervised always see this last one pca always produce certain separations with regard to the condition it always try to please you make you feel happy okay think about that so it's not sometimes really you want a encouragement you can look at pca psda but really you need to be cautious about that and you can see here is that the pca psda applied before and after you can see a better separation so if you have certain separation apply psd actually i try to tease out the signal and make it easy to find out which one actually contributes to separation and make it easier to generate hypothesis okay but if you have no separation at all and usually more causes so psda can improve pca no problem but if there's no actual separation psd can still do it which is overfit so you really psda pca always use together let's see so if you look at the upper two i guess my question has always been how much is modest separation versus no separation whatsoever so if you look at the upper pca yeah do you consider that to be no separation or that's hard to see yeah probably sweet you can see some separation and there's actually there's some separation and if you see use different normalization stuff you can see some separation i know which data it is it's just separation it's from clinical sample so separation is not good and but there's some separation on the other hand the psda in a few in in i'm probably going to discuss is that there's going to be some validation if you if you just see psda you do a permutation see the how good a separation going to be it's still going to help you prevent some jump to that there's some good signals so for the pca in this top left i i i couldn't tell you so psda is a partial least square regression so somehow the the regression then oh yeah so what it does internal if you why won't it's in the internet because i write all the codes so what is it converts a class label into numbers and try to perform psda regression and between different numerical values so and it's susceptible to overfitting by produced patterns of separate even there's no no rail signal so we need to perform cross validation need to do the permutation okay cross validation permutation where we're going to discuss more on this concept later so overfitting this is very very important just don't not eat now you really believe there's an advanced method that magically help you solve your problem because a lot of the things more advanced is more likely to overfit and please you and so what is overfitting is that this figure shows you that here's your data point and you want to fit lines here and and you can see this just right it says this is underfitting because this is the same the fit a bit but didn't really fit nice it's underfitting and here you try to connect all the dots it's overfitting so overfitting is the bad or why because there's some inner population there's some variation just like this this is called noise variation but the overall truth the patterns actually this is the truth and you fit but if you really start fitting the signal and it's became more noise and prone to a lot of errors so it it didn't become a bad model also you can see this curve a lot of in order to generate a curve you have to estimate a lot of parameters so this parameter is very simple sample specific okay this one only have two parameters it's probably one or two parameters so this one became very unrealistic and it's probably not generalized you use this model you get a new patient and you're doing diagnosis probably totally wrong and this one probably working well and this one probably working somehow but it's not as good at that one so overfitting is really um so yeah usually producing indicators in order to evaluate the fitness the goodness of fit and to determine whether you are overfitting them because when you do PLS you have would mean spare of error prediction or validation from these you can determine yeah for PLSD8 you produce the same indicator yeah I think so we usually the cross validation r square q square we do calculate so this is we try to follow the best practice possible and so because multivalence the user is really targeting it's not a computer scientist not a statistician it's a regular bench scientist so they need to be user friendly they also prevent certain errors so it's also workshop I want to spread some quotients and don't just try to jump to conclusions so cross validation how do you do that cross validation is that you change your models using a part of data and try to predict the other part you haven't used it right here is that we have three-fold cross validation we put out a data like 30 samples 10 10 10 okay we try to be balanced like every every 10 we have five control five disease we if we can do a balanced sub-sampling then you use this chain predict this and calculate its performance use this to chain and computing this and doing this then average their performance how many times making wrong what's the r square or q square and we can have a good feeling about how good it is right and sometimes people want to do more granular unfold cross-validation you can see you have 100 samples you have 99 change your model I'm predicted one and do it for each samples and this is probably more more reproducible what I'm saying because you you guarantee that you are going to go through each sample at least once and going here is that how do you divide people samples into three-folds and every time I'm random draw 10 10 10 and it's not guaranteed so if you're doing cross validation different try probably have slightly different result but leave one out cross-validation guaranteed reproducible because every sample need to go through it okay the only the only thing that leave one out is that it's more computing and computational intensive you can see that and if you have a large sample actually became not matter three-fold or five-fold or it's just all the same because a large sample when you go to like see 500 samples it's really saturating you you you get a whole two parameters it's it's very stable and you from 500 to 1000 no change from 1000 to 2000 no change it's it's it's it's just between like 10 samples 20 samples as you're going to see a lot of fluctuation because you don't have enough samples to estimate them parameters so psd a model component of features so what's the value performance evaluation is that some of the squares captured by the model as r-square cross-valid is r-square the q-square and prediction accuracy so this is a if you are really traditional statistician and this is really the values they they are looking for and the review is going on what is this value so it's it's it's it's all provided so i'm not going to too much details because this is all the same thing regarding your prediction and also keeping on ps is try to do a regression first and convert to a classification and a permutation test and so again like a empirical p-values and the same concept because psd is very susceptible to overfitting basically he's going to try to please you disregarding what's the data it is so let's see how it can trick you so let's assume there's no separation we just randomly shuffle your class label and let psd to do that to do that classification and in this case we are not a visually judge we just compare the r-square q-square or the prediction accuracy and see how are they able to do that so this is one actually computed from the same data and he just asked about is that a real separation or not so actually it seems to be robust and you're doing some permutations and even the data seems like here it seems not why they separated and going here is that a real or not and then you're doing a permutation and see how good is the separation using this original label versus a random permutative label you can see even if you try to please you use a random label the separation is narrow it's still separate but it's narrow so the one you separate the distance that is separated here actually still for the way so definitely there's something here that signals and being captured okay yeah yes not alternative it's complete complementary so you need to both show that psd the cross validation and show the permutation because for a lot of other methods like SVM or random forest usually just one should be fine sometimes i also do the permutation i found this on s3 because the these two methods much less susceptible to overfitting cross validation working fine or not really indicate its performance but pure stairs are not so pure stairs is so heavily abused what i'm saying so you need to have a double double check okay so psd the other one's a vip score it's a very variable importance in projection so this is similar to the loading okay you can use the loading but a lot of the people want to use this vip this is okay so it's just different formula and the same as the captures more information it's a weighted sum of squared correlation between the psd component and the original variable and so it's captured more and ways corresponding to the percentage variation explained by the psd components of the model so a lot of people want to see what's the cutoff so if you see the literature most people use one as important but it's it's really sometimes if you see the ranking of the vip scores and really this and nice separation and at a certain point you see it's a separated for example here you can see that this group actually very there and this we got this one or two so you don't actually this group is enough and some time you you don't even need to use a point one one or more than one you can see this one it's more clearly there so when you see the vip is this yeah then it's just going to go back and remove the variables that will load vip scores does that generally reduce the number of variables so you're not less prone to overfitting that one so you you mentioned that actually his his question is that if if he go back to the psd model remove these variables that's ranked very low right it seems not informative well it improved model of course you're going to improve the model but it is unethical to do this because you already see the overall result right and you go back remove the things that's not performed well and you retrain it the performance value going to be like before let's say 85 you're going to become like 86 87 you're going to gradually improve it because the model building is actually trying to their best to estimate the signals or stuff once you see the whole big pictures and you're feeling back and you're definitely going to improve it but if this model going to be more is actually better probably not because you are you attend to increase your model fit to your particular data so you make your model more likely to be overfit and because in clinical medicine you would routinely do back with backward stepwise regression where you would start with all the things that have a some significance and routinely work out the barrier yeah yeah so in that setting it's quite common to do that yeah it's what I'm saying unethical is that once you see all the data and you rank use a VIP or use t-test like you and just use the t-test and you remove the variables with high p values and then build the model you already kind of violated certain things it's basically you you know the result you already use the y label because how the p value is being calculated in the t-test or how the VIP is calculated they already use the x and y they know this one is less associated with the patient with your class labels so once you attach that you you already use the information at so the the the whole point is that the only thing that so you probably have a more predictive model which is all fine but you need to be validated on a new set of patients you never used to once you're doing that it's all fine I don't care about it it's just that you use the current data and remove the uninformative variables based on current data and you tell the review I'm 99% right in this one which is not true even your model probably best but in reality you have model performance in 18.5 not in 19.9 why is 19.9 because you're constantly manually trimming this uninformant variables the model itself is correct what I'm saying is most likely that what you have done the model itself is correct but the performance reported is not right it's over optimistic it's not 99 it's 85 right so if you validate your model even after whatever your manual trimming and on a new patient cohort and that cohort never used to generate your current model and you report this is performance it's all fine so disregarding your previous one so the one I'm saying that all the key features all the weight all the parameter probably all right it just performance value is over optimistic and you make it more likely from the regular plus one to the nature of biotech because the 99% seems amazing but it's not in reality just because you evaluated on the same data right just this number is over optimistic I get some of what you're saying let me let me be here right in the line-up session about it because it seems that I get that at odds the way we do multivariable regression in clinical medicine in terms of what makes overfitting but let's let's talk about it okay yeah yeah I think this is somehow addressing what he he said it's so the model building and the biomarker thing is a different thing biomarker what we are talking about the exploratory statistical analysis biomarkers need an iterative okay this model building is that if we talk about that you really need a new cohort okay and when I need to stop I'm not quite sure thank you so assessing classification model performance so I think it is intuitive if your data is very balanced so half and half that's easy so accuracy is whatever you done is correct out of the total prediction so you have 12 13 samples out of you try to predict and the nighttime is correct so 69 percent accurate the other commonly used is error rate so 1 minus accuracy so 13 1 so it's obviously I'm saying it's because it's used in the metabolized for different approaches so you need to switch back and forth what he's talking about how to convert them but unfortunately in clinical data it's a lot of the imbalanced so HIV so you have five cases are in 1000 samples oh you find 1000 people so if you predict every everyone's healthy and you have 99.5 percent and so it's a majority vote it seems very easy and the same the cool performance but it's not right it's not useful it's not just so there's biomarkers screening biomarkers and like diagnostic biomarkers a lot of things is you need to really for different purpose how do we try to balance them and of course there's a for clinical they have very strict things and over the years and I think sensitivity specificity became more and more prevalent and used so true positive true negative false positive false negative sensitivity specificity so this is all different environment and just you need to develop a feelings about what they are sensitivity is a true positive rate so whatever you see what's the in whatever you see the right what's percentage is right and specificity is true negative rate so whatever you see is that healthy it's not disease what percentage is it's actually healthy non-spec so it's a true positive true negative so example is that you have the healthy true negative on the left side it's a true positive on the right side and the side and it's you can see it's not a clear separation so you need to make a decision point where the cutoff so if that's a cutoff you can see some regions it's it's going to be like this this region you can talk on the blue it's good this is actually on this side it's a blue and true it's positive you predict the positive so it's true positive but this one is actually positive but you predicted negative so it's going to be false negative okay and the same thing going to be here so it is you can see you can move this bar cutoff threshold from left to right from left to right so you can switch how much false positive going to be produced or how much false negative going to produce and because there's no perfect solution this because it's not it's overlap so you have to decide what's the cost if i'm seeing this false negative what's the consequence it's false positive so this is really depending on clinical questions if i'm seeing this actually it's positive we say it's negative we miss it it's going to be highly probably have a very high consequence rather than i said oh these people probably healthy but it's if we see it could be positive let send them to the next screening probably it's more tolerable so it's really depending the cost so you can move this bar back and forth and to decide what's the cutoff so cutoff is not really a computational thing we can always calculate what's the best cutoff i give the best balance but clinical when you do a real time real life application it's a lot of other things and how do you make a d0 one one tool is called ours a curve called the receiver operating characteristic curve so it's a historical name for reader studies and very widely used biomedical application so it can be used for classification performance assessment and compare different biomark models so you essentially it is a two positive versus false positive so this is based on that you can actually move the choose where it's the best so for example you have your classifications models and it's used different cutoff at each cutoff point you can see computing is a true positive false positive and you draw that because you can see that this is x axis of false positive and y axis true positive you can draw these dots and connect them and this will be your rc curve so it's rc curve it's not that straightforward you need to first build a classifier and you need to predict and use a cutoff and calculate as you step so this is the best done is by compute by compute not by hand okay so from here you can see you are treating all between sensitivity and specificity so if you go here you have a high specificity but less sensitivity we go here you have a high sensitivity but less specificity so it's really it's and if you go here probably you have the balanced but if this is really computed by the computer and your reality you really need to see which part you want to choose so here at competition you can do that area under the curve basically the the bigger this value the more the better the performance so this is also used that you can compare different about about mark models and here you can see that if you're perfect you can basically like this is 100% but if you do a 95% it's pretty good and and this one is random so somehow they found it overlap so if you random ones just random guess is basically 15-15% it's going to be a diagonal this is around 70% so usually this is we see a lot of 75-17% so this is quite a common so if you want you could start going to 18-18-5 and become yeah more promising yeah so this ROC curve can only be used with classifiers that uses two classes yeah it is defined for only two classes can we use it for classifiers that uses multiple classes or we have to use another technique to validate these types of no this ROC curve if you see that this is only defined for the sensitivity specificity so this is only to the yes or no true or false binary so even you're good whatever multi classifier you have to multi group classifier will be able to deal with two groups right so you have to make it as ROC curve as whatever you decided if you have multi groups you still have to be into two groups okay just because it's defined yes or no you can sorry so it'll yeah it's abc you have to define a yes or no once you do that they can do it so somehow how you do that is based on what groups you think is the positive what the group is negative right yeah it's just defined for these two steps okay so if i have multiple groups i have to find a way to make it to two groups in order exactly this is how it's defined so if you have multi groups you can have like you have control low or high you can either be in low or high you have control versus low control versus high it's just the reality this is how it's defined and you yeah you can try to do like accuracy other performance value value multi group but ROC curve strictly two groups yeah so other two methods as other supervised methods commonly used for this metabolomics like same kind of soft independent modeling of class energy this is a commercial or patented proprietary tool we're not going to talk too much if you use same copy and and you read their manual i'm not a fan of that so the other ones opus is orthogonal projection of latent structures so this is available in metameles also again this is based on open source r and you should try to see orthogonal prs don't see opus because i get some legal notice from common you don't even want me to mention name so um so this is an interesting point and the other ones SVM random machine random forest this is all available there it's just you just use whatever you want so it's all good model so let me see and i'm not really basically SVM random forest what i'm saying that it's really well used but it's mostly treated as a black box and what important concept is that i just mentioned about this cross validation importance variable importance so once you get this concept you can actually use it as it is there's no way i can give you some intuitive explanation on SVM stuff because it's really really complicated but on the other hand is you feel free to explore and during lab session you can try to interact with me see if i can give you some more insight but a formal introduction of this one is really i don't think it's suitable for this type of audience but on the other hand this this method is so well used and a lot of other materials available so again this is re-emphasized what i've mentioned before is that i try unsupervised simple like from t-test and over to unsupervised i move to supervised and then go to machine learning don't go the the other way around because that's first it's hard to interpret and probably you're going to overfit your data if your data working fine with PRPC and really really that's probably you just try develop your hypothesis and try to write your manuscript you don't need to go to very fancy tools to to do that of course you can pst to further separate the signal but it's it's it's pretty set already and let's see data network progression is that so for our lab and what we want to do is if you use pca it's our clustering which is unsupervised so it's really based on your natural clustering of your data okay the data label is not being used it didn't use prior knowledge so there's no kind of ethical things if you see the label you already know the secret here going back to change your data this is all you can use it just safely once you go to supervised it starts using your prior knowledge use a class label and and if you see the separation already in pca or clustering you use this to improve the signal to develop further refine your hypothesis it's all fine but if no separating your pca or unsupervised and you see this suddenly it's a good separation in the supervised you really need to be more cautious at the moment now and statistical significance is that supervised method is special for psda and this can be misleading so always try to please you and at this time if you you should do a permutation for the specific psda which we metabolism already have it so again we are not going to iterate this and so there's no hard rules i can tell you should do this should not do that it's just a general guidance and you just follow and every time you need to think actively about your data what you expect and what's surprised and something is too good and it's usually too good to be true okay this is some time you you need to start to think and you can ask me and or you can see other publication what their practice so sometimes you see a review of papers using the certain tools they just really use it in a way that's not proper so it's it seems you spend so much money and spend so much time and really try to think about it and try to think about your data try to think about the method and the marriage or both it really make a good enjoyable reading for your paper oh this is actually um early yeah and any questions and yeah let's come back to the beginning about FDR yeah so i when we do a multiple variable all the variables and try to do FDR but if i do a unique analysis like no separate variables using unique analysis then when you do FDR connection or uh i mean let me try to rephrase your question you are talking about first discovery read in a multivariate analysis yeah so when you do multivariate analysis we put all the pieces right okay let me let me correct you first and let's so first is that a false discovery to design for univariate because we test the individual feature using t-test or NOVA one after one after one like 20 thousand times this is FDR when we do a multivariate test we consider all variables simultaneously we only test once okay there's no false discovery there's no multi-testing adjustment so all the features that based on the like based on vipr loading it's it's based on coefficient correlation coefficient it's just a rank and the top like say top 20 probably significant based on this pca analysis there's no false discovery because there's no testing they simply rank based on the the weight so i was sometimes when i use it it's just you have a bunch of all the things and everything like it's kind of similar to the FDR thinking that if we i just choose the only proof was to simulate it then do anyone like FDR corrects me that sometimes yeah i'm not quite aware of that FDR design of a pca stuff it's it's kind of new to me so unless you show me the paper and like i can see it better i that's just the theoretical not applicable yeah so yeah for the evaluation of the classification model you're saying that we have to use either permutation or cross validation are they both equally effective in this or are there any scenarios where you have to use cross validation and other ones that you have no cross validation is more commonly used for machine learning permutation test is just a recent years it started more widely used it's they are testing different things so so let's let me try to say it so cross validation is tell you how good is your model works like say 95 percent it's that's basically it's based on 19 times out of 100 times probably good and the permutation test is actually give you a p-value and if p-value is significant it means there's some signals in your model actually it's not random okay it's just they're not going to tell you how good is your model it's just sales says there's some signal in your model right it's different from random so they are very complementary they are not talking about the same thing