 And okay, module six, we are going to focus on more metabolomics specific data analysis. And so we're going to cover three topics. One is basic steps in LC-MS spectral processing. And then we are going to focus on functional analysis. So one is for quantitative metabolomics. And this is basically the concentrations we collected yesterday. How we move from the peak tables, from tables of component concentrations, a list of significant compounds to their functions. And the last one is how we actually do the functional analysis for the entire metabolomics direct from LC-MS peaks. So overall that the module five, we teach the statistics and machine learning, which is neutral. You can mostly apply to metabolomics and to adoromics. Now we are going to focus on more metabolomics specific. And although metabolomics have some unique things like on target and the targeted. So my emphasis here is that raw data processing, like is different. So LC-MS spectral processing you are going to learn is unique and we try to make it as smooth automated as easy as possible. But yesterday you used the GC-LC out of the magnet to do that. That's really unique. Now after that, we are going to do the functional analysis. This regarding, I'm having two different topics about quantitative metabolomics and LC-MS on targeted metabolomics, you should feel some similarity. And the concept of the enrichment analysis is actually very similar. And we try to override that. We want to make you feel there's a consistent patterns and you can almost leverage to across. So don't need to always think of this and that, but there's consistent framework. This we got in a data input. So metabolomics here shows some metabolomic experiment. I just mentioned about we cannot do studies at one samples or one replicates. We usually need to do a lot of replicates and cohort. So a lot of samples give you a confidence because at the representative of a population, so you can extrapolate using the P values. And metabolomics also sometimes have the technical replicates just because at her early days and the stability of the machine is not that so high. And we need to have technical replicates. So we need to have like a lot of times early days we talk about replicates and have an average. Now this is seems the last comment, but it's still a lot of times we do have biological replicates and technical replicates, okay. And you see the slides yesterday and we do have targeted quantitative metabolomics and on target sometime we call the global metabolomics. So data generation after data generation on target metabolomics using a calibration curve and internal standard actually get absolute quantification. And you use a constituent table and do the pathway analysis, a statistical analysis you get biomarkers and pathways are probably quite similar to what we have done with the gene experiment data. For on target metabolomics there's some slight difference from the peaks and we want to directly using them to find the patterns. Then we try to annotate only those significant peaks because annotating peaks and identify them it takes in some time as it is how to say this is one of the bottlenecks. But today we want to more or less share the similar workflow. So we can almost from peaks directly to the functions without doing an extra step. So you will feel it with this after we finish this module. So metabolism analysis designed to work for almost the most common type of data matrices, okay. And for targeting metabolomics you generate a constituent table you can direct upload. And sometime you need to add in some labels if you don't have label for MR and it's that's an on target on MR you can do a spectral binning. You can actually use them using some algorithm to chop into individual things. And for LCMS or LCMS metabolomics data and you can upload either peak list you use your vendor provided software or some proprietary software or what software you would like to do to use and upload the peak list and you can sometime manually annotate some of them. So it's done to really matter. The next one is you can also upload a raw structure. So this is a really on target and the format must be in a common format or open public format. MZML, MZXML or MZ data. So we don't support in the proprietary just because they are proprietary. So now we are starting LCMS raw structure with data processing. And yesterday David already mentioned about what the LCMS structure looks like. It is very large and noisy. So if you do use a high-revolution ones you actually can get a gig bytes of size for the LC high-revolution ones. So it's quite unnecessary if you need to upload the data like this and do the processing. And what we need to do for LCMS structure is actually we need to find the MS features. And this feature is usually defined by their M2Z results and the retention time and intensity. And you can see on the left side you do see each peak actually have this two dimensional coordinate which is their tentative ID or tag. You have this return time, have this M2Z ratio and this is really the unit for our untarget data analysis. But keep in mind that each component can give rise to multiple mass signals. So David mentioned about neutral loss and edX and isotopes. So they are actually all very informative. We can use these signals actually to do the data compound annotation. And if hope for next year we are going to give you the MS2 and how do we leverage this to do a compound annotation. So overall that all of this is called MS features. They are not noises. They are useful for doing a compound annotation and the functional analysis. So the purpose of processing a raw MS data is to identify, quantify and align all these MS features and across all the samples. And we would like to get a table of these features and index by their retention time and Z value with their quantitative information for subsequent statistical analysis. The quantitative information here we're talking about the peak height or peak area. We do understand they are not absolute quantification but within the same equipment, within the same batch, they are actually comparable. So they do suffer from some matrix effect but overall that if we talk about they are similar matrix, same type of samples running through the same machine, they are comparable and the statistics can be performed robustly. So it's just another compare across different studies across different locations, across different machines. That's a limitation but overall that within the same machine, within the same matrix running through the, like same day, it should be very stable. Nowadays the machine is actually quite robust and now you with some internal standard that you should be able to achieve a stability of the spectrum. If that's not and you need to calibrate your machine. So a profile centroid. So I think David mentioned it, you all will probably going to cover a bit on this is we don't upload a raw spectra to the metabolism list. Also the raw spectra generated from machine is called a profile and we need to centroid. So centroid is really convert the MS data into a centroid mode to condense the Gaussian profile into a centroid. And how do we do that? We use a protein with that. So you must, if you deal with the raw data, you need to install it and do the conversion. So this is a very versatile, you knew why they use in proteomics and metabolomics. So we are not going to ask you to do that because our data is all open public format. So if you really need that you should download that, which is quite commonly used. Now what are the options to do that raw spectra processing? And here's several most popular one, all three, XMS and probably all this one. And it was first released 2006 and then there's new ones, but MZ mine now is version three and MS style and open MS. Open MS is originally developed for the proteomics, but now it is adapt to metabolomics also. So all of this can process the raw spectra. We are going to focus on XMS. Why is that metabolomics? Actually the underlying algorithm is based on XMS. We also do a lot of the evaluations on different tools. We do feel XMS is very robust and efficient and across a different platforms both high-revolution and low-revolution is okay. And we also closely follow the development of this tools. We have a lot of optimizations and parameters. So we are very comfortable to offer this just because we use it internally a lot. So also XMS have a largest community, have a lot of the helps and you can use for LCMS and MS and MS multi-reaction monitoring isotope labeling. So it's a very, very versatile. And the R package is well-documented and yeah, it's quite handy and you can modify to a different purpose. So if you go to XMS and see how they design originally and this is a overall high-level flow, you're reading the raw MS data. As we are talking about the MZML or CDF and you do the peak detection and peak detection is for individual samples. Now we need to do a peak matching and across the different samples because this could be a shift in the retention time. After that, you need to do some retention time correction and alignment. So make sure that peaks generated by same compounds will be compared across, right? If you miss that, then basically you're comparing apple versus orange. So peak alignment is quite a peak alignment and peak feeling. So after you align the peaks, you'll notice some peaks actually presented across a lot of samples but missing other samples, so you should guess. That's peaks probably also present. So you can actually ask the algorithm to zoom in to that region, try to extract that peaks. So overall, that this peak detection, peak alignment and peak feeling is iterative a few times to get a good result. Overall, that it performed well and after that you get a peak intensity table and we can do peak statistical analysis and directed from that. Of course, we can also do an annotation from that result. So the most important algorithms in XMS is called the center wave. There's some other algorithms for peak detection called the matched filter, which is designed for low resolution spectrum. But the center wave is really at least at the moment when I think 2010, the design is for the high resolution. I think nowadays our resolution much higher than that time. I think that's for top probably in mind designed for that. So the center wave algorithm actually working very well for at the time and they consider a lot of the factors and the peak height shape and how to define signals. And overall, they have many, I think, 30 or 40 different parameters overall to tune to find the best performing combinations. And this part is actually scared a lot of people away. And at the early days, we're doing this metabolomic workshop and this part is actually most, most challenging. And just because I myself, I can write all the code, understand it, but I just don't understand why you choose that parameter because it's dependent on data and dependent on the machine. And the different machine, different data and have different, even people have different preferences. So the issue for this is the XMS group from the scripts, they actually have a default for different type of machine like a tough orbit trap, they have different default. But we also found that even default doesn't work across same type of machine because the sample also important and parameter column also important. So overall, the best way to do is actually make it automated. We automated the whole thing based on your data, based on the specific starting from your sample instrument default, then we optimize based on data. And this is actually you are going to use as we want to be auto optimization workflow. So you upload the data and they have a very high level of defined what's instrument type or what's positive naked mode or stuff. And we will do optimization and get the best parameter to get it out rather than you rather than ask you to specify parameter, you have no idea about, actually most people will find it hard to do and you need to test multi times. But as I mentioned, when you talk about a try and error and repeat and machine can do it very well. And so how do we actually do that? We need to find that the signals that we believe is more likely to be true. And we want to pick up the signals and pick up these peaks and try to evaluate whether this kind of set of parameters get the peaks that looks like a real, right? And we also do not want to use the whole spectrum because LCMS high-revolution ones, you see most places don't have signal, they are flat, they basically have a lot of noises. What we would like to do is just using some regions randomly but we focus on the regions that contain real signals and use this signal to change the algorithm to get the parameters. So you can see here, we show that we found out the regions that contain signals and we just extract them. And you can see at a high level they contain signal but in between is both high and low signals will be there. So that's sufficient. We don't use the whole spectrum because you are including a lot of the baseline noises. So we just choose the regions that contain the real signals. Then we start to train. How do we find the best combination of the values for these parameters? So there's a lot of the parameters to train and I think we have choose the most important eight. And you can see it's peak width, peak height and the difference and signal to noise threshold and a lot of several other things we found the most meaningful. And we try to update the different values from the default. Then we computed the quality scores. You can see this quality score, the QS, which looks scary but what they try to do is that make sure this value is being maximized when you choose the values. What they try to maximize is one is a relative reliable peak ratio which is identified by the isotopes. So there's a lot of false positives or noise peaks. David mentioned about if we really try to clean the untaggined tableon peaks from like 12,000 all the way we can get to 2,500 or even a few thousand. So almost the majority of peaks will be removed. So they are really the noise. So that's because one of the things that's a lot of region contains that just based on noise. What the real peaks looks like is that several things. One is they have isotopes. They have isotopic peaks. That means they are more likely to be true. This is very clear. The other one is Gaussian peak ratio. So the peak real peaks have the Gaussian peak Gaussian shape. So it's a very tried and we know as a if you run this that you know that's that's reliable in identifier. The other one is the coefficient that you should be able to have a group of features. So if we found this is a true statement they should be have a stable group features. So more or less we give different weight and we want to make sure it's a reliable not a random features that appear across different acoustic samples or broad samples and a Gaussian shape have this isotopic. So emphasize this one. Once they have a higher value then we think that's the best parameter is set. Okay. So we use this just empirical. Approaches basically we select the most meaningful regions. We select most the set of the parameters and we select the most meaningful criteria and we optimize it. We just want to see how good they could be. And we compared with every other available optimization algorithms. There's definitely a lot to try to do the same thing. And there's OVET, there's IPO, there's Opti-LCMS which is the metabolism part of metabolism. This is called auto tuner. So we can see that so I can tell you IPO is the one when we developed, we use it it could run for weeks and without give you result because you use the whole structure and try one and one and really they want to give you the best result but practically it just cannot finish because the high-revolution takes a lot of time and so they just stuck. So we want to have a reasonable solution and still get the best result. Overall, that we did the optimization we tried our implementer algorithms. We tested on a simple mixture and this is the standard. This is IBD Realsyme, where they are from this collecting using obitrap. So we can see our approach actually quite similar to just not, you see that you should notice here's days and what we can do is a minutes, okay? We can just a few minutes, we can get things done and we can see the result using XMS without auto tuning and with optimizers. So we can get actually better high-quality peaks with more isotopes, with better more attacks with more formula assigned. So overall that we found out the captures better signals and the more real peaks. So we actually doing this multiple times most likely we get the same result like this. Sometime it like 10% better, we can get sometime 20% better. So it's feels as working very well. So we are very confident and to put it through our website, of course you can do the default and you can do the manual tuning if you know what you're doing. But if you don't know and like most people just let algorithm do the best and they are really good at it. So today you are going to the lab and try to upload the not upload try to use a building example data and to run through the process and a few other result. So the overall that we want to get a result directly into like a PCA, you can see the patterns. So and you can see some signal peaks and you actually see the each features if you click and what's the TIC looks like and what the peaks actually look raw peak at that across different groups. And this is a box plot summarizing different groups. And each dot you can click on it and see the spectrum and just like you are connecting with a machine. So this is a really if you are running an instrument and you know this one is make you feel comfortable but for most of us I don't think we need to get very detailed under the future level but the thing that what we try to design a type of analysis is that we hide the complexity but we don't but if you want to see it, you can see it overall that you can just get a result and we do a lot of the quality checking from behind but everything is there and you can see this you can see the visualization you can see the R code you can see the pyramid. So once we get MS features across samples what we can do is basically you open a new door and suddenly you can do almost everything. Okay, you have that data table contain peaks you can do statistical analysis which is independent about the concentrations of all the peaks you can actually do some biomarker analysis of course this biomarker won't be compound it will be related to some peaks features you need to spend more time to identify what it is and you can do a functional analysis and actually you can do a pathway analysis because nowadays if you high-resolution peaks you can directly map into the pathways and actually with good accuracy and so basically up to this step you will have a lot to do and without spending years to identify all the compounds you're almost very close to writing your papers you do need to do some slightly more like MS-MS to identify a few more compounds but the global or un-typing metabolomics with the current tools and functions you're getting very smooth and streamlined. So what I would like to just mention is that there's a new generation of the spectrum processing called ASSARI this is based on Python and we are actually evaluating this and this tool seems 10 times faster I think and the result is quite similar to XMS and probably we are going to put it there in the future but we still need to understand how we make sure they are performed very well across different platforms. So far it's designed for the OBJAP so the feeling is that if you're doing an RSIK and you know there's algorithm called Clisto before that if you're doing RSIK you're running in a supercomputer or put a lot of memories but if you use Clisto you can run it on a laptop and quickly and it's amazing and hard to believe you can do it but the result is almost the same as using a heavyweight higher the algorithm called a high set or top hat and the thing is just all about how you program the algorithm and get some redundancy memory intensive gone. So I think that ASSARI will be published soon and so far it's in bio archive so if you're really eager and hard to call and you are welcome to read it the same as them they are using a lot of the high-resolution features to simplify the analysis. We don't have a question. Okay, good. So no more questions, let's move on. So we have a peak and we have compounds and we want to understand the functions. So functions and what is functions? So this is something here we show as a tag map because yesterday we showed SMPDB small common databases. So we talk about functions. We always think about pathways, talk about biology. So this is almost equivalent. So how do we usually get there is that we need to link our individual peaks and the compounds to their groups. The groups is about metabolite set or gene set which in our pathways or joint pathways or networks. So once we see they're connected they are working together doing something we know they are doing a function. And so how do we testing whether this group's active and we use a process called enrichment analysis. So enrichment analysis is really tried very intuitive. It just test whether these groups biological and meaningful groups or metabolites or genes that has significantly enriched or significantly reduced in your data, okay? This is exactly what we would like them to do but we could give the name called the enrichment analysis. And what do we mean by biology meaningful? And we'll be mentioned about pathways but as David mentioned we have like a quarter million metabolites but we only have much less number of pathways. So we need to really expand the functions beyond pathways. So what are the possibilities? We can have some signatures that are associated with disease signature database. We also have some SNF associated the metabolites. So we know if you change mutate some SNFs or associated with a lot of other compounds changes so that means they are related. So these libraries they are biological meaningful so they are not as clear as how pathways exactly but they are grouped we know they are correlated so they are a meaningful. So that's why one of the reasons why metabolites was used quite a lot is that we define a lot of a lot of the functions manually and curated from HMDB, from SMPDB, from CAG, from drug bank, from blood, urine, CSF, fecal. And we also collect from a genetic studies they found a lot of the metabolites associated with SNPs and we also know some of the chemical related subclass, main class. So and all of this is meaningful. They help you refine your hypothesis to a more narrow scope and to further doing tests. So this part here we continue this update. So we need to define the functions. The defined function cannot be do it manually. So this is a library is quite taken a lot of our time. I know they were doing this database maintenance update quite a effort. So we just make it more computable. We direct the quality from the underlying database to do the testing. So you can see below there's options that only use metabolite sets continue at least two inches. And this is more or less quite unique to a metabolomics. Why is that? We talk about a metabolite set. We talk about functions. We talk about groups. So we usually don't talk about single individual compounds. So the more groups that you have and the more confidence this is function rather individual compounds. Unfortunately metabolomics is still a coverage of the issue and a lot of the metabolites set that contains very few metabolites. So that's the things you need to pay attention to is that the functions when you define you need to see how many hits. If you have more and it means that functions more likely to be real. If you have very few, that means you don't know because metabolite can involve the multifunction. If these functions really have one or two at the definition the boundary is not that clear. So this is not compared to gene set. Gene set is much more have high coverage. They have 20, but metabolites. So most of them have less number. So here that the figures just shows how we think about functions. So functions are coordinated changes. So this is intuitively we think functions are carried out by a group of metabolites, a group of genes. If you do that, they must talk to each other. Let's go out and watch movie together and play soccer. So being able to do that, they must do it and have a patterns not random. So they can be both on the top and left up-regulated or down-regulated, all significant changes. So here I get a lot of questions people ask here should all the metabolites be up-regulated and all down-regulated? My answer is it depends. I also think a lot of metabolites it could be product and a substitute. So it's significantly changed. So direction in metabolomics, I don't think it's that meaningful. So you just use, if you use the t-test score, try to use the absolute value. p-value is fine, p-value is not direction. So on gene expression, sometimes it's meaningful both up and down for metabolomics. I think directionality is not that quite meaningful. So just put it on hand. So here is that we show actually from metabolomics you can see the changes sometimes so strong and the features especially some of the peaks and general from same compounds, that one must be similarly changed high. So if multiple peaks correspond in the same compounds so they must be higher or lower together because that's the same compound and that's sometimes how passed to annotated peaks they must change together. Now we're talking about functions and we talk about metabolites set. We talk about pathways and the defined functions. Now we need to evaluate how these groups of functions of this enriched. So I mentioned this about how they do the enrichment and hopefully in one direction but we don't talk about direct and we talk about they change it, okay? And compared to the random as long as it's not random they both either up and bottom it's all fine metabolomics. And if you want to do the enrichment analysis in the statistics you already have two widely used algorithm. One is called overrepresentation analysis and like a fish exact test or hypergeometric test. So if you don't know, you can Google it. So it's a very simple stats but it's very robust. So overrepresentation analysis and they start from the significant compounds and using a cutoff. So because if you want to be tested first you need to define what are the significant compounds? And you can use 0.05, you can use 0.1 as long as you think that the meaningful groups to test. And also I can let you know you can always test the enrichment for any patterns not just significant ones. You want to test a group of co-regulated genes or genes of metabolites change similarly. So you can just manually select that and define what functions enriched, okay? I'm just saying here most people are testing a significant compound based on P values but you can test any compounds. If you see this group of 10 compounds as changes together both higher both low. This you can test what function involved, what function enriched. It's not related to statistical test, it's related to what you want to test for. If you want to find out this group of genes or compounds you can always use this group to do the enrichment analysis. So you define it based on T test, you can define it based on classroom, okay? The other one is called a gene set enrichment analysis. This is quite a common genomics or transportomics. So this one is you use a complete ranked compound list and it's cattle free. So in metabolomics you can talk about the metabolite set enrichment analysis because we use metabolite set but input will be ranked compound list as not a significant ones. What they try to test whether the top or bottom is highly enriched for certain functions. Why did we do this is because they think when we do a significant cattle whatever we are doing, that's arbitrary, okay? And you use 0.1, you use 0.05, you 0.01, you can potentially get different results. What they use as gene set or just yes, they try to avoid this, you use a complete rank, just want to see the top or bottom whether there's some themes enriched. So it's cattle free and it seems to be more in, more policy, less subjective but in reality they are complementary, okay? I think both of them is useful, okay? Cattle free and rely on cattle is all useful. So no need to just choose, I have to do this, not doing that. So functional analysis metabolomics, we use a lot what we learned from transcriptomics and if you're doing a compound concentration you're almost doing some similar like a gene expression profiling and you can directly upload the compound concentration table and do enrichment analysis against the pathways or gene metabolocet and to test whether the conditions are significantly changed under the conditions. So it's really smooth and streamlined. Slightly more challenges on target metabolomics. How can we do that? One reason that we cannot direct the mapping peaks to the pathways, right? This is because we can do the genes because gene can map into the gene set then based on names and targeted metabolomics can match the metabolite set based on names but we cannot match in peaks to that pathways because the pathways defined by metabolites and pathway not defined peaks, can we do that? So this is part is puzzling a lot of people and they think that is the main challenge and we will tell you we can actually get around that easily with high-revolution mass spec result. And here we just focus on targeted metabolomics for this session and next one we focus on targeted how we do that. So in metabolite analyst we have actually three different modes to do the enrichment analysis or... And so we follow the one is if you give a list of metabolite names we just do the regular over-representation analysis. So this list of names could be generated from the T test and over could be generated from your cluster in a result you want how these compounds are involved co-regulated. The other one is called SSP it's a list of metabolite names plus concentration data and this one is unique to targeted metabolomics because you can compare your concentration against the reference standard which is like a clinical chemistry defined by the textbook you can compare to that. So this is really make it how do you diagnose it using glucose diagnose diabetes because there's a reference standard if your glucose is married absolute concentration is the stuff you can do this diagnosis it's not possible with on target, okay? And then the last one is called the quantitative enrichment analysis QEA which is similar to GSEA we want to directly upload the table and you don't need to do any cut-off we just use global test algorithms which is similarly designed GSEA but is more sensitive and so here's the stuff how you go to metabolomics one is over-representation analysis single sum of profiling and quantitative enrichment analysis. So the over-representation analysis is you need to do some results you select the interest compounds and it's going to compare to the metabolized libraries and do an enrichment analysis using a hypergeometric test and you get the ranked pathways and if you're doing a single sample profiling you're going to compare with the normal references and this is from HMDB and you can get some normal ranges so if you're really high and you think it's abnormal and if it's a signature way in a path with several common way in a path of change and you will detect the pathways so this is the same thing QGA is a very streamlined you upload a compound-concentrated table it will assess the changes and test with a metabolized library and give you results so overall that we put all the different input and we have algorithms and they will do the comparison with the metabolized library and give you a report. So you can upload a compound list and you will need to also upload a compound this is for over-reference analysis if you do a single sample profiling you need to list plus the concentration because the reference standard is you have a unit so you need to make sure the unit is the same and here it shows this microurine normalized to creatinine so you really need to have a unit which is absolute concentration with the standard unit so you can compare. So here's that shows if you upload a compound with a concentration and you can compare with what's reported range so you can see here is a lot of literature reported range and some of them you see oh this is high it's low compared to the literature and why it's can this is all the result based on the textbook from the literature mining and the collect by the HMDB and you see that said oh this is high and why it's high is because we have 10 studies they are talking about this range each individual study different regions different cohort they have different converters interval have different mean values so there are a lot of difference but yours if you marry like this you're very sure it's very high so whether there's a measurement around or actually disease markers it's all clear here but what I'm saying is if you want to do such things you need to really have a higher extreme values to see this a lot of times you will see you're actually part of this probably over life with something with one some studies and different from other studies so it really require every study reporting and using a similar technology and all absolute quantification so it really require a lot of the community effort so from the original analysis we do the overall quantification analysis of global GSEA test and we can get the pathways and you can see the pathways going to be ranked from the top to the bottom basically the top is the most significant and they also have the color based on probably most the significant photo change average photo change and you just let you to choose but you need to understand that a lot of pathways they share the same compounds so some compounds oh these two pathways both significant and report them but you can see that the underlying compound the same set of underlying compounds it's same almost you're talking the same thing but you reported it twice one for Partway A, one for Partway B so this is quite a common another visual addition we show is called enrichment map what they try to do is put different pathways links to each other if the pathway share a lot of common compounds they will be linked so you know overall that this to give you a separate the first one to claim but it could hide some information and the second one is slightly more informative because they tell you which pathway is more correlated you can actually click on a particular pathway to show that who is the common underlying compounds so if you choose the biomarkers they will help you prioritize some so people keep asking what's the difference between a metabolite setting rich analysis and pathway analysis so the enrichment analysis part is the same but pathway have actual information it's called structures so if we know the structures we actually can get actual confidence about what the pathway changed so for example if this is a graphics and we see the structure like this and we know if the compound is changing at this edge and versus the change at the center at here and the red or blue nodes they have different consequence so if you change this one and it's not called this metabolized enzymes or genes stuff you have a big influence on the whole pathway so this is the more things we feel intuitive they're supposed to give you some feelings about whether this pathway is enriched if the same compound one pathway have this compound changed one pathway have this compound changed and you should really know that this one is more likely to change the whole pathway this will probably downstream and not affect the whole pathway so this is really intuitive information we combine how do we measure them? that graph theory they have a degree centrality they have between a centrality so the high degree and this means a hub and the high between us they mean the bottleneck so they are all important measures so what we want to do is that we didn't give you a combined scores together just to tell you what to do what we have done is make it transparent just give you a new dimension here is your key values we just rank them from top to bottom it's most significant it will be ranked on the top but we'll give you an actual dimension the pathway impact so if the change in the metabolites is more likely in the most significant the strategically important locations we will be spreading out to this group so if you see the metabolite both significant and they have a lot of changes this signal change is also important locations you will find out the compounds this pathway is more likely to be changed so here if you click this compound click this node and you can see a lot of compounds significant and a lot of the compounds actually in the location that seems upstream and in the center so more likely this pathway is affected so the part is that we pathway give you actual information to let you feel whether this is based on structure and you can click it to see how many compounds actually involved but this one visualizing help you know not only as a group you also know the location upstream downstream center or peripheral so this is a pathway information give you more more confidence about biology because you know the structure and the final is that report generation so a lot of people use the tools and they want to have saved the whole process and we do have PDF report generation and embedded results so if you get any results and you want to go you go to the last page download you can see general report or general for you but the thing that once a general report is static you cannot label it and you it's not interactive interactive it's online okay once a general it's become static this is unfortunate but overall that you have to accept the offline is more static how you do it you actually put this in a PowerPoint start manually label them before you forget so in online you mouse over you'll know what name is up so you want to put a publication you put it in a PowerPoint or a game and label this a few pathways we can actually label everything in the community but you won't be able to read it so the ways which one you want to advertise it to label it it's not too hard to do that now we've covered how we do it for the target metabolomics and can we do the same way to untarget metabolomics so the answer is yes so we try the hard and for the community almost 10 years I think nowadays communities start accepting we can actually do the same things using for untarget metabolomics assumptions that you need to use high-revolution LCMS so if we don't do it what we did earlier days before we just do the functions using untarget metabolomics your collecting spectra you do the cleaning we do the peaking alignment and yeah if you do a lot of things it takes quite a long and large data and even peak annotation it takes weeks to month if you really want to using standard to get it it takes years okay up to identify most of them you can actually understand the functions so this part is actually the most time-consuming and it requires a lot of deep knowledge about mass spec so it really prevents people from using LCMS based metabolomics it really slow down things and if we talk about genomics RNA-seq everybody understand ATGC but nobody understand peaks and from peaks to compounds it really take a lot of analytical commission knowledge but how can we accelerate that and make it easier so what we think is with high relative peak we can actually do a putative annotation directly in the peak a high relative MS potentially use some retention time to direct the mapping to the pathways and potential compounds we can do that with errors yes so so what we realize this is published by my collaborative Dr. Shu Zhao Li from Now It's in Jackson Laboratory what they found the concept is like GSE if we do see a function changes the function must be coordinated so if we map in the peaks to that to that compounds in the pathways we must have a pattern okay it's not a random so such pattern will jointly point out to that pathways we are looking for so individual compound if you imagine peaks to that compounds and to that pathways you do have as long as the area is random and if you have high relative peaks majority still point to the right direction so this is you can do a simulation and we can see clearly that's true and actually we see we did a recently we did a simulation study we found out as long as 13% 25 to 30% of the peaks you annotated right remaining is random we can almost 100% we call that pathways no problem at all so we do have a certain percentage of accurate peak to compound stuff even like 13% and we can find that pathway accurately so this is quite quite quite as comforting because we use the tools and we try to do simulation and we found out exactly what we thought it should work and I scheduled it about 30% okay and all the remaining is random so this is how MamiChop works so the algorithm developed by Shujali is called MamiChop and in the table analyst we call it the peak setting original analysis called PSEA because exactly we're using the peak set to define all pathways but if you respect the original it's called MamiChop and so what we try to do is like similarly we have done with empirical p-value calculating so we just use this raw original data mapping to the possible pathways and say which pathways are enriched okay now we actually randomly draw from the whole peak list whole peak list same size okay same size not just using a significant one here we just see if we use a cutoff like six peaks like eight peaks here and we test which pathways are enriched now we don't do that this is the original one we test and do the pathway enrichment analysis again and again so just randomly do it again and again and eventually we'll see the original one how robust it is so we get a p-value overall data the first one is better you use a high rate of peak you'll be able to identify that enriched pathway no problem what will the remaining ones actually give you a confidence how randomly you can get that one okay so the first one already gave the answer the remaining ones overall it requires higher resolution mass spectrometry so we're talking about the whole peaks probably 3,000 2500 higher if you have peaks very low and the confidence won't be that good and we also talk about if we talk about 3000 peaks how much peaks to identify the pathway we talk about 10% we talk about 300 peaks you don't need to care about p-values you just talk about ranking based on top 10% we'll try to identify what's the most likely peaks pathways involved you cannot select like here we show it's 8 it's not 8 peaks it had to be meaningful number of the whole higher resolution peaks like 3000 you have significant ones need to be several hundred to identify them using on target metabolomics all we need is multiple groups multiple information higher resolution ones so this is the last part shows that the things we just validated MamiChalk against the several other tools one called GSEA we just basically adapting GSEA approach see how good they are compared MamiChalk actually MamiChalk is much more sensitive just really a lot of the things designed specifically from metabolomics make it very powerful and sensitive results so we have about like 30% 95% 98% all the parts we perturbed we know it's changed because we did a simulation we know the ground choose we can identify them just about 30% we get it so this is quite satisfying so we know actually that using high resolution on target metabolomics we can get functions very accurately but we can get the function accurately so this is a functional analysis page and modules you are going to click here and upload you are picking test the tables or rank the peak list okay it's completely the peak list not just significant ones and so one of the things we always get people's questions that they try and they didn't read the tutorials they didn't attend workshops they didn't get a good result and they tell us to help them and actually I'm just repeating again again high resolution MS is really critical and you get a very accurate result and need to complete the peak list and the people you have total just a few hundred peaks and we assume that's just only significant peaks and the function is groups behavior so we need multiple peaks talking about the same peaks same pathways to get that confidence so this is not a big deal if you're really using all the chat bell to get there and so get a result you actually see the potential hits what's the what's the result look like you can see the result quite similar to the target so we actually need to do it to be consistent okay and in between there's some possibilities fuzzy mapping but it's all hidden under the interface you don't need to know what we would like you to know is the result is robust and it's accurate you can definitely find where it's mapped but it's we do a lot of the simulation mapping so here's some interpretation of the result so p-values we have to be Fisher and Gamma Gamma is more from the distribution from permutation what I would like to see is the function ranking is quite accurate it's just p-values which are more accurate and fishing you should trust Gamma you can trust but because of public server we're only testing 100 times probably better so that's sometimes the Gamma distribution suppose more robust but it's dependent on the permutation time server only give 100 or 200 so this is GSE GSE actually can see both up and down you get a result you have more hits you see like this so you can actually see from here what's underlying compounds from the from the match how many annotations like isotopes and add-ups and see this so if you are really understanding LCMS and this one also makes you happy because you know that's likely to be real so here's another one is that you can actually see your peaks as hit map people and this hit map people allow you actually high rocker classroom we'll show up on the center this is your focus view it's the overall view and if you see the real whole peak it will be much longer but here just show how you can see like from these patterns and show up in the center it's a focus now you can test using MamiChalk not this here is not significant peaks it's just peaks with interesting pattern you would like to know and you will test which path is enriched so we'll just let you know that all of this very free really to do any of the patterns from the peaks and get the functions so the annotation a lot of this you see the annotation if you click these peaks you'll see some annotations right beside this individual peaks and adax isotopes and stuff it will tell you this is most likely to be it's not 100% but it will try the best what it most likely to be once you see this one hit by so many different things yes individual could be wrong they are correct so this is the instance so the next one we are going to ready for the lab almost so the lab session going to be in 2 for 2 hours we will right so this is we are on a break now