 So, hello everyone and welcome back. So next one is the Metabolist and this is directly related to the lab in the afternoon. And so we already last module we discussed about statistics and general strategies and considerations in data analysis. And now we are going to introduce the Metabolist, which is a tools developed in the last 10 years. I started it in, yeah, about 10 years. So the objective is to be familiar with the standard Metabolomics data analysis workflow and to be aware of the key elements of data integrity check, outlier detection, quality control, normalization, scaling. And this is a lot of questions asked, but here we will show how to do it with a Metabolist. And how to use Metabolist 4.0 to facilitate data analysis. So it is a Metabolist account version 4.0, probably 5.0 is on the horizon. So hopefully there's a lot of cool features going to be introduced. And before we do a data analysis, we need to understand the goal, so how the experiment was conducted. So at least we need to know there's biological replicates. And sometime, especially during the initial step developing the technical platform, we need to do the technical replicates. So technical replicates help you to decide which Metabolist you can marry reliably. So if you're doing technical replicates, you find things change all the time, then clearly something is wrong. So either platform is not good or certain compounds you are looking at is not suitable for the platform. So Metabolomics data analysis is two routes. So one is targeted in Metabolomics and basically also called quantitative methods. So you need to identify and quantify the compound before you're doing a downstream data statistical and functional analysis. So this part is really was quite time consuming and now it's become more automatic. So with the quantitative approach, you basically get a metabolite concentration table. And you can do a lot of functional analysis because once you have the ideal metabolites and a lot of paths with you can assign them. And you also have some feelings about certain compounds, what's the functions. And this is an advantage. And for on target in Metabolomics also chemo metrics also called global Metabolomics, what they try to do is they're using all the features and don't identify them. Just doing a statistical analysis and they only identify later on that the features are significant, they try to identify them only they are significant. So they try to save some time. So both approaches are both popular. So quantitative metabolites probably most common is NMR, but for LC-MS higher resolution. So a lot of time is more doing on target. Just using this peaks first before you're doing a quantification why it's because a lot of things are unknown. If you try to insist everything to be identified, quantified, then it will be very slow to do it upfront. So the advantage of disadvantage to compare these two is all the similarity here is that you all need to do data quality check. So garbage and garbage always works. And for the chemo metrics or on target metabolomics you need to spectral alignment or BANY. So yesterday we did that LC-MS spectra. We need to do the spectra alignment. We need to do the peak-picking alignment and it's time consuming. This one is definitely not manually, you need to use a good computer, good software tools. And if you do see the target method, you need to do the common identification quantification if you're doing it manually. So it takes a while. So I did it in the first year. So it takes about one month to get about 100 samples down properly using genomics. That's the time. After that, when you do a statistical analysis, it's more similar to in-date normalization, QC and outlier removal if you detect it and you do the PCA or PSDA. And with target method, you basically can do a more pathway analysis and write your paper because you already know the ID at the beginning and data analysis, functional analysis is easy. But for chemo metrics on target, you basically need to identify compounds. You cannot report peaks as biomarkers and write up a story. So somehow, if you want to publish in the journals talking about biology, you have to identify the compounds. So first step is data integrity and data quality. And this one is more closely related to your machine, the platform. So you need to really calibrate your machine and make sure you have the blank, you have QC, everything is set up properly. So this is your initial step and you have to build up a platform. And so here's NMR is robust, but LC, GC, and it's better sample prep and platform. So this protocol, you need to be followed and need to be followed that well-established protocols and train the technicians and to make sure the reproducibility or the technical replicate is good and the compounds or the regions of interest is being married reliably. So if you're doing our target in the metabolomics, you can do a spectral alignment. For example, you have one, you're collecting one spectra after spectra, after you get 100 or 100 ish, you can align them all together. And then doing a processing, if you're doing a solid spectra, so far, we're running the problem is you may use XMS MD mine or something. If you have running a large-scale, solid spectra, you just cannot complete just because of memory. So a spectral on target metabolomics is fine if you're doing 100 ish, but when you go to solid level, it's, in theory, it is doable, but a computational cost became quite significant. And the tools, even including XMS, it's not able to deal with the tons of samples. So that's the current issue. I think a lot of researchers in this area realize that. But most of us won't be able to feel it because the hundreds of samples are pretty good and the tons of samples are quite rare. So for NMR, one common approach is binging. So NMR, you basically chop all this NMR spectra into small bins. And here it shows about 10 bins, but in the field, you really have to prefer the width of the bin. For example, 03.04 is commonly used for NMR. You basically have the 12 ppm all divided by these bins. You can have about hundreds of tons of the small bins. Each bin contains, sometimes bin will contain multiple features, sometimes it contains no features. So you can see from the example here. So you get the spectral alignment, spectral binging, and hopefully the whole purpose is because you don't do target, you don't identify. You only use the features. You always want to make sure you're comparing apple to apple orange to orange. So you want to do a spectral alignment of bin and make sure it's been aligned properly. So after that, the next one is normalization. For example here, are they different or same? And actually, I believe this is a urine sample. And if you're doing a normalization and it will become almost identical. So normalization is better for the urine, it's not like blood, it's more well controlled. So it's a dilution effect or something else need to be incorporated. So this effect is well known. For example, probability quotient, the matter, the internal standard, and sample specific like weight volume of samples is all well supported in metabolism. So all this method is being added because the community really responding to different also developing new approaches, a lot of things working well. So it will be incorporated into metabolism. And so again, it's depending your particular samples or protocols, you can choose the appropriate method. And the data scaling can be applied to samples or features. And so scaling features help manage outlier. So samples, for example, you have choose the overall log and auto-scaling, parade-scaling, probability quotient scaling, range scaling. So these ones are quite all of them commonly used and been well discussed. So for example, the paper I mentioned about the annotated Cambridge 2006 have one of the most comprehensive analysis on them. So I don't recall any subsequent study on that topic and being as Sarah. So that was pretty seal the deal. So it's very well discussed about the advantages and disadvantages. So I encourage you to read it. So if you want to spend more time. But again, summary is that all of the methods have their advantages and disadvantages. So it's very hard to come up with one single solution. It's always working well. So that's why metabolism contains a lot of the options for you to choose from. So now we're talking about QC, outlier, and data reduction. So one thing is that people are always eager to remove the things that seems not to follow their expectation. So here is that outlier removal and justification. First, the sample collection, sample measurement, it takes time. It costs a lot. You don't want to easily just remove them. And sometimes it actually indicates something very novel. So don't do it too early and too eager. So the other one is that this is called data filtering. You want to remove. So again, data filtering can be used on both feature level. So we will discuss later. So some features looks like closed-room background noise. And some features is a clear, very good signal and a beautiful signal, well-married. But that's not a change. So it does not change. It does not consist of biomarkers related to your biological process. So it seems that it's not content to my information. And so you can think about to remove them. So in transcriptomics or RNA-secal microwave, and there is a tradition, usually you can remove up to 25% of all the genes that based on there either abundance or variation. And almost always improved downstream. So this is we're talking about the comprehensive transcriptome or global metabolomics, which contains a lot of this noise. So for target metabolomics, usually you can skip this because you manually go through that. Only have about hundreds of metabolites. So you should keep them as much as possible. So the other ones, dimension reduction, feature selection also can help you deal with these issues. And we will discuss later. And so let's go, I think, directly to metabolism. So it is comprehensive web-based tools for a statistical functional analysis of metabolomic data. Now we have a spectral analysis added just before the workshop. We also have, actually, multiomics with networks. So that's a lot more since published. So it is first published in 2009. So it started developing in 2008. So at the beginning, the first version is doing just a multivariate statistical analysis focused on stats. Why is that? At the moment, the only thing you can do is either using the same kind, which is commercial tools. It's probably thousands of dollars to get a license. Oh, you're using Excel to do the data analysis. Oh, if you use R, you can do some R-based. But at the time, the R is not that popular. So I found it's very, at least, time-consuming for me to manually do it every time, write my R code, even I'm comfortable with R. So I think let's put online so people can use it online, do their own data analysis so I can have more time to do my own stuff. So that's my motivation. So I put out all the popular common statistics online. So it turned out to be quite useful. And people ask to add in more things. So in 2012, we published two version two. And one thing is that one addition is both focus on more functions. So metabolite set enrichment analysis and pathway analysis. And which is, so you can see the version one or two mainly on targeted metabolomics. So we use the univariate to focus on the functions. So after that is version three is popular in 2015. The main thing is that we realized that a lot of people are using it, also we learned that some people actually also use it for microarray gene expression. I'm surprised to see why they don't do their own. They should have a lot of tools developed for themselves but it turned out to be said that metabolites like to do much more intuitive, everything looks better. So it turned out, but it's also caused a big burden to the server. So we need everything to be better. So we'll probably version three is basically redesigned the whole interface and make it a faster memory cause less memory. And we also added the biomarker analysis which turned out to be also an interesting discovery because biomarker seems everybody discuss it for about 20 years and there should be tools for that but they turned out now. So we just do some biomarker analysis, everybody will think, oh, that's so useful, they want to use it if there's bugs and people will send me emails until I keep sending me email until it's been fixed. So I know it's actually quite useful and also feel honored to keep developing and sustain it. So we added that and throughout the years then we keep improving the tools. So we have, so with the more user use it and a big concern is that because you update your tools now I cannot reproduce my result. I said, what do you do? And they don't get, they don't remember it. So we gradually introduced more reproducible research. So we exposed R command. Why is it because after 10 years and people, the R became popular. A lot of people actually come with R. They see R command, they don't feel scared. So I found this very good educational tools to let them see this is a command and this is a graph. This is statistics and they see it online, they learn it. And eventually, I hope they will don't use web, they use the R package so reduce server burden. But on the other hand, it's also a cloud. So version, since version three, we use that using Google Cloud. So cloud actually became very useful, also easy to manage. And we also added a lot of support. So since version three, we started moving more towards on target metabolomics. Why is it because on target metabolomics became more and more LC-MS based. It became more, play a big roles in like microbound, like environmental exposure, like a lot of studies. So we started anymore. So in version four, we started to introduce something else called the MS peaks to pathways. So you don't have to identify them. You just use the peaks and you can get pathways. Also met analysis, network-based data integration. So there's a lot of the evolution behind the scenes. And it is basically based on the whole community. What is the, what user or the community works and basically a metabolic responding to that. Sometimes leading, sometimes just following that. So it's very responsive. So we usually have at least one or two update per month. So that's a lot of work. And because the tool is so used so much, we want to make sure it's not introduced new bugs and make sure that performance as good as before and before we getting more. So this is overall workflow and you have a lot of different input. So tables and pick list and we can add them spec choice because we just added support yesterday or two days ago. So we also supported raw spectrum now after you upload you need to do the data processing. And so the whole idea is to make sure you're from your raw data and to normalize the table. So this normalize table can be used for statistical analysis, pathway analysis, the power analysis, the Belmark analysis. And so this concept is overlap with our omics analysis in a nutshell. So you need to know which step you are kind of in. So overall is that the disregarding how many fancy interface, the online call remains the same thing. It's data pre-processing, especially it's important for the raw structure. And after that you do the data normalization. Also if you're doing a really, really large scale study, you should do batch effect correction. Batch effect correction somehow is here. Other utilities you do a batch effect correction. So after everything is normalized and no batch effect you can do a data analysis and data interpretation. So this is a common approach. And so here is that we provide a lot of example data set and you can download them from if you click data formats you will see a lot of data set. But on it so you can download the upload but on the other hand you can also directly go and click a start analysis and because we have the example there directly below your upload page. So you can also do that don't have to upload. So it is used for the screenshot in the next few slides. We will use this sample. It's called the component concentration data collected from Carl Ruhman. It's from four groups. And let me see. So it feeds a different diet. So have different proportion of the grain. So from 0, 5, 15 or 45%. So the hypothesis that if you really have a very high concentration of the grain inside the diet it actually damages Carl Ruhman. So you're going to see some changes in the metabolomics. So this is a collaboration between David and one professor in the agricultural department. So we use this data doing data processing. So common task is convert the raw data to a data magic suitable for study analysis. And here we have several common format. The one we will be demonstrating is the concentration table. It is from target metabolomics. So I would like to pay, put your attention to here is the last three actually become also common. Everybody knows the first one, but actually a peak list. If you collect a peak peaks from your spectra each sample is one file. You have 100 files, basically 100 peak list, MZZ on the retention time. You can upload as one zip files and metabolites will align the peaks and do the analysis. And you can, of course, you can upload a raw spectra basically a raw spectra and we will do the peak peak and peak alignment. And if you pick peaks, so the peak peaks will much smaller. So I'm saying if you have more than 100 spectra what you can do is you pick your own peaks. So if you have less than 100 spectra, you can upload. Spectra being is also you can do it for for an MRI spectra. So raw spectra temporary are disabled on the public server. This is public, but we are using the dev server. So dev server is powerful. We will do more testing and the problem by the end of the week we will make it open to the public. So the spectral analysis will be available to public soon. So with this example data we downloaded which is four groups from Carl Ruhman, Target Data Metaplomics. So we use the statistical analysis this module. This one is the most popular one. And the other one is also popular, but it's usually even you're doing biomarker and other things you should try from statistical analysis wise because it gave you an overview of your data. So it's a lot of the things quite useful. So what your data format is that you tell them it is concentration and samples in a row. So if your sample is saved in a transformed format you put the samples in the columns. So you can see there's add option spectral beings and picking the tables from on target approach. You can see below, we also recently added ends that table 2.0 M which is designed for metabolomics. So I'm not sure how many of you actually is using that. This is a kind of the community in metabolomics that want to promote this approach. So we added there, but we also have doubt whether people are actually using this. But we just put it here recently based on just be a good sit-in for the community recommendation but we will see whether it's useful. So the other one is a compressed file. Here it's that you can collect your peak list. One peak list of file per sample and then you can upload and doing that. So this is, so you, as I mentioned, you don't have to download and upload. You can directly select here and so you save some bandwidth and you directly connect and to go to the submit page. So this data set is, it contains a zero grain have a different concentration. So it is analyzing the RMR spectroscopy using quantum metabolomics techniques. So it should be from genomics quantification tools. So the hypothesis, the high grain diet stress on cows. So if we click data collection, click submit and you will get this next page called the data integrity check. So what's this page does is just make sure your data is suitable for next few steps. This step actually very frustrating for a lot of users also frustrating for us to make sure the exceptions are handled and make sure that if there's errors are communicated back to the users, we cannot, when user use metabolites, we have no idea who they are, where it's data, what are the data because every day there's about 3,000 users. So, and they come and end on data, they download and the data will be removed. So, and so we have no idea and the best way is they take care of themselves. So we have development tools that more or less let them help themselves. So how to do that actually is very hard. So we have to try and then working well and people complains we try to understand why they will improve. So back and forth, we getting more stable things. I also think user probably learning tools much, learning tools understand what better. So they start adopting the practice. So either way, if you click the example data and you are going to this page and basically you have checking the data past. And so it's a fail that you're going to have some red lines. So some are basically emphasize what it think could be the cause of errors. Most time it's quite meaningful. Sometime it's not. So that's we receive the emails and we need to look at data and tell them what could be wrong. Sometime we found it's common issue. We have to improve the code to make it less dependent on the user interaction. So you can see here is a missing value estimation. And here is that a total of zero missing value was detected. So it's not applicable. And so if you really have a lot of missing value you want to do it, you need to click this. But if you click skip, even there's missing value, see it will set by default, this value will replace by that small value which small value is the lowest positive value divided by five. So it will be as a baseline value to replace. So this page we show some normalization. You already see this page before and you can see so many options. And here is a sample normalization. Sample normalization is based on biology, based on experiment. So if you're collecting from different TCO samples or volumes of liquid volume, you shouldn't use that sample specific normalization like a weight of volume. So you can need to click. So for this one, this is a cow rumen. And what do we want to use this? We want to use that reference group. So if you, it's called normalization by a reference sample is also probability quotient normalization which is well used for arm armor tabulomics. So what it does is try to use a sample and everything normalized against the sample. We, what we has done for this one. So we create a reference group, basically we don't because some once selecting on one sample could be subjective. It's hard to reproduce, but we want to select average. We want to create an average sample based on a reference group. So you have reference sample. For example, this one is a zero, zero is a control. They have zero grain. So we have created an average group sample from this. Then you apply the probability quotient normalization against this average from this group. So this is more stable. And then you choose auto scaling. And so she goes back. So don't, I know some of you are asking question why you choose that? And this is just for demonstration purpose. You can certainly choose the other one. But on the other hand, it's below here. They give you some visualizations allow you to see whether the date looks normalized or not. And so a lot of times based on your experimental design based on the practice of this particular people usually applied. So you need to follow the standard before you try something that's not. So auto scaling, which is mean standard divided by the standard deviation is very common. Log scaling very common. And this one is also common. So far that we don't want to apply too much. So we want to do minimal. So if you apply too many normalization it's make the scaling or even any result. It's hard to interpret. So what I'm saying is only if you assemble have some biological reverence. So you should try to adjust it here. So biological like this is very important. Statistical adjustment is more or less in here. So transformation log is common and auto scaling is common. So we try to use common ones because everybody understand is common ones first. So this one is we already mentioned about this. And we talked about the row wise column wise. So row wise column wise it became sample wise and feature wise. So this one is based on that paper. We also added several other procedures. So here's a visualization. You can see before and after. So before you can clear this is skewed to the left. You can see a lot of concentration actually very low. And this is very typical to metabolomics and some of them is very high. And I'm pretty sure what's this one. And very high when probably acetate. So this one is very high here. So if you do the normalization as we've selected you can see it's became much more normal looks normal. So the thing is that you cannot get absolute but clearly that's already quite symmetrical. And so here is that you don't know what the best normalization will be. So you can try an error and also follow things you can understand. So after normalization and it's good to see whether there's outliers and something suspicious. And so how do you get that is you need a lot of experience. Of course the other one is visualization. Something is totally standing out and you want to see whether it's normal or not. So how to deal with outliers is a visual inspection and normalization and exclusion. So you need do it with caution. So if you suspect it you need to start thinking about it. Can you address it by normalization because every sample actually pressures. So you try to see if you can correct it first. And this is talking about the samples. If we talk about noise reduction instead we talk about the features. Features you mainly for untacted because the noise is very big for untacted metabolomics. Be it a MMR spectra of beans or LC-MS peaks. So people want to include everything in the analysis. They don't want to filter a sample. Which I found out is somehow understandable but a lot of times if you don't do that you are going to have a big penalty later. Why if you include in this a noise and you actually reduce the power. Why is multiple testing adjustment? So if you have like say 1000 features and you are going to adjust for 1000 times if you have 500 features you have 500 times the huge difference. So the penalty is different. So usually also reduce the noise to help you not only p-value calculation also there's a lot of other signals being enhanced. Most likely it's a low signal close to baseline. Usually it's not an accurate merit in a lot of platforms. And so if in the transcriptomics before studies that the overall pattern and just using top 15% of the signal that's the most abundant one you will always keep the same pattern that you change almost. So you are talking about you reduce about the 15% of the features that's our genes that's just have low abundance. But the biological story this pattern is damaging. So that's I'm seeing all mix analysis have a lot of features but a lot of noise. So you want to abandon ideas that you want to include everything in the downstream analysis. So some create earlier data filtering and you will improve a lot of the downstream analysis. So this figure shows what the outlier looks like and this is created artificially. So we just randomly adding some big numbers to some samples. And you can see in the left you see it from PCA it's P080. This is sample totally standing out from the remaining. And definitely there's a big cause for concern because you can see the majority of the grains one is actually there and this one's standing out. So sometimes that could be wrong. And this is sample but if you want to see a heat map you can see this same sample. Also very strong. So the countries are very, very high. And so it's clearly could be a dilution issue and also could be something else. So you need to spend the time on understanding that. So if you can normalize to adjust it, that's good. If you can't and you can exclude them. How do we exclude them? You can reupload the assembled with the sample removed but on the other hand you can do it with a metabolismist is using a speech called data edit. So there's edit samples you just scroll down and found the samples and remove them. So let's remove the outlier samples and the feature removal is how feature filtering. It's also called data filtering. Basically there's studies and published in PUNS or even Nature Journal on early days you talk about all mixed data analysis. You apply feature filtering you really have a much better result and strongly urge people to consider this. So what's the noise features or uninformed features? So there are several characteristics. One is that features very, very small values. So close to baseline, detect limit. So this is understandable. And the other one is variable that near constant. So you can marry them very well and they're very abundant, but it doesn't change. So like housekeeping genes, they are not informative in terms of your biomarker analysis and stuff. And they are not going significantly in the pathway analysis. So the last one is low repeatability. So this is the only applicable if you have a QC sample because QC is mainly from technical replicates. So they should within like a 20% of the LC-MS or 30% for GC-MS. So if your QC samples of this one have a large of the RST values then that means it's not, it has a low repeatability. So it should be excluded. So this is some of the basic rules we want to exclude. And usually we found it's very good. And so for untagging the tabulomics, you don't want to really include everything. So here's noise reduction or data filtering. So you can choose a different variances, a low variance, low incentive, low repeatability. And so if you choose now, and someone doing that you have, it's basically only allow you have 500 features. So this is usually sufficient. So if you have more than 5,000 of the features, so it will feel you are going to apply if you're gonna choose or not. So it will choose whatever low variance or low intent. So this is, we found out this is healthy for your, for the server, also healthy for the data analysis. So if you don't like it, and I think you can use Metabolis R and do keep everything. But the web server, we do have some constraints. I don't think these constraints are unreasonable. It's just that we found it. Some people want to include everything. It causes server stress, but not necessarily good for your own data. So this is something we adopted. So next step, we do a statistical analysis. So what's the common tasks? We do important features. And this is just like differential genie expression analysis. Then we try to identify interesting patterns. So, and we want to test whether there's some difference between phenotypes and classification and stuff. So how do we do that? We're starting with the single, a simple university approach like ANOVA. Then we gradually move to PCA, PSDA and the classroom. So this one, we already covered about the statistical detail behind. So if we go to the statistical analysis, we will see there's several available links. So here the arrows show this one, we are going to explore, but you can see there are several things else we will not explore. So it's up to you to explore. Also something as a great art. You cannot do it. Why is that the data applicable? So it's for example, orthogonal PSDA only suitable for two groups, but this sample can be for groups. So it's not directly useful. You can also exclude the groups using data edit. Once they have two groups and come back, it will be enabled. So let's do ANOVA. ANOVA is try to identify which try to identify those more time that are different between all groups. So that's here. So ANOVA is people commonly used and we just do ANOVA. So what's the cutoff? The cutoff is 0.05 based on adjusted FDR. So here's a post hoc analysis. What does that mean? Post hoc analysis. So it means that for ANOVA you see the significance, but you don't know which two are significantly different. So you know there's something different, but you don't know which of them are different. So basically there's post hoc analysis for the testing only for those that are significant and do a pair wise key test. Then I think it's called fissures LSD and then tell you which two are significant. So here you can see this result is interactive graphics. You see a significant is red and unsignificant blue and you can click and you can see the figures. These figures we have enhanced the figures. So it looks slightly different, but it's better. So if you click the details, and so how do you get a detail? You can see there's a table icon. You click it, you will get the details. So if you click details, you will see another table. So it's so far it's ANOVA and on the left and you see the result and you click your ratio, which is one of them sort of here. You can see the ratio is this original concentration. This is normalized concentration and I receive a lot of the question, ask, hey, why the normalized things looks like that? Because a lot of clinicians, the only custom to the left side, once you see that looks like that, they just lost. You just don't know why it can be lower than zero. So this is something, sometimes I try to help, sometimes I just give them the link on the normalization because this part is that we assume people will be comfortable and do normalization and box plot, with box, with plot. And that's good to learn. And but on the other hand, you can see on the below on the right column, the right most column is a result from post-offin test. So it shows which two groups that are significantly different, for example, your ratio, you can see that 15, zero, 13, zero, 15, 45, zero, that means that first all the statistical analysis is based on the normalized data. So let's take a look at the right hand hand side of the figure. So everything on the control is low, control is red, which is low, the 15, 13 or 45 is high. So you can see everything against the control as it changes, but they themselves in between, there's no much difference between 15 to 13, 13 to 15 or 15 to 45. This is what the post-offin test tells you. So it's simple, but it's very useful. A lot of people really like this. You can do another one, you can also see who's different from who. So otherwise you get, you know the big thing, but you still not answer your question. You want to see that. So these features help you get there. So we're not saying that's a novel, but it's really useful for a lot of people just want to understand your data. So what's next? And we can compare different compounds to see which one are most different or most similar between the two four groups. And let's see, do that. What's your people usually do is they're doing a feature correlation. For example, a common use this like you can have the Pearson correlation coefficient and across all the compounds. And then they see who's highly correlated to each other will be together. So this is a hit map of the correlation table. And you can see that this hit map for example here on the bottom right, you can see several red squares which means positively correlated much. So like S3PB isobuturate and butrate and I develop it. So some of them, this group is a compound. This probably have some, hopefully have some functional correlation. So this is positive correlation. You can see almost go to one and there's also an active correlation. So this figure is not interactive. Unfortunately, we may make an interactive later, but you can, there's a correlation matrix, there's a P value matrix and people are obvious find out it's useful and they can regenerate the figures with the other things so here. So some people want to publication quality figures. So you need to have the higher resolution as PNG or PDF or SVG and you can select the EPI dot per inch. So you can see with this, this icon for your generator figures. So you select the format, select the resolution and you click submit, you'll get a 300 DKI one. So we get the figures and now we go to the next group. So we approach called a pattern hunter. Pattern hunter is, because this is almost like time theory, have a zero, 15, 13 or 45. So we want to see whether certain compounds have some patterns. For example, a linear increase with diet grain concentration increase. Oh, increase at the beginning then it decrease or decrease the increase in the last. So some people do have some stories behind this. So they usually looking for particular patterns. How to do that is that on the left of this part, of course, you can go to the original statistics overview, select the pattern hunter, but now since you were on the correlation page, you just directly click left pattern hunter and you'll go to this page and you basically want to do a find the patterns. So the pattern here is that what the pattern you define is a big number. You want to define pattern is one, two, three, four. This you are looking for linear increase. If you're doing a three, four, three, two, one, you're looking for linear decrease. And if you're looking for one, three, three, two, so you're looking for increase in the middle and decrease in the beginning and end. So it's always so. So for example, if you specify the pattern like that, you will see the result looks like this. So correlation bars. So for example, if you say at the bottom, it's positive correlation, it's orange. At the bottom it's a blueish. So it's negative correlation. So it's a strong positive correlation and strong negative correlation at the end. So which is basically for you to develop your stories. And now we let's from that more linear pattern analysis to quality variate like PCA and the pure SDA. So PCA we discussed about scores and loadings and which is 2D, but for the web-based tools, we can do it in a 3D. And so 3D is more interactive here. So this is PCA's score plot and you can see this is default with different groups shows up with a group with the sample labels. And somehow people don't like sample labels. You can click on select and use or use grayscale colors. So this PCA figures have several options to allow people to have their own preferred coloring. So actually while here is that if you click on the top left, there's a processing. And this also, I think there's a image editor. So you can specify actually specify the colors. So the green, red or blue ish. If you don't like it, you can also specify the color. So far it's, and even the symbols, how do you specify, you can also specify that. So material is then allow you to do it right here, but they do have a panel in the under the processing and allow you to specify the colors and the symbols to do everything. So if you see the loading plot, it is more interactive. So the interactive means you can click on dots. So what we are seeing that you see that groups is separated along this diagonal from the top right to the bottom left. And if you see this same diagonal separation, you can see that this is top left bottom right. And you see the most, you wanna see the most important features that drive the separation. So you click this dots at the end of this diagonal line. So you will see something like a top is three pp and probably endotoxin stuff. You will find this, the feature you identify from multivariate PCA will be similar to those you identified with ANOVA. So this is comforting. So that's I'm suggesting about multiple evidence pointing out to the similar set of features that's much more convincing. So a lot of time PCA tell you the same story as ANOVA, okay? And sometimes it's different. You need to think about is that true or this is something else? So here is a synchronized 3D plot. Why synchronized 3D plot is we mentioned about you want to see the separation. You wanna see the features that contribute to the separation. So it must be on the same perspective. So same perspective is that you rotate this one and the other side have to rotate simultaneously. So we add a function. So if you rotate it one, the other one rotate it. So you see this at the moment is separation from the groups. So on the same angle from bottom left to the top right, you'll see the same thing. So that they will have the facility you to discover the important features. And at the top, I have something said, save you for a report. This is important is that some of the static of figures like we show in the previous page, especially here, it is generated on the server and send to your browser, to your computer, okay? If you generate a report, we have a report generation. It will be the same because this figure generated on the server and sent to you. So we know it. But if you go to this view, this is generated on your browser, okay? Server has no idea what you're looking, how you're interacting, which angles, how you rotate. So if you want to generate report, it won't be the same as what you see here. And if you want to include in this is best angle, you want to include in the report, you need to click there, save you for report. And this current one will be like a, just like a snapshot, like we're taking a photo and on this image or send back, so it will get the same image. So this one as we found out, we don't know how you interact with the figure. If you want it, you have to explicitly ask, we use this one. So PCA is unsupervised and POSDA is supervised. We already discussed about PCA is more safe, if you see pattern, POSDA is usually more powerful, but on the other hand, you need to pay attention to this overfitting issue, like Q-squares, R-squares, cross validation and VIP scores. So score plot is showing up here, it is clearly better compared to other PCA, the separation is better, okay? And what I would like you to pay attention is here on the top, they said class order matter, implying time point and the disease severity. So here's that if you only have two groups, it doesn't matter because your two groups, it's usually always looks fine, but only you have three groups, more than two groups, and this option going to show up. That means if you think class order matters, POSDA will consider the order, try to make sure there's a time series or the concentration changes, and there's a meaning in the why, and so it will try to separate it according to that expected sense, it basically is doing more like a regression base. But if you have multiple groups and you don't think the order matters, and you just, you need to uncheck this. Why I want to emphasize this is some people send me email and describe what they have done, basically group A and B and C, for example, in one, five, 10, or one, two, three. So what it says is they use different label, they sometimes get separation patterns different. Why is because of such things? So the class order thing. So if you are class multiple groups and the class order does not matter, uncheck this. So even you change the class label, you always get the same thing. But if your class label matters, so that's, I would like to say. So if you have questions, you can ask me to clarify. The thing here is that if you use different names, and it will, if you use this in checked, they will order it based on the alphabetic order. So if you change the names, alphabetic order will be different. So the pattern will be different. This is expected because if this is checked, but if this is, so make, if you, if you are sample multiple groups, sample class, all that doesn't matter, so make sure it's unchecked. So this is one issue. Sorry, I have a quick question about that topic. Yeah. So would that only apply to like ordinal values or like what experimental groups like matter in that case? Yeah, it's just all about, you are thinking about whether the class order, for example, in this time, in this group as matters, we make a select because zero, 15, 13 and 45, that's the concentration changes. We do expect that there's something inherent in that meaning. So we keep it. If yours, if the time series, and you probably want to skip that. So if you think that's all that doesn't give you anything, more information, you just uncheck it. Jeff, for time series, shouldn't we use paired like type of analysis? Yeah, paired analysis only for two groups. And I think a paired POSDA is not available here. I can tell you, we do have paired CTest. So time series not necessarily is one patient before and after the treatment. And in a lot of paired tests, it's actually means something different. Okay, thank you. Oh, Jeff, one question on the chat. I'm not seeing the chat here. Oh, I'll read it then. It's about PCA. Question regarding PCA scores plot. I have seen some people apply CTest or ANOVA to the principal component scores assigned, i.e. PC1, to each sample from different groups to identify if a separation that is visible on the scores plot is statistically significant along a principal component. What are your thoughts? And this is Juan. Juan. I think I received that email before and I'm not sure. So PCA is unsupervised. It's really inherent in the patterns within your data. And there's no much model assumption. It really is about, you can call it in different ways, but you really want to see whether the statistics behind this one, for me, I found it's puzzling. I don't get it. I can tell you honestly, I don't get it. And of course, you can write a program to test it, but what that testing for. And I can tell you the other thing is probably more meaningful. I also received suggestions which I'd have done before, but it's not available. So it's the stability of this PCA score. Sometime PCA separation is driven by one or two strongly influential samples. And one thing how to do is called cross validation. Basically you leave a sample, one sample at a time and calculate the PCA and then run some at a time. And just recalculate the PCA for like here is like many times everything cross validated. So you can see the position change for each sample. So you're going to draw basically a squiggle line for each other possible position for the sample with regard into their center. So they have center, then they have all this variation during this cross validation. That's also I found it meaningful. Basically you cross validate, you want to see the change of the position, whether it's stable or not. If you remove one or two samples, suddenly all the separation is gone, then that means it is not so competent. But the statistical testing, I'm just to tell you to be honest, I don't get it. So I'm not saying it's right or wrong. I just, I don't understand why they want to do that because PCA is not really supposed for prediction. PCA is for looking at your own data structure. So maybe whoever sent me want to send me if a good journal and from a respected group, I would like to read them all. Otherwise I tend to ignore that thing. I'm, that's, I do get the suggestions to add to that. I'm not convinced. So. So Jeff is Dave, I just, Mathematic, it is correct that they can do the T test if they keep things in the principal components and they can do assessment of separation. So it's, it's mathematically correct. Yeah. That's, that's so true. Mathematical correct. It's just a statistical testing that I don't just understand what they answered except just that, that plot. It is same as statistical or significant separation. It's beyond that. I just don't relate it back to the overall big picture. So this one, at least I can say it's not a common. So I will see that. But mathematically clearly we can do it. We can test it. That's so true. I think part of the problem with PLSDA and PCA is that we can bias people's perception. So what we're showing here are colored ellipsoids. And if we remove the colored ellipsoids, people might not agree that there are four groups. That's true. And they might not even be able to see that there are separations between at least three of the groups clearly. So I think there is something to be said for trying to make or provide more robust, I guess, assessment of the grouping. Because we can color all we want. And by creating certain sized ellipsoids, we can bias the perception of clustering. And in the end, you want to be able to have some statistical test or some mathematical rationale to say, that's why I see four groups here rather than say three or two. Yeah, yeah. That's true. Yeah, one of the things we do the cross validation even for PCA. And I think I got that script around 10 years ago. We can plug in and allow people to do that. But yeah, I think again is that, yeah, I understand all the concerns and people's frustration. And especially when the signal is not strong and a lot of things is, how do we get some comfort from that one? It's statistics, P-Vider do helps. But we want to make it more justifiable. So I need to get that feelings and also understand the assumption what it can tell so we can adjust in the world. So clearly I'm open to a lot of these suggestions that a lot of time is, we need to want to see the larger community take on that. So for the PRSDA cross validation and here in the default, it shows the lead one or the cross validation also plotted. So why it's doing this is that we want to see what's the number of components used for the performance. So here that we use top five when you choose one choose two choose three or four or five. So it seems the top three component looks that gives the best performance using a Q2 square. So this number of component is used for why is that if you use more component, the more component you use is more likely you're going to overfit. Jeff, what do you mean by component? So, you know, PCA component one, a PCA is a principal component, right? QSDA also component. So if we want to see the PLS component one, so if you see the component, you'll see the component one component or two. You see that? Yeah, yeah. So this is what we're talking about. We use that component to build the classification model and we use three, we use four, we use five. Basically that's a huge matrix. And here is that we plot just you know component one and component two. But when we do the computing, we access all the component, but the more you use the more likely you can go overfit. So we want to use that. Okay. Yeah, so the other one is that Q2 square, sometimes you get it negative. So unfortunately that happens and actually not uncommon. So Q2 can be negative. So people asking question all the time, I just put a lot of answers directly below this one. So read this paper by, you can click to read. It's very well written and clearly interpreting it well in a statistical well sound way. So what it says is when it's become negative and it's not predicted at all or overfitted. So Q2 is negative it means the model is overfitting. Okay, not that well. So that's the bottom line. So VIP score. So VIP score is well promoted by I think it's the same company. So how I see VIP defined it's have a good formula. So it's based on the correlation coefficient. So most time actually you get something similar you use either on the other one. So with, if you use some cutoff for example, your point 1.5 VIP score, you get this seven of them or you get a top two. So this is, oh, here's that you have this. So what do the different colors show? Yeah, here it shows that. So you have this feature that's significant and you'll say, hey, this feature is significant and it's a biomarker for which, so what's the expression level? Let me say that or what's called abundance. Here it shows that three PP most abundant in zero and less abundant in the 15 and least abundant in 45. So it's with more grain in the diet, three PP is reduced and the better endotoxin actually increased. So you don't have to open another webpage and try to see how can I interpret the result is significant, but how it will regard it to my phenotypes. And here the direct plot in the phenotype the abundance level right beside this one. So help you interpret it. Thank you. So here is the permutation. So we mentioned about the POSDA overfitting and we have cross validation, but they also it's not enough. So we're doing permutation. Permutation is that basically you have separation and the separation is basically we use a NOVA based approach. So between group versus the winning group separation. So, you know, this is like a NOVA group and we use between a within and we do that. And so if always between and within group is much larger to the right and you are safe. So again, and we agree, we can apply this approach to PCA. It's just some thinking because PCA is inherently not beautiful predictive. And that's if we want to make a stable permutation of cross that that that one's cross valid the PCA is more making sense at the moment. And so, so here we go back to more visual. So we do the PCA POSDA, we do the class analysis. So class analysis is more visual things and head map. So class and head map. So we, we, we already discussed and the head map actually people like it. So what, why they are like it is because very comprehensive you can get a lot of the customization and the default looks it's good and you can get publication ready figure. So you can do it local using your arm but here is almost, you don't have to do it to writing code. So here's the head map visualization. And you can see there's a lot of the distance classroom algorithm color contrast data source. You can cluster at raw data, normalize data. You can do standard doing to want to auto scaling or not. You can do the new overview or detail of your detail of your basic or give you a huge plot. You can see more details but restrict to 300, 200 features. So distance, you, I see it as you clean distance and the correlation correlation and there's a lot of distance errors commonly used and also view mode. So you don't want to organize samples or organize because default you are going to organize the samples and the features. So if you don't organize you will all that by the class table. So it's much easier to, you want to see the feature changes across different conditions. So we, that's in the other ones you want to show the cell borders, group average stuff where group average is more recent edit. For example, here it shows that if I only classed in the features and we samples not classed just based on their sample. So you can see the some variations across different class labels. If you make that mess up probably not so good for your visual detecting patterns. And again, and if people want to do that show the group average it will just put the average and here. So it's very plot everything. So sometimes people want to do that especially the replication is not so consistent. So they try to do some hiding, but anyway I think that's, I want to see the whole data like the every features across all samples. Once you're doing the average and you're not sure about the, whether it's meaningful or not. So here is that we did the classroom PCAPSDA and metabolism keeping all the plots and the graphs you have generated. And you can generate a printed report that summarize what you have done and what you have found. So here's that one is that you can download the R history if you, you can regenerate everything from local if you want. Also the R history have to save all the parameters. And you can also click a general report and you're going to have a report, analysis report. So this is all your figures. But if you generate a net report you're going to report everything in PDF and save it as a PDF. So why this feature important is that because so many people are using it we want them to capture and parameters and reproduce the analysis. Also metabolism is actually evolving gradually. So we cannot afford actually keep all the versions always there. So the best one is that we give them the best. Also, yeah, if they want to do it they can have a free R package locally maybe. So that's, but that's really we talk about something that more cutting edge research edge most the core is always stable. So we keep that compatibility in mind. So here's that a correlation analysis. If you say we because we mentioned about that correlation analysis. So the here is that you summarize who's correlated with who and with the table and here because you selected last one you selected about the average group average for hidden map it will be plotted here and you have the appendix or the R commands here. So everything's ready. So everything's there and up to you. So if you're doing a lot of analysis it's really can generate 18 or 500 pages. So it's a lot of the things for you to read. And so that's what I'm saying that we make a lot of effort to help you also help saving us time. And sometimes we do get asking questions about how to interpret P values and how to interpret box plot. So we just suggesting you please read our tutorials read all the stuff because common ones are all covered. So it's not we are mean people. It's just that sometimes we want to focus on something as more cutting edge really addressing it. For that one we already covered we will try to refer you back to our FAQs. And so this is our command history on the right-hand side it's always there every time when you're doing one command it will plot it on the updated there. And this is something we have metabines are it's been updated quite a lot. So and so this is in last month I'm pretty sure there's a lot of updating we refrain from committing yet because we know a lot of people follow this command. So if we did something it's not totally yet and people will send us a lot of requests. So we try to release it a stable only we're testing a lot. So for you is that try to install and if you have issues you can let us know I'm not sure yesterday someone mentioned about this. I hope it's solved. Otherwise that's definitely it's going to main focus we want it to be installable on all the platforms Windows, Mac and Phoenix. So the overall idea is that we want to maximize the use of metabanist and the knowledge that the code. And so I always see 75% of functions available on the web it's very easy to use. But for a lot of people who want to build a pipeline special batch analysis is much better to use our package. If you don't care about the interactive real time you just want to hit the command line and fix the parameters is much easier even though metabanist on package. So that's we finish the statistics and we will start going to the enrichment analysis. I think I'm still on time and 2.30 still have one hour. So do you have questions or I keep continue. So we all talking about the statistics and from the early morning and till the last slides now we're going to functional analysis. So the functional analysis is basically using stats but it also considered our background knowledge our prior knowledge. So the previously all the PCA, POSDA, university statistics they really purely met and does not consider our prior knowledge. So we have to think hard in your mind to pay this cluster this top common how they works we have to connect the dot ourselves. So for this large omics studies, it's really not easy and we need to do it using computer. How do we do that? We found out enrichment as it really helps a lot and the first enrichment analysis or most or at least the gene set in rich analysis like GSEA was a make it so popular so well accepted approach to understanding a list of significant genes. So why it's meaningful is that a single genes first the single genes can be involved in multiple pathways or multiple processes and it is significant but you're not sure which pathways because it can involve the multi pathways but if a group it is more the group behaviors more pinpoint in a particular pathways. So once you go to a group and like a gene set a pathway it became more confident because one single genes it's not like a 10 genes that change together. So overall feeling is that the gene set is a gene set or metabolite set or group or based analysis much powerful suitable for omics data. So the purpose is test with their biological and meaningful groups or metabolites that are significantly enriched in your data. So in order to do that we need to define this what's this biological meaningful groups? This is our knowledge. So when we're doing enriched analysis itself it's just a statistical like a hypergeometric test that's statistics but the knowledge need to be building so this is critical. So where's the knowledge, knowledge is pathways, knowledge is disease, knowledge is like a localization which if all the company involved inside the mitochondria is enriched you know something is happening with mitochondria. So this knowledge need to be captured in a text and need to be evaluated using statistics. So enriched analysis to combine our knowledge with statistics and give us the link between patterns and biology. So in order to do that it's here it's not about the stats now and I just tell you there's overrepresentation analysis there's single sample profile is quantity of enriched analysis is all about the test whether there's something is enriched and it seems not random but most of our time is actually try to get this knowledge collected. So most of this the knowledge is based on HMDB and so here is that we have a disease associated with metabolized sets. So basically you have the disease and you have some signatures. So a few compounds change together and it's detecting blood, detecting the CSF, detecting urine. So this is a HMDB has some disease browser so they do have some things already defined. So basically we just try to collect the samples and working with as HMDB and get all these things more defined in a computationally computer friendly form. So we can test the advanced location based there's like a cell tissue specific metabolites and the pathway based this is probably most widely used and the other one is single nuclear polymorphism associated metabolites set. So this is quite interesting. Why is that you have a SNP changes and you are going to call some phenotypes and you have some blood metabolites have signatures. So this is SNP associated metabolites. So it's meaningful. The other ones, jaguarid pathways. So if we do have this, you take the drug you have some things in your blood or urine it will capture here. Also yesterday's comment about a pesticide and nanoparticles, some exposures. If we have a signature in terms of the metabolites we can capture here. And so people can test to see whether it is related to. So that's the knowledge and here's how do you upload your data. For example, overrepresentation analysis you basically, you have a list of compounds which is significant list. How do you get that? You're doing t-test and all you do some clustering from the compounds that are together. The other one is single-summing profiling. Single-summing profiling you basically upload that list of compounds together with competitions. So you compare with your normal reference. So normal reference we talk about human only. We, so we know some compounds should be in certain concentration range. If you are high or low it will become significant. And the other one is quantitative enrichment analysis. So basically you are uploading the table. It will do some quantitative enrichment analysis and within them and found a significant genes and doing rich analysis. So QG is probably more powerful because metabolites have building approach and make it more suitable. Sometime you do your own statistical analysis and you'll necessarily give the best one. So once you get a metabolite set library and we'll do the enrichment analysis to have you are going to see the enriched pathways. So we already discussed you can upload a list. You can upload a list of compounds together with concentrations and all you want to upload the concentration tables. Again, this is all targeted metabolomics. So on targeted we'll talk about it later in called mommy chalk. So here is that you upload compound list to your copy and paste. So this is a compound name and it will just upload and try to recognize. So one of the issues that come on the name is that it could be typo. For example here, isolucine, so L-E-U. And here's a missing you. So we'll try to let you to see, maybe you mean isolucine and we'll ask you to correct it. So this is if you're using HMDB ID or some public ID is easier. So but a lot of you will find the compound name is more useful. So if you standardize your names and you do this select the library. So this library is here and you select which pathway you are interested in or disease associated set or urine or CSF or associated metabolite set. So there's a different organization to help you select. And here is that you get a result. Result shows that as either a network or as this one, more traditional. So network basically means that each set is not a totally independent, okay? Each set of shared metabolites with other ones. So that also give you some feelings about whether it's a certain cause, whether it is a melanin or it is a glycine serum. So if they both significant but they share the same core metabolites. So it's probably, it's hard to tell one from the other. So it's really, it compliment this one because this one just the same area part with isolated independent but it's connected. So it really help you to understand how complex, how correlated they are. So below it will be a table and you see this one ranked by the genes and here you click a view. I think this one, you see this is allanumetabilism. You have two significant things, two significant match. So this is not very convincing because there's so many compounds you only have two match. So hopefully you have more matches, more commonly you will be. And also you can click SMPDB and because this one's really just to show you a name and you want to see it, you want to see it from here. Allanumetabilism, you can see a lot of the text and you see where it's located, we'll build your story. So here's a simple sum of profiling and you see what you need to do is that you have a concentrations. So when you have concentrations, they will all go into compare with the reference. So the reference is based on SMPDB. So and you can see that when you compare, you have mean, high and I think the other one's low. So basically M is immediate. So you are within the range. So high is above the range. So for example, here is the high and here it's several high and you can include this high ones. Sometimes you may see the medium ones. You find it's very marginal. I want to include it, you can click to select. So the one with the high and the here, show the here, you want to do a statistical enrichment analysis. So you include them. So for this one, you can only put in from single sample since you need concentrations. Yeah, so you need to marry it with the sample. So you marry all the samples, you copy and paste here and you are going to compare with the reference that's found out from the literature. And so for example, here in this L-thranian, what you married is 39.19. And here, label the nine, label the high and you want to take a view. So take a view and see this. This is what you married, 39.19. This is here. And how many Australian being married and this is all reported on the literature. It's the PubMed. This is 36.2, 12.2, study two, one, two, three. So there were 10 studies, okay? They reported about the measurement, the range. It's all here. It's your range. So it's pretty confident. Either you somehow, something wrong with the patient or could be the sample is no more concentrated because all the 10 studies, none of them get that high value. It's just a sanity check. So that, this one is more on the real data sample, your data tables. So you have this whole concentration matrix, okay? And you just click through it will give you data normalization, missing value implementation and enrich analysis. And you can click a node to see analysis or study more. So what I'm saying that it is more powerful the algorithm. So you have a lot of the things lights up. So you can click one, but here is that you're not only see the matched names, you also see the box plot, box plot because you see the, because it's uploading as a control table, you actually see their concentration range. So you can get a box plot. This is a cancer, cacacic and control and they have three, three match. And the same as all considering the higher in the cacacic versus control. So, and that's enrich analysis. So enrich analysis is really testing your set. This set is just a text you put the, you just put it there. You don't have a structure, okay? So you don't know who's interactive, who's turned on first, who's second. You don't know that. You just know they changed together as a sequence. So pathway analysis are more refined. So, because pathway actually contains more knowledge. So, we are supporting a structure and a visualization and help you see who's upstream, who's downstream. And currently supporting 21. I'm thinking about the prog-23 now we gradually adding some organisms and many model species. So if we click pathway analysis and we click using example data, we just go through it. And this is again, this is a cancer data and we use, this is a lung cancer. And here we choose auto scaling and just for demonstration, okay? And purpose. And we use the default parameters like a global test and we the CAG, because the pathway analysis we mainly using CAG, we also have the HMDB, SMPDB, but a lot of the library actually based on the CAG. Why is that CAG have a long tradition also passing CAG and rendering with, rendering it seems quite well established a lot of our packages. So here is that how do we do the pathway analysis? So one thing is the enrichment analysis we already used for the metabolism in rich analysis. The other one is that we want to do topology analysis. So this is new. So topology analysis, what's topology means? So within a pathway or within a network, we want to see something that seems more significant more important. So what's that? What's which position more important here is that hub. So nodes that have a high, that are highly connected. So if this node changes, it probably have a more effect rather than that leaf that edge. So the other one is bottlenecks that blue notes. So if this node changes, it has a lot of effect on the other one. So this one can be actually catch capture easily with degree centrality. So red is very high degree and blue is very high between its centrality. So this is one is, there's more probably slightly more advanced to help you to select the nodes that are more important. What it says is for the metabolites that have changed if they are more in a significant position or important position is more likely have an impact on the pathway than those that on the very downstream at the lowest level. So because of direction, if you change it here, they don't have impact on the upstream. So they usually it's thinking that way. So we use this intuitive reasoning. We just incorporated this one. And so basically what we want to say is that for the change of different metabolites and we use them to also consider topology. If they are topology or significant, then we combine them to visualize them, okay? So the thing that we, so we just put this impact as a axis. So you can see the same thing and the more to the right, it became more have more impact. Basically those changes more in an important position. And then on the p-value, a wire value is set that the smaller the p-value is more significant. So this is in ritual analysis. So what we really want is that we have more pathways that with metabolites changes that are significant and also on the central top important location. For example, if you see this, you see this, this is you click, it's a glycine serine and it's running metabolism. And you click this, you will see the pathway like this. And you will see that a lot of the changes actually in the upstream or in the center because you see this, this is connected to so many other things. So if you change on upstream here with a lot of connection, this is between a degree centrally very high. So something else give us same p-value but only probably in the downstream. So what I'm saying is, if you want to see what the pathway enriched, so you need to impact it, you need to think about the structure rather than just by the p-value alone. I have a quick question about that. Sure. How do you quantify that pathway impact? Cause it's obviously it's numerical here, so. Yeah, so. Yeah, you can see the R code to how do you do that? No, no, I'm just giving us. So there's some something is that you know, each pathway and each within a pathway, you can have a node actually have a degree, right? You do have a degree calculation based on how they connected. And you're normalized to a total as one. For example, you can see this pathways it's not absolute degree, it's normalized by the total. And so the maximum impact you can get it probably one. And so if you have this one, have this impact normalized impact a degree of value and plus this one plus this one. So you can edit them up. So eventually the, if they're more important, so you will edit up more. So let's see. So the root of my question is really, is something that has for example, a lot of downstream impacts going to look different than something that has like for example, just like three downstream players. Does that make sense? Like those would those be weighted differently? Yeah. So this one considered direction, okay? Definitely upstream only considered from up to down. Okay, don't consider it. So it can see direction and let me see right here between this outer degree. So if you, so what it says is this is probably called outer degree. And from here you don't have received a degree. This basically except this one. If you have this, it's probably zero. So even your impact, if you even changes, it does not contribute to this impact. So only changes to this will be added. So. Okay. Yeah, this is not a rocket science. We just do a lot of the normalization, make sure it's compatible. And also think that upstream should have high impact than downstream. And so they added up together. Whether there's a better ways, and I do believe it could be, but I'm not quite aware. Also there's needs from benchmark. What's, how do we interpret and proportional? For example, here and here definitely we all agree, but here probably you don't want to see that. So clearly the algorithm reflect in this. Okay. If you change it here and the change it here, even you have a same number of metabolites or more, probably if we're going to hide it here. So you will be significant, you call it, but not going to be on this side. Thank you. And so here also give you some stats and you total 32 you have eight matched. So this is more confident. So this is something you click, you see that and you see that again, we capture everything. So you get these figures, you get this report and we do a benchmark analysis now. So let's keep going. So we're talking about this morning about classification. We see the accuracy error rate, which is fine, but it's not biomark analysis. Biomark analysis is actually quite different. It's talking about sensitivity specificity and ROC, AUC, that's more meaningful in clinic. So that's why at the beginning I said, why people don't use some machine learning method and we realized that it's not that straightforward. So because we need to consider this sensitivity specificity and we also need to respect the tradition, so tradition is that clinicians they use the univariate or just single biomark at a time. And the more modern thing approaches using multivariate analysis. So basically we can use the random forest using SVM using pure STA, we can do that, no problem. And a lot of times people want to do it manually. Why? I believe these three metabolites are most important. I want manually select them. So this is three different modes. So you can do everything. You can do univariate or whatever you want. You can do multivariate using this machine automatically select or you can manually select. So you go to a biomarker, you select like, let's see this first data set. And again, this is eclipsing. So it's a pregnant mother's and it's from serum. So it's 25 is preeclampsia and 25 is normal pregnancy and try to find biomarkers for preeclampsia, that's the goal. So all the data pre-processing is similar. So you can see we only have five missing values. And so it's fine, let's skip. And here, if we want to go, we click on log normalization and we want to be as simple as possible. Why is that you can, you recall one question about if you build biomarker models, you do a lot of data processing, you change the scale. So yes, we are aware of that. We hope not to touch the raw values, but on the other hand, we have to because it's just the downstream machine learning our statistics are expected, but we will also want to use simple approach as possible. While we are on this page, there's slightly something new here, it's on the top. And let's me see next one or if you didn't see it. So on the top is that compute and include a metabolite risk ratios. So there's a well-agreed observation or actually conceptually people already suspect that is metabolite itself may not be as predictive as metabolite ratio because it's flux. So metabolites product or reactant. So their ratio is more pronounced than the metabolite itself. So it's a lot of time it turned out to be true. So it's better for biomarker analysis. People really, really want to try their best to improve the performance. And they do find that using a ratio, sometimes the performance, the ratio biomarker performance sometimes is better than using original combination. So we do allow you to include in this ratio. But on the other hand, I will give you is that ratio, you include in a original metabolites also including a ratio, you can significantly increase the size of the features and the increase of overfitting risk. And so also even when you're doing it here you're already including your risk already. So a potential overfitting ratio associated with the procedure. So whatever you are doing, if you're doing this you need to validate it on the independent cohort. This is the other one. You should have another validation set never used it for cross validation. So if you're doing this, it will increase your chance of discover biomarkers. But whether the performance here is real is realistic or not, most likely not. So you need to validate it on a brand new cohort. This is what I would like to see. And let's move on to the log normalization. We didn't use a ratio, okay? Here that's a different story. We just use the simple ones and one thing, explain how it works. So we do a normalization and seems all the skin log it looks very well up to. So we don't do the classical we just do in multivariate RLC curve. And here we do the default as linear SVM and SVM build in, so classification. So in order to build the RLC curve is special to very smooth RLC curve and we need to do a lot of the, you need to get a bootstrap to the bootstrap samples. So that means we do a lot of the, you need a lot of samples to get the smooth things. So what we have tried to do is called the MCCV is a Monte Carlo cross validation using balance the sub sampling. So we just doing a sub sampling and using two-third to do the changing and one-third to do the laptop. So the details of this is described here. So, and also the R code is all available. So we try to be clean, try to be efficient and also try to be don't susceptible to the imbalances sample. For example, biomarker, even when we try to do that sometimes people give samples only have a few cases and majority is control. So it will affect in your cutoff. So we do here is a balance the sub sampling. So make sure every sub sampling is have the equal number of case and control. So then in that case, we have a clear cutoff is at the point of five. So we can easily do that. So for example, here we see the example using SVM. And so compare all the models. So we use a significant week. So here is that I need to mention use a need to specify the class we can method and the feature ranking method. And this is one thing we need for SVM, there's a linear SVM there's a built in and also for this algorithm. And we do a sub sampling. We also do the feature ranking. Why is that we want to use a subset of features to build the biomark models. We don't want to use all features because the more feature you use the more susceptible to overfitting. So if you can find the two biomarkers and predict very well it's much easier to develop in clinical. So the feature selection and the cross validation is built in. And so this is one challenge of a lot of the all mixed based biomarker discoveries is that if you use all biomarkers it's overfitting. If you choose a few biomarkers and you also need to be very careful about the cross validation because you don't want to select your features based on all samples. Then you'll see classification is so good which is clearly overfitting. You cannot do that. So at a feature selection cross validation had to be all happen within each fold within each cross validation not user. So it cannot be separated. So this one is that I want to just make sure if you're interested in doing a spellmark discovery Jeff, I have a quick question. So it's like the first question is how many features you consider here? And the second, like did you adopted any feature reduction technique in order to avoid the overfitting and improve the model performance? So we don't know how many features we will find the best. So we just select the further two, three, five, 10, 100. So basically you just create a ladder and just using your pre-ranked features you just select and see whether at like for example, two, three, five hopefully some simple model actually perform better. The thing is that if you select more features not necessarily you're going to better it because overfitting things that's a cross validation well tell you you are overfitting. That's exactly what I want to tell you is that so I'm not sure. So the bottom part is already cut out. So you can see just below is there's some ranking on the models two, three, five or different things. So top performing model not necessarily is the one with the most number of features. So two or three probably give you the best. Cross validation do tell you your feature using 100 features actually not working well. Jeff, sorry, does that have a risk for the information leakage if we select the most important feature to predict the model? Yeah, this is one I just mentioned about you have to do the feature selection within each fold not you cannot do the selection using the whole data. So if you select the feature within that two-third, okay? Yeah, okay. And use that feature to predict the one-third that doesn't use the for the signaling features then that you're not going to leak. So I will go into addressing that issue. So this is the only way, only best way to address that. And the button will cause some issues if your feature is not stable which will probably come up in the next few slides. So here it shows that the model four is probably the best. So this model four have 10 features. And when you select one models you can actually see the component interval which is people want to see as from point nine to one is 95 in four. So it's common. So here it's exactly addressing the questions that you mentioned about feature. So first is that we do the feature selection only within each fold. Like we use a two-third of the sub-sample of the things select samples then predict another fold is not used for feature selection. Then we do another two-third to do the feature selection and predict another one not used for feature selection. So this whole thing is sub-sampling and the prediction and it's kind of repeated for as many as possible because we are doing Monte Carlo. But the issue here is that we don't get a consistently significant feature in each fold. They are going to change. So for example, one, two-third you'll probably find the glycerol is top. In another one probably rank the third one. So if you choose the top three and it's not stable. So if it's stable it might make life easy but a lot of times it's not. So in the end is how do you know your feature is most meaningful? The only thing that the things we found is rank a feature by the frequency of being selected. So because if you select the features in the model and you just select it and select it how many times this is meaningful but the rank is not because that's not stability because especially you have smaller data set you don't really get that. Sometimes people find everything is this data set to give you more or less a rank. A lot of times you have this almost a street line so they don't have too many variations. So so far that in region analysis, pathway analysis mainly for the targeted metabolomics and the biomarker and a statistical basically that's almost orthogonal. So you can do a target on target, okay? And here's another new module called MSPIX to pathways. This one's for on target. So the motivation for this is that people have to spend a lot of time to identify the compounds and to do functional analysis but they spend a lot of time and then do the functional analysis and found out they are not significantly changed. So there's a lot of times have been wasted if you know these compounds it's more likely to be involved in a certain functions and change the significant. So before we're talking about on target metabolomics it is you doing a statistical analysis on the features and to try to identify those significant features which is fine but a lot of time these features you cannot identify which there's no compounds assigned to this PIX and yeah there's no pathways involved. So you get it, you cannot explain them. How do we want to do that is we want to do some enrichment analysis first. Then we want to try to identify these compounds involved in these functions. So you see the features identified from this enriched pathways already known in most likely the compounds within these pathways. So you only need to focus a compound within these pathways and it is interpretable and functional and meaningful. And this is something how these whole stories come here. We already mentioned about accurate annotation of the feature is very time consuming because you need to do the reference standard you need to do MSMS and it takes about a day to month this and but you don't have to because if you're looking at the group level like the whole pathway level and the statistic can help you a lot to reduce the uncertainty. So how do we do that is that we borrow this thinking from GSEA. GSEA we already mentioned enriched analysis in looking at the group behavior. So an individual can be random but when you have a group consistent pattern it's more competent. How do you see the significance? Well, we do permutation. So I already mentioned about the permutation. So GSEA is a permutation approach which takes some time but nowadays they get much, much faster when we understand how it behaves. So we do a GSEA almost immediate and we also do a GSEA for this mommy chalk also very fast. So the key idea is to leverage the power of the order inherent in the biological system to tolerate random errors in accuracy in individual peak assignment. So the important here is that error must be random. So if it's random and we see consistency now we know there's some patterns here. So we always need to test whether this order is significant or not. So the chance of seeing an order from a group it's very low. So that's making the group based functional analysis without accurate identification of the individual of each level is very making sense. So LCM and PICS. So you can upload your PICS list and basically you do your peak line, peak and level analysis or whatever in your favorite tools or you can do it with a minimum type analyst. Then you'll copy and paste this ranked PIC list. So how PIC can be ranked based on P values based on T score. So I recommend that you always use two group analysis don't use multi groups. And so because they're easy to interpret. So that's the way. So again, you can have so many different format one column, two column, three column because you want P value and T score. And so we have a lot of examples. So you're welcome to try and think about which one is most suitable. And also if you don't want to do it you can upload a PIC intensity table. So here is two algorithm we use MamiChalk and GSE. So MamiChalk is designed for a metabolomic GSE so we just adapt from MS from the gene setting which analysis we found both is helpful. So we provide both. And here's some other thing that's called ad act currency comp metabolites. So this one will help you improve the ad acts or currency compound, we improve the annotation. So it will usually refine the results a little bit. Default, we do a lot of default based on the best practice but some people are really expert. We can improve it. We allow them to do it there. So we have a lot of the pre-built in pathways and so it updated last year. So here is M. I believe this is a GSE, if you GSE because you says uploaded pathways downloaded up regulated up down regular. You can actually click to see how many of them hits. And so it's, and you go to next level you see that pathways overlaid on this cake pathways. And you, this is basically a select lot of pathways you highlight here. And you click individual, the matched compounds. You will see how many matches here. It's all these different ad acts. And you will see what it's meaning for not. So here's that you don't know but you know the function is more trustworthy. And you click in your compound you see who's assigned to the compound. You will see, this compound being so many of this potential hits more likely to be true. And especially in places that a lot of the compound concentrate in particular region follow in a particular pathway. And here and here and more probably give you some confidence that activity there is turned on. So if you want to do some validation you should focus on the compound that's there to your pathways. So that is a mommy chalk using the statistical test. And the other thing is that we don't want to do the global T test on rank them. We just want to see what is the patterns within my data. For example, I think this is an immune cell probably dendritic cell and it's been treated in the different antigens maybe. I forgot the whole context. So it caused different changes. So, and then for example here if you upload pick intensity tables you will see that this one hit map. So this one's interactive hit map. So what it will show is that there's a lot of the places or patterns seem interesting. For example, here you can drag, use your mouse, drag and stop. You'll see like this one right in the middle of here. So this is your point of region of interest. And you think here the peaks actually change a lot with regard to the experimental conditions and you want to see whether there's certain functions that are enriched. So this is, and you do enriched analysis on this region, okay? So you are not doing a global ranking in the pivitis. You are using your eyes to select a region of interest and the test in the wider certain things. Now you can see that there's a CoA biosynthesis, you are filling an anion metabolism and you can label them with different colors and you can see how many of the hits are there. So it's all labeled. You can see you click, see again, see some different edX on the right. So it complimented the statistical test allow you to visually select the regions and doing an enriched analysis. And you can design your MS, MS later on that. So again, we cover about half of the functions within our tools and there's still a lot more to cover. And clearly I think that's almost your turn or we still have more time, I'm sure.