 standard slides and we're going to be looking at Metaboanalyst is the software that we're going to learn about today and just sort of as an introduction to Metaboanalyst remember that we're doing metabolomics experiment where we'll have collections of samples, biological and technical replicates typically so dozens to hundreds of samples and from those biological samples we're going to be measuring hundreds, thousands of metabolites now it could be metabolites, it could be genes, it could be proteins, the same thing but conceptually we're going to be working with metabolites now you can measure metabolites in the targeted approach which is what we talked about largely yesterday or you can measure metabolites and bins and peaks in the untargeted approach both are valid, both are used both are useful I've told you my bias or preference hopefully I convinced you of its merits but they're still both used and both useful and they both have a certain type of workflow so with the targeted one typically we saw yesterday where we looked at our our spectra for instance and we saw that with the quality of the spectral data you fixed it then you did some compound identification with that you did some compound quantification and that point you could have downloaded the data or saved it as a file and you might have done, you know that was one biological sample, you do it for 10, 20, 100 then the next step that you'll have to do is to try and do some data normalization and that normalization as I say can mean scaling that is adjusting for concentration differences some things are dilute like urine, can be diluted blood, you don't have to worry about CSF sometimes but then the other aspect of normalization remember was converting a log distribution to a normal distribution so these are the two things that you'd have to do after your data has been compiled then you might look at your data as well to see if there's some outliers, some mistakes, some data entry errors, typos, things that were just all zeroed because whatever happened this is quality tracking, quality assurance, outlier removal and then the next part is the data reduction the data reduction is dimension reduction, PCA data analysis, data interpretation, pathway mapping, all those things so that is the process for workflow now with the chemometric or untargeted approaches which is more common for MS-based techniques first thing to do is to check your data see if the spectra there are properly referenced or normalized next thing to do is then to take the spectra and align them so if you collect the 100 spectra whether it's HPLCs or TICs you're going to try and match them all up hopefully they've been collected under good conditions that alignment then allows you to sometimes you can work straight with the aligned spectra if your data limited or computer limited you can bin the data so that sort of helps reduce it, it allows you to just focus on the peaks and ignore the noise then you can do some of the data scaling, this is then making sure that everything is adjusted to the right concentrations also trying to see if you can get the distributions of those relative heights so that they're normalized then there's a quality control checking to see that there weren't strange things, noise, removal, noise reduction, outlier reduction then you now have your data and you can do PCA, you can do the data analysis but at this stage everything is just peak one, peak two, peak three so you've now done your PCA, you've identified peak one and peak two or distinct from peak three that's when you have to go to your compound identification to figure out what was the thing that was driving those things from being separated so this method as I say used by a lot of people this is where the challenge is because if you found a nice separation and the things that in your loadings plot clearly show at least two peaks out here are driving the separation you've got to hope that you can identify those two peaks if you can't then you can't publish okay so in all of them there is this they all begin with a data integrity check so in some cases it's this issue of trying to identify false positives so these are presumably all controls, all replicates, technical replicates or something like that and what you're seeing here is that this is the normal one but now you're seeing this thing changing it shifted or it appeared and disappeared so what's going on so you might have to try and figure out is this stuff that's contaminating the column, stuff that's contaminating the inlet, the probe we've talked about things with adducts and massed back gas chromatography, we didn't talk about the issue of extra-derivatization products, these produce sort of false positives there's the isotopomers certain breakdown, product neutral loss elements ionization all of those things produce this sort of flaky data so this is not a problem with NMR, it's a different technology, a different technique it's partly saved because it's not that sensitive so lack of sensitivity is one benefit oddly enough anyways obviously people will then check to see if they can eliminate some of this data and there are programs that come with various mass spec systems to help do this but then there are manual ways to also help eliminate some of these things so you check your data, you try and get rid of the false positives, you try and merge your peaks so that you're not counting adducts eight times, isotopomers twelve times once you've done that then you can start doing the next phase which is sort of that spectral or data alignment you know if it's a total ion chromatogram or an HPLC or UPLC this is not unusual where you're saying this is one run red half hour later this is the next run, same or similar sample there's a drift this is particularly problematic for LC what you can do is with software you can actually shift these things to deal with the drift sometimes the column wears out, things start stretching out, peaks start broadening so you can do the alignment and the favorite one I think many people use is XCMS to do alignment but there are other ones Home A and Zed Mine and the technique is called time warping which does this sort of alignment and I've listed some places that you can grab or download these different alignment algorithms this used to be more common it's not done so often today but binning is a way of essentially grouping peaks together and allows you to get rid of essentially noisy data so it was when CPUs were a little smaller memory was a little tougher but conceptually it's still a way of clustering or grouping things so that you can get peaks, simplifying the spectra to some extent so you can, this is commonly done on NMR could also be done in mass spec but as I say it's sort of an optional thing but it still creates groups and peaks so you've tried to clean your data, you've tried to align your data maybe you've done some binning now you do this normalization slash scaling this is important now this could be sample from one person and this could be a sample from the same person at a different time or it could be a sample from a different person and you can kind of see that these two look very similar in overall positions where the peaks are but you can see that one is about twice maybe three times higher is this a real difference is this you know tuning related to the sensitivity of the detector is this dilution effect, if this was urine, maybe the person at the bottom had been drinking gallon of water or something so this is one of the issues that you basically have to deal with is are we dealing with a dilution effect because someone added too much buffer someone drank too much or are we dealing with a real phenomena that these are physically very different individuals or different processes so you can deal sometimes to the whole sample to deal with dilution you can deal sometimes with trying to look at the total integrated area and trying to match the total integrated area and there's things that are use probabilistic quotient methods sometimes people put in internal standards to make sure that the same concentration is being measured and it depends, it depends on the sample urine that's a big one blood it's not so much an issue cell samples, cell extracts, as long as the same number of cells it shouldn't be an issue but only the experimentalists can know what sort of scaling factor they need to do it's something that can't necessarily be uniformly applied so here the scaling and the normalization are different they are as I said it's just that the people will use the terms interchangeably so here normalization you mean real statistical normalization to make it normal sometimes people will scale to certain features so here's some unique feature here if you try to work on only that then you probably there's only a tiny, tiny amount here that's something you'd ignore so then you look at these lines that are perhaps more informative and then in terms of after scaling then you can do the normalization which is trying to get things so the concentrations follow some normal curve as opposed to some sort of skewed thing so that's another thing that you still have to do after the scaling and so there's log transformation auto-scaling and things like that no, well we'll show you in this case that the software that the metabolism analyst will do this uniformly and so it'll sort it out or generate up but you you don't know which method is going to give you a nice normal distribution so you typically have to try a couple and you iterate so it's okay let's try log transformation sometimes you don't even have to do anything at all and it's nicely normally distributed so those are the things that you have to adjust for okay so we've done those first three steps I think in our flow or data flow and then the next one is quality checking, quality assurance some aspects of outlier removal and then the data reduction so the quality checking, outlier movement, sometimes you can remove solid peaks you can remove noise outliers but typically if you're removing data make sure you have some reasonable justification don't remove data because it makes my fit better that's unethical and then after you clean things up then we do the PLSDA, the PCA, whatever else you want clustering so that's when we get into the dimension reduction and all that fun multivariate statistical stuff so the QC here is really based on data we got because what I heard is that from the instrument we'll have another QC yeah there's multiple levels where you need to do quality control I mean one level you're trying to make sure the instrument is producing reproducible data so you will put in every tenth sample or every twentieth sample something that has a defined composition or standardized sample and you'll compare those quality control samples through this or even right there to see if they look identical and superimposable and everything else there may be quality control checks that you do to make sure that at the beginning of the day and at the end of the day to make sure the instruments tuned properly it detects a certain standard, a compound and then what you'll find is that over the course because you can rarely do a metabolomics experiment over one day it may take several weeks or months and then you're going to look to see how that varies over days or weeks sometimes you may be combining data from different platforms so again you want to be able to find out if there's ways of combining them this is another point about quantitative metabolomics is that combining data from platforms is trivial whereas if you are using only chemometric methods combining from different platforms forget it so but if you've been working on one platform for a long time you want to see if there are if there's systematic drift that'll have happened and we'll show some examples we'll see this at the end of this lecture where you're seeing some tests that have taken over multiple weeks and you're seeing the very first batch of eighty samples is one level and the last batch of eighty samples which is largely the same group that's way up here there was a drift it has to be normalized or scaled so it's brought back down to the normal values so all of those things are done and need to be done just because instruments aren't perfect, people aren't perfect things drift over time okay so we're going to talk about metabolism, so I've gone through the general workflow of how you do it and metabolism analyst is a server to basically support that kind of workflow both for targeted and for non-targeted metabolomics it's designed to work for NMR for LCMS and GCMS, so the major platforms you can do univariate statistics, you can do multivariate statistics so all the things we saw earlier t-tests, ANOVA, PCA, PLSDA it does this on the web, it does this with graphics, does this with nice plots explains things it's designed to help people do analysis especially those who aren't fluent in R or MATLAB or do statistics in their dreams what we try and do is is also link that data to other types of resources that we've developed and what we talked about yesterday like the small molecule path database or the HMDB or keg or other resources so the workflow is data preprocessing that's the thing we talked about quality checks and everything else, the data normalization which is scaling slash normalization to Gaussian then there's the data reduction which is sort of the data analysis it can also be data annotation especially if you're having to do that at the end of an untargeted study this is a little more in detail so you can put in a variety of data, you can put in raw spectra, you can put in peak lists, you can put in bins or you can put in those tables that you got from the economics which is in-she-identifier concentration which is as I say that the recommended input that more and more people are suggesting so once the data is in then you can do a variety of things there are spectra, then you can do some spectral processing as the list of peaks, you can do some peak processing as there are bins or spectra, you can do some noise filtering if there are concentration tables or peak lists then you can also do some missing value estimations which help deal with the normalization problem so you're dealing with hundreds perhaps tens of thousands of data items so there's a data integrity check just to make sure that everything's been entered correctly then we can do this scaling slash normalization uh... and this is where you interactively see how you can change it your data set which may have been very skewed or may have some odd concentrations out of the molar level and everything else is a millimolar level and so you can try and get the data to look like it's a normal distribution if you don't do this everything you do from here on is probably not correct so that normalization step is sort of an interact when you don't know whether it's going to be a log transformer, you don't know if you don't have to do any transformer, you don't know if you have to do row rise or you don't have to know if you do column-wise normalization so you do an interactive thing try this, try this, try that finally if you're happy with the normalization process then you can start doing the data reduction so for the data, how should we start from the row normalization or which one do the first? um... I can't remember actually right now which one's first typically but usually it's I think is it would you do the row as the most common one? we'll go through that a little later from here you have several things, you can do the multi-group analysis which we spent the morning going through you can do time series analysis this is a longitudinal study, this is a cross-sectional study whether either of these, then you can do the next one which is to try and get some biological interpretation so you can do pathway analysis you can do enrichment analysis and this is similar to gene-set enrichment analysis, but this is metabolite separate enrichment so this is where you can take that sort of raw, those clusters and actually go to people and say you know what these are the pathways that have been changed, these are metabolites that are most interesting um... we can then take that data and spin it out into tables, graphs images, a very high resolution we can also do other things that we could have done also up here, we could do some quality checking check things like the temporal drift batch effects so this could have been here or here and then if I said we were starting with just raw lcms data and we got some of these clusters now we want to figure out what those compounds are, so there is support for for data annotation yes there's some trouble in the path to increasing value of notation uh... on the one hand if you just sort of fill in the same value for everything reduce your variation of the tribe, things like more computation, you know, overfading you can just make a signal what are some things that you like to do for increasing value of... the one that we most often do is for missing values for the type of metabolites we do, which is more targeted, is if we can't see it, we know it's below our detection limit we know what our detection limit is and then we take that detection limit and divide it by two it then becomes a common value, so it's not zero but it is a value the value is uniform so at least it's I mean you could make it have some variation about some norm, about half the value of the lowest limit of detection there are problems when you zero things, I think Jeff has a comment I think one of the things that's becoming clear, and this is maybe as Jeff was pointing out if you didn't hear, but if you're seeing mostly missing values get rid of it the whole thing, basically human, animal, plant, microbial variation is not that large so yes, we can see things varying by a factor of two but it's not, you're not going to find a situation where people in this room, if we were looking at their urine or their blood or their CSF where someone's going to have only half the compounds in other rooms they're going to have exactly the same compounds, they may have a variation of five percent so someone took acetaminophen this morning, I didn't I'm not going to have that in my CSF that's one compound but it's not going to be representing half the variation, so in LCMS as you say, there's lots of situations where you're going to see these things I think that's, this is leading to a lot of problems and really what you want to do is get to the point where you're only taking the ones where there's high reproducibility, about seventy-five, eighty percent that are common peaks so that's, yes, that's cutting down what you're seeing but that's, that's, that represents the real variation in, in animal systems we're not that different and, and when we do this with you know, techniques like NMR where everything is identified that's about what we will see, is this about an eighty, ninety percent consistency of what is in, in each organism, sample, whatever okay, so we're going to go through Metabolanalyst and we're going to look at four steps, we're going to do raw data processing we're going to do data reduction dimension reduction PCA then we're going to do metabolite set enrichment analysis and then we're going to do metabolite pathway analysis MSCA is, was, um, a separate module, it's now part of Metabolanalyst MetPA is, was a separate module, it's now part of Metabolanalyst so you can access them both ways I suppose but we're going to treat this as all one happy family so this is a step through, so I'm not doing it live because sometimes the internet will die or it's hard to do while you're talking so I'm taking screenshots uh, and we're talking about these things and the idea is that after lunch you guys are going to go through these, some of these screenshots and you're going to walk through this, you're going to run the program and you're going to access it so anyways, this is what the home page looks like, or close to the home page and uh, it'll explain some things and there are overviews, you can read that, there's different data formats so by clicking on this panel you can navigate to different regions in the program um, Jeff is a person who wrote this, so he's here, so you can answer more of the questions than I can um, it's been out for a couple years, it's used by uh, an average, I don't know, 100 people a day I think um, so you've had to upgrade the server about four times um, because the use is pretty intense so if we click to the data formats there are some example data sets that are provided so there's one of the data sets we're going to look at is cow data um, just because it's different enough, um but there are things uh, this is with compound concentrations so this is the targeted approach and we're doing this for a couple reasons, partly to keep consistent with what we did in the theme for this but it's also faster the data processing is faster when you have concentration tables, if we were using um, time series peak intensity data or bin data it takes longer so we don't, we've got 16 people hitting the site since it's just not going to work, so for saving time for data processing, we're using compound concentration data uh, there are also other zip files that are, that people can use, including the full LCMS data so this is one that you can download and then try and process and that's fairly taxing on this server, but it can work there's an explanation of the format, so it follows comma separated value which is the Excel format so every Excel spreadsheet can be saved as a .csd file so first thing to do is to convert that data into matrices that are suitable for statistical analysis so p-class spectral bins and raw spectra can be converted into tables we're working on concentration data we're going to also convert that and that's almost already converted, it's almost ready to, to be suitable for statistical analysis so we can go once you've gone to the first step, there's upload your data and you can see these different steps, upload, process, do some statistics, do metabolites, do enrichment, do pathway analysis after you've got untargeted data, you can figure out what your compounds are start labeling your metabolites since we already have concentration data, we don't even have to worry about this we just worry about these ones so here's our data format in this case we've actually uploaded the cattle data but it identifies, it can identify whether it has concentration data or whether it's peak data, spectral bins it could have been zipped file, it'll ask, it'll ask about, but and then it'll ask whether the samples are in rows or in columns so you have to know that so here's the data, as I said, this is Ruhmann samples 39 samples, they're measured with NMR they were fed different proportions of grain you can imagine it's a bunch of cattle and they're on grain diets, and they're on grain diets for, I don't know, two weeks and then they go from grass to grain to more grain into lots of grain this is how they do sort of fattening up for cattle and also to improve milk production and cattle are designed to eat grass, they're not designed to eat grain, and so they've known for many years that as you increase grain cattle start having problems some of them get into very serious problems but they wanted to know what it was doing and so in this case they were sampling what was going on in their stomachs that's the Ruhmann so these are, as I say, different types of test data you can sample, you can click on it and upload it and as I say dairy cattle they also do it with beet cattle get similar results and different proportions, grass fed oats and barley at fifteen, thirty, and forty-five percent of the diet stick a needle in them, pull out their Ruhmann, that's the digestive food and then analyze it through NMR, just like what you guys did for CSF and then as I said grain is known to be stressful to cattle and actually kills a lot of them so we've uploaded our data and now it does a data integrity check so it says data content passed it's identifying four groups zero percent, fifteen percent, thirty percent, and forty-five percent these are not pairs, this is an unpaired T-test, sort of saying thirty-nine samples forty-seven metabolites so thirty-nine cows ten for each of the four sets of groups all the numbers are numeric uh... not negative, concentration shouldn't be negative there are then you know you've got a typo uh... there's some zero values and missing values, so no missing values, again people often make mistakes in the case of the zero values uh... we're putting these two small values those can be arbitrarily chosen or are calculated based on the data anyways everything looks fine to us so we can go ahead and skip some of the stuff uh... now we want to do the normalization so the rows and columns see the rows for the samples and the columns for the metabolites uh... so many cases with rows you may not need to normalize at all but you could also normalize to a reference sample which was chosen here and then there's the columnized normalization so these are the metabolites so you're having things that are millimolar and micromolar and the question is are they normally distributed or not so we could have said none we could have done a log scale we could do something called auto-scaling parental scaling, range scaling and the explanations and citations for these are provided in the website so you can learn a little bit about them so in this case we chose to normalize by reference sample I think arguably we could have just done none and then we've done auto-scaling which works really well for things like urine and rumen anyways so we now have a data matrix so the samples are in rows and the compounds metabolites are in columns so these are the different types of normalization row-wise or column-wise we're trying to make the rows comparable to each other so in this case if there was a dilution effect so this is sort of the scaling this would sort of try and help deal with that principle there shouldn't have been because rumen isn't like urine so we didn't try and deal with a scaling or adjustment for dilutions but in terms of column-wise we're trying to deal with the huge range in concentration some at submicromolar some at high millimolar and so we have things of different orders of magnitude in some cases with mass spec you'll have things that nanomolar and at millimolar so you've got a six-fold concentration range so when those concentration ranges are quite large as I say there's log transformation which is most common auto-scaling which works pretty well for a lot of things Pareto and range scaling as I say for the row-wise we didn't really have to deal with it now we're looking at the metabolite concentrations and this is usually where the biggest problem is you can see that there's most of things that are very low levels of concentration probably in the micro-molar but then there's a couple of compounds butyrate and acetate which are in the high millimolar range and so what you've got here is essentially a distribution that is almost exponential and you can't do good statistics on an exponential distribution so we did auto-scaling and now we've changed what looks to be very biased but something which looks kind of nice and in fact what's shown here is the distribution of concentrations and to my eye that looks like a normal or Gaussian distribution now we could have tried log-scaling, you could try that and maybe you'll get a curve that looks maybe a little squareer that also looks normal it's a qualitative assessment to say how normal does this look but that is a good normal distribution you can in that situation sometimes where all your scaling doesn't help and you end up with fine little work then you've got kind of a problem you probably have to break your data set up into two parts can we do the log to transform and then do the normalization? well we could have done the log transformation, we had the option so this was auto-scaling so we could have turned this one off and just click log and then first we'll do log and then do auto-scaling no I don't think it supports two at the same time, it's one or the other can you try all four? yeah because basically in microarray we're using the routine in the pipelines that first take the log transformation and then you do the normalization, I mean auto-scaling auto-scaling already looks like a square root so it's basically like log, I've already tried all of them so then it's divided by the unit of same that's right and then inside which one you're supposed to yeah I have another question so on the workflow I say there's alignment so we do not do alignment we're not doing alignment because what we're doing with is concentration data so alignment is something that you do for the untargeted one and so yeah, that's being skipped for this demo here you split the data into two sets at which point do you combine them? so if you've got a bimodal distribution you might as well treat it as two different experiments two different platforms, whatever you want yes I understand but please, the row lines the rows are sampled yeah yeah, you have to so the most important one is actually the column-wise in this case it's normalizing the concentrations so if you don't do that then a lot of the statistics kind of mess up the row-wise or sample normalization or scaling in this case especially for dilution effects or some sort of systematic bias that may have occurred that's important to see, it's important to know but typically if you're sampling from, you know, blood sampling from probably CSF if you're sampling from cell extracts where the same number of cells has been taken and you've been careful about that you shouldn't really have to do row-wise normalization but because the instruments that we work with can measure concentrations over from, you know, nano molar to almost molar then you do need to do often column-wise or sample or not sample but metabolite normalization I think we found this issue we were talking about this the other day so there's this biocrities kit that generates mass back and it has a bunch of metabolites that are up in the millimolar range and a whole bunch of metabolites in the nano molar range so you typically get a bimodal distribution and I don't think we've been critical enough because when we do when we do the biocrities data we don't get this quite so nicely we get something that's sometimes bimodal and really what we should be doing then is just splitting the two is two different experiments, arguably what happens to that the most extreme concentration where does that fall on the it comes with the others I don't know, I think it's actually up here in the normalized concentration does it appear in the middle of the peak or at the right most end the one that's at the highest concentration this one? yes so this is the maximum concentration or the green part so this is the average where does it go in the normalized concentration? so this one is now here is that really true? so this has gone through auto scaling but isn't it supposed to still represent the highest concentration? well I think if we'd done log normalization it would have been shifted but auto scaling shifted things in a couple different ways yes is that true? is that a true representation of that extreme values in the normalized? so all you have to teach is in this diagram we name this column say it would be a high image all the numbers within this column will be divided by the mean so that rank on everything and we're in the same but this comparative other component it's similar to 0.0 around 0.0 that's what's the peaceful purpose for the rank-wise even if you have that mean so how will it be 0.0 how about 0.0 not the rank so it's really closer to rank normalization so even though it's very high in terms of the rank where it was in the middle so you mean center which means you use the mean of this compound itself the global mean global mean this is a really standard because global mean if you apply the same value global mean the whole trend of this yeah I understand that this isn't really standard I mean the question isn't the answer is whether acetate is high is higher under a certain conditions if you do the right normalization you find how that population moves whether it's plus or minus one or plus or minus other mean is some arbitrary so if you are looking at univariate we really compare the absolute competition really not original if you're looking at multivariate it is in the variance of overall things and the variation of the variance is correct to say not absolute that's the content behind it probably within a minute but in your case in univariate we really use an original okay so I think we'll move on so that's the normalization and I think as Jeff pointed out they're standardly used so they're not special to this and I think as Chris was pointing out we're really just trying to see we're not worried about the absolute values we're worried about the relative changes so now there's an aspect of quality control and this is something that because we're dealing with concentration tables we don't have to worry that much about but if we were dealing with spectra this is sometimes an issue still as I mentioned or was shown in the diagram there are things that you can do for quality checking with drift and time drift and overall chair but there are things sometimes where we've got we did look for missing values but we could also look for values where there's typos typos are not going to be detected in that original data quality check so some of these typos could be picked up by inspection so here's an example of a typographical error we did principle component analysis and we actually see two very nice clusters and then we see this out here it could be real but what you want to do is go back and look at your number and in this case it was probably not real and someone had forgotten to put a zero in or put the decimal in the wrong place somewhere you could use a clustering diagram this is also called the heat map or hierarchical clustering you can see this black line here and that's actually this same thing here showing up so again some sort of typographical mistake so you couldn't have picked that up in that very first fence which is just looking at missing values and everything known in rows and columns but this allows you to sort of sort some of these things out and go back and see if you had some typos you can go back rather than re-enter everything and start all over you can actually get back straight into the sample editor and actually delete that problem value but this is not something you should do is say oh I don't like this this doesn't make my curve look normal and I'd like something that would be better I'm just going to delete this until I finally get the answer I want something that as I say is an obvious typo here there's noise reduction this is again something that's more particular specific for raw data rather than the concentration type data but there are tools for doing some filtering to eliminate certain variables peaks so this is important for particularly LCMS data in certain types of GCMS data and it uses again standard techniques that people have identified for doing this sort of noise reduction okay so those last few sizes are more specific to raw data but in terms of this concentration tables you can look for some outliers but now we're more interested in data reduction and analysis so this is where you've done your study control disease or growth media one growth media to whatever it is and we're trying to identify the important features, important patterns differences between phenotypes trying to classify and predict so after we finished our processing or pre-processing we now have our statistics and now we can start doing a whole bunch of things full change analysis we can do the t-tests volcano, plant, sunnova we can do some correlation analysis PCA, PLSDA significance analysis of microarrays slash metabolites Bayesian analysis and heat maps self-organizing feature maps canvines partitioning random forests to support vector machines these are all supported so this is quite an array of tools actually and many people only need to use a couple but you can explore and try all of them as you wish so what's highlighted here we're going to try and do some ANOVA analysis, PCA, PLSDA and some hierarchical clustering so by clicking on those things it allows you to enter into the analysis so if we've uploaded our data it's there we can just click on ANOVA here is our analysis of variance we're dealing with four groups remember 0, 15, 30 and 45 so you use ANOVA not t-tests we're trying to see if any one of these are different and we can choose a p-value so that's your alpha we arbitrarily chose it at 0.05 it doesn't have to be we could have chosen 0.1 we could have chosen 0.01 there's some post hoc analysis this is Fisher's exact test here's our analysis and then there's some icons here which allow you to click on them and it gives you a little more detail so in this case these are the compounds that have been measured so endotoxin is lipopolysaccharide it's a bacterial metabolite glucose I can't remember what 3pp is alanine, isobuterate, methylamine and then these are the original concentrations between 0% and 15% for this compound uracil so we click on uracil up pops these plots so a chart and here's a box plot this is the normalized concentrations so at 0% grain or grass fed cattle you can see something here and then for the grain fed cattle you can see something here and by the ANOVA test we can be quite confident in terms of the p-value at 0.0002 that yes this is different than these other three that's what the ANOVA test is supposed to tell us so grass fed clearly different than the other three at this level we couldn't click on this one and maybe we find that this one is different than these three I don't know but you can query each one of these by clicking on this information so as I say you did your ANOVA zoom in clicking on here to tell us what's going on with in particular what are these compounds so these are ones that are significant so that means which are significant different for yeah so this is a measure of the log log p so 0.05 per year and this is 0.0001 but still it's like a show of the different metabolite like each one is referring to one metabolite each metabolite is a second metabolite they're 47 and it's too hard to write out 47 names on a graph so that's why it allows you to zoom in it's just telling me if there is a significant difference between the metabolites yeah so it's one of the four so it's a one way ANOVA test that's being done, not a three way or a four way, it's just the one way it's just saying is at least one of these different from the other three so this is allowing us to look at the details to look at the specific compounds and to identify which ones are most significant in this case LPS is the most significantly different glucose the next one and this is what we'd like you to do as you're going through this so I'm stepping you through but I'd like you guys to do this after lunch on yourself, on your own and to try and answer some of these questions so in addition to that zooming in we could have also clicked on this graph from the ANOVA and now it's essentially looking at this correlation and we're seeing essentially heat map we're seeing essentially some general broad clusters grouping certain types of metabolites and then we're also seeing certain groups here which correspond I believe to I don't know, rain fed and grass fed we can take those plots, those images and we can actually go to the image center and we can convert and print those plots and we could use them for papers or presentations solution, format size and save that or download that particular image so again, these are sets of questions we'd like you guys to try and answer in the course of the lab that you'll be doing after lunch you can also do something that's sort of looking for specific patterns so in this case we're looking at four different states you could think of this as almost like a temporal study you know, 0, 15, 30, 45 it could be days, hours whatever, these are sort of states but we're looking for perhaps trends or maybe there's a periodic trend and as I say, if you're trying to plot a trend you need more than two points three or four and so this actually has four sets and as we go from 0, 15, 30, 45 does something climb, does something fall we don't know and so what we're going to do is look for this pattern to see if we can find trends, in this case the pattern we're looking to see is do things go from low to middle to higher and higher so do things go up so we're looking for things that are trending to, according to this 0, 15, 30, 45 we could look for things going down we could look for things that go up and then down, we could go any sort of pattern does the first row say what you're referring to and once below just generalize it that's right so this is a pattern that we're looking for and see if there's certain metabolites that do follow this particular pattern so which ones actually have this climbing pattern, well you get more and more LPS as you go higher and higher levels of grain you get more and more glucose as you get higher and higher levels these are the strongest ones with the strongest correlation coefficient, so this one is up around 0.7 this is maybe about 0.6 I can't read it that well but there are also some that have the opposite they have a negative correlation instead of going 1, 2, 3, 4 they go 4, 3, 2, 1 so the strongest one is 3pp now if you recall, I think it was endotoxin has the highest but I think 3pp was like number 3 in terms of the ANOVA test so one was way down so I think if you looked at the 3pp you'd see something else distinguished between the two cohorts or the three cohorts but again, we're seeing different trends and then things that are around negative 0.3 and positive 0.3 well those aren't really strong correlations so you can kind of make your decision about which ones you want to keep which ones are most interesting but these are trends that you can see that you guys can answer so we've gone and done several steps in data processing we've done the ANOVA we've done some correlation analysis we've done a few heat maps now we can get to the thing that we were everyone likes to do so it's normally published and so we're now going to PCA we have 4 groups and roughly 10 samples in each group and here is what the clusters look like now these are labeled so I think if we didn't label them I think it would be a little difficult to actually see any kind of clusters that's like a big mess but what you can see is we've labeled the grass fed rain fed at 15 30 and 45 so I think if we just labeled 0 and 45 you'd see two very distinct clusters but we're including this progression and you can actually kind of see a progression things are sort of dropping this way in terms of your scores plot we can look at the loading plot and say what are the compounds that are causing this drift, the separation as they go from here down to here so it's the proportion this is the grass fed, this is the heavy grain fed so we look along this direction to see which ones are driving that shift in the PCA plot the score plot and the loading plot tells us that endotoxin and glucose and isobuterated 3PP are the ones that are pushing it and that's what we exactly got when we did the correlation analysis endotoxin and glucose, isobuterated 3PP if you look at our ANOVA test the most significant ones identify these ones as well yes there's aspartate and yes there's maybe veiling but those don't seem to be driving this trend which goes from here to here on your scores plot here to here so you're looking for direction or shift from here to here so on your so the blue one the light blue one is with the highest the grain fed that's why so we're looking for things along this direction that are most different it's just like because my score plots are like this right so that's why in the end if they want to go in one each area we're looking for them okay so we can do two-dimensional PCA you can also do three-dimensional PCA this is from the same data no this is from time series which one? oh it was a time series data yeah okay so this is a sample three-dimensional PCA you guys will try and generate one which hopefully would be corresponding to what you see with the cattle data but as I say what are the ones that contribute so you'll try and answer some questions so we saw some clustering you see some trends let's see if we can now use the heavy artillery and see if we can exaggerate that trend so now we're going from we've gone from ANOVA in correlation PCA now we're going to click on PLSDA click here and we're going to click on a 2D score plot and now you can see here's grass fed 15% grain fed 30% grain fed 40% so maybe in the dark blue we can see actually three very distinct clusters so PLSDA has sort of pulled things a little further apart and made it a little more distinct than what we originally saw from the PCA so we can get Q squared and R squared values this is through the cross validation and the choice is how many components we'll use to calculate Q squared and R squared we max the number that we typically use is relatively small number typically five or six to do that cross validation we've done tenfold cross validation and what it's showing up is that things start to stabilize in terms of the R squared which is almost one and Q squared which is about 0.8 after about three this is a measure of the accuracy overall so these are three components to use essentially in terms of the modeling and this is one of the mysteries of SIMCA P I think in terms of how many components but overall this is a very robust PLSDA model so this is telling us that it's A strong and B not over trained what are the components they're driving this PLSDA model well not unusual the same ones that we saw with the ANOVA the same ones we saw with the correlation elements the same ones we saw with the PCA plots so 3VP and Otoxin glucose then there's alanine which we haven't seen before here's uracil which we've seen before and these are indicating how they're different and this is indicating their trend so from high to medium to low from low to medium to high so some of them are having opposite trends that's also depicted in this variable importance plot you don't understand now we have more components right we have component model is only taking 2 and 2 count as 3 it's not working at 3 metabolites it's a way of assessing so it's not looking at 3 metabolites it's just how it's doing it's cost validation no it's just simply a matter of how you do R squared Q squared and usually the R squared Q squared analysis it's still a mystery it's not formally published in the books because the people who did this run a commercial enterprise so we just tried to emulate what they did so this as I say is this is more conventional the statistics are more interpretable then the permutation test which is something I prefer over R squared and Q squared so we're over here we click on the permutation and here it's done 100 permutations but with 100 permutations you can see there's not a single model outperformed the first one so because no models outperformed it in fact this model is so much better you can be certain that the statistics better than .01 if you did 1000 ones it's going to be less than .01 if you did 10,000 it'll still be less than .001 this is a really good model so that's confirmed as I say by the R squared Q squared values so these are again some questions you can try and answer so we've done I know correlations, patterns now we can do some heat map analysis and here is our heat map it's a little dark might be better on your but here are the four clusters here's 45% here's the graph spec 15% 30% and you can see partly by visualization you can see that there's a distinct cluster with the 45% there's more red here with the 30% there's more red here with the grass fat the red is over here so you can kind of see that there are clear clusters so this is quite distinct then there's the metabolites that are listed here that sort of help with the distinction again this is just another way of visualizing the data but I think hopefully you're getting the message is that whether we use de novo whether we use PLS you're getting a lot of the same answers the things that we're driving the differences so the four groups cluster together rather than put together right yes they separately cluster it's perfectly cluster yeah it looks like it so you were just clustering cluster this one on the templates but you're showing basically the 10 samples for the 10 10 samples for the for the brain pair and then 10 samples for the 10 you don't make a mean you just keep the different samples separate right? actually I'm quite sure if you're clustered on these you'd still get them separated very clearly always always always but no I think so anyways is that because are we organized? you have options these are the tools that you can choose you can reorganize if I would not take that do not reorganize? so you would get another dandoram on cool dimension yeah so this is this starting on these ones then we have this one just so we might have one maybe one here moving down to here but you can see the four groups pretty nicely separated well this is why you're supposed to do this this is what we can do after lunch so you get to explore try this out and you get to try and answer some of these questions so we've done about a half dozen analyses and this stage it's generated a fair bit of data and so you can actually download it and some of them can be figures some of them can be reports so in this case this is the analysis report and so it writes your paper for you and it explains things, produces the plots the tables and values you just have to write the introduction okay so that's stepping through the analysis approach the point being is there's lots of options I'm trying to demonstrate it it's for you guys afterwards to try and play around to also try and answer some of the questions that we've posed now within Metabolanalyst there's also this component called metabolite set enrichment analysis which is based similar to gene set enrichment analysis which is used in transcriptomics and within gene set enrichment analysis it's sort of subsets but there's over representation analysis there's quantity of enrichment analysis and there's single sample profiling to make gene set enrichment analysis you actually have to have metabolite sets so there's a lot of sets that were from disease groups some from pathways and then there's this one I guess it was mostly from the SNP set that was published a few years ago so it's a little biased that way but these are things that people are going to be adding and could be added to this but the point is that we found separation we found 3pp, we found glucose andotoxin that's making a difference but what's the biology with that what are significantly rich what correspond to certain pathways or to diseases or are they localized in certain organelles or tissues now the problem is that we were doing this stuff with cattle and this whole MSCA is really designed for humans and humans don't have ruminants and we're not ruminants and so we're going to shift a little bit and we're not going to use the rumin data we're going to use some human data that was collected to do this as I say you can look at you can input data that's just lists of metabolites so glucose, andotoxin PPA those could have been our list of metabolites we could go a little further we could look at here's our list of metabolites and here's our concentrations that were significant or we could have almost the original data set just as it was when we started doing our PCA analysis so in terms of overrepresentation analysis we got our lists of compounds that are important the ones that we identified from our PCA analysis we just type in that list about 15 or 20 and then it will go through its libraries and see if these are associated with any particular pathways, diseases or other things that are already known we could now have just a list of concentrations glucose methylamine whatever and there are actual concentrations standard known concentrations normal concentrations to see if some of those are also unusual and then identify some pathways and then you can have the full data set all the compound everything, concentration data all the full metabolites and then it will do a more detailed analysis based on what's enriched and then we can interpret this so as I said MSCA only works for human data so that may rule out about half the people in this room it's just hard I mean of the organisms that have been studied humans are certainly the best studied ones but it doesn't have to create your own custom sets so if you've happened to collect a lot of data for certain standard organisms so we're going to look at lung cancer and colon cancer patients and they were suffering cancer catechia and as I say we have these three options for example, procliner QEA so we're going to choose the first one here's our list of metabolites so this 60-70 odd metabolites this is their names just names so we don't even have to worry about concentrations again it's useful to have quantitative metabolomics and what it'll do is it'll just make sure that we didn't have any typos and it turns out someone mistyped isoleucine they didn't put the E before the U so let's pick this up and so it's now trying to map this name to possible names to possible identifiers hub cam identifiers, keg identifiers so we're just going to make sure that your name and spelling is correct and you can say yes, sorry I made a mistake because again when you've got lists of compounds there's usually about 10 synonyms for every compound everyone uses their own preferred synonym so this is important this name normalization is what it's called and then we can say are these metabolites associated with any pathways since this is collected from urine we could have looked for disease associations we could look for associations with metabolotypes and SNPs there's other metabolite sets we could have had a self-defined metabolite set so if we defined something for cattle that we knew about we could have created something like that but it just hasn't a lot of work done on cattle so here's our ORA analysis on these important metabolites that we identified here and interestingly the ones that are most important seem to be all associated with amino acid metabolism oh what does ORA stand for overrepresentation analysis so anyways catexia is muscle wasting muscle is made of protein if you're looking for a sign of it you should expect minimally that the focus would be largely amino acid metabolism that's essentially catabolism and so that's certainly some of the things that would show up and it's branch to me and then there's ammonia which is how you get rid of protein and then there's a few other things that are sort of perturbed but there's a pretty clear story about which ones are most most bothered based on this list we can see the number of hits that are expected, their P values so the ones that have the highest you know if we use .05 we can go from propanate and above so there's six major pathways that are modified we have some information about the false discovery rate dealt with and we talked a little bit about false positives already from there we can also click on these things and it'll take us to the SMIT DB and it'll see or show us some of the compounds that are modified and the roles they play in that particular pathway what is the false discovery rate and how should we view it? yes, you want a low value so that .05 single sample profiling so this is something that doctors would like to use this is like biomarkers so someone's coming in you've asked them to do a urine test and here's their concentrations from the urine is everything normal so this is going to essentially look at all of the values from this patient and it's going to see which ones are abnormal here are the concentrations and here are the reference concentrations that have been compiled in the HMDD and this will compare which ones we know this person is I think is a male or whatever medium so whether they're unusual or not and we can see if they're outside a particular range and flag those as being possibly problematic so in this case it was 3NE and this is the concentration range which is normally seen these are different studies that have been published standards and this is the value that was seen for this particular patient 93 micromolar and you can tell that it's way above the average so something's wrong here I've got too much 3NE coming out of the urine and there's some references and you can go back and track just to see how confident that is so that's another check so that's a clinical style of checking it's not necessarily a pathway style of checking the third approach QEA quantitative enrichment analysis in this case we can just upload the whole set not just one patient it's a whole group of patients I think submit that and this one is identifying perturbations and once again we see amino acid metabolism, methionine metabolism butane is another sort of amino acid and then propanite which we also saw before these are the ones that are perturbed from the full set so this is using not one patient but many patients not using just a list of metabolites but using a little excel file and it's identifying some of these pathways very very high very low p-values very low false discovery rates is being highly perturbed and this is essentially looking at the different matches and I think you guys will have a chance to look at this exploring the metabolic changes that are associated with catechia so the question is you guys can get to answer by looking at that so we've identified clearly some perturbations to amino acid metabolism that happen with catechia we've identified which ones are going up which ones are going down now can we go to pathways we saw how we had some links to pathways but we're just trying to extend that MSCA analysis by looking at the pathway structures and right now this is a little more flexible so it's not just for humans it deals with a lot of different organisms so humans but E. coli and yeast and a few others and these are things that were compiled from keg pathways so we can do metabolic pathway analysis met P.A. using these techniques this is again we're just using an example set from those cancer patients so same set that we did for MSCA same sort of analysis in this case auto scaling we're not doing any row-wise normalization we could have I suppose because there are some dilution effects so this is just like we did with this one what animal system are we going to look at bacteria no these are not bacteria I don't feel like the plants know these are humans so we clicked off the homo sapiens so we're going to do pathway analysis for homo sapiens use all the compounds so you could do a global test and we're going to look at doing some pathway topology analysis so there's pathway enrichment and pathway topology so some compounds so this is something that people have done with protein-protein interaction networks and gene interaction networks so you can do the same thing with metabolites there are some metabolites that play a very important role they are hubs they connect it the same metabolite used as an essential one so glucose is a great hub glycine is an important hub of metabolite choline an important hub of metabolite where it branches off to many other areas then there are bottlenecks this is a bottleneck that's where many paths lead to other nodes so whether you identify the hubs or the bottlenecks you can use graph theory to help measure those features that we sort of qualitatively identify as hubs and bottlenecks so degree centrality and betweenness centrality so something that is a hub has a high degree of centrality and something that is a bottleneck has high betweenness so this is a quantitative measure so here's the pathway visualization this is a catechia sample and this is the impact of the pathway and the probability of that pathway so we can click on this one which is the most important evidently pathway there's many pathways there's 80 that are drawn here typically the ones that have the highest impact and the most significant so we click on this and it shows up as the glycine serine 3M which is what we found in the presentation analysis but it identifies the number of compounds these are the cate diagrams and you can collect the metabolites that were significantly altered so we're not seeing everything metabolomics doesn't have perfect coverage and we can go in and click on these compounds and identify which ones were different most different between those say with pitexia and those in the controls and we can see that in the case of serine those with catexia and somewhat lower than people in controls that's a box plot it shows the mean and then it would have the standard deviations or those variations about it so we can also go a little further and these are the ones that we're seeing for all of the pathways all of the hits, gain, tables of statistics and their impact value and so on so that is a short synopsis of metaboanalyst I think you could easily spend a day doing everything exploring everything we didn't do things like k-means cluster we didn't use SOMs we didn't use SVMs or random force you guys are free to try that out we didn't do time series analysis although we used the four states for the cattle feeding proxy for time there's things called two factor analysis we didn't do some of the data quality checking which you could do we didn't look up peak searching so this is just some output that Jeff compiled of types of time series analysis that can be done and have been done for metaboanalyst certain types of graphs and then there's some data quality checking as I say, this is where we're looking at batch effects batch one, batch two, batch three and here's batch four in the black set of quality controls that were spiked in you can see that they're tightly grouped even this one is tightly grouped but the quality controls would have said this was fine but the whole average has been shifted so we would have two or still have to adjust that so with the quality control checks it's possible to look for batch to batch changes and I think this is very important in a lot of metabolomic studies that can cause a few heart aches and headaches so that's the synopsis of metaboanalyst