 through everything. So it's an introduction to Metaboanalyst and then you guys are getting the rest of the afternoon to work on a Metaboanalyst with the data that you generated and as I say towards the end if people are interested I could do a very short 25-minute talk on the future of metabolomics if you want it. So what we're going to do is first of all introduce you to Metaboanalyst so this is a software tool that's been around since 2009 and Jeff Shaw who is here is the guy who has written essentially all of it and is now in version 3. So we're going to get familiar with it and then we're going to look at how you could analyze NMR and GCMS and LCMS metabolomic data. Now I'm just, maybe I should just, I'm wondering if I clicked on the wrong hang on as I double check here because I've got two things. So I'm just bringing up the right lecture here. So we're going to go through the standard data analysis workflow. We're going to look at issues related to data integrity checking, outlier detection, quality control, normalization, scaling, and then we're going to dive into Metaboanalyst. So this is the example that we've given you've seen before about how a typical biological metabolomics experiment will be done. You have treatment and control or untreated and treated. You'll have within the treated many biological replets, within the controls you'll have many biological replicates and some people will also take technical replicates. They might take double blood samples, they might take a single blood sample and then aliquots into several aliquots for follow-up duplicate redundant studies. But regardless, you're generating lots and lots of data where you're measuring hundreds of compounds or features, or thousands of features per sample. We talked about the two different routes to metabolomics before, the chemometric, untargeted, pro-flying methods where identification isn't a major focus and then we talked about the other one which is quantitative slash targeted or even untargeted, where feature identification and quantitation is central. So the traditional older style chemometric method, there's a lot of data integrity check, but it's also true with the targeted and quantitative methods. Now where they start to differ is the spectral alignment or binning, which you guys did in XCMS becomes very, very important. That's the first thing you do in a chemometric method. In the targeted method, the first step you do, or after the data integrity check, is this compound identification and quantification, that's what you guys did in Bazel and also GC auto-fit. There's a nut step which is the data normalization. This is the kind of vert sort of skewed data distributions into normal Gaussian distributions. So that's done both with the chemometric and targeted methods. There's the data quality control check and outlier removal. That's also done in both chemometric and targeted methods. Then there's the data reduction and analysis, that's a PCA to try and reduce dimensions and then to start generating pathway less and interpretation. In the chemometric, the compound identification is moved to the last end because by then you've presumably identified what is important. And I think again, people who did the XCMS exercise will have seen that the last thing that you see is this table which then shows you your putative compound identifications. But often those are identified as being the most significant features and then it's sort of up to you to figure out what the compound is. And as you saw with XCMS, you get anywhere from 10 to 50 different possibilities. So that's the difference between the two workflows and you guys now have experience in both. So now we'll dive into this issue of data integrity, data quality. So one of the challenges with LCMS and GCMS is they have a high number of these false positive peaks. So these are either noise peaks. They could be fragmentation, neutral loss fragments in LCMS. They could be different derivatives that happen with GCMS. You'll get issues with adox, both with GC and LC. You'll have isotope peaks. There are also these ionization issues that can happen and then there's just standard noise that shows up. NMR, interesting, doesn't have those problems because it's such a relatively insensitive technique. Anyways, identifying those features and adox is something that you can handle with using replica studies where you have the same sample, a technical replica injected, and then the adept calculators that we talked about the other day in which you also saw an XCMS that showed different adox that corresponded to the same mass. We also saw issues of data and spectral alignment. That's part of that workflow that we talked about. It's very important for both LC, not so much for GC, because there isn't as much variation. But you can see in this example here where we have two LC runs where there's a systematic shift, and then you can essentially shift things, realign them, and now you get the red aligning with the blue. So the XCMS does that, MZM does this, chroma does it, and the methods are usually these time warping algorithms. Again, this is a bit of a review. You can also bin samples, and this is something that's an older technique, and it was done because of essentially data storage limitations on older computers, but it's still legitimate, and it's a way of just carving up spectra that are very data-rich or peak-rich into regions. So these could be retention time regions, so this could be time, or it could be NMR spectra, it could be in PPM scale. Another aspect that we talked about as well in the statistics lecture actually is about scaling. Scaling things up or down, sort of to match, to deal with dilutions, which was a problem particularly with urine samples, but that normalizing or scaling can be done to total integrated area, it can be done to some kind of internal standard, so creatinine is often used as a standard in urine. You can use specific gravity, weight or volume can also help if you're working with sort of solid samples like tissues. And it depends, it's dependent on the sample, it's dependent on the circumstances that you're working with, and so it's really a matter of having some intelligent design. Scaling can also be done to a certain feature in addition to or instead of total area, total peak height or whatever. Sometimes you'll see in these examples where there's some largely identical spectra showing up, then you've got this one giant feature which either you may want to get rid of or handle it as an outlier. In terms of normalizing things to make sure that there's a distribution that's following a normal Gaussian shape, this is where you can also use things like log transformation, what we talked about, and then there's Pareto scaling or range scaling, and again those are things that are either handling intensity or distribution, and this is as I say normalization can mean two things. Quality control, outlier removal, data reduction. We saw some examples of where there's some outlier removal data filtering, we get rid of solid peaks in NMR. We can do some noise filtering which is also often done in MassSpec, GC, LC, and even in NMR. And then the outlier removal, it's something that you just don't simply take an eraser to, it's something that you have to justify in your methods as to why those were done or why you chose to do that. So again these are all parts of the steps that I showed in the second slide of this series. And then the next steps were the dimensional reduction feature selection, which we've gone through when the statistics done, and then clustering which was this hierarchy color k-means clustering. So that's a quick overview of the steps that are done. So now I'm going to use or show how this is done typically in Metabol Analyst. So this is a web server and the website is given here and it's now in version 3. And it was designed to allow people to do online metabolomic data analysis and reduction for LC-MS, GC-MS, and NMR. Maybe just a quick question, how many people have used Metabol Analyst before? One, two, three, four, five, six, seven, okay. So maybe a third. So the history is it was first introduced in 2009 and it allowed multivariate and univariate processing, ANOVA, T-TES, PCA, PLSDA. In 2002 there's big enhancement which allowed it to look at essentially pathways and to integrate some other tools that had been initially developed on their own, but then were more further integrated. What we found is that shortly after introduction to version 2, we had a huge spike in use. We went from a few thousand users to a month to what is it, about 50,000 users, which was sort of like what happened yesterday when everyone hit the same servers. So Jeff spent a good chunk of 2014 rewriting everything so it would be able to handle the heavy use and also moving it to a bunch of mirror servers. Improved things because R has improved a number of functions and new tools as well for the web. We've also added biomarker analysis tool, power analysis, which allows you to calculate how many samples you need. And then it's also been integrated with gene expression data, which allows you to go across to other omics fields. So each time we make an improvement, we get another few thousand users added, so it'll be back to the drawing board and trying to make it even faster. So we're going to go through data processing, data reduction. We're going to go into functional enrichment analysis, metabolic pathway analysis. So the functional enrichment is sometimes called metabolite set enrichment analysis. We'll talk a little bit about power analysis, sample size estimation. We'll go into biomarker analysis. We will lightly touch on integrated analysis. So there's several modules in Metaboanalyst, eight in total, and you can choose or click on those to jump into them. The general path that you use in Metaboanalyst is first to do some pre-processing of the data, cleaning it up, fixing it up, looking at it. Then we get into this normalization thing, which includes both scaling as well as making it more Gaussian. And then you get into the data analysis, which is the multivariate statistics. And then you get into the data interpretation, which is the pathway or metabolite set enrichment analysis. And hunting around, linking to databases and PubMed, reading more about what you found. So this is another way of looking at it. A flow chart that Jeff has used and I think was used in the last publications. So some aspects, these are the raw data, so the peaks could be spectra. It could be concentration tables. And we'll be using primarily concentration tables, not your raw spectra. Then there's that pre-processing extent, which is marked in red. So there's the name mapping. So if you've got the concentration tables, it means you've got concentrations of metabolites. So you want to make sure you're using the right name. So you have a misspelled leucine or you haven't chosen an obscure name for hydroxy isoballuric acid. There's a data integrity check. So are these values actually consisting of numbers or do they have the words NA or not a number? Or where you maybe have to make sure that these things actually have real values. In the case of peak list, there's peak selection, peak alignment, but that's something you guys won't have to worry about today. After you've done the integrity check, there's the data filtering to get rid of perhaps either noise or spurious peaks. And then there's scaling log transformation for normalization. That's the pre-processing step. And then you have the choices within those modules to do statistical analyses. That's multivariate. Time series analysis, we won't do that today. Biomarker analysis, as we will tell like said enrichment analysis, which we'll look at pathway analysis, another one we'll do. And then we'll briefly touch on power analysis. The integrated pathway or integrated sets with genomic information is another one that we won't have time to go to in today. So if you type in the URL from a table analysis, this is what you get. So it's structured. There's updates, some citation information. There's a left column, which gives you some things that you can highlight or select on. But really to get started, you have to click on welcome. Click here to start. This is actually one of the more confusing things about metabolitis because it's relatively well hidden. Everyone's looking for expecting to dive in. And we might have to get a flashing icon that tells you to click here to start. Anyways, if you do start, you can actually dive into some examples, data sets that you can work with. And there's a number of examples that you can play with. And in fact, we will be playing with example data set. And the one that we'll be working with is a set of concentration tables collected for cows that were being fed different levels of grain in their diet, at least for this first batch. So in this early step, that's the red part in our flow chart or the data processing and preprocessing. When you're doing data processing, you're basically trying to convert your raw data into data matrices or tables. And as I said, the one that we'll be working with today is concentration tables because that's what you guys have largely generated from your basal or GCMS. And then even from your XMS data, you have maybe not absolute concentrations, but you have tentative metabolite identifications and relative concentrations. But metabolitis can also work from peakless. It can work from spectral bins. It can also work from raw spectral data. Yes. Yeah, I don't know. Did that fall out of version three now? You won't have an example. I don't think there's an example set there, but you have to choose, browse, file, select it. So just going back then, I guess in terms of if we've identified this data, in this case it was a table of concentrations from cows on different diets. We've selected a set. We've indicated what type of format it is. And we also indicate which ones where the samples are, whether they're listed in rows or in columns. So just for those who missed it, so Jeff was saying that if you wanted to upload your own raw data for MS, you could just click on that zip file and click on the MS spectra. So we could upload the files, or as I say, if you wanted in this case using the example data, which is what we're using, you could go to the test data. And this is the test data that we'll actually be using for the example I'm illustrating. So two different routes. What you guys will be doing with your data is you'll be uploading data in this window, because you're not going to be using your example data. But for the example I'm using for this lecture, we're just going to use this data set and sort of step through. So this data set, as I said, is from cattle, dairy cattle, and they're given different proportions of cereal grain. Cattle are grass-loving animals. They were not designed actually to eat grain, or very much, maybe 1%. And we actually give cattle grain partly to allow them to produce more milk or also to fatten them up just before they're slaughtered. There's real concern in this area, because in fact, very high levels of grain have been found to cause a lot of stress in the animals and a lot of diseases in the animals. And they were interested in why, what's leading to that. So in this case, they were fed different proportions of barley grains, mostly 0%, 15% of their feed, 30% of the feed, 45%, which is very, very high. And then you can collect the rumen. So cattle have giant stomachs, multi-stomachs, and they have rumenal fluid. And this is how they're able to consume cellulose-rich material and convert it into energy. And so we used NMR spectroscopy, in this case, to analyze the data, and we had concentrations for about 45, 50 metabolites. So once we have our data, we've selected it, the first thing that we should do is do a data integrity check. Now in this particular case, we're going to skip it, partly because NMR usually generates high integrity data, but if you're doing GCMS or LCMS data, you should check. And what it'll do is it'll check, or in fact as it's let it in, it's checked to see whether samples were rose, features were in columns, it'll check to see whether the data was sufficiently formatted. It will identify whether things are numeric or not. If there's some missing values in it. And if it's all checked through, then as I say, you can skip doing any missing value estimations or other fixes that you need to do. Yes. So that's one of the things that happens when you're struggling with your own data set, which is missing value. If you have any advice, which isn't good, if I sit here and you say, it plays by a very small value, which is probably the lower limit of the instrument. But sometimes it just misses, I guess, somehow. And you can average, or is there an advice on our recommendation of what we should use for missing value? So we standardly take the lower limit of detection divided by two. And that's what we substitute in for everything. And that seems to work pretty well. It's also a case of how much or what proportion of missing values you have. So if you have a metabolite where, you know, 60, 80% of its missing values, you might as well just delete the metabolite. It's just not, it's just adding noise. So it's sort of the proportion of missing values for a given compound is also a thing you have to consider. But when we've got, you know, 10 or 20% missing values, then it's sort of that lower limit divided by two. Okay, so in this case, we have samples that are in rows. That's horizontal. And compounds metabolites in columns. So we have different options in terms of this normalizing or scaling. So in this one, we've chosen to create a pooled average sample from group. And then in terms of the data transformation, we could have done log transformation, but our hints or suggestions are, and typically this is true with NMR data, that it's fairly normally distributed. And then we'd like to see if there's any sort of scaling that we should be doing to adjust based on concentration, this sort of between different things. So in this case, we've chosen autoscaling. So these are things where, actually it's free to go back and forth a little bit to see what really works best. So, as I say, once we've got this set, we've uploaded it, or in this case, we've taken the available data. We have a data matrix. So samples and rows, compounds, concentrations are in columns. And so the row-wise normalization, column-wise normalization, and the combined normalization. So in the row-wise normalization that we're doing is to make, in this case, compound concentrations sort of comparable, sort of deal with dilution effects. In this case, where the samples are in rows and compounds are in columns, in this case the column-wise normalization is to make things move to a normal distribution. So that's the metabolite concentrations. And so we talked about the importance of having normal or Gaussian features. And again, the different types of log transformation or types of transformation that you have to do that. So here's an example where you can see the distribution of metabolite concentrations here follows what almost looks like an extreme or maybe even almost an exponential distribution. That's the thing on the left. So if we now do a log transformation for this one as an example, you can see that the distributions of the concentrations now look Gaussian. See how the peak on the right, which is drawn at the bottom, has a nice bell shape to it. So this is an example of how it's sometimes very, very important to do that kind of transformation. So in some cases you don't have to do it. And the only way you can know it is to sort of look at your data. So if your data pops up and it looks like what we see on the left, then you should go back to this data normalization window and say, okay, I want to do log transformation. Try it. Let's say it still doesn't work. So okay, I'll try another one. And it has two or three options. So those things can be done to help essentially make your data more compatible for t-tests and ANOVA, PCA, and everything else. So this is a very important step. And a number of people will just skip it right over. They won't even think or iterate back and forth to see what gives them the best distribution. Yes? You looked at this where you don't normalize it or normalize it in different ways and observed how sensitive those sites are to that transformation. We have a little bit, but we haven't done a really systematic study. And part of it is just, it's sort of up to people and their data. But yes, you will get changes if you've chosen different ways of normalizing or converting it into a Gaussian-like distribution. So this is, it's an important one. And if the data is, this is in this case very nicely distributed. So you found the sweet spot and things should be quite robust. But if you have still something that looks like it's multimodal or just looks like a square curve, something might be seriously wrong with your data. And obviously what's going to happen to it is probably not much better. So as I said, this is something you can't know a priority. So this is why you have to do some bit of back and forth. So there's an interaction thing. And this is why it's also important to be very clear about your protocol and how you've done this so that people can reproduce it. This visual inspection of saying this looks nicely normal is still kind of an intuitive thing. So there are some formal protocols that can assess the normalization, the normality of a distribution. But we don't use that normally and that's partly because I think people's eyes are usually better than sometimes the formulas. So after you've done that normalization or Gaussian log transformation and scaling so that things are relatively consistent, you want to look at your data a little further and this is dealing with the noise, the outliers and other things that may be problematic. So to find an outlier again, computer programs don't do a good job so your eyes are best. And sometimes outliers can be are actually corrected when you do these normalization log transformations. Other cases it's just so obvious this is like you typed in three zeroes instead of one zero or something like that or you forgot to put the decimal place in. And so that way you can correct it. Or if you know that there's something that went wrong with the instrument then you might as well just remove that outlier and we just talked about the example of maybe a metabolite where most of them are missing values. Well you might as well just get rid of it because there's no real information there. And then there's a noise reduction which is usually important more for raw spectra peak list sets which is something we won't really have to deal with here. So what might an outlier look like? Well sometimes if you've carried on and you've done a PCA analysis and here we have red and green so there's a nice decent cluster here and then you've got this green thing that says way off in left field that is probably an outlier that you didn't pick up. Or you could have done hierarchical clustering and it's saying most everything is kind of light red, light blue and then you see this thing that's basically black. So black street running all the way across. Something went wrong there maybe everything was multiplied by a thousand accidentally. And so it's a matter of gaining and of going back to your data and either removing it or making sure that you don't multiply everything by a thousand. So these are ways that sort of late in the process where you can detect outliers but other cases it might be just obvious to your eyes when you look at the tables. So there's an editor that allows you to deal with that issue of editing things. You could also not use the editor just use your go back to your CSV file and edit it as well. Then there's a data filtering step which again is more for noise reduction with lists of peaks. We don't generally have to deal with this in the case of concentration data. And in terms of identifying noise there's some general rules particularly with peaks and peak lists about what's generally noise and uninformative features. So these are things that are very very low intensity things that have very very low variance. And usually the low variance threshold is a good way of identifying noise. So these are the cleaning stages and it's something that gain requires a bit of time. It's not sort of blindly upload and blindly press buttons. It does mean that especially with the data you guys have collected that you're going to be spending maybe 5-10 minutes just double checking your data and looking at it maybe longer. So once you've gone through those first steps which are sort of the red things then you can start doing the data reduction and statistical analysis. So with data reduction and statistical analysis we're trying to identify important features interesting patterns I think it's between the different phenotypes in this case cattle that different proportions of cereal grain we want to try and do some predictions or classifications and to do these things we'll be using three or four different tools ANOVA, multivariate analysis and clustering. So these are available in Metaboanalyst and this is the menu that you have so there's a univariate test there's multivariate tests and then there's the cluster analysis so these are marked with these arrows here. Now why are we doing ANOVA and not T tests or well it's because we're dealing with four populations 0, 15, 30 and 45 so this is where you have to remember a little bit about what we talked about earlier today that this is not case in control it's four populations and with ANOVA then we're just simply to try and identify which ones are different from essentially 0% and everything else so that's one way ANOVA we could have done two, three or four way ANOVA if we wish but this is just the one so we click and upload the data and you'll see an interactive plot here and you'll also see four clusters with box whisker plots red, green, blue, turquoise indicating in this case this three is propionic acid I can't read it that is in this case particularly different and so this plot is interactive you can click a spot which in this case has a significantly different value in terms of the log p or so it's 0.0001 or whatever so it's quite different between the 0% and the others we can also view this into a table so again there's a region which you can click and when you click on that you can actually see these plots and you can look at other compounds on this list to see how different they are and what the results are with respect to ANOVA again the interactivity is important as well as the graphs that are also clickable so in this case with this set of data you could explore it and it encouraged actually some of you to go and use these things just what I've done I'm skipping through because I'm not showing it live but you can do this and use this as sort of your first tutorial so look which compounds are most different are most similar between the four different cattle group you can also go to the correlation link and actually generate a heat map to look at the compound correlations so this is the type of heat map that's automatically generated and you can save that as a PNG or PDF file and if you're doing that you can select the type of file quality and image density so in some cases if you're looking at two populations especially more than two populations you might want to see if there are certain trends now it could be groups or populations or it even could be over time so beginning, middle, end it's three points and so a metabolonalyst has a pattern hunting utility to look for certain trends so we have a linear trend where it just keeps on increasing for one particular metabolite depending on 0, 15, 30, 45% or you might find something that goes up and then drops down or the reverse so these are patterns and so with a pattern matching you can try and look for things that match these sort of predetermined patterns which might be linear or periodic or square well-based and this as I say is usually when you've got three, four, five or six groups and so with that you can actually look to see which ones have chosen a pattern and say which ones actually most match that pattern in this case we're looking for a linear pattern where they go up 1, 2, 3, 4 in those two groups and what we see is that in this case endotoxin and glucose are strongly linearly correlated but then we have some other ones like ethanol and formate which do not seem to have any linear trends and then we have a negative linear trend with 3pp which is just the opposite so this is allowing us to look for as I say patterns that might be informative and it's again useful when you've got three or four or five populations or looking at three or four or five different time periods so that was essentially doing analysis of variance there are other options we could have done now we'll look at the principle component analysis and in this case we're just simply saying do these things separate and we can look at a 2D or 3D PCA score plot we can look at the loadings plot we could also look at it in 3D where we're now looking at three principle components so here's what it looks like we've labeled things just to make it a little clearer but you can see that there is some separation in this PCA plot and we're looking at actually four groups and they're colored so the ellipsoids cover most of the points in these groups and we can kind of see a trend where they separate more from top right corner to lower left corner is sort of the trend in terms of the separation and so if we look at the loadings plot then we can look at those two trends the top right and lower left corners and those metabolites that are in the top left lower right are the things that are driving the separation and so we'll click on some of these points and in fact we see that 3PP is one of them and then I suspect it's glucose is another that's driving the separation so that's in the 2D view we could do this in a 3D score plot and this is interactive use your mouse and click and drag and rotate you can mouse over things to see some of the sample metabolite names and so on so this is relatively new for metabolite analysts so these are the enhancements that have happened over the last year or two with new visualization tools and improvements to R and to the web systems that are available so we see separation that's a good sign that usually says well let's go to the next step forward to see if we can get enhanced separation this is where the PLSDA is useful and remember the cautions about using it because it's a powerful and easily abused function so in this case we've got PLSDA things are obviously labeled because we've marked them as a label and as I said PLSDA sort of takes the original PCA axis and rotates them in a weird way to maximize the separation and we can kind of look at the different values the Q squared, the R squared, the VIP plots and so on so here is what we get with this PLSDA so in fact if you recall what the PCA plot looked like separation not great but evident now with this PLSDA plot you can see quite significant separation so we've rotated things in a convenient way that maximizes that separation we can then look at the Q squared and R squared values and I can't see this but both of them look to have the Q squared and R squared are both greater than 0.7 so this suggests that the PLSDA plot is robust the separations that we're seeing are real and robust what's driving the separation well not unexpectedly we see 3PP endotoxin and glucose which is the same things that we're driving the separation we originally saw in the PCA plot so this is the VIP plot and it's a little different than the ones I was showing you before which had two colors this is the one that's indicating high to low regions and you can basically see the pattern in these four groups and if you see a nice rainbow color that goes from red to orange to yellow to green that shows a specific trend others you might see red red green green which might show another different type of trend between the two groups so this is just showing the trend but it's also identifying a set of cutoff in terms of the VIP scores but I think these are two and three and four so they're fairly significant and what as I say the PLSDA is just simply reiterating what we probably initially saw in the PCA but it's just making it a little more obvious however R squared Q squared is 0.7 great the suggestion is if it's great 0.5 it's good but I prefer to do permutation analysis whenever I'm doing PLSDA and so in this case we've done the permutation analysis it's 2,000 I think I can't read the number is it only 100? I can't read maybe it's 100 so it runs at 100 times calculates all these things and you can see based on the distribution with this that the separation the one that was correctly labeled just is way way far away from all the others so this clearly shows that that's a very significant and robust PLSDA results so you can set the numbers you can go up to about 2,000 that's a lot and I think the advice probably from Jeff would be choose a lower number first and if it doesn't look significant choose a little higher number to see if you can get better statistics yes that's some kind of value that I think combines q-squared and p-squared r-squared together so accurate just based on predicting how many times you predict that in the right class that's another thing well they can tell you both kind of values that's a good point I think that's a good point both kind of values that's just whatever you're comfortable with for the people using PLSDA they usually use q-squared and r-squared but for machine learning they actually prefer accuracy how many times they predicted right so it's sometimes it's and most times it's followed the same pattern so I guess it would be almost equivalent to a rock curve performance for sensitivity specificities for moving these things the number of components that you're using in this rock curve model increases and so the rock curve probably close to what the area under the curve would be so you can think of it sort of like that and so here's it's just using one component to try and distinguish between a different cattle and then if you use two components or three components or four components it progressively gets better can I say something yeah the thing with that many components that when you get too many components performance can improve but if q value actually dropped slightly in that figure that means it's probably overfitting based on q-squared but based on accuracy seems still improving that's not all the time they agree with each other but it's based on whatever you're comfortable with because there's two different cultures, statistics and machine learning let's give you the choice but you just see yeah I know some software once you have q-squared some people see the software yeah yeah here it allow you to specify by default it testing from 1, 2, 3, 4, 5 but you can try to increase and once they say to stop increasing you can now you see it's best model for example so there's three components etc the 3pp endotoxin and glucose which are the ones that seem to be and then you can see that there's not much to distinguish between I guess it alanine and methylamine from the glucose so the performance isn't enhanced a whole lot more as you go beyond the three or four components so that's the PLSDA we've seen the PCA and then we can do a little bit more in terms of the heat maps or clustering yes any ROC curves on this yeah but not with this particular one that we're example that we're using so under the cluster analysis we can do the heat map we'll use the hierarchical clustering one which is generally preferred and it allows you to look at the behavior of metabolites and you can ask questions which ones have a low concentration in the 0 and 15 but increase in the 35 and 45 which component is the only one that's significantly increased in the 45 group so those are things that again a visualization tool like a heat map can allow you to look at that why you might ask those questions who knows but these are examples of things that you might do and so clicking on this the panels allows you to generate that you can cluster on the rows or order by class labels whatever this is what it looks like looks like we've got some pretty good color range here but you can see the red, green, blue, turquoise above so those are the four populations and then you can see the clustering on the metabolites down below again you can cluster with the heat map in different ways and both above and below or to the top end of the side so again it's just a different way of looking at data it can be informative depending on how you want to look at it some people really prefer this kind of view over say PCA or PLSDA but it's still a matter of choice and then giving people that choice so what we've been doing is a sample run and most of it would take much less time than what I've done in terms of explaining things but we're obviously trying to make sure you can understand what's going on so we've done most of the multivariate analysis but it's been tracking all of this information all the plots, all the graphs that we've generated so what's really nice about metabolites is it's actually kept records of all of those things and so you can print off or save as a PDF summary of what you've done and what you've found and so these are the results and these are the things that have been generated and you can get this full PDF report and if we looked at it this is what it looks like so it's written your paper for you and you just have to submit it off to nature and you're finished so that's an example analysis of metabolites so for your question your last one, the latest one is going to be put on the report so if you don't hit map 10 times, the last one will be stay there because overwriting otherwise you're going to be very long and you don't know so if you want to click download so your last operation make sure this is the one you want does that help? okay so we've done the statistical analysis we can also do an enrichment analysis so that's one of the other options that are available and so this is based on a concept called metabolite set enrichment analysis or MSCA and that's based on something that's been around for an expression analysis for almost a decade so called gene set enrichment analysis and there are three options in MSCA so this is over representation analysis or a single sample profiling and then quantitative enrichment analysis so to do this sort of thing you actually have to have a number of libraries your databases so GSEA has very large libraries a large community contributes TableLomics is much smaller in terms of community but it has a whole bunch of predefined metabolite sets pathway sets and disease sets that have been generated partly from TableLomics database but also from some so HMDB as well as some data mining that Jeff has done so it's really to see if there's some biologically meaningful groups of metabolites that are significant it's we're looking at pathways, disease and localization because all this information has been mined from human data MSCA is really restricted to human metabolomic data but it probably could be converted or translated to mammalian data pretty well so it's except some of the concentration values may not match for mice and rats so with MSCA you can actually just have a bunch of metabolite names not any concentrations, just metabolites so these are ones that somehow you've identified by relative concentrations or just they are not there by some simple assay and then that allows to do overrepresentation analysis other sets you could actually have a single patient in this case metabolite names and their concentrations so this is a single sample so you want to have a metabolomics test just like if you want to go to 23andMe and have a SNP test and here's 110 metabolites that up from your urine sample what's normal and what's abnormal so this is what SSP does now the next one is if you've got a cohort study so the whole room is being studied here and it might be doing cases and controls but in this one it's a more extensive one where you've got not only metabolite names and concentrations but also for a whole bunch of people in a clinical test so three types ORA, SSP and QEA and they need different types of input ORA is really simple input and a modestly powerful technique SSP is for your own personalised assessment and interpretation and the QEA is for a large scale study of multiple people so for this we've chosen a study that was done a few years back by our group looking at lung and colon cancer patients who developed catechia so this is a form of muscle wasting and if any of you know people who've had cancer for stages they look very very thin and that's called catechia and it happens for certain types of cancers not for all and if it happens to develop the outcomes are much worse for those and we still don't know why it happens and for some people who have lung cancer catechia never develops and so those people actually live quite long and reasonably morbidity free lives but those that do lives are significantly shortened so we'd like to know more about the metabolic basis of catechia and so this is why the study was done so you can select that sample set it's a data example set and in this case once we select it we can choose the enrichment analysis and in this case we could upload a compound set and I say aura is a weak analysis so here's just the list of compounds and we wanted to try over enrichment analysis to work it has to have some name standardization and so in this case one of the compounds was misspelled and it's not able to have anything in cake or HMDB and so it flags and says what have you done is this a real metabolite so once it's identified that then we can try and fix isoleucine and spell it correctly and once that's done then it suggests things but once it's done we can check it then we can choose what sort of metabolite library want to compare this list of metabolites to so we can look at a pathway library SNP associated library predicted metabolite set location based sets in this case we've chosen a pathway library and based on those metabolites that were altered or at least from our initial set of metabolites that seem to be significant by whatever way we chose we see that there's a number of pathways that were chosen so glycine serine tryptophan metabolism is altered protein biosynthesis is altered phenylion tyrosine metabolism is altered methionine metabolism altered ammonia recycling is altered these are all significantly changed based on their representation in known pathways and this is actually quite striking because it's quite consistent with what we know about metabolism associated with catechia so this is being picked up and then we can go in and dive in and see what's actually there with these table values we can see the false discovery rate so this is the FDR corrected value compare that to the p value which is not corrected for false discovery so remember we're doing many comparisons so these are statistically robust measures of what's significant we can go a little further and after we've clicked on these things we can look to see a pathway that's associated with this particular case this phenylion tyrosine metabolism and this is linked to SMIP DB and so this is a phenylion tyrosine pathway which highlights things that are there where it works alterations largely in the liver where the metabolites go so that's aura so it's a fairly simple minded analysis single sample profiling this is as I say something that you would do if you were a physician and you wanted to characterize someone or if you happen to want to know what's happening with your metabolites so it's just for you or for a single person and here's a list of their metabolites and their concentrations so is there anything abnormal and this is a typical thing that would be done in a urine test except that in clinical urine tests they only measure 7 things so this is typically reading out I don't know 100 or so so all this does is just look at the known concentrations to normal values and so there's very extensive data on normal values for blood and urine and humans and so it's just running through and saying what does it look like is it within the realm and it's conducting comparisons you can see it a little more and then you can also have that slotted out based on other known studies and where your value is and you can certainly see in the case of a lot of research for this person it's way high what does it mean well that's sort of what it'll comment about or reference information may be there a step further it's going beyond just a single person now you're looking at a cohort this is where you use QEA so this would be for large scale study again for humans more clinical and again there's various options that you can select and you can submit it so we're collecting all 77 patient samples and just running it through and the result is not unlike what we get with aura but it identifies some other things in a bit more detail and it's identifying issues with galactose metabolism which didn't pop before it's also identifying issues with insulin signaling but it's also still showing issues with some of the branched chain amino acid and essential amino acid metabolism that's altered in catexia and again there are graphs that you can view and it can go down several levels and you can also match up to the different pathways through Smith DB so MSCA is intended for human studies so it's not for plants not for microbial it could work reasonably well with some other mammalian systems but it's really oriented for clinical work question yeah so again just for the question is there a way you can identify metabolites that show up very very frequently as being important in disease indications for nearly every study that's a bit of a challenge we're working on a database called marker DB we're trying to track biomarkers as published in the literature for a whole bunch of conditions and to try and measure the incidence of those compounds to see if they're just something that always shows up one example that does show up almost all the time are carnitines and acyl carnitines and these are actually markers of inflammation and white blood cell activity so if you're sick usually your white blood cells are active and so that could be for anything so what you're seeing with elevated acyl carnitines is essentially a high activity of white blood cells so it's not a very specific marker but it is a marker for something that's going your body's not happy I know that in our early days we kept on finding acyl carnitines but we were very excited about this is a specific set of markers now it turns out that there are some classes of acyl carnitines that seem to be specific for certain conditions so as a broad category acyl carnitines are not informative but as specific acyl carnitines there may be some very unique or specific conditions so yeah it's still evolving I think people need to address that question more thoroughly okay I'm going to jump to another module so we've done statistics we've done enrichment analysis and now we can look into pathway analysis so pathway analysis isn't restricted to humans it's actually supported for a number of model organisms so mice, fruit flies, rubbidopsis, E. coli, yeast and the pathway analysis is able to work because it builds on the keg pathways which cover a large number of organisms so pathway analysis goes a step beyond metabolite set enrichment analysis because it's able to look at other organisms but it's also a way of looking at the biology, the pathways now I'm going to put in a caveat here and this is a very important one for a lot of people and it's fundamentally a weakness in metabolomics overall almost all pathway databases that are out there and keg in particular focus on catabolism and anabolism so breakdown or construction of other metabolites so that's our classical view of metabolism but metabolites are far more important than that and what's not depicted in most pathways is the signaling role that metabolites have and you can think of glucose as being one of the most important signaling molecules in the body and if you look at glucose in keg it just simply shows it in glycolysis but glucose activates all kinds of things it's a target or it targets many, many proteins and it's tightly controlled by many systems and proteins as well none of those are depicted in keg a few of them are in SNPDB acylcarnitine is something that I mentioned these are the byproducts of white blood cell activity yet go to keg no mention of immune function or white blood cell activity it's simply used in beta-oxidation of fatty acids so again it doesn't link to a physiological important pathway or process leucine isoleucine very important branch to chain amino acids those amino acids target mTOR and their critical function essentially is insulin analogs insulin leads to all kinds of signaling events and all kinds of physiological processes do you see that pathway in keg no so elevations in branched chain amino acids people simply will interpret them through keg and say oh well it's involved in synthesizing branched chain amino acids and here are some other compounds it tells you nothing so this is a fundamental weakness of pathway analysis and of pathways in general and pathway databases and until metabolomics gets to the state where the pathways that it offers provide some of this physiologically important information like leucine and isoleucine or insulin analogs or the glucose is an important signaling molecule that targets 500 other proteins until that's there people are going to have some very very shallow interpretation of metabolite data and it's still the way it is people just keep on saying I found isoleucine and cool that's amino acid synthesis it's useful for making muscle end of story as I say that's a useless result and those pathways are important for you know not just humans all mammals probably largely function as well for any system that has B cells and T cells so those are all things that can go all the way down to any multi cellular organism pathways in terms of signaling for unicellular organisms also are lacking so the operon pathways and E. coli do you find them in keg? no do you find second messenger signaling and quorum sensing signaling in keg? no so these are all examples of fundamental metabolite measurements that metabolites where we don't have pathways that are absolutely vital to understanding how systems work okay I'm going to have to speed up here but it's it's certainly a pet peeve of mine and one that I think you guys who are young and full of energy need to think about so in this case we're going to use the same lung colon cancer set to do pathway analysis so we selected it we choose the pathway analysis we upload our data as before because there's data with concentrations we have to do a little bit of data normalization just like what we did with the PCA so we could do some log transformation or auto scaling other things and this is how we've chosen to do it here and again this is through trial an error that we found this is the best and then we can choose our pathway library so we can look at mouse or parasites or plants in this case obviously working for humans so we choose humans and then we can do a couple of analysis we can look at network topology analysis and we can use all of the pathways and we can use both assessments of pathway enrichment and pathway topology we can measure things by relative betweenness or out-degree centrality with enrichment analysis we can use a global test or global and COVID test so in the case of topology position matters and this is this idea of hubs and non-hubs or nodes bottlenecks and if you're thinking of these sort of clusters or pathways here something marked in blue represents a major bottleneck and then the red ones represent hubs so this is a form of graph theory and this has been fairly well developed for the last 10 years and it's been used in mathematics for decades but it's a way of assessing positions in pathways so the nodes that we talked about have a high degree of centrality or rather the hubs have a high degree of centrality and then the bottlenecks have what we call high betweenness centrality so that's sort of a quantitative way of talking about hubs and bottlenecks so that's the topology but then you can go to sort of the pathway enrichment analysis and so you can get these kinds of plots which impact the pathway and log P in terms of significance and so this produces sort of a modestly correlated graph that goes from light yellow to dark red and goes from the corner to top corner in terms of the importance of the pathways so usually the things that are off on the far top corner are the ones that are most important so based on pathway analysis the glycine serine screening metabolism was the most important for this one so it's a little different than what we got with our metabolite set enrichment analysis but again this is having to deal with pathways and thinking of pathways in this way with concentration data and so you can go in zoom in, you can click on things it provides graphs this measure of pathway impact includes log fold change and the differential expression metabolites statistical significance the set of pathway genes and proteins and the topology so it's all sort of combined together which is a little different than the metabolite set enrichment and it uses both topology representation analysis you can get tables and you can link this both to keg databases and in this case because it's human also the small molecule pathway database or smith db statistics are there false discovery rate information is there so it's again quite statistically robust when we're doing biomarker analysis there's often two things that people worry about whether it's well in this case I clicked power analysis maybe I should have clicked the biomarker analysis so the idea is to find biomarkers using rock curves so we learned about rock curves and we want to do a process where we maximize the area under the curve and minimize the number of metabolites so this is similar to that q-squared accuracy plot or the discussion we had earlier where we see that after adding more than three or four metabolites the accuracy really isn't improving and the q-squared starts falling so it's the same sort of thing with biomarker analysis, we're trying to choose a model and it could be a PLSDA model, it could be an SVM model or something that classifies things and so that classification is typically that multivariate model some people have some really good intuition about what the biomarker should be so biomarker analysis in this rock curve tool allows you to do manual analysis you can also look at one variable at a time and see which one is contributing in terms of a discriminating factor so in this case we're not going to be using the cows and we're not going to be using the cancer set we're going to choose another panel of 90 patients expectant mothers at three months pregnancy in this case we're trying to identify which ones went on to develop preeclampsia and which ones have normal pregnancies so this is actually to try and come up with a rock curve to predict preeclampsia so just like with the PCA analysis we have to do a data integrity check if everything A okay and yes it was so we can skip things then we can do some data normalization and scaling so in this case it has a lot of both NMR and I think MS data so we have to do a log transformation to make it look better and again this is what it looks like before so notice this highly skewed sort of question like distribution and then you do the log transformation and it looks very much like a nice normal distribution so now we've got it sorted out then we can get into the realm of doing this multivariate rock curve analysis so what it's going to do is sample of the dozens and dozens of metabolites for these 90 different patients it's going to try and choose a set of metabolites that discriminates between those who developed preeclampsia and those that didn't so it's going to use some machine learning methods I think this one uses an SVM and it's going to pick out which ones are best and it's going to try different models and in this case we get rock curves even with just two components that have an area under the curve of 97% and then if you use up to 10 components you get up to 98%, maybe even a half percent then if you start using too many then your performance starts falling so there's an optimum one and roughly if you want to look for maximizing benefit in terms of minimal number of measurements somewhere around 3-5 compounds gives you your best performance what are those compounds so again you can look at that, you can also assess it does essentially doing an SVM it could have done a PLSDA but you want to have a measure of sort of reliability and this plots out sort of the maximum and minimum performance for this type of model and gives you a spread or estimate and this is not done enough in rock curve analysis which I think is important to do from there you can identify the significant features which moves you into the VIP plot and the case here that distinguishes people who will develop preeclampsia it's four or five metabolites eglycerol, hydroxybutyrate, choline, acetate are the ones that are most significant and in one case they're elevated others they're decreased so this again allows you to come up with both a model and a predictive tool to classify or identify or predict which individuals will develop this disease based on their blood samples so that was the biomarker analysis then the power analysis which is a mistake on the other slide is what we could look at and this is this thing that people use I'm not sure why this slide gets corrupted of determining whether a study has the size of the study in terms of the number of patients should be used or whether the study that you've already designed has the power to have a real effect or to identify these important differences so it's again a quantifiable thing where you say do you want to have an 80% chance of coming up with something that's statistically significant or a 90% or a 99% chance of something that comes up statistically significant if I've chosen and I've got 20 mice and that's all I have is this how powerful is the study based on my results how confident can I be so statistical power depends on sample size it depends on your choice of alpha 0.05 and it depends on the effect size so it can increase the power by increasing the sample size increasing the effect size if you decrease your significance criteria then that also decreases the power of the study so there are different criteria issues as both false discovery rates or p-values that you can use so you can make choices and in essence to do a power study you basically have to have some pilot data and if you've got some initial pilot data then you can start figuring out based on the curve how much more data you need to get a much more powerful sample so in this case we've generated some pilot data based on the pilot data the curve of the power versus sample size it says that we need about 60 samples per group to get an 80% power so this again suggests what you will need to do in terms of designing a future study that was based on the pilot study to make a more statistically robust assay so power calculations are typically used for validation studies they're done after you've done your discovery so most times you can't if you're doing something for the first time you can't do a power calculation and if someone asks you to do so tell them it's their nuts it's just it's not possible so we haven't covered everything here we haven't looked at K-means clustering or self-organizing feature maps we didn't get into random forests too much discussion on SVM we didn't look at time series data and metabolite integration but these are actually very powerful tools in time series analysis there's some examples here of what's possible whether you get Venn diagrams or other kinds of cool plots the integrated pathway analysis which combines both genomic or transcriptomic and metabolomic data as part of a package that Jeff developed called INMEX and I think there's a couple of you who are interested in combining those sorts of things unfortunately we don't have the time and the data sets to really do that but you're certainly welcome to explore that with Metabo Analyst okay so I think that covers an overview of Metabo Analyst what we're hoping you guys will do over the next few hours even if you want to start during lunch or not but certainly during the lab is to explore these things there's lots of tools data sets that you could try but many of you probably are anxious to work on some of the data sets you generated yesterday and the idea here was just to show you those pieces that you would typically want to use and there's different paths that you may choose to follow different things that may provide you with more information or less most of the sample sets that we've worked with at least two were human studies one other was a mouse study so based on that it kind of limits what sort of paths you can have or choose in terms of the analysis schema and as I said you don't have to work exclusively with the data we generated yesterday you could work with the example data sets and as you'll see in the later labs there'll be some questions you can tackle or things you can explore to help better understand Metabo Analyst okay so are there any questions in the last minute or lunch?