 So I'll present Metabo Analyst. How many people have used it before? A few, good. Others, this will be your first time. Metabo Analyst is Jeff's invention. He started it when he was only this tall and has been working on it for most of his adult life. I'm fortunate to actually be able to present it, but I'm presenting probably from a perspective more as a user and then also as Jeff's former supervisor. So we had certain ideas that we wanted to see. So the idea is to understand a little bit more about the Metabolomics data analysis workflow. Quickly go through that. We'll look at the issues of data integrity, quality control, checking the ideas of normalization and scaling, which Jeff talked about. And then we'll look at Metabo Analyst. So in Metabolomics, we're often working with both biological replicates and technical replicates. And it's the same with genomics, proteomics, transcriptomics. It's the same thing that, in the case of mouse models or humans or plants or even microbial systems. The biological replicates in this example are just the different mice. They look sort of the same. They're inbred. But often to make sure that things are done correctly, you will do technical replicates. So you might collect two blood samples or you might take a single blood sample and split it into three different sets or five different sets and so on. And this is important both having the technical and the biological replicates. The technical replicates allow you to assess the performance of your instruments. The biological replicates allow you to get the sufficient statistics to assess whether there are differences between control and the treatment or healthy and unhealthy or whatever. QC is really important in metabolomics in the experimental set and having both quality control samples, something that might be completely different or an artificially constructed sample or one that's put in every 10th run on your mass spec or your NMR or GCMS is also important. Some people will pool samples to create a generic collection which also is serving as a quality control run. And I think these are important. I think people talked about and mentioned how they'd like to have more of a discussion on the process and metabolomics sample prep and experimental design. That in and of itself could easily be a full day, but happy to field questions as people have for those. So that's partly a general design and then we also talked about the general concept of either targeted or untargeted metabolomics and what we'll be doing in terms of metabolism analysts. It supports both. So whether it's untargeted or targeted, there's a slightly different workflow in terms of what things might be done. So the untargeted approach always starts with the integrity checks, same with the targeted one. In chemometric or untargeted, we'll be doing spectral alignment or binning. That was the XCMS model that we looked at for those of you around. And then the targeted compound identification and quantification, that was the DC auto fit and basal. So those where they fundamentally differ. Then they do the data normalization. So whether you've just got bins or whether you've got actually compounds and concentrations, you still do some sort of normalization. You'll do a quality check. QC is what we usually call it. We'll do some outlier removal. Inevitably when you're measuring many things at once, there'll be some glitches in the instruments or in the sample. This helps sort of sort some of those things out. And then we do the data reduction analysis. And that's what Jeff was showing you in terms of the different T tests, the NOVA tests, the PCA and PLSDA. The last thing in the chemometric method or untargeted is the compound ID. Because by then, you've simplified your problem and you've identified a few peaks that are relevant and then you go back and match them. In the targeted one, you've already done that. So in some respects, the targeted one has fewer steps in the overall process. So what is the data integrity, data quality issue and how do we do it? The bottom line is, and we've talked about this before, is that LC and GC have lots of false positives. These are peaks that are essentially noise. Some of them we talked about as we did that reduction where we worried about adducts and isotopomers and a whole bunch of other variations. But there's also just the junk that sticks to columns. There's the mysterious stuff coming from fragmentations that we don't know. Or in the case of GCMS, some strange derivatization products. NMR is a little simpler, partly because it's not as sensitive. But what people will do is that they will collect multiple samples. And another trick they'll do is they'll collect over multiple days the same sample, just to see as things sort of seem to drift and change. And so if you still see the same things on the same spectrum for the same instrument over five days, it's probably there. But if it keeps on appearing and disappearing, as many tend to do, and as you maybe see here, then you can say this is not a real peak. It's noise. Same thing with replicates. If you can run them repeatedly very quickly or over the same course of time or the same day, it's a game looking for things. We're running blanks, another way of checking for noise. So this is a QC or data integrity that we can typically do. We've talked about spectral alignment. That was part of the XCMS exercise. Unfortunately, it wasn't working yesterday. But it is commonly done for LCMS, commonly done for GCMS. And this is just a game because of the fact that columns, whether it's gas or liquid chromatography columns, vary over time and use. And the parameters, even the flow rates, cannot be perfectly controlled. So there'll be shifts. And so through using things like time warping or other approaches, you can get the blue to align with the red. And that's available in XCMS and MZMine and Chroma. In the world of NMR, but it also can be done in mass spec, binning has also been done as part of the untargeted approach. And that was largely due to limitations in terms of what computer memory storage could be. But in many respects, people now just take all the data. They don't even bother binning. Or they'll do peak picking and just compile all the peaks. So the term normalization is sometimes confusing. Some people will view that as a term for scaling. You scale or normalize to a certain value or a certain ratio. Normalization also means converting something into a normal distribution, which is a totally different meaning. And it is confusing when people use the same terms. Here we'll just say data scaling. And this is something that Jeff talked a little about. But it's how to essentially address this problem here, where you're saying two different spectra. This happens to be NMR, but it could also be mass spec, or it could be an HPLC chromatogram and say, are these identically? You can see the peaks in the same position, but they seem to be one case higher and one case lower. So there are different ways to scale things, to make sure that in fact you're comparing apples to apples and not apples to oranges, or that you're measuring in a meter stick and not a yard stick. So some of these approaches, using total integrated area, this is an issue particularly important for urine analysis. Similar sorts of issues, challenges with say, fecal water. There are other approaches using internal standards, sample weight, sample volume, probabilistic quotient method, which is a more complex one that seems to help. And it depends on both the sample and the circumstances. Some people will choose to a specific feature. So in the case of urine, people scale to creatinine. It's not always the best, but it's been used for a long, long time. The other thing that people can do is more complicated once again. Jeff talked about the auto-scaling, preto-scaling, and range-scaling. Those are all things that typically have to be done fairly early on in the process, as I say, just so you're ensuring to compare apples against apples. Other things related to data analysis, as I said, these outlier removal QC. We saw this in examples when we were looking at NMR and basil, where we had to remove the water peak, that's H2O, which is a giant signal. We also saw this in examples where we filtered out a noise and subtracting the blank in the GCMS auto-fit or GC auto-fit. But then there are also situations where you see clearly a peak that wasn't there before and disappeared later, or one that's just off the wall or off the charts. Those are outliers, but in many cases you have to rationalize why you're removing them. Especially in graduate students, we'll see the best way to remove an outlier is to just take an eraser, and it just disappears. But that's not the way to do it. I mean, that physically removes it, but you also have to justify it. And just providing a justification or a set of justifications in your paper or methods protocol is fine, as long as you've justified it, but doing it without justification gets you into a pretty touchy area with science and fraud and other things. The other part is after you've done your outlier remover, there's this dimensional reduction and we'll delve into the PCA and PLSDA features and clustering that's done. So, these components of quality control, data integrity, outlier removal, data reduction, all of those are and can be performed in Metaboanalyst. And that's why we're going to talk about it. And this is something that's been under development for seven years, at least. More like eight, I think, now. What's unique about it is it's a web server. And historically, in the early days of metabolomics, the only way you could actually get to do a lot of this stuff was either by commercial programs that you had to download. SIMCA, I think, was the primary one, in fact, that many people still use, actually, to do a lot of the multivariate statistics. And then many other programs have since appeared. Those of you who tried a little bit with XCMS also see that it does some of the multivariate statistics as part of its analysis. But the appeal from Metaboanalyst was the fact that it's quite fast and that it didn't necessarily get stuck at the data upload data download, which I think you guys saw with XCMS, where you're dealing with gigabytes or terabytes going through the web. So it's sort of a little bit downstream. So you've at least done a little bit of processing on your instrument, but now you're interested in just doing the analysis and the interpretation. So it started in probably about 2008 as part of the Jeff's PhD project. And we published a version that had a lot of multivariate and univariate statistics. But by 2012, we'd had a lot of issues where people wanted to go further than just doing simple multivariate statistics. They wanted to do some functional analysis and pathway analysis and metabolite-set enrichment, which is something that Jeff developed. And then, as usual, typically, the old tools that had been around in 2008 were just elite. So we had to do a massive rewrites to get Metaboanalyst current with the new technologies. But that also gave substantially better performance, a lot more interactivity. We also added power analysis. I think people have asked a couple of questions about that. That's becoming quite common. Integration with gene analysis, gene expression analysis, biomarker analysis with the rock curves that Jeff talked about. So it's expanded from basically four modules to I think it's eight now. So we're going to go through sort of a simulation here or step you through some of the components again. People are familiar with Metaboanalyst. This might just be a quick refresher. Those who've never seen Metaboanalysts will give you, I guess, an overview so that when you dive in, you'll be able to follow things. Now, we've printed off, and I think, Jeff or Ann, you guys got those. So how many people have the printed documents for the Metaboanalysts tutorial? So raise your hands, very high. So just there's three, one, two, three. And then there's these two. You'll actually need them. You can either work from your computer or you can bug some of your friends. But these are very detailed. There's also a tutorial that is done or doable through slides. Some of you learned better by just doing and not reading any instruction manual whatsoever. So fine. We're going to talk about eight things, raw data processing, data reduction. And we're going to talk about functional enrichment analysis or a tablet set enrichment analysis. We're going to talk about pathway analysis, power analysis, biomarker analysis, and integrative analysis. We aren't going to dive into a lot of detail for some of them, but others we will. So here are the eight modules that are in Metaboanalyst. There's little text explanations and tabs that talk about them. And that's statistical analysis, pathway analysis, enrichment analysis, time series analysis, power analysis, integrated analysis. I can't even read the other part, so you can read it. But then there are these steps. Four major steps within a number of these modules. So there's the data preprocessing, that part of we talk about QC and integrity, data normalization and slash scaling. Then the data analysis, which is the PLSDA and OPLSDA and PCA and clustering and everything else. And then the data interpretation, which is the functional set enrichment analysis, the pathway analysis integration with genetic or expression data. Another way of looking at it is in this sort of flow chart and separating both the types of data that you can get, whether it's NMR, MS, GCMS, raw spectra, peak tables, or tables of concentration. There's essentially four different types of input. Then there's the initial data analysis and integrity check that's marked in red. So it's name mapping, it's important. Many people have different names for the same compound, and there are issues of misspellings that show up or people dropping E's from certain compound names. There's the data integrity check. Some aspects, if we're just getting peak lists, there'll be peak alignment and peak detection. There's the filtering steps as well, the scaling, transformation and normalization. So those are the other parts in red and then the other stuff, multicolored purple and orange and green and gray that we get into. So if you go to Metabo Analyst, and you can usually just type the name on Google, it'll take you there. We've got links as well. You'll see it structured in an early standard way. There are different data sets that you can choose, and there'll be information about different types of data formats, and it, like XCMS, allows a lot of different formats, and Jeff has been very responsive to people's needs. So there's a panel on the left side that allows you to navigate things. There's a front panel that tells you what's the latest news, what's been updated, what's been added. And then clicking on the front, on the left panel, allows you to take a look at some of the example data sets. A few people get confused, and I still have to convince Jeff to make the click here to start larger. But a lot of people will stare at this and just don't know how to get past the front page. So anyways, if you click the data formats, you'll find a bunch of sample data sets, and these are ones that you guys will be using for some of your tutorials or practice exercises, and it'll talk about the formats and explain them a little bit further, and you can go to them and specifically download them, and then there's others that you can actually upload from the website itself. So we're going to dive into the metabolomic data processing, which is that sort of initial part that we talked about in red here, and we're going to look at primarily concentration tables today. As I say, you can look at peak lists, spectral bins, and raw spectra, and they're supported, but it's slower in some cases to process, in some cases people are moving away. And this is sort of my own personal emphasis, but I think as I say, it's being driven a lot by the community to move towards more targeted analysis in metabolomics. Getting absolute concentrations and working with those makes your life so much easier and publications so much more robust. So we're going to convert a variety of data formats into these data matrices so that they can be used for statistical analysis, and as I say, we're starting with the concentration tables. So to do this, we'll start off with a beginning module. So if you click here to start, that's that little red or dark red one that people still miss, you jump into this panel display. It shows eight different panels. The top left one is statistical analysis. Once you click that, it'll take you straight to a panel window that allows you to upload your data. So if you have previously prepared data, GCMS data that we did yesterday, NMR data we did yesterday, or some of the example data that you can get either from the Wiki or that's right here in Metabo Analyst, you can enter that file or select that file name. Alternately, if you scroll down the window, there's a whole bunch of example data sets that we have, and these are just for people to try it out. And we found this, I think, to be really, really useful. A lot of people have learned a lot from just clicking on the different types of data. So there'll be NMR data sets. There'll be mass spec data sets. There'll be peak picked ones. There'll be concentration tables, all of them. So you can see what they look like just to make sure the format is right and that you haven't done things a little screwy. So in this case, we're clicking on one particular data set, which is from COWS. We're looking at dairy cattle. And in North America, we tend to feed dairy cattle lots of grain, whereas in Australia and New Zealand and England and elsewhere, they feed them lots of grass. And it's been of interest because basically cows in Australia, New Zealand and England are much healthier than the cows in North America. And the thinking is that it has a lot to do with how much grain we feed them. So they started looking at dairy cattle and feeding them different proportions of silver. So one was all grass, then about 15%, 30%, 45%. Then they stick a needle in the cow's stomach and they take the rumen out. So this is the fluid in the multi-chambered stomach of cows that works miracles. It converts grass to energy and milk and meat and everything else. And in this case, NMR was used. And so we got quantitative data. And as I said, the hypothesis was that high-grain would cause metabolic stress to cows. So we upload the data. We've selected it. And it will give you a report on how things are looking at. It'll say how many samples have you found? It'll look at whether things are numeric. So if you start putting in Greek letters, it will pick up on some of that and say there's something wrong. And when you're dealing with large data sets, often that happens. A lot of people make typos or they shift columns or forget something. So this is not perfect, but it tries to pick up some obvious errors. And if it looks OK, and in this case it does, you can just skip the next step. You don't have to go back and modify things. But it allows you to change things. So in this case, we're going to go through this thing called data normalization, which is this combination of both scaling, which is comparing apples to apples, and turning things into a normal distribution. So it's all in one. Now in this case what we've got are, the samples are in rows. That's the horizontal thing. So sample one, sample two, sample three, sample four. And the columns are your compounds, you know, alanine and citrate and valine. So that's the structure. You can have it the reverse way, but in this case that's what we've done. And then there's these two options where we're going to look at sample normalization and what we've got are the data transformation or data scaling. So we can choose a couple of options. In this case, we're using a normalization by a reference sample. And we're using, in this case, clicking create a pooled average. In the case of the data transformation, we're choosing none. We're not doing a log transformation here. And then we've chosen auto scaling to make sure that everything is roughly on the same scale, if you want. So samples are rows, compounds, or could be peaks or bins and columns. And as you've seen in that structure, we have the sample normalization, the data transformation, which can be some of the log stuff, and the data scaling. So a normalization of a sample means that if we're dealing with dilution effects. So in the case of Ruhman, we may be dealing with dilution. So cow may have been drinking a lot of water, or it may not have been. And so the Ruhman is potentially up and down. In the case of data transformation and data scaling, the other things, we're trying to make each variable, in this case the compound column, comparable to each other. So we're trying to create a normal distribution. And I think as Jeff highlighted, and I'll emphasize again, is that we can't do statistics unless things are pretty close to a normal distribution. And that's the basis to ANOVA, even to elements of PCA and PLSDA. Most things that we measure in living systems seem to follow normal distributions, but sometimes they don't. In some cases it's possible to convert that normal, or abnormal distribution into a normal, using things like log transformations, or cube root transformations. So that's the transformation. This is especially when we've got things that are measured over many orders of magnitude, which can often happen with metabolites. And then the scaling, which is auto-scaling, range-scaling, peritoscaling. Jeff talked a little bit about that already. And again, these help. So on the left is what your raw data looks like. And in this case you can see that there's a lot of compounds that have concentrations that are very, very low. So this is the concentration axis, and everything is down near zero. But then there's three or four metabolites, acetate, butyrate, poline, and LPS or endotoxin, which are quite a bit higher. And in some cases these are measured by different techniques, maybe not the same instrument, so you're merging data together. And so you have a very skewed distribution. How do you fix that? Well, this is what this normalization does. It converts that skewed distribution through log scaling or cube root scaling to a normal distribution. And you can see here through that transformation everything's been brought from zero to something close to the middle. And if we look at the distribution of concentrations, that looks like a bell curve. So that's really important. And this is where a lot of people struggle with, I think, in terms of using metabolized. This allows you to, or metabolized is designed to allow you to iterate, so you can try your transformation or maybe you don't try the transformation and to see what it does. So in most cases it doesn't have to be perfect. But if this is what you're getting, and this is what you're going to try and use for the rest of your run, you're running into a disaster. So you have to try and get it to look something like this. And many of the transformations will do that. And so there's no right or wrong. But you should be able to distinguish between the left and the right. If you can't, then it's... well. Time for more remedial statistics, I think. So, as I say, you can't a priory know what the best one is. Many people just stick with one and it seems to work for them. But check. Look at the graph at the bottom. That's the most important part. So you can interact with it. You can go back and forth. Don't be afraid to. And as I said, the example we have after this particular transformation using these parameters that were marked here works just fine. But I bet, and some of you could do this, and you'll get something that looks almost the same and that's okay, too. So, as I say, there's no run-right way of doing it, as long as you can get that kind of normal-like distribution. Yes. You're looking at it visually. So this is where humans have to do it. We've tried to, I guess, automate it so that it could iterate all these things, but it's a... I suppose we could. But I think people want to have some control over their data and say, you know, this is what I've chosen. This is what I like. This looks more normal to me than the other one. It's a tough pattern recognition problem. A human brain is probably the best for this one. So this is why we still left it up to people to do it. Okay. So after you've done the scaling and the normalization, and it is important, then you want to look to see if there's any noise or outliers that are still left. So, again, could potentially do this automatically, but I think it's humans are best at sort of understanding and even explaining what's going wrong. So this is where you look for visual inspection and you look to see if some of this is sorted out. Some cases, when you've done a normalization, you've actually made the outliers go away. There are also issues of noise reduction, especially when you're dealing with spectral bins and peak lists. We're not dealing with that, but people who do typically have to do this noise reduction. And that's similar to that model where I said, you know, here's a bunch of features. Now let's do this separation. Let's look at the isotope. Let's remove these things. He went from 15,000 features to 2,500 features. That's noise reduction. So different ways of doing it, different approaches. You could do it within Metabolonist. You could do it outside of Metabolonist. So what does an outlier look like? So sometimes if you, you know, happily moved along, didn't really look at your data, you'll get this. So you'll see a green cluster and a red cluster and that's looking pretty cool. And then you see this thing way, way out. That's P80. That's an outlier. You can also do it through a heat map or pyrographic clustering map and everything is sort of a mix of, you know, single dots of orange, red and blue. But then you get this dark line running right across. That's also a way of detecting an outlier. So what's gone wrong? Ideally you should have a bit of an idea. Maybe it's a typo. Maybe it's something that you remember someone spilling something on it and it was a measured immediately reactor. If you can rationalize it, good. If you can't, you're probably stuck with it. Anyways, there is an editor. So this is equivalent of the eraser that allows you to make your outlier go away. At least you've been able to pinpoint it and you can now get back to analyzing it. There's also some data filtering tools that are offered. And in this case, it's essentially trying to take out variables that are uninformative. So in some cases there may be things that are below the limit of detection or seem to be low. But registering is zero's because that's what the instrument kicks out. And the way that many systems will look at that is zero is a real value, therefore I'll do the statistics. And in fact, actually you don't want to use it. It's a not available thing so it's not going to be part of the statistics. So in some cases you will remove those things. So there's guides in terms of what can be done and how to remove some uninformative or skewing kinds of data that you didn't want to be in there. Again, people can do this before they dive into Metabolinist, but many cases people don't realize that until they actually are right in the middle of Metabolinist run. So noise and uninformative features typically have very low variants or very, very low intensities. And in this case there are different options that you can use. For the data that we're working with, you don't have to, at least for this set of cow metabolites and Ruhman analysis. So that's some of the filtering. So we've, there's data filtering that's the scaling, that's the normalization. If you don't do this well, then you're in trouble. And it's more of a problem with targeted data. It's less of a problem with targeted data. So we've now gone through that red section. We can kind of dive into the more interesting stuff, the fun stuff. Which is how to identify important features. How to identify or detect patterns. How to look at differences between phenotypes. How to classify. And so we're going to use ANOVA. We're going to look at multivariate analysis. We're going to look at some clustering. And so after you've uploaded your data, cleaned it up you're actually in a position to, in the statistical analysis steps to do both univariate analysis multivariate analysis significant feature identification clustering and classification. And within each of those there are different options which are clickable or highlightable depending on what type of data you have. So in this case we're not going to do a t-test. We're going to do ANOVA because we're not dealing with two populations. We're dealing with four populations. So four different diets. One with grass, one with 15% grain, one with 30%, one with 45%. And we're trying to identify metabolites that are different between all of the groups or between the control which is 0% and everything else. So we can go straight to the ANOVA analysis, ANOVA choice. And this is the plot that we get. And there are different things that you'll see in this plot. I guess this is a challenge with to click on ANOVA it's a one-way analysis this is I think it's a propionic acid one. And typically what you're going to get is this plot indicating the p-value that's scaled according to a negative log. So the dash line is the 0.05 that are cut off and typically used. And so the things that are in red are the interesting things. And so if we click on this red one that's the most significant 10 to the minus 8 probably that's the little graph we get. And it's showing how it's very very different between the 0 grass-fed and very different with the 30 and 45% and somewhat different at the 10% or 15% level. We can also click on these little icons. One is a painting pellet and another one is an Excel table. So if we click on the table icon we can now get a list of all of the components that were measured by NMR. So they're listed here. But it gives the p-value, the log p-value, false discovery rate, and then some information about which ones are most different sort of pairwise comparisons. So we can now click on these ones which are hyperlinked and we can see a graph in one case showing a original sort of plot, then there's the box plot which is probably preferred. So we could go up and down and explore what's different and which pairs or combinations are different and that all that information is in the table there and it's all done that for you and presented it. So at this stage, it still needs some interactive explore. It's not going to write your paper for you. You need to sort of look at the data and say what's interesting here. You can click on the correlation link to generate a heat map and also look at some of the pairwise clusters or correlations. So this is a heat map and I think for the longest time I think Metabolos is about the only website that actually generates interactive heat maps. You also basically have to go to so you can make heat maps with Metabolos if you want. If you're doing gene expression analysis too you can do it for proteomics. It doesn't care what your data is so you can get nice heat maps through Metabolos. So we collect ANOVA and then we just click the correlation thing that's right under the ANOVA panel and boom this is your correlation analysis. And you can see little clusters that seem to be highly correlated and ones that are not correlated. And you can play around. It's done some of the clustering. You can also choose a different type of coloring scheme. So some people like red-green but those who are red-green colorblind don't like it. So these are again all the options that you can choose and then the type of clustering that's done. So this is largely a default one. You can do it as a high resolution image and make that part of your paper or your poster. So this is what you've chosen that you can choose how many pixels, the type of format you want and then just click submit. So that's a bit of an ANOVA analysis and the interactivity and then that's the heat map analysis that can come from that. And you can see how it's basically a point click. It's all running R in the background but using a lot of visualization tools through the web. So if we're looking at more than two groups just as we are with this cow study, you can also look for patterns. And this is something that was developed or added to metaboloma analysts for both temporal trends but also for in this case categorical trends. So you could have an axis of time here or in this case it's 0, 15, 30, 45. And so we might want to look to see if there are trends where as you go up in grain concentration is there a linear trend or does it go down or is there something that drops quickly and then rises again. So again this is requires some interactivity. Maybe you've got a hypothesis about something you want to look for that. And obviously with just four cases you can't really look for periodic trends but certainly we could look for linear trends and look for data that has three or more groups in that. And so we can choose these trends and so this is a profile where we're looking for going from 1, 2, 3, 4 which is sort of a linear trend. So category one, category two, category three, category four, is it going up or down? And we're looking at the correlation coefficient in this case. So we can run it through and this is the pattern matching result. So we have a trend going up and you see that as you increase endotoxin or LPS or as you increase grain you see an increase in endotoxin. As you increase the grain content you see an increase in glucose and that makes sense because these are very starch rich components. We see also an increase in methyl, amine and cadaverine or a whole bunch of other amino groups which actually cause a lot of stress to the cows. And are one of the reasons this is effective problem. So we can look at the correlation coefficient and we can see that endotoxin has a correlation coefficient of probably about 0.8 or 0.9. So it's very linear. Others around here are pretty weak. We can also see that 3PP has a totally negative correlation coefficient that's even probably stronger than the LPS or endotoxin. So we basically are more interested in the extremes here to see if we're seeing something that's highly correlated or highly anti-correlated. Now you could have looked at the graphs and clicked away at things and you probably also would have got an idea. But this gives a quantitative assessment. Yes? For the pattern counter, if you don't specifically have like a time or concentration dependent variable across your samples but instead different conditions, could you create a pattern if you were expecting this kind of change with this condition and then a different one. So it's not like just like a linear pattern. So you can choose, as I said, these are pattern sets that you can pre-define or custom profile. So you can say I want to see a sinusoidal one. Okay, it'll look for those. So yeah, it's very customizable. Okay, so we've done ANOVA, we've done some pattern analysis, we've done some heat mapping. Now we can go from the sort of the univariate analysis to the multivariate analysis. And this is the PCA PLSDA. So again, as Jeff mentioned and as highlighted in the earlier statistical talk, what you're trying to do with PCA is essentially enable data to see if there are separations. So this is kind of the acid test of whether there's something real with your groups and whether there's some trends. We're going to look at the score plot and the loading plot. So the score plot will tell us some of the most significant, in this case a two-dimensional score plot, the most significant principal components. And then we'll look at the loading plot to see which of the compounds are contributing to that separation. And then if we look at a PCA plot in three dimensions, we'll be able to see the three most significant principal components in that one. Unfortunately, we can't do a four-dimensional plot, but that would allow you to look at four different principal components. And there's different tabs for viewing. So if we click just the PCA, you know, a few seconds later, we'll pop up this graph. It'll color things and it'll automatically generate ellipses around the clusters for you. So we can see four different colors and we can see sort of, I think, decent separation between three, three of them here. And you can play around with the visualization. And this, as it says, scores plot, two-dimensional scores plot. So that's up on the tab. We can then click the loadings plot for the PCA. And this now displays the individual compounds that are largely driving the separation. So if we look at the scores plot, you can see there's a trend sort of going this way. So it's a diagonal trend in this case. Sometimes you'll get trends that go horizontally, sometimes they go this way. But the separation trend is kind of going this way. So when we look at the loadings plot, we want to see the things that are at the opposite ends of that trend to see the ones that are most responsible for the separations. So the upper lower left corner are the metabolites that are actually driving the separation. And we can click on them. And if we click on one of them, I guess this one pops up. 3pp, that's the propionic acid. We saw that one before. That was the one that also had the strongest trends, had the most significant p-value in the ANOVA. This is again sort of mirrored in what we're seeing with the PCA loadings plot. So the endotoxin will also be up in the corner and then this what else did we have? Glucose. Those will also be the drivers that are moving this or separating things. We can go from two dimensions to three dimensions. So now we're looking at 3D, three principal components. And this is interactive. So you can take your mouse and put things around. And sometimes a three-dimensional score plot is more informative. It's not always or sometimes it's hard to see. I don't know. Is it possible to generate an ellipsoid around these yet or no? So at least with the 2D you get the nice ellipses and these ones, at least you get the nice interactive viewing. So what we are seeing at least the PCA plot tells us there's some separation. It tells us there's some distinction. It tells us that our hypothesis at least that there's metabolic perturbations to grain diet is there. It doesn't necessarily prove that more grain is bad for cows but it's, if we know a little bit about the trending molecules that trimethylamine, dimethylamine, dimethylamine's death are not great for cows. So let's go to the next one. Sorry, one question. Do you get a significance score with the PCA? So there's the R squared and Q squared which generally we use that for the PLSDA. I'm trying to remember. Do we also get that for? So the significance is largely for the PLSDA. Choosing how many principle components you would go to. Yes, that's right. So generally with the PCA axes there should be some indication about how much of the variance is explained. So the principle component 1 is 17% and principle component 2 is 14%. Not great, but if you're ending up with a principle component of 50% and 45% that's really good. If you've got 99% and 1% there's something kind of screwy. But so this one is a case where the variance 35% roughly is being explained by the two principle components. And it's good but not great. So the PLSDA is what you use second. And many people jump immediately to the PLSDA without PCA and that's not a good idea. If you saw a PCA plot and it was just one big mass and you couldn't see for the life of you that there was any separation. PLSDA may generate something separable but that's just more an artifact. And in fact, if you do the permutation analysis with the PLSDA you should probably also indicate or find that that's also spurious. But the fact that we are seeing a separation the fact that we are seeing principle components that are in about 30-35% suggests that our PLSDA will probably work out pretty well. So we choose the option and I think as Jeff had mentioned PLSDA really tries to maximize the separation from a PCA plot. We're going to look at the PLS scores plot and we're going to look at the Q squared and R squared values and we're also going to use a VIP to identify some of those more important metabolites. So in this case we're looking at the PLSDA plot whereas the PCA plot had this trend but it wasn't really obvious PLSDA has four very distinct clusters so it has pulled things apart nicely which is good but then we want to say is this an artifact? Is this simply PLSDA doing its thing and fooling us all? So we go on and click on the next tab which is the cross-validation. So if we have are using too many components in the fit then you get overfitting and then it's essentially false. So in this case this is using a number of components that plots out the number of components and the overall accuracy and the R squared and Q squared values and you can see it kind of trends like a log curve so it flattens out and this is the number of components that it's using and the R squared and Q squared values and they're all above 0.7 which is very good and they basically flatten out after about three components so this is sort of an evaluation of the model and so it's not using hundreds of components three components seem to work R squared Q squared above 0.7 even above 0.5 is often good so this checks out yes we don't have 20 variables or 20 components essentially to get a good R squared or Q squared value you want to minimize yeah yeah yeah so you want to maximize the number of components and it's a similar rule about rock curves and things like that keep the model simple otherwise you're overfitting I mean the term is I think if you use five variables you can fit an elephant I think what statisticians say so we can go to the next one which is also within the PLSDA thought so we've looked at the score and a spot across validation features or imp features and this produces a VIP or VIP plot and this is plotting out the metabolites that played an important role in the separation for the PLSDA and as it turns out also for the PCA you've got four different colors which represent the four different groups or clusters and you have some numbers here and it's showing 15 and what we're seeing again is 3pp the propionic acid we're seeing endotoxin LPS we're seeing glucose we're also seeing that opposite trend so remember that 3pp if we looked at the correlation coefficient it had this incredibly negative correlation coefficient and we saw that endotoxin had this incredibly positive correlation coefficient VIP plot doesn't care whether it's positive or negative these are two very important ones like in the color coding we see that the trend one goes from red to green the other one goes from green to red typically there are some numbers these are unfortunately cut off here but you typically want to have VIP numbers that are better than 2 or better than 1.5 and so you can make a cut off and you can say anything above 1.5 I'm going to keep this is really good so there's a numeric scale at the bottom did it print off there or it's gone is there a numeric scale at the bottom that they can see anyways there is a horizontal numeric scale that gives you a sort of a cut off value equally important is the validation or permutation test and so Jeff went through permutation with you guys and you can see this is a plot where it did a thousand with different models and it's finding in this case it just did a hundred or actually not a thousand so it did a hundred but the model here is so far away from the rest of the things that you can say this is a very very significant statistically significant result if you wanted to run for a thousand or ten thousand I guess it could do that but at this point it's so obvious you don't need to rerun it proteomics and mascot this would be the same sort of thing as a mascot score and it's just sort of off the charts so this stage everything's looking actually very good we can now go into some cluster analysis we've already done a little bit of heat mapping before with the ANOVA but again these are just some examples so we can ask the questions about how things are clustering which have low concentrations in one group it's another group how are things significant we're looking at also the behaviors of certain individual metabolites so again using the left panel we can navigate down rerun the PCA, PLSDA we can see ebam, sam, dendrogram we can go to heat map and so clicking on that one with the default parameters gives us this red-green heat map it's a little hard to see but you can see that it's clustered in the four groups the 15 with 0% 15% 30% 40% at the top and then we've got the metabolites distributed through there there's not a really obvious pattern that I can see but it's just an example of what you can see with certain heat map yes question in that case would do not it looks like that that's right well there is a trend we do know that it goes from 0% 10% to 15% it's just I can't see a trend in the I think if we had a blue red white one it might have been a little more obvious I think it's not very great with the red-green one but this is a chance you guys have a chance to play around with the data and see if you can make it look nicer or reorganize it and again this is the interactivity about it there's different ways and better ways of sometimes doing it so we've done the multivariate yes so you've actually that seems that would be unusual to get a q-squared r-squared good but a permutation test failing yeah so the best test is the permutation test that's the one that's most reliable r-squared q-squared is sort of an invention that Simca came up with should shuffle the data if we have just a few samples we just try to order and see if there are not enough samples to get a few values so that's the limit I think 12 each probably 5 or 8 yeah because for the cell line sample sometimes just do the biological treatment so sometimes just the library system yeah yeah no I certainly that's if you can do more I think I'm not sure if there's a you know you can talk to Jeff and ask him to to change some things on this but that's the limit exhausted all the positive but if he did a permutation test of 50 50 but if he's got nine samples so if he's doing a permutation test of a thousand or whatever it's just going to give him garbage but I just I don't know if you can you have a choice in terms of how many times you permute and there's obviously going to be a limit how many you can permute if you've got to solve sample size so if you're over permuting then I think you'll get garbage but if you permute relative to the appropriate sample size and at least you can get some measure but that sample size you may never get something above 0.05 you might get 0.1 or something I don't know so at this stage you can you can actually download a lot of the results that you've just been performing and you can download individual things or you can download the the paper that's been written for you and it actually does a pretty good job of reporting things so people can take that sometimes just as a preliminary report or use that to help build their actual paper so we've only got about half hour here so I'm just going to move through some of the next items here we'll look at the enrichment analysis that's module number two this is based on essentially the idea for gene set enrichment analysis has anyone heard of GSEA before? and so if you've worked in transcriptomics that's been around for a while so Jeff came up with the idea of applying it to metabolites and it is quite powerful quite widely used now there is an independent website MSCA which we still maintain it has over representation analysis, single sample pro-flying quantitative enrichment analysis for any of this to work you have to have a large predefined set of pathways, diseases and metabolite sets it's been a while since we've updated I think we still want to do some more updates with this but it is quite useful so what you're trying to do is interpret you're trying to get some biological information you want interpretation in terms of pathways, diseases sometimes genetic variations or localization right now it's designed for humans and this is because we're building on the human metabolism database it should in principle work for all mammals may not work in terms of absolute concentrations so what's healthy for humans may not be healthy for a mouse but it's not far off so it won't work for fish, it won't work for plants or microbes so in MSCA there are three types of analysis overrepresentation analysis, single sample profiling or quantitative enrichment analysis so that's sort of separated in these things ORA, SSP, QEA we'll show you a little bit how that's done because we don't have a lot of time but in this case we're not working with cows we're going to work with humans we're going to select the sample set of individuals who have lung cancer and colon cancer some of who developed catexia, so that's the muscle wasting not everyone with cancer develops it but if you do it's bad news life expectancy is greatly reduced and as yet there's a little understanding of why catexia develops and almost nothing that they know how to do about it the intent of this study was to see if you could predict people who would develop catexia before it happened and maybe try and have an intervention but in fact you can predict catexia through metabolomics so here we are, we're looking at the MSCA we've uploaded our sample set, we've chosen this enrichment analysis we can look at our compound list we're just using a list of metabolites metabolite names that have keg matches it is considered a weak analysis we're not using concentrations we're just simply saying here's the things that we are seeing because we have names of compounds that we've identified we have to watch out to see if we haven't misspelled things in this case there was a typo isoleucine was spelled without an EU so it's highlighted the problem in terms of the typing and then it allows you to go through and correct that naming problem, it makes the best guess and so now you can match it to either Pubcam, HMDB or keg so once you've fixed any problems, in most cases you don't have to, then you can try and choose your metabolite set library and there's a bunch of libraries pathway-associated ones disease-associated ones some that are specific to blood some that are specific to urine, CSF some are associated with SNPs there's a number of studies coming out others that are related to dysfunctional enzymes, so these are different ones that you can try and as I say in principle it can work also with other mammals so you click the one you want, we chose the top one, this is the result in this case it's looking at pathways and it's seeing what's being disturbed in catechia is the glycine-serine-3 metabolism, protein biosynthesis, phenylalanine in fact we see a lot of amino acid metabolism being disturbed in catechia and that partly makes sense because it's a muscle loss process, it's a function of largely cancer and cancer cells trying to grab as many essential amino acids as possible, but it also reflects the perturbation we see in glycine and serine metabolism and general in cancer so it's reflecting both the process of muscle wasting but also some of the processes involved in cancer so that's a plot, things are marked in red as being more significant, you can also flip around and view this in a table and look at the level in terms of the p-values false discovery rate, how significant this is so again it's both by the color scheme and by the actual numeric values, it allows you to get to where you're wanting to go you can also click on the pathway databases small molecule pathway database SMITDB that allows you to explore this a little more and so we're looking at the phenylalanine tyrosine metabolism and then this allows you to sort of see what's written about it there's some references about it and you can understand how it's related to in this case, liver function so that's the simplest and most primitive one of a representation analysis the next one is actually probably more appealing to people and this is the single sample profiling and this is basically the way that doctors think, what's your readout for glucose, is it above 6 millimolar in blood or is your creatinine level above X this is how they make decisions of whether there's diabetes or kidney failure or other things like that so in this case it's not just a list of compounds, it's a list of compounds and their concentrations and then what we've used with the HMDB is tabulated all of the normal and abnormal concentrations so in this case it flags things as being you know abnormally high or abnormally low and so it's looking at the reference concentration range, measuring reporting those concentrations and then we'll indicate whether those things are within range or out of range and then you can look at these things in a little more detail and where this sample is so this is the different studies range to be conceived as where this person is so way off scale so something's wrong so that's the single sample profiling it's probably underutilized I think we've been re tabulating a whole bunch of concentration ranges for the last couple months and so we'll probably try and update that so in some respects this is what people would like to use or have for precision medicine in metabolomics quantitative enrichment analysis is maybe the most powerful one it's I guess preferably used so you can again select different things in the list typically concentration label some kind of group label and this is the results you'll get the same kind of presentation as we saw before with the overrepresentation analysis you can click on things you can get false discovery rates so somewhat similar if you do click on things you'll get more detail graphs than you will get with the ORA kind of rushing through because we're a little bit behind time but these are things that you'll be able to explore a bit more in detail with your own data sets pathway analysis this is separate from enrichment analysis and the idea here is just to look at both pathway structures and to help with pathway visualization unlike MSCA which is limited to humans pathway analysis covers 21 model organisms so you can look at humans and mice but you can also look at fruit flies and yeast and E. coli right now it uses the keg pathways we're hoping to migrate to the smith pathway sets because we now have model organism data for most of these now again keg is great but it doesn't cover that many pathways it covers signaling ones so that's one reason why we're trying to migrate into the smith small molecule pathway database which covers disease and signaling pathways so in this particular example we're using the same set the kikexia group we uploaded the data from urine we select this pathway analysis module we can upload the data we can either paste it in or upload a file in this case it's somewhat similar to the statistical analysis that we did before so you do have to do some data normalization so scaling, autoscaling log transformations or whatever else is needed this is you can see how things are marked here and what was used and then from there you can choose what sort of organism you're looking at so in the case of humans we'll choose the mammals and we'll check off homo sapiens so this is the pathway library that we're choosing from one of the 21 and then we're going to be able to do some topological analysis of the pathways so in network topology analysis basically you're looking at whether a metabolite is a hub or a spoke the same sort of thing that's being done with genes or proteins are there hub proteins or spoke proteins generally I think people are finding now that the hubs are very very important their hubs are highly connected you can also through the topology analysis start identifying bottlenecks or shortest pathways between different nodes and so there are terms that people are using in graph theory called degree centrality and betweenness centrality that you can use so the hubs sort of have this high betweenness centrality and some of these other ones have high degrees of degree centrality centrality so you can plot out some of these things in a bit of a heat map which is listing the pathway impact and it's a law of p-value and you can see the things that are darkest red in this case the glycine serine pathway which we previously identified using over representation analysis using the other tools is having the most significance and the greatest impact in terms of its centrality or role and the frequency which metabolites are showing up because if you click on the kick pathway which is shown here simplified version these are all the metabolites that are involved with glycine serine metabolism so we don't have full coverage it was only NMR but you can see at least almost 10 I think in that so that's significantly over represented it's involved in a number of pathways and you can see based on the structure of that topology which ones might be more hub which ones might be more spoke like yes so is there a directionality in discovering these news in terms of the a lot of the reactions have something like reversible this is the consideration but if you can see there what is that you've got arrows that's very basic but this is partly why the arrows are drawn to show some semblance of directionality if you want so you can see where some are potentially hub like or bottlenecks or where the flux and flow would typically come from but in this path just like when you're looking at the PCA you're looking for things higher up those are the ones that are most significant ones down here the circle and the color are much less significant so clicking on the circles pops up the pathways and then clicking on the individual metabolites you can see the differences plotted out in a box and whisker plot so again very interactive and converts the data so the pathway impact is including that log fold change, the significance, the topology it's all kind of combined into one score and in essence it's sort of using that overrepresentation analysis you can convert things into tables and click on things to see more detail false discovery rates and p-values are provided last 15 minutes here we're just going to look at biomarkers, the biomarker module this is a clear trend in metabolomics particularly in applications to humans but also to veterinary work animal systems we talked about it before is that metabolites because we can generally quantitatively measure them are really really useful for biomarkers means the biomarker you measure here is just as valid as in the states or in South America or in Africa and that's a historical reason why clinical chemistry is so widely used in diagnosis, prognosis and monitoring it works for animals lots of examples of biomarkers are showing up in veterinary medicine as well it works for plants and so I think these are really useful and what this part of the metabolist does is actually get calculates the rock curves it will do it in three different ways we'll use univariate markers, multivariate and then there's a manual approach what you're trying to do is Jeff mentioned is you try to maximize the area under the curve and minimize the number of metabolites so it's the same sort of thing that we're trying to do with the other q-squared r-squared to maximize the r-squared, q-squared values for accuracy so in this case we've got a data set that we can load it's actually a set of mothers three months pregnancy and half went on to develop preeclampsia at six months others had normal pregnancies so preeclampsia is this high blood pressure condition it's actually the leading cause of infant mortality in the developed world so if you can predict or identify individuals who are going to develop preeclampsia then there's some really simple things you can use aspirin and viagra actually prevents preeclampsia so if you can identify these people at three months then you can actually put them on a prophylactic treatment so they never do develop it so if you could come up with a nice blood test to discover preeclampsia it would be very helpful so we uploaded the data we did a data integrity check just as we did with the classic statistical analysis we had to do some scaling and normalization with this we did a log transformation and as we saw before so this has this skewed kind of shape but after doing the appropriate log transformation it has a nice looking normal distribution it took a little bit of time to figure out which ones were best so the first one looked very skewed again, the second one looked a little better the third one looked like this we got a good normalization one scaling and transformation so now we're going to the rock curve analysis and we've chosen to do one out of the three this one is the multivariate one so just click that wait for a few seconds and boom, this is the rock curve you get and it presents sensitivity on one axis and the specificity on the other axis and what we're seeing is that using as little as two or three or four metabolites we can get a rock curve area of almost 98% which is kind of amazing so this does suggest that preeclampsia is predictable by metabolomics you can choose a model from this a little further we're using a support vector machine and we're going to compare all models for this and it'll calculate a 95% confidence interval based on the number of samples so this particular one that we chose using just two metabolites and a confidence interval between 91 and 99% and for rock curve analysis this is typically what people want to see they just don't want to see the clean curve based on your sample size so if we had 10,000 people doing this the confidence interval would probably be as thin as the line if we only had 15 patients the confidence interval would probably cover most of the the whole curve there so this is where more helps with improving your confidence we can go a little further similar to so if we just dove straight in we didn't do any statistical analysis we just wanted to know can we see a decent separation can we get a good rock curve this is exciting let's see if we can see playing a role here so in this case we'll see that glycerol hydroxy, uterate, choline acetate all have a very significant role in in this disease how and why is still a bit of a mystery but it seems to be related somehow to lipid metabolism so we could go back and explore this using the Tablet 7 Richmond analysis and other things to see if we might be able to figure out what we're seeing this for the people we were working with they just wanted to know a marker because they wanted to be able to do a test power analysis in our last five minutes here this is a increasingly an issue for people in large scale studies whether it's with patients or whether it's with animals or plants and trees even microbes how many samples, how many technical replicates how many biological replicates in genomics especially in GWAS studies they ask this question all the time and if you can't give a good answer or you don't get your grant or you get kicked out of your department so power analysis is also becoming important in metabolomics because it can be expensive to do a metabolomic study and where do you draw the line to just keep on collecting and collecting and collecting now the problem with power analysis is that you generally have to have pilot data so if you don't have any pilot data because what's your power analysis tell them they don't know what they're talking about because you need to have some preliminary data to be able to do a power calculation now you can cheat you can look to find out if someone had an equivalent study so if we were doing preeclampsia and we were wanting to look at blood you might look to see if there was something in the literature where someone was looking at maybe high blood pressure in older men 30 and 30 so based on that data we could extrapolate to women pregnant and predict this would be for preeclampsia and that's what people do a lot anyways if you have some preliminary or pilot data you can do power analysis using metabolo-analysis and it's not just for metabolites you can do this for anything so just like with a heat map function in metabolo-analysis you could do it dive right in for proteomics, genomics, whatever so power analysis if you have a power of 0.8 it means that there's an 80% chance of ending up with statistically significant treatment effects so essentially you can give a quantitative number saying how powerful your study is based on the design the number of patients and the type of features you're looking for just a quick question how much for your pilot data how many replicates would you need to just pop like a seed nest to do the power analysis it's really hard to know so the example I often will give is that in the 1910 or 1912 whatever it was Gerrard found PKU or Alcaptan area these are the inborn areas of metabolism he just had one patient so but the power was sufficiently strong because PKU and Alcaptan area was just so off the map in terms of the presence of the compound and that's often what's done for inborn areas of metabolism they usually have one and maybe two people and the power analysis is sufficient and they can institute a country-wide test but for other ones you know, yeah, you're looking at more subtle things but that means you have to have more so it is sometimes hard to know but a rule of thumb for just about everything has to do with the size of 25 or 30 and this is one of the reasons why classroom sizes in schools have between 25 and 30 students it gives you a sufficient number to determine a reasonable distribution so there's a reason why class sizes are the way they are and that largely was driven from statistics to help identify A students from D students and things so sample size effect size, significance criteria those are all things you can choose and it doesn't always have to be 0.05 or 0.1 or whatever so as you increase your sample size you increase the power as you decrease your significance criteria you can also decrease the power so that pilot data as we said you use whatever P value or FDR you may have a budget you may have limits in terms of what you can do if you upload your pilot data you can calculate your power analysis for ranges between 3 to 1,000 and so for this example based on the pilot data that was uploaded or synthetically generated you see a plot and you'll see a sample size from 0 to 200 and if you wanted this 80% cutoff you can just look at the long track that says basically you need about 55 to 60 people to calculate a power of 0.8 if you wanted that 0.9 it looks like you'd have to get about 140 or so so that's the plot it's sort of a simple log one but it uses the data your pilot data or your hypothetical pilot data that you may have gotten from another study to help them so not enough time to cover everything in this presentation there's lots of different clustering methods lots of different classification methods that use machine learning like support vector machines there's a time series data two factor analysis integrative pathway analysis where there's both gene and metabolite and just some interesting shots showing what the time series analysis is that's similar to the pattern analysis that we talked about but it has other cool plots and ways of looking at the data that Jeff has been working on but he's really trying to move metabolomics into more of a systems biology perspective integrating gene expression data with this and in principle you could probably get protein expression or proteomics data as well mapped out into that so I think it is now exactly 1230 we'll wrap up here for