 So, we'll try and have a sound and light and firework show just to keep this Metaboanalyst module engaging. I think Jeff's done a really nice job of introducing statistics to you and a lot of what Metaboanalyst is about is trying to use the statistical tools that you've just learned about but to make them very easy to handle and very easy to use basically pointing and clicking. So most of today will be devoted to Metaboanalyst and we'll talk about it, I'll introduce it to you and then your lab for the rest of the day or most of the rest of the day will be focused on using Metaboanalyst. So in this case we'll talk a little bit about the standard data analysis workflow for Metaboanalyst just to say you can understand the basic layout for Metaboanalyst because it is structured and it's intended to be sequential to follow or guide you through a standard data analysis workflow. We're going to talk about some of the things involved with data integrity checking, detecting outliers, some QC or quality control, normalization, transformation, scaling, centering. And then the multivariate statistics and we'll use several examples just to show you how the data analysis can be done. Now in addition to this introduction to Metaboanalyst I think we sent out a chapter from current protocols in bioinformatics that describe Metaboanalyst. Did anyone see that or download that or I don't know if we got that text. But anyways, it goes through screenshot by screenshot and page by page through Metaboanalyst. And I think we were going to make that available for the pre-reading but we'll try and maybe make that available for you for the lab as soon as possible. So in a typical metabolomics experiment you will work with both biological replicates and technical replicates. So if you're working with cases and controls as an example, so the pink and blue cases and controls, there might be 40 cases and 40 controls. You don't usually do an experiment with one case and one control. So the 40 cases, the 40 controls are biological replicates. Now you'll also occasionally do things called technical replicates and this is often part of your quality control and quality assurance process where you'll be, in some cases, taking a couple of extra samples and rerunning them at the beginning of the day or at the end of the day. Sometimes you'll remeasure them. Some people standardly do technical replicates for everything just to ensure that there's good reproducibility. This is particularly true for untargeted metabolomics where quality control is really crucial. For targeted metabolomics you generally don't need to do that many technical replicates although it is a good idea to have some QC samples. So biological replicates help give you the numbers to get some statistical assurance, to get good P values to address the issues of false discovery. As we talked about yesterday, there are basically two routes for metabolomics and there's the targeted approach and then there's the non-targeted or chemometric or profiling approach. And again we went through both techniques yesterday where we did targeted GC-MS, we did targeted NMR, and then we did untargeted LC-MS. So with the data workflow, and I mentioned this before and I'll repeat it, there are differences in the workflow. So untargeted metabolomics, you'll look for the quality of the data, you check to see if everything was measured, if there's actually real values in your data matrix. Then you'll do all kinds of spectral binning and alignment, that's what we did with XCMS, there'll be some normalization, some quality control outlier removal, then the statistics data reduction data analysis, and then when all is said and done, then you're going to try and identify your features. Targeted methods will do some quick data integrity checks at the very beginning, then you do the compound ID and quantification, then you do the data normalization QC, and then you do the data reduction analysis. So there is a difference in orders and basically the situation is one does identification at the beginning, the other does identification at the end. So what's data integrity and quality checking? So in the case of LC-MS and GC-MS, we talked about this before, we went through this as well in real time with the XC-MS demo, there are lots of false positive peaks. We went through some of this issue where you're seeing adducts, where you're going to see neutral losses, you're going to see isotope patterns, you're going to have other things that will produce false positives. NMR, you don't have that problem, so we didn't dwell on that with the NMR analysis. In order to process these things and to deal with those false positives, we went through some examples, there's tools, many mass spec instruments that help facilitate that and there are also tools that are online to do that. We also went through the process of data alignment and spectral alignment, we saw how it's done or typically done for LC-MS, can be done for GC-MS. There is some issue with NMR as well, where there is some need for spectral alignment, and that's a result of pH variations, again we saw some programs that did that and most of the alignment tools are based on what's called time warping. There are also processes, not so common now, but still done in some groups and in some projects where spectra will be binned, and that means they break them up into bins. This is to reduce the data size, the data matrix size. In this case we're cutting up this spectrum to, I don't know, about 15 different bins, and in each bin we just basically measure the overall area under each peak and we give a position, chemical shift or retention time or mass to charge value, and that has essentially simplified what might have been 3,000 data points to, in this case, 14 bins. So this is a simple data reduction method, it's used, you'll see it's used in some cases for data with a TABO analyst, and a number of groups still use this. Another thing that Jeff mentioned, and this comes into the field of what we call data normalization, scaling, centering and transformation, and it's really confusing, I think, for everyone, because in fact the same words mean different things to different people, and I'll apologize right now because sometimes I will slip into talking about normalizing when it's also sort of scaling or transforming. So normalization in the world of physics and math often means adjusting things so that they have a minimum and a maximum that's between zero and one, or zero and a hundred, but that can also be called scaling. Again, you just adjust things so that everything is on a different unit of measurement. Normalization in statistics normally actually means converting something that was a skewed distribution into a normal or Gaussian distribution, so it's a fundamentally different definition. So this is where there's a problem, and then people also use the term transformation to do the same thing, converts the skewed distribution into a normal or Gaussian distribution. So these are terms that are confusing and they're used in the literature kind of haphazardly and, again, it's with different disciplines in different fields, mean though. So an example where you have to deal with normalization, scaling and transformation is when you've got things that are diluted. So this can happen with urine. It doesn't happen with blood. It happens with saliva. It happens with cell culture media. It can happen with cell cultures. And this has to do with the fact that sometimes in this example, there's a concentrated sample and a dilute sample. You can see the signals are high in one and low in the other. So how do we adjust for dilution effects? And there are different ways of doing that. Adjusting for dilution, we might try and integrate the total area of all the peaks and say, well, we want to have generally about the same area. In the case of urine, people will measure the refractive index or the osmolarity or osmality. They will also calculate the specific gravity. So all of those are ways of normalizing or scaling or adjusting so that the dilution effects are consistent across all samples. Because you can easily end up in a situation where the cases have very dilute urine and the controls have very concentrated urine. And all you're detecting is simply that fact. And that has often nothing to do with the condition. It might be that the cases were very thirsty and the controls weren't. So you can also put in internal standards that will adjust this. In NMR, we often put in a DSS standard. This helps us adjust things if need be. People may weigh a sample. They measure a volume of a sample to make sure things are consistent. This is actually probably the number one problem in metabolomics is people not properly adjusting to dilution effects. So it's very important. And there are mathematical methods including probabilistic quotient methods, which are also used. What you do and how you do it depends on the bio fluid. It depends on your capabilities within the lab. It depends on how the samples have been prepared or given to you. There are also situations where samples might look largely the same, but there's some compound or several compounds that are 10 to 100 to 1,000 times higher in a certain sample. So here's one where this compound is perhaps 100 times greater than what it is in all the other ones. So this you might call an outlier, but perhaps for several people it's way off the scale or animals or plants it's way off the scale. So this is a way of helping to manage these outliers and this is what skews your distributions a lot. And so when you have a skewed distribution, that's when you start using these log transformations or autoscaling or pareto scaling or probabilistic quotient methods or range scaling. And that helps bring things in so that they look like normal distributions. So you can see where I'm again mixing terms as well as say scaling, normalization, transformation. They'll say, well, this is converting something so it's a normal distribution. In other cases I'm saying normalization, scaling, transformation to say let's adjust for dilution. And that's a problem of semantics and terminology that's unfortunate, but hopefully when we say normalization, scaling, transformation and centering, we put all of these things together to say, okay, we're just trying to make our data a little cleaner, a little more normal and also adjusting for quality control. So there's another phenomena which is part of this filtering and filtering is not so much scaling or normalization. It's actually literally removing data. So in many cases, removing data is legitimate. Getting rid of a water peak in NMR is legitimate. Getting rid of noise peaks in mass spec is legitimate. Getting rid of those false positives, which are essentially part of the column matrix that come off of GC or LC runs is legitimate. But it still needs some justification. You have to be explaining why. And unfortunately there's lots of people who just think removal of data is standard practice, just take an eraser and remove the outlier and it's gone. And that's all you need to do, but you ultimately have to explain why. For my data, actually, we run for even 100 metabolics from more than 100 patients. So if we found some data is missing, no data, we can remove that. No, not necessarily. And this is where people have to be aware of issues where if there is no data, it doesn't either mean that in some cases it's below the limit of detection. And in that regard, removing that, if you've identified that this is an alanine or something like that, it's probably there. It's a compound that's essential to life. And then by say removing all alanines from all of your analyses, in a few cases that you didn't have detectable measurements for it is erroneous, but also could lead you to some false conclusions. So typically what people will do is they'll take the lower limit of detection and divide by two and give the lower limit or put in essentially a fake value or a synthetic value so that that collection of metabolites can still be considered as part of the statistics and as part of the analysis. Now there's a point where if you have too many compounds that are below the limits of detection and that could be around 20 to 30%, at that point you've reached a stage where probably the instrument's not effective enough, the assay isn't effective enough, and so it is probably wise to just remove it entirely from the analysis. But that needs justification. You need to say or explain explicitly what you've done and why you've done it. And if you don't say why, people are going to look at your raw data and say, you know, what are you thinking? What were you doing? Like in my case, it's 6%. 6% data, so that is not so much. That's right. But the problem is if I include this data, my data is actually LCM. So if that one will not be again deductible. That one will not again be deductible. Like as some is not deductible here, if we design some diagnostic panel or something, that will not be possible. No, it is certainly possible. And when you've got 6%, that's legitimate, and you can use that as a biomarker. And obviously you like to try and make your assay over time a little more sensitive. But if that's an effective biomarker or part of an effective biomarker panel at 94%, having something below the limit of detection and then giving it a value of half that is completely legitimate to include that as a biomarker. It's commonly done even in things like qualitatively in born areas of metabolism. There are things that they may look for that are absent, and they say I can't find it, therefore this person has this disease. The system can't detect it, but that's the test. If you can't see it, it's part of the test. So anyways, we could go on and on about that, and we can talk about that maybe in the lab. So data filtering, as I said, is one that, again, people make many mistakes on. Dimensional reduction was PCA and PLSDA, something that Jeff talked about. And we also talked a little bit about clustering as well, which again helps with similarity detection. All of these tools, these techniques that we introduced have been incorporated into Metabolanalyst. So maybe before going into this, how many people have actually used Metabolanalyst? Two, three, four, five, so about half or two-thirds. So just for context, Metabolanalyst is actually the most widely used analytical tool for metabolomics. It has, I think, an average of six or 7,000 users per day every day. And roughly a third of all papers published in the field of metabolomics use Metabolanalyst. So it was developed by Jeff while he was in my lab, and now he's continuing to develop it in his lab at McGill. So what it does is it's a web server, and that was the novel's part about it, is it took metabolomics analysis out from having to have custom software that you paid lots of money for, and it put it online. And the second thing it did was it made it very easy and very visual. It's also very general. It works for LCMS and GCMS and NMR. It's particularly suited for targeted metabolomics, but you can put any type of data, including untargeted data into Metabolanalyst, and it'll work. So the first version came out in 2009, and that introduced most of the major statistical techniques that Jeff explained. It introduced a lot of the color plots. It introduced the workflow. Version two added a number of other tools for higher level interpretation. Tabalite set enrichment and pathway analysis and rock curve analysis. And then 2015 enhanced things much more, improved the performance overall, added power analysis integrated with gene expression, and so on. So that's the current version, and it's always being improved, with Jeff working on additions, and my lab is well doing a few things to update it. So what we'll do here is talk about an overview, and then you guys will have a chance in the lab for a few hours this afternoon to give it a try. So we'll look at raw data processing. We'll look at what's called data reduction or dimension reduction in statistic analysis. We'll look at enrichment analysis, pathway analysis, power analysis, biomarker analysis, and we really won't have time really to get into integrative analysis. So metabol analyst is divided into eight major modules, and it's too small for me to see, but so there's each of them, there's statistical analysis, pathway analysis, enrichment analysis, targeted pathway analysis, integrated pathway analysis, and so on. Each of these you can select very easily, and they are all part of a general workflow that takes you through data preprocessing, data normalization, data reduction in data analysis, that's the statistics, and then data or pathway interpretation. And they're all different, I mean they may sound all the same, but in fact they are distinct, and this is sort of embodied here, so think about the modules. So blue boxes at the top represent different types of input, so that could be GC or LCMS, raw spectra, it could be peak lists, it could be bins, which we just talked about, or peak tables, or it can be concentrations. And as I say generally, if you can get relative or absolute concentration data, that's the most efficient, because it's a small data set, and it's much more informative. It'll do a variety of things, it'll do name mapping, name normalization, we'll see some examples of that, it'll look for the data integrity, it'll also if necessary deal with peak alignment, or peak detection. It'll go all the data, no matter whether it's LC, GC, NMR, raw lists, peak lists, or whatever, it will go through some sort of data-filtering step, and then you as a user have to do data normalization and transformation and scaling. We'll talk about that. From that, then you can do a variety of multivariate statistical analysis, you can do time series analysis, we won't talk about that, we'll do biomarker analysis, you can do power analysis, pathway analysis, metabolite set enrichment analysis, integrated pathway analysis, and then other utility functions that also can deal with batch effects and batch corrections. Again, we won't have time to do that in the demo, or at least in today's discussion, but you can do that in the lab. So, if you go to Metabo Analyst, and you just type the name and anywhere it'll take you to this website, and it looks like this, and it's... I think that the toughest part about using Metabo Analyst is finding out how to start it. So, you have to look in this little tiny spot, which is click here to start. I've been asking Jeff for almost six years to make the font bigger. On the side is a list of hyperlinks, which allows you to navigate through the system. So, that's on the left with the arrow. Then Jeff keeps a steady update of what's new, what's been added, things that are changed, and then there's the citation information below it. So, if you click on the data formats key, you'll see some example data sets. And this has been really important, and this is something that you guys will use, hopefully. You can also use the data you generated yesterday, your own data if you want. But if you just want to get a feel for this, there's a whole range of data sets, example data sets coming from our lab that you can use, and you can download them. And some of them are in different styles or formats. Some will be compound concentrations, some could be raw spectra, some could be binned, some could be whatever. So, you can download them if you wish, or you can also use an interactive tool which is, as you get in, also has example data sets for each of the specific analyses. And again, that's very useful for people to get a feel for how to use Metabolanalyst. So, let's get into some Metabolomic data processing. So, you've started, or you've found the Metabolanalyst data site, and what you're going to do initially is start converting the raw data form into a matrix that's suitable for statistical analysis. In most cases, if you're using your own data, it's going to be in an Excel format or a comma-separated value format or CSV format. So, it has to be in a table. And you can have concentration tables, peak lists, bin lists, or raw spectra. We're going to focus just for simplicity on the target analysis, so these are concentration tables. It can be absolute concentration or relative concentration. So, even the XCMS data you guys have yesterday could be put in this format because it would be peak values or integrated values. So, it has to have a label and a number. Sure. Multiple times. Bringing hundreds of computers down. So, yeah, since we're going to have 30 people using this at once, we want to try and keep it fairly free. Anyways, if you go in to click here to enter, this is the first screen you're going to see, and it's going to be actually eight panels. I can only show six, but from here, what you do is click on statistical analysis. Top left corner, which should be the natural thing you should do because you all read from top to bottom, from left to right. So, once you click that panel, what you're going to see is this screen here. There's going to be a little window on the left, which is hyperlinked, which is a process window that takes you through steps. Step one, the step two, step three, four, five, and six. So that first step is upload. And you have two options. One is to just take a tab delimited file. The other is to take a zip file. And you can take those files that you've downloaded, files that you generated yesterday as you wish, or if you scroll down that window a little further, there's this second option, which is try our test data. And this is a case if you're just trying to get a feel from a tab on us, this is the best way to do it. Scroll down, click on one of these test data sets. And I'm going to use this one and set the one number two, which is one which will automatically be uploaded. So it's already sitting on the server. You don't have to use your computer and upload it. But as I say, there's two options. One is to take the data you just generated yesterday, put it in, or take the data that's given there free for you to try. So the data that I'm using here in this example is from Dairy Cattle that were fed different proportions of cereal grains, basically barley. And in Canada, we feed in actually the U.S. as well. Most cattle are fed grain to supplement their diet. But cattle in say Ireland or many parts of Europe and Australia and New Zealand are fed mostly grass. And there's a difference that they've noticed in terms of the health of the cows and the longevity. And certainly as you increase the grain concentration, the cows get increasingly uncomfortable and also seem to have a number of disorders. So they wanted to figure out what is it that's happening when you increase grain in cattle? Now grain is very energy rich and the assumption is that if you could get cattle to eat lots of grain, they grow quicker. They appear to, but it's obviously with some cost to the general health. So they decided to look at the rumen. So this is the gut fluid if you want in cattle. They're ungulates. They have these multi-chambered stomachs and a few gallons of rumen that's in each cow. And this processes the grass into fuel and food and methane. And we used NMR to analyze this. So the first thing we've done is once we've uploaded that one, it'll immediately bump us to this screen here which is doing a data integrity check. And it'll say, you know, it's looking to see what data is in what column and what row, samples are in rows, features, meaning the compounds are in columns. It'll tell you whether it's a comma separated or not. If you've messed up in the file format, it'll complain and then you can go back and reformat your file. It'll look to see if there are some things that are empty, whether there's some missing values. In this case, the data is quite clean and so we accept all of the small adjustments that the data integrity check and so we say skip. So we now have, you know, a list of cows in rows and all of the compounds or compound concentrations in columns. And there's probably 50 or 60 chemicals and I don't know, there's about 7 or 8 cows. For each of the different states that we're fed 0%, 15 or 10, 20 and 40% cereal grains. This is the next step which is probably one of the more confusing parts for anyone doing statistics. And this is the scaling transformation normalization step. So this is trying to deal with, yes, there are probably some dilution effects with the rumen. Sometimes people didn't get as much rumen as they wanted or the needle aspirin didn't do a great job. Then there's also some of these compounds which are 100 to 1000 times higher in concentration than all the other compounds. So we're trying to deal with dilution effects. We're trying to deal with skew distributions of metabolites. And so you have different options where you're going to try and do either normalization, scaling, transformation. And you can think of it as row-wise normalization and column-wise normalization or normalization with respect to the compounds and normalization with respect to the different cattle which may have been dilute or not so dilute. In this case, we've chosen some specific numbers. We've chosen to do normalization by a pooled sample. This helps to deal with some of the dilution effects. We didn't do any log transformation but we did decide to do some auto-scaling which helps center the data a little bit and also improves the distribution. And we could have tried clicking a whole bunch of these things until we got something that looks decent and we'll show you what you're trying to do. But just with data normalization, as I say, you've got these samples and rows and the compounds or peaks or bins and the columns with their corresponding concentrations. And as I say, you can do this row-wise normalization, column-wise normalization or combined normalization. The rows, we're just trying to deal with the dilution effect. That's row-wise normalization. That's the top one. The column-wise normalization is to try and deal with making things normally distributed. So that's the log transformation, the Pareto scaling transformation, auto-scaling and so on. So if we didn't touch the data at all, what we would have got is this on the left. And you can see that this is showing the compounds and their concentrations and the concentration ranges in box plots. And you can see that most of the concentrations are really small, 10 to 50 micromolar. And then there's a few that are up in the order of a millimolar, or 10 to 100 times higher than any other ones. This creates a very skewed distribution. So Jeff showed you an example of a skewed distribution. This is really skewed. Everything's lined up on the left and then there's a few way off to the right. So by using the scaling functions by doing both row and column normalization and centering, we've now changed the data. So instead of looking really skewed, it looks like a bell curve. It's Gaussian. And it's now plotted in the same box plot to the right. And you can see everything is sort of, you know, it's visible unlike the one on the left. They have a range. There's a standard deviation. You can see that the quartile sets. And when you plot out the entire distribution, which is shown at the bottom, you can see this bell shape curve. So this is critical for doing statistics. As I said, if you didn't do this, you would be dealing with non-normalized data. And essentially all of your calculations would be grossly in error. Statistics was not designed to work with skewed distributions. It was designed to work with normal distributions. Yes. Can you obtain a better distribution of your data? Do you have any suggestions of flow shard how to play with these different possibilities? There's not really. Usually if you see something like, many combinations will work just fine. So this one isn't going to be, you know, you always have to do it. I think if you're dealing with urine or some other sample, you might want to try normalizing to some external samples. So this is something you have to design in your experimental design. And you'd have that either creatinine or something like that. If you're working with cell samples, you should normalize to the weight, the dry or wet weight of the samples. If it's working with fecal or stool samples, again, the tendency now is people normalize to a wet weight of material. Fruits and vegetables, people will work with wet weight. And so that's one way of normalizing or scaling. Then when you're dealing with concentrations, log scaling works really well most of the time. Auto scaling works really well most of the time. What Jeff has done recently is he's made this interactive. So in the old days, it was kind of tedious. And if you guessed wrong, you'd spend hours going backwards. But this one you can do this in a few, I don't know, 30 seconds, you can see whether your first guesses worked or not. Assessing whether this looks normal is really up to your eye. This is where statistics becomes a bit of an art. I usually say statistics is the mathematics of intuition. It just formalizes what we might intuitively see in a mathematical framework. And this is another example of intuition. Does this look Gaussian to you? Some of you would say, no, it's not enough. I want to do some more. Okay, you can. You're working with a slightly different distribution. And depending on what you've done, you will get slightly different results, not profoundly, but slightly different results depending on what you've done to normalize and scale. Yes? Can you recommend this visual interpretation? Is it enough? Or would you run like normality or almost elasticity test? You could. It's just that we don't have that on metabolism analyst. And yes, I mean, some people have done that. And this is something you could bug Jeff to add, at least to generate that. Yeah, I like that. And actually, I asked the physician about it, because I started using box, box, box, Jeff, for now to see what is the right way. He said, no, no, no, biologically, you cannot. You have to find the justification to normalize this. We belong in the other one with square roots or whatever. So it's, there is a reason why not to do it, when you have to do it. Yeah, I understand that you don't normalize this because what we do normalize is we compare each feature instead of using this normalization or just hitting this across the side. All the normalization is you select one after the other. If we don't treat individual features, each piece needs to use this normalization, that issue is an empty normalization. And if you really want to do that, you have to do that. If you want a software course, allow you to apply different features, algorithm on different features, and that's going to be so complex. And you can do it, but it's, you really don't recommend it for that comic set. So if you really want to do across the software for a feature across the sample, instead of for, for the, across the feature, features normally should not sample, because if you think about your sample, it contains all the number capitalized. Some capitalized is really like a suger, like an ascending. It's very high abandonment, tens of thousands, and really a lot of low abandonment. There are several auto-difference, and you really expect it to be normalized distributed and normally it's not. But we need different, within the population, when particular function of the, the capitalized should be, we can expect normalized distributed. But all the capitalized, within one particular sample, you want to normalize distribution, there are some in biology, probably I don't think, but we can do it, no problem. It's just a good thing to do that. So I think in the interest of time, we'll have to move on because we're already behind time, because we need to try and get caught up. But it is, it's an interesting issue, but I think Jeff's given a very good answer. So this said, we've got a new version for a table analyst, version 3.1, whatever, that allows you to interactively do this. It is important to do it visually. After you've done this scaling, transformation, normalization, you want to be able to look at this data a little bit closer to do a little bit more quality control checking. And this is to look to see if there are any outliers. Now in some cases, people will do this visually or initially when they first prepare their data matrix. Some may have noticed it when they were collecting their data, but many don't. There's also a case where we try to do some noise reduction, especially when you're dealing with raw spectral data from mass spec or GCMS, LCMS. So this is what happens or you can do, you can use some PCA analysis or you can do some heat map analysis at the very, very beginning to look to see if there's an outlier or two. And this is an example where we've done principal component analysis and there are, in this case, two clusters, a red and a green cluster. And then there's something way off on the right, which is marked with the arrow. Or in the case of the heat map, we're seeing a heat map of metabolites, different groups, the red and the green group, and everything's kind of, you know, gray, blue, slightly red. And then you see this dark streak all the way across in this heat map. That's another way of detecting an outlier. And this could be a case where, in this particular example, maybe it was a wrong sample. Maybe instead of urine, it was plasma and someone mixed it up. Maybe in this case, it was highly dilute or highly concentrated because someone had added too much of the diluent or concentrated unknowingly. But this is an example where this will mess up your data. It's a mistake. Ideally, you'd like to understand how the mistake was made. And in that respect, if you can, then at least you can justify removing it. So you can go in and remove data using a data editor, or you can just go to your Excel spreadsheet and remove the data. But if you're already into metabolites, then you can use, you can navigate down this navigation panel on the right, and you can go into the data editor and you can do that editing. You can also do some noise reduction. This data filtering step is to get rid of some of the additional data that maybe you didn't get rid of as part of your XCMS processing. And again, there's a couple of options for data reduction or noise reduction. Typically, you're looking for things that have very, very low intensities. These are things that are near the lower limit of detection. Things that have essentially no variance. So it's the same value over and over and over again in every single sample, which probably indicates that some contaminant that was washed through or something that is incredibly high in one sample, incredibly low in another sample, incredibly high in another one, incredibly low. That's very little repeatability. Those are trends or features that says there's something wrong with this particular peak, this particular feature. And that is very problematic with LCMS data. And it can be problematic with GCMS data. It's not a problem with NMR data. So those are some tricks and some tools that are available in a table analyst to get rid of some of the data that's uninformative noise. Okay, so that's cleaning up your data. That's getting it normalized, which is probably the most important step. And then you're ready essentially for the data reduction and the statistical analysis. So the data reduction and statistical analysis allows you to find those interesting features, those significantly changed peaks or concentration. It allows you to look for certain patterns or trends. You can maybe assess differences between different phenotypes. You can classify, you can predict. A number of the ones that Jeff mentioned, these ANOVA techniques, multivariate statistics, will also look at clustering. So as you've completed your scaling and if you didn't have to do any further data cleanup, you now have this option, which is in your statistical set. And there's about a dozen different options and what's marked in these arrows are some of the ones that we'll try. We'll look at analysis of variance. We'll look at PCA and we'll look at PLSDA and then we'll look at heat mapping. So the first one we'll do is ANOVA and the reason why we're doing ANOVA is because we're not looking at cases and controls, we're looking at four different dairy cow populations. So that's n is greater than 2. 0%, 15%, 30%, and 45%. So we're just trying to identify those that are different between all groups or just between the 0% group and all the other ones where they have grain. And there's, again, different types of ANOVA that you can do. So simply by pressing the ANOVA button on the left, you get this. It'll take a second or two and it'll produce a plot, a scatter plot in the bottom. And what the scatter plot is showing you is the log of the p-value. So you want something that's less or this is a log base 10. So you want something that's above that dashed line to be relatively significant, but then the ones that are way high up are very, very significant. You can click on these dots and by clicking on the dot, a little box plot appears up in the upper right. In this case, this is three phenylpropionic acid, 3ppp. And it shows you the concentration differences at 0, 15, 30, and 40. And you can see there's a pretty obvious difference and it's pretty significant. You can also go a little further scrolling down, looking at the plots. You can click on individual compounds here. Here we've clicked on uracil. In that table, it's giving you a p-value as well as the false discovery rate to the FDR. And that's usually of more interest because we're looking at multiple variables. And you'd like to have FDRs that are below 0.05 as a rule. Again, it's showing some plots. Again, there's some obvious trends, some obvious differences. So these are significant. Uracil is significantly different among the four. So you can basically explore. And this is what I think is really useful about metabolism. It allows you to browse through. So you're seeing how different certain things are. You can see whether they're trends, whether it's one that's different than all others, or whether there's a linear trend. Zero is less than 15, which is less than 30, which is less than 40, or whether they're all the same. And you can also go from just the analysis of these box plots to another one, which is the correlation link, which is done as a heat map, and to see how certain metabolites have varied between different groups. So it's a symmetric plot, which is what you expect. And looking at things that are in the red zone or in the blue zone to distinguish that, you can also generate not just the visual HTML image, but a high-resolution image if you need to do that for a paper or a poster. So again, you're navigating on the left side fairly easily to do some of these things, and everything is point and click. So there's no coding. And in that case, once you've chosen that, you can download it to your computer as a PNG or a little vector graphic or a PDF as you wish. So that's ANOVA. Another thing that you can do, which is a little different than ANOVA, is to look at patterns and pattern analysis, especially when you're dealing with two, more than two groups. So you have 10 or 15 groups, in this case, we're dealing with four, and so we're using something called pattern hunter. And this is showing some possible patterns. There's a linear pattern where say going from zero to 15 to 30 to 40, the metabolite increases. You also have a situation at zero, the metabolite's low, and at 15 and 30, it's high, and then back at 40, it's low again, or you could have the mirror image. Those are patterns. So in this case, pattern hunter is an option. And again, it's navigated on the left side, and we can choose a pattern. In this case, we're just having a pattern that's designated as 1, 2, 3, 4, which is this linear one, low, medium, high. And in essence, all we're going to do is look for linear correlations. So we've chosen our simple pattern based on these four groups. Press submit and then a second or two later, this pops up. And this is a graph that's showing the correlation coefficient, the linear correlation coefficient, Pearson correlation of these different metabolites. So some climb with grain concentration, some do the opposite. So in this case, we can see endotoxin and glucose climb. They have the strongest linear correlation of about 0.6, 0.7. So they increase with grain concentration, whereas 3 phenylporovate and isobuterate drop and quite significantly and they have the opposite trend. And you can choose where you would say something significant. You might say, I want a correlation of greater than 0.5 or 0.6 and you can say which one is the pattern that best matches this. So that's ANOVA and that's sort of pattern analysis within ANOVA. The next thing you can do is principal component analysis and this is the first part of our multivariate statistics. So what we're doing is we click on the PCA link there and we're going to look to see if the four groups separate. We're going to look at a scores plot and a loadings plot. There are different plots and Jeff chatted about it a little bit. You can also look at these plots in 2D and in 3D and there are a variety of viewing options. So again you just click the PCA plot a few seconds later, this is what you would see. So it's already done the PCA analysis and it's already created the ellipsoids that circle the different conditions 1 in 0, 15, 30 and 40%, they're all colored. Now if there wasn't any coloring it would probably look like a big mess. So this is sort of giving you some visualization to say yes there is a distinction it's not profound but certainly the 0% differs substantially from the 40% and there's a bit of difference between the two. And you can see some tabs up at the top and you can navigate with the different tabs. You can look at the scores plot you can look at the overview you can do a 3D plot you can look at the loadings plot. So by clicking on the tabs you can move back and forth. So if we click on the loadings plot we see this graph that's not as commonly shown because it's not typically as colorful but what you can see in this loading plot first is a scatter of points but there's a trend. So if we go back there's a left to right trend between the 0% and the 40% so there's the blueish turquoise one and then there's the red. So the trend in our scores plot from bottom left corner to top right corner is what we're seeing. So when we look at the loadings plot it's the points that are at the bottom left corner and the top right corner that are driving the separation. So based on that trend that we saw then we're going to click on the top right and the bottom left and we're going to see which of those compounds are driving it and lo and behold one of the top ones that we see is 3PP that phenylpropionic acid and again it's reiterating what we saw with the ANOVA. We can go back and say okay 2D plots are nice 3D plots are cooler so we can click on the 3D plot and this has also got a javascript visualization so you can rotate it and see how the reds are separated from the dark blues and the greens. In some cases the separations are quite profound when you go to 3D plots and they aren't so obvious in the 2D plots so it really helps to do this. Okay so that's the PCA. We are seeing some separation it's not great but it's evident we're also seeing some drivers that are pushing these groups apart certain compounds that are consistent let's go to the next bigger gun that we can use in our arsenal and that's the PLSDA so that's the one option below and so we use this to essentially try and maximize or optimize the separation we saw with the PCA it's doing some linear regression to do that and as I said and I think as Jeff has pointed out PLSDA can be abused. It's a classification technique as opposed to a separation technique it's using information and it can get to essentially over trained so this is where you want to look at the q-squared and r-squared values that tell you whether this is real or not or to look at the VIP scores some of these things are significant or not any time the VIP is greater than one I usually use 1.2 as a cutoff that's significant so you've clicked PLSDA again a few seconds later and this is what you get in this case you can see the separation is stronger all of the ellipsoids are distinct now they aren't really overlapping as they were with the PCA they still have sort of the same left to right and you have the same sort of options where you can click on the scores some data validation a 3D plot and so on we can evaluate it by clicking on one of the tabs and this will give us our r-squared and q-squared so these are sort of mystery statistics I think that were invented but everyone trusts them for reasons I don't know but any case if the r-squared and q-squared are greater than about 0.7 then your PLSDA is good and so in this case a 3 component model gives you an r-squared q-squared that's sufficiently good and that's highlighted and that's essentially a stable model and that's quite robust we also look at the variable importance plot or VIP plot and this identifies the compounds again that drive the separation and once again we see topping out the list is 3PP we also see things like glucose and lipopolysaccharide or endotoxin that also drives the separation so the PLSDA in this case is robust it's identifying many of the compounds we saw through ANOVA and through PCA and that's a good sign if you're getting completely different things each time you press that would either suggest that there's no significance to what you're doing or the data you're seeing or that you hadn't normalized things properly so as I say a validation of that your work is good is to basically see that whether it's ANOVA PCA or PLSDA you're getting basically the same answers you can also as Jeff pointed out beyond R squared Q squared you can do what I think is a better technique which is permutation testing and so this will run 1000 permutations and you can see the distribution here where most of the models that were generated all had certain scores way down on the left and then the one that was actually the real one has a position way out to the right so it's highly significant much smaller than 0.001 than the permutations but it's clearly a robust model that's been generated the last thing that we can do in this one or the one I'm going to use in this example you could do many others but we'll just look at heat maps and this is the hierarchical clustering approach and it's another way of looking at multivariate data some people prefer it because it's perhaps a little more visual a little closer to where people think of microarrays or protein chips and it allows you to look at the behavior of certain metabolites so you can look at which ones are low in group 0 but increased in group 35 or 30 and 45 and so on so with the heat map you just click on the heat map on the left side there's some default parameters most of which you can just use as default but you can change the color scheme as well so here's one which is the black-red-green but if you're red-green colorblind this looks probably just all gray so you can use the yellow-blue schema and so on and at the top you can see how it's grouped the four different so 0, 15, 30 and 40% or 45% in terms of the feed and then the metabolites that are clustered there it's not a really distinct trend at least I can see with this but again it's every person has their own preference and choice so we've gone through ANOVA we've gone through PCA, we've gone through PLSDA we've looked at pattern hunter, we've done a few other tricks so all of these have involved clicking, saving clicking again making some adjustments what's happened while you've been doing this and what I think a lot of people appreciate is it's keeping track of all of the graphs all the plots, all the parameters you've chosen so at this stage you can ask it to write your paper so you just press this and there's your paper so what problem is this and when automatically the PDF is not generated so we have like 5.5 so it needs to be generated to actually we used to just have staff people write the report and then send it to you but not anymore so anyways this is important so this has to do with traceability provenance reproducibility it tracks your workflow and a lot of people are trying to develop workflows so that there is that level of reproducibility if you do this ad hoc with a whole bunch of different programs and I used Excel for this I used another program over here another one over here you know perhaps in your paper or write up or thesis it'll tell you in some detail but I think that the key point is to capture all the parameters that you used and have that record is very important and that if someone wanted to be able to do this reproduce your work ideally you'd like to have that set to that parameter file available and it might be worthwhile I don't know for Jeff to consider this but to have a record of those parameter sets then whether they're ultimately archived for people people can actually download them we think about the record here for reasons how to actually have the parameters people can rerun it again to the same thing but somehow we discuss there is a sense but I think it's a good suggestion yeah I think it's more and more people are this issue this crisis of reproducibility in science I think this is one of the things that people really are starting to push for you can have that parameter set and that would presumably be something that could be deposited into metabolites and it could also be identified as a file in your supplementary material if you're publishing with it but at least that guarantees reproducibility of data sets okay we're going to move into the next part so that's statistics that's where 90% of the activity of a metabolite analyst is but probably the most useful stuff is in the other 10% this is because I think people are so obsessed with PLSDA and PCA so one of the real gems of a metabolite analyst is this tool called enrichment analysis and this is a concept that Jeff developed from gene set enrichment analysis which is well known but we were trying to come up with a name for it and we just call it metabolite set enrichment analysis so it's sort of gotten a life of its own it's divided into three different techniques one is over representation analysis another one is called single sample profiling and the other one is called quantitative enrichment analysis which is probably a little closer to what MSCA is about in order to do enrichment analysis you need to have really large data libraries and these are ones that we've assembled through HMDB and through other data mining exercises over the last number of years but I don't know when was the last time we've updated this okay so there's a couple of years but some of the data sets are a little older and I think with the update to HMDB we'll probably update some more of them but some of them are quite unique and there's quite a number of them that are available so the idea here is not just to say this is statistically significant this is different it's to say what is 3PP and is it important in some kind of pathway and so it's trying to link things primarily into pathways, diseases the localization of some of the pathways some are only membrane bound some are only found on the liver and so on so right now because gene-centered enrichment and metabolite-centered enrichment require huge amounts of data archiving the only thing we can do is support for humans if someone wants to do it for E. coli or fish feel free but this has been an enormous effort and not just from our group, from many groups across millions of dollars just to get the data sets so as I said, three types of data analysis so for overrepresentation analysis you just need the metabolite names so that's not much for single sample profiling you can just say patient a single person and say are they sick or are they healthy and all this is doing is comparing metabolites with their normal concentration ranges and then the quantitative enrichment analysis is to take a larger sample and you've got lists and concentrations of metabolites but for many people so you can look it up this way there's ORRA which is the simplest or weakest form of analysis there's SSP which is for individual patients it could be you, it could be your friend and quantitative enrichment analysis which is typically done as a medical study with multiple patients and again this is patient centric it's human centric because this is the only place where we can get this kind of detailed data as an example we took some individuals who had lung and colon cancer and they were suffering from a condition called catechia which is a muscle wasting and this is the really essentially the major killer of cancer people call it dying from tumor burden but it is the number one reason why people die from cancer in some cases it appears to be preventable and perhaps there are nutritional interventions that can help mitigate it in the previous slide where you mentioned compound concentrations we actually need the actual concentrations but if you have a case for control and it's more like relatively dense that doesn't work but in the case of compound the overrepresentation analysis you could so if you've got something that's overrepresented then it's actually compound names but in this case the concentrations are critical because there's all this reference data that's in HMDB so here we're doing overrepresentation analysis we're using the lung cancer data set we simply click on the MSCA option in this case we're just going to choose ORA it uses lists particularly found in KEG it's a weaker analysis it's just looking at word or text frequency ideally we'd like to move this from KEG into HMDB because it's much richer or SMITDB because it's much richer data but this is one of the things where if you've used a list you want to make sure you're using the proper names and in this particular list someone had misspelled isoleucine I think they forgot to put the E in so there is a name normalization or name standardization that's actually built into Metaboanalyst because we've spent many years trying to get name normalization in comparison standardized in HMDB so it looks at your name list and says maybe there's a problem here you've missed spelled isoleucine you've done the correction then I think as I said the real key to this MSCA are these 8 or 9 different libraries some are pathway specific libraries some are disease specific libraries some are SNP specific libraries unfortunately a lot of people just choose the top option and say I've done it that's probably not the most interesting one to use and it may depend on your specific application that you want to look at in this case we just chose the top default and again these are people suffering from cancer and muscle wasting and one of the well the first few that we see basically are protein metabolism and amino acid metabolism are all messed up in these individuals and that's highlighted in terms of the red ranking and then things fall in terms of their relative importance and then we can what's that what is the meaning of p p here is the is it a full enrichment or the p this is a lot like the p value it's just like 10 to the minus 1, 10 to the minus 2, 10 to the minus 3 so the lower the p value more significant the darker the red color so if you scroll down there's a table which will give you some more details including the FDR p value the expected value the total number of hits based on the frequency of those terms and these different the expect value I think that's calculated in the same way that blast values are done but Jeff might be able to give you an idea I don't use that I just use the FDR and I think that's obviously the more relevant so if we use the FDR why would be the cutoff where is the cutoff again it's sort of arbitrary but most people use an FDR of 0.05 but you can use an alpha value of 0.01 you could use an alpha value of 0.1 you just need to either explain it why you've chosen that because I saw in in the investment analysis 0.25 I guess if you want to do that this is where I think there's obviously lots of debate 0.05 is something that was never actually suggested it just sort of became a standard and people will die on a sword if someone set 0.06 or 0.04 then it's silly but if you want to use 0.25 fine and all you need to do is if you're explaining it just simply said based on how other people have done it this is how I've chosen to do it you're just trying to reduce the frequency of false discoveries so if you can live with a 5% rate of false discovery or a 25% rate of false discovery that's fine but that's like data was not recognized so let's say I have 6 metal lights it should be heated with some pathway but there is only 2 so 4 already they are not counted and that's case they are showing the FDR value is 0.12 or something I know they already did not recognize the other 4 if 4 include what they are the FDR value can be going around 0.005 something so can I put that and argue this things I'm not sure and I think that's something just given our time constraints because we have to be done in a few minutes I won't have time to answer that now we can talk about it at lunch if you want so in terms of just looking at the results this is simply giving you a little more detail you can click on things it will give you a certain view you can also select the pathways and it will explore or allow you to explore them in more detail so they are linked to the SMIT DB pathway set now single sample profiling as I said this is basically what would be equivalent for a physician to do this it's one person so we've chosen one sample in this case not all of them so it's one person just looking at their metabolites there's about 50 or so 60 that are listed and we're just seeing what's high or what's low in this case we have HMDB has large sets of referential metabolite concentration values from multiple studies we're looking at 3-inning in this case and we've clicked on 3-inning in more detail you can see that there are four reported studies looking at 3-inning concentrations in the urine and you can see that 3 are pretty tight and then there's a fourth one that's kind of off in terms of its study one but the concentration for this person is way above normal so they're dumping large amounts of 3-inning in their urine as it normalizes to creatinine so this suggests that this person has something wrong and it's probably related to catexia and probably related to amino acid metabolism we can go further which as I said is more typical of MSCA and this is looking at a population not an individual and in this case we're looking at the whole set of catexic and non-catexic individuals and the result here is similar to what we saw with over the aura over representation analysis you have a plot which ranges from yellow to red with p-values ranging from miniscule to very significant you have the table and in this case it's identifying collections of pathway related metabolites in this case we're looking at this approvic acid metabolism and so in this case these are metabolites including ascorbic acid, peruvate lactic acid that are quite significant and that are very significantly different for people with catexia so this is a way of seeing or identifying pathways that are perturbed in this case for people who have catexia and cancer it's done on a pathway basis on a disease basis on a SNP basis any of these nine different modules that are contained in MSCA could have been explored in more detail now this one has already suggested some pathways that are different for people in cancer so we could actually go to the real pathway analysis module and look to see what is different for them so MSCA gave us some hints instead of just using pathways we could have looked at locational differences of metabolites or looked at a few other things but this is strictly for pathways and it's perhaps a little more detailed the pathway analysis is not restricted to humans it works for 21 model organisms so humans are included but you can also work with mice which are the equivalent for rats Drosophila, Robidopsis, E. coli and yeast now right now we're just using the keg pathways but we're trying to migrate to the smith or pathways pathways that we introduced you to because they're much more comprehensive and generally more complete than the keg pathways this case just for example we're using the same human data set that we did before this is lung and colon cancer and we're looking at urine so if we're doing that we just click on pathway analysis we choose that pathway or the data set but to do pathway analysis we have to do a little bit more than what we're doing with metabolite set enrichment analysis we actually have to do the same normalization that we or similar kind of normalization so the optimal parameters for this one are given here where we reference it to use a specific reference sample we did auto scaling things normalized I'm not going to show the picture because it looks normal after this and then we can choose our pathway libraries so the input is a human data so we don't choose E. coli we choose human so make sure you choose the right organism so here's the options it's 21 and at this stage you can start doing some of your pathway analysis and there's several options and they're explained in more detail current protocols and bioinformatics which is that on the web now people can read it if they wish so different options yes yes you would want the names although you can have arbitrary names but if you have too many of them it won't see any pathways because you're looking at keg so if you've got 80% of the compounds with names and 20% which are unidentified then maybe you'll be able to get some pathway analysis it's not going to map the unknowns to anything if you think it's useful but in the end you've got to have a good number of compound names and you need to have some relative concentration or absolute concentration data that helps distinguish things so we've chosen the default ones they're actually generally pretty good the point is that different pathways have different levels of complexity and connectivity and people have identified what we call hubs so there's a red circles are examples of hubs there are things like bottlenecks so the blue circle is an example of a bottleneck and in the world of graph theory we talk about degree centrality and betweenness centrality which are sort of ways of measuring things and so something that has a bottleneck has high betweenness centrality and something that is a hub has high degree centrality so if you're a topologist then that's cool but what most people typically want to do is they just plot out these types of plots these are circle intensity plots plotting the pathway impact versus the log p so significance so as you go up negative log p so higher up to the upper right the more important the bigger the circle the more metabolites that are involved and the greater significance statistically so we've got something that's way over in the right top right corner and that one is glycine serine metabolism and what's happened is that we've highlighted the pathway question what is the meaning of pathway impact? there's no formal cutoff for pathway impact but it's calculated the description and the calculation I think is in the original paper I don't remember exactly the formula Jeff might be able to tell you thank you behind this so ultimately the rule of thumb is to choose things that are on the upper right corners they're the ones that are more significant in this case you've got about eight or nine metabolites in this particular pathway that are identified as being significantly different that's why the circle is a little larger if we only had one or two metabolites the circle would be smaller it has a google map like browsing function so you can zoom in and if you zoom in you can also click on specific metabolites now you'll notice that these are only giving egg identifiers not the actual name so if you want to know the name you have to click on the identifier a pop-up box is presented and that's showing the actual metabolite as well as the difference between the cases and the controls the catechia and the non-catechia yes in your pop-ups is that the normalized data or is that the original data what's what? it would be yeah so this one is normalized yeah there will be some thoughts I think that will show the original data but others in this case will show definitely the adjusted normalized scaled transformed data so there isn't capability to shut down because I'm thinking if I'm reviewing this paper I probably want to go back to the original data as well as yeah I mean ultimately they would look the same in terms of both the distribution it's just the number would be different so I think you'd be able to generate a box plot probably not through metabolites but with other tools just to say this is the one I want to show I want to make a nice plot for it because I would not use the pop-up box plots in a paper they're just too low resolution so if I found something that's really cool and I want to emphasize this and I want to put that in my paper I would just use a separate tool but you would get something that would look a lot like this it's just because of the scaling it has a different y-axis yeah if you want to regenerate the original one just don't use this pop-up it's all that I know what it has you do have this original pop-up box plot with iron so here it's just for convenience so see it there but it's not for your actual purpose okay so in terms of a more detailed explanation of the pathway impact it's sort of a fuzzy method so it includes the log-fold change the differential expression metabolites the statistical significance topology and it sort of combines it all into one measure but it is a rank score as Jeff said and so it's really just allowing you to say this is more important than the other one you can also get a tabular result and as before everything has a column many things are hyperlinked things will be hyperlinked to the pathways both to keg and to the small molecule pathway database and they'll have the false discovery rates as well as the p-values and others another one that you can apply or do is also a biomarker analysis and this is further down and this is particularly useful given that many people are starting to use metabolites for biomarkers Jeff talked about receiver operated characteristic curve or rock curves what you're trying to do with biomarkers is you want to maximize your area under the curve, maximize the AUC and minimize the number of metabolites in a biomarker panel many people doing biomarkers in transcriptomics, proteomics and metabolomics completely miss this concept and they'll propose or publish a biomarker panel with 200 genes or 50 proteins no one in their right mind is ever going to commercialize or use that kind of test because it would cost thousands and thousands of dollars what most commercial approved tests are one, two or three markers and that's just for practical purposes so you want to minimize the number of molecules, whether it's metabolites, proteins or genes, but you want to maximize the area under the curve so there's three different modules that the biomarker analysis module has one is for a single marker analysis a univariate one, there's a multivariate one and then there's some people who believe they have a better idea than the computer and so they'll manually choose certain marker sets and in fact sometimes that works better so you have a simple module simple cut click and paste sort of thing where you can choose your, in this case a data set, we're uploading one this was a set of 90 patients women expected mothers three months, they gave some blood samples 45 of those went on to develop preeclampsia which is a high blood pressure condition which is quite risky for the infant and the mother and then 45 of those had normal pregnancies so this was to try and have a predictive marker for preeclampsia so that once a person's found that they're pregnant they could have a quick blood test and they could determine whether they're at risk for preeclampsia and there's some very simple prophylactic strategies for that basically an aspirin a day will prevent the preeclampsia if you are one who is prone to it so with this we upload the data just as before we do a data integrity check everything checks out just fine as before with the cow samples these are for women we're doing a data normalization just like we did before on the left we see a highly skewed distribution on the right we see a highly normalized distribution it looks very Gaussian so this is good and so we can start doing some statistics and this is where we're doing the biomarker evaluation so we want to look for a couple of biomarkers not one so we're going to do the multivariate one and so we click that option and then a few seconds later this is the rock curve and if you've seen pictures of rock curves this is an exceptional one and it's showing different colors for different numbers of metabolites there are it's a rock curve with two metabolites a rock curve with three, five and ten and really the one at two doesn't differ a whole lot from the ones with ten and twenty so at this stage you could say probably a biomarker panel of two or three metabolites will be sufficient to predict women at early pregnancy who will develop preeclampsia you can go a little further and then try and assess the model a little further this is an SVM, a spark vector machine model you can get an adjustment for the possible ranges and I guess with rock curves you have to remember that you can get highly optimal ones and not so optimal ones so this is essentially doing a permutation test and it generates a range and that's marked in the purple or light blue which represents the range and then the middle line represents the average which gives you the area under the curve of about ninety four point nine four so we've got a great rock curve what are the couple of metabolites that are so helpful so at this stage you can scroll through a little bit further and view things and what comes out through the VIP plot is that it's glycerol and I think it's three hydroxybutyrate I think which becomes substantially different and the VIP scores are several that probably could be used and if I was just looking at the range I'd probably choose four metabolites this is probably the best ones but if you're developing a test you'd probably want to have something that's easy to measure and consistent so you might say well glycerol is a pain but hydroxybutyrate isn't so I'd use that one and then one other and this is the sort of thinking that you have to have when you're thinking about biomarkers maybe the best one isn't the easiest to measure but the second and the third best one are and that becomes your biomarker panel another technique that's often used and often asked especially for people writing grants but also trying to design experiments is called power analysis this is common in medical grants it's also common in veterinary practices where they have to determine how many samples do we have to collect and what point will we say that our study is too small or too big or quite size so you're generally wanting to find a condition where you could choose the number of samples or patients where you have a power of about 0.8 that's sort of the consensus cutoff 0.8 means you've got an 80% chance of ending up with a statistically significant result some people may want a 90% some may want a 99% again it's up to you but the consensus that a lot of people are happy with is about 80% so if that's the case and you're trying to look at a sample of cows or people or plants how many do you need to test and most of us just do this very arbitrarily and say well someone gave me 70 samples that's what I'll use and it might be that's far too few or maybe it's far too many and this is the point about power analysis and you can design your study but to really do this you have to have some preliminary data now to get a more powerful study larger sample sizes are better that's just the rule, that's obvious there's a thing called an effect size which is how significant is the markers or the chemicals or the genes or the proteins they're looking for how different are they between the control and the cases and that's where you actually need some preliminary results and this is the whole point about power calculation is you need to have at least a sample, a pilot study done to be able to determine this so with a pilot data a few, maybe 10 samples you might look and you'll choose a criteria most people use 0.05 you might use at a false discovery rate of 0.05 or 0.1 you might want to figure out what's my power and how big does my sample have to be so once you've uploaded a sample set small pilot set you run it through this power analysis curve and say we uploaded 10 samples what this says is in order to get sufficient power, to get 80% you're going to have to have about 60 samples both of cases and control now this is one where there's a relatively strong effect size but there are many examples where you're going to need a thousand to be able to distinguish things in the case of GWAS the numbers are often around 10,000 to 100,000 fortunately in metabolomics most people are able to get things quite significantly distinct with a few hundred to maybe at most a couple thousand so we're out of time I'm four minutes over time so we're not going to cover some of the other things that are available clustering, classification time series two factor analysis these are some examples that are shown in the CPIB chapter that you guys can download this is an example of the integrated path analysis