 So welcome back everyone. We're on to our second lecture and this one is, as I say, these are our standard slides. So creative commons and then we're going to be focusing on metabolite identification and annotation for the next hour and a half about. So we're going to be looking at the three main platforms in metabolomics, NMR, GCMS, and LCMS. And I think from our little survey we found that there's about five or six of you doing NMR. Six or seven of you that do LCMS and the vast majority of you will be doing our do LCMS. And then we're going to learn about some of the MS searches and databases and that mostly relates to LCMS. So let's dive into that now. You saw these slides before, which was this idea of annotation, whether it's NMR spectra or mass spectra, you have a bunch of peaks, you don't know what those peaks mean. And metabolite annotation is all about identifying those peaks. And it's either producing a picture like what's shown above with labeled peaks or a table shown below which has the names of the compounds and their concentrations or their relative intensities or relative concentrations. And obviously ideas to have more than just one or two, you'd like to have dozens, hundreds, even hopefully one day thousands where you can annotate. So if you compare metabolomics with things like genomics and proteomics, there have been at least in genomics and proteomics a long time, essentially web servers where you can take a DNA sequence and go to something called BLAST that some of you have heard about and search that sequence against a database of genomic sequences, which then can give you identify that gene for you. It can also, you can use modifications of that to help get transcript abundance, those of you doing RNA seek. It's the same sort of thing. If you do proteomics, you can take gel data or LC-MS-MS data spectra that you've collected from your proteomics experiment and go to something called mascot or other tools. And just type in all of your peak intensities, press go, and outcomes the identified proteins. And then that can also be used to get your concentration data if you have intensity information. Historically, with metabolomics, it just, it wasn't that easy. You couldn't take your chromatogram, upload it to a website called metaboloblast or something, and it wouldn't come out and doesn't come out with your metabolite IDs. And it's still kind of difficult, but I think we're going to show you some software over the next day or two that actually does do that relatively quickly now. But it historically has been a problem with metabolomics and that there just wasn't these web servers to do instant identification or instant quantification. There's a quote that is from Donald Rumsfeld back in, I guess, about 2001 where he said, there are known unknowns. He says, say, we know that there are some things that we do not know, but then they're also unknown unknowns, the ones that we don't know that we don't know. And then I think they're unknown unknown knowns and known unknown unknowns and it's gotten into the, I guess, a bit of a meme but that's not the situation and metabolites in metabolomics where we can identify compounds at the time they're just peaks, but they are knowable and known. They have reference spectra, they have a lot of information about them. And that's largely what we do in metabolomics, whether it's an untargeted or targeted metabolomics. Unknown unknowns is a lot harder. This is where you have to do computer aided structure elucidation, or you have to do a lot of synthetic work. Most people over the course of their lifetime may only identify one or two unknowns unknown unknowns. Some of these may take several years to identify. People are natural product chemists do this a little more often. But I can tell you as somebody's been doing in doing metabolomics for 20 years I have to identify an unknown unknown. So the vast majority of our time is spent identifying known unknowns, things that are have references, but at the time that we collect our data we just didn't know what they were. I hope that makes sense. So to identify known unknowns, we use basically spectral deconvolution. It's more commonly applied to NMR but it's also deconvolution is done in GCMS, LCMS, MSMS. The spectral deconvolution is essentially taking a spectrum of a mixture of compounds or peaks, and to compare that mixture of peaks with peaks from pure compounds, reference compounds, and those reference compounds are in some kind of pre compiled database. And there's a huge effort going on in the metabolomics community, the chemistry community, the natural products community create reference databases that have pure spectrum of well purified single compounds for NMR and GCMS and LCMS or MSMS. And doing specialty convolution allows you to quantify and many cases not only identify but also quantify. In the case of NMR, this is an example, this is one we'll start with, and this is actually how I got into metabolomics 20 years ago. I guess it's 22 years ago. So I was dealing with a mixture of compounds, and you can see a whole bunch of peaks at the top that's in blue. You can see doublets and triplets and singlets. And what you have to do is have a reference set of known compounds. So compound A, B and C, red, green and purple. Those have characteristic spectrum. So if you know a little bit about NMR, you know that in NMR, one compound is not equal one peak. One compound usually equals 10 or 12 peaks or something like that. So compound A is a couple of doublets, a triplet and a doublet, a couple of singlets. And B, a couple of doublets, compound C. And you can see that in some cases, the doublets overlap. And so the net effect is that the actual mixture spectrum is the sum of these two. And it turns out in this spectrum that all three compounds are of equal concentration. So the intensities are reflected in that. And the height of the peaks is partly related to the number of hydrogen atoms or protons. And you can see how these red, green and purple peaks add up to the blue peak. So we can go one way, which is adding them together, but the real challenge is the reverse. It's to go from the top spectrum and figure out that it's actually made of those three compounds and not four compounds and not five or not one. And so deconvolution is called an inverse problem, and it's a difficult problem to solve. In NMR, there's a software company that was started about 20 years ago. It's called Kinomics. And it was back when the word metabolomics wasn't around. So this was a kind of a made up name, but it does spectral annotation for NMR. And what you can see is in the big screen is an NMR spectrum. And what's also visible in the lower corner is the full spectrum ranging from zero to 10 ppm. There's a water peak there. This is looking around 3.73 to 3.82 ppm. The lines here correspond to the spectrum. And then inside you can see this quartet that fits very nicely to part some of the peaks in the spectrum. What's shown in the table are the peaks that have been identified in the compounds and their concentrations. So this software is deconvoluting the NMR spectrum. It's allowing you to identify in this case the alanine peak. And there's the 3.8 ppm that's for alanine. And it's given us the concentration as well. So in an NMR, what the Kinomics software does, it's commercial software. You process the NMR spectrum manually, then you phase it. You remove the water signal. You do what's called baseline correction in NMR to flatten out the signal. You manually reference things. You normalize the peak shape so they're symmetric. And then you fit the spectra to a library of about 400 spectra using kind of a guess and check. You say, well, this kind of looks like it could be alanine or I know alanine is supposed to be here. You click on the alanine signal and it up it pops and you find that it's a little bit off. And so you use your mouse to drag and shift it until it actually lines up. So you do that over and over again as you analyze the spectrum. A person who's trained in this can take about 20 to 30 minutes to finish a spectrum. Someone who's never done it, as we've tried a few times with people, takes about two hours, three hours to fit a single spectrum in a mixture of blood. And different people have different ways of baseline correction, different ways of doing their chemical shift referencing and phasing. So it means that it is prone to error. So it's slow and prone to error is a bit of a problem. So this is one of the very first metabolomic software tools to come out. As I say, it's almost 20 years old. Brooker has developed software called a mix, which does something similar to what the kinomics software does. They've also developed a NMR instruments called the juice screener and the wine screener. And they also have none now for spectral analysis of LDL HDL lipoproteins. But this is an automatic tool for analyzing juice components and wine components. Imperial College has come up with a tool called Batman, which does automatic analysis of NMR spectrum. And then my group has been working on two other ones. One's called Bazel, which we'll learn about today. And another one called magnet magnetic metabolomics or magnetic NMR metabolomics. So both of these are automatic tools for doing spectral deconvolution. Do all the things that you do with kinomics, but trying to automate it. So when you automate things, obviously it's fast. The computers can just work night and day, but they work faster than you do. It means that their precision sensitivity recall is really high. So there's no variation between one person and another computer makes the same mistakes, but it also does it the same way. It means you can run it all the time. People don't get tired at dragging and clicking and dropping. You don't have user bias or user errors. And it turns out that computers can sometimes detect things that humans can't. Sometimes we have our own biases. Sometimes our eyes aren't as good as we thought they were. So NMR is technically the only fully automated approach to doing metabolomics now. Thanks to some of the software like Batman or fruit and juice screener or Bazel. And so we want to introduce this to you as part of the course and part to highlight the fact that there are techniques that can fully automate metabolomics and groups that are doing automated NMR based metabolomics are able to analyze tens of thousands, even hundreds of thousands of samples. So if throughput is something that you need to do, NMR currently is the best approach to do that. Of course, it's not as sensitive and you need a fair bit of sample volume. But right now a company called Nightingale based in Finland is analyzing all the samples in the UK Biobank using NMR based metabolomics. Batman is, as I said, a software tool and open source project that was developed by Tim Ebbels in Imperial College. And it uses Bayesian deconvolution to identify metabolites in one dimensional spectrum. At first publications came in 2012 and they haven't really done a whole lot of development since 2015 about. The difficulty with Batman was that most of the data processing has to be done manually. And that actually takes most of the time. So you still have to do baseline correction and referencing and phasing and that varies a lot from person to person. So what we decided to do was to try and develop something that was fully automated. Something that would do all of the phasing chemical shift referencing, water removal, baseline correction, compound identification, compound quantification all at once. That was harder than we thought, but it's been running now for about five or six years and it works pretty well. So it's now been converted to a web tool. This is what you guys will use. It's very accurate. It uses something similar to hidden Markov models. Some of you might know or have heard of that, but hidden Markov models are a form of probabilistic graphical modeling and they're used in speech recognition. So if you used Siri or Alexa, Amazon or Google Home, those use hidden Markov models to determine your speech to recognize voice patterns. We treat the NMR spectrum kind of like a speech or voice pattern. It ends up fitting peaks and shifting intensities the way that people sort of do. It has to know something just like Siri knows it has to know that you're speaking English. If it thinks if you're speaking to it in Chinese, it won't do very well. So if you tell Basil that I'm looking at serum as opposed to looking at urine or I'm looking at a plant extract, then it has an idea of what compounds it should typically find. But that's not a whole lot of additional information and so it does a lot of the work computationally. This is an example of the fitting that Basil does, where in one case we're looking at a spectrum where there's 90 compounds and you can see the blue green line is what the actual spectrum is and then all those colored peaks below. That's the fitting. And so if you're trying to do that by hand, it would take a long time. It would be very hard to do. And you can do it for 150 compounds and that's also shown below. So this is the strength or the advantage of doing automatic deconvolution. The computer doesn't get tired. You do. The computer can kind of see patterns where you can't. And that's, again, the strengths of automated deconvolution. Here's an example where comparing Basil versus manual. So this is someone who spent about 45 minutes fitting the spectrum. You can see the black is the actual spectrum. The red is the fit. And so they used economics and they clicked and dragged, clicked and dragged and shifted. Then below is Basil. And it's again the red versus black and you can look very closely and you can see the fits are almost identical. In some cases there might be some things where it's a little better. But the point was that this was done by the computer. You could refit it again and again and again. I got exactly the same answer. If you ask the same person to analyze the same spectrum, they probably get a different answer every single time. And of course the computer can do this over and over and you can run it on multiple processors. So that makes it very, very fast. So Basil, as I said, is converted to a website. Basil is obviously a plant, but Basil spelled this way. It's about a nod of the head to Bayesian statistics, which are used a little bit in the program. So it's a website. It's been written up. It's been used by a lot of people. It has a very strict protocol about how you need to run things. Unfortunately, a lot of people never read that. So a lot of people are trying to fit in an appropriate way, but that's life. So how does Basil work? So typically you start with a screen like this. You choose a file where you upload your spectrum. So this is like this idea that we had as the blast for metabolomics. So upload your spectrum. You can kind of provide a name. You can call it my spectrum if you want to call it that. You tell it whether it's plasma or serum or cerebral spinal fluid. You tell it what type of NMR instrument you're working with. 500, 600, 700, 800. This is the measure of the frequency or the size of the magnet. And then you indicate how much of this chemical shift referencing compound called DSS has been added. And then you can tell it whether you want the fast or the slow, slow poke approach. Once you filled in those things, then you can click go. And in the first five seconds, it takes the NMR spectrum and does a Fourier transform. And so this is what an NMR spectrum looks like first time it's transformed. So that's pretty ugly. Lots of peaks that are pointing up and down. Next thing it does is it phases the spectrum. And that's getting all the peaks so they're pointing up. And so it looks a little more reasonable like a real NMR spectrum, but you can also see that the baseline isn't great. And there's a giant water peak. So that's phasing takes about 15 seconds. Then it does removal of the water. It does the baseline correction. And then it figures out where the zero PPM or DSS signal is. And so in about 30 seconds, it has done all of the automatic processing. And this is about three to five times faster than what a human can do. But it does this automatically and it'll do it consistently. And baseline correction is something that varies a lot from person to person. It's a little bit like art. So at that point, it then does an automatic fitting. And that will take the three to four minutes after the pre-processing. So what you can see in this spectrum is you can see something that's black and you can see something that's blue. Black is the actual spectrum. Blue is the fitted spectrum. And if you could, you know, zoom in really close, you'll find that it matches essentially exactly. You see a giant peak at zero. That's the reference compound that's called DSS. And then you can see a few other peaks in the middle there. And those are sugars and amino acids and a few other things. It may look like it's fairly sparse. If you zoom down, you'll actually see many, many hundreds of peaks and all of those have been fit. And there are various tools that the software has. As it did the fitting, it was also determining the structure of determining the compound. And it also determines the concentration. And it determines how confident it is. So in this particular sample, it could see hydroxybutyrate, acetic acid, etame, carnitine, and it's measured the concentrations in micro molar. The accuracy that you get in terms of concentration measurements with basal or magnet is about three to 4% error, which is actually four or five times better than what you get with mass spec. So this is an advantage of NMR in that it is very accurate for quantification and very reproducible. The confidence score ranges from zero to 10. And so many of them have a score of 10. Some aren't as high. And in some cases it's identified specifically that there is no material there. And so it's confident it is not in that sample. So the basal server that you guys will work with or learn about today is limited to three types of biofluids, serum plasma and cerebral spinal fluid. Magnet, which we were hoping to show this year, but not, it's not quite ready, can work for a larger number of biofluids, including fecal water. And we're adjusting it to work for beer or wine. NMR struggles to analyze urine. And urine is very complicated. And so neither magnet nor basal handle urine. Basal is limited to NMR instruments that are 500 or 600 megahertz. It can handle data from different manufacturers as well. As I said, you have to follow a protocol. So you need to filter your sample through its regular filter. You have to add DSS. You can't add TSP. You have to add a referencing compound. And I think the tendency is people almost never read the instructions. So it makes it perform not as well as it's supposed to. It takes about five minutes. And it analyzes a single spectrum at a time. So you guys will have a chance to run basal in the lab today. We'll give you some examples spectra. And you'll be able to sort of click and play around with it. We've designed it as a web-based tool. And the version you have will actually allow you to analyze multiple spectra at the same time. Now, switching to GCMS. This is obviously changing gears quite a bit. GCMS spectra look a little bit like NMR spectra. They're peaks. They're narrow. Instead of chemical shifts, you see retention times or retention indices. Under a given peak, sometimes there are actually two or three or four compounds, which is sort of shown below. And the spectral deconvolution that's done in GCMS is more about trying to pick out those individual compounds under that single peak. Now, in GCMS, you have the retention time chromatographic spectrum, but you also have the MS spectrum. So in the third dimension are these MS spectra, or technically fragment spectra. So in GCMS, you use electron impact. So you fragment things into tiny fragments. You might see the parent ion. Sometimes you don't see the parent ion, but then you'll see all these fragments. And the fragments have a lot of information in them. So by using information about the retention index, the covets retention index, as well as the actual peaks in these different spectra, and the different colored ones, you know, turquoise red and blue correspond to the three different spectra for the three different compounds that are under this one particular peak. We compare those peaks to a database. And we look to see which ones match. And where we get matches, you can see where they're circled. And those matches tell us essentially what the compound is because those are reference spectra. They were collected and they're stored in a computer. And the match tells us with pretty high confidence what they are. So in GCMS, you use electron ionization or electron impact. You get multiple peaks. And those peaks are partially predictable, not entirely, but they make sense to skewed mass spectroscopists. And so if you see this particular characteristic of 32, 31, 29, 15, that probably tells you that this molecule is methanol. So you can see in a mass spectrum, this is probably more like an MS-MS spectrum. But you can see whether it's in chemical ionization or electron impact or ESI. You will see the molecular ion. Sometimes you'll get what are called adducts. These might include, in the case of GCMS, it could include the reagent gas. That's why you generally want to use helium, so it doesn't get an adduct. But sometimes things just form these adducts. And then you get the fragments. And then you get the smaller molecular weight ions that are further to the left. So the mass or the M over Z starts at zero on the left and goes up to about 600 for most GCMS spectrum. Now, we talked a little bit about how GCMS is done. There are different derivatization reagents, which will react for specific things like ketones, methoxamine. It's used a lot for sugars. TMS or TBDMS or BSTFA. These are molecules that attach trimethylsilane or variations of trimethylsilane. In this case, it's a methyl silo group. And they usually react with hydroxyl groups or amino groups. And these derivatization reactions, you have to do them separately. They take about an hour. Things have to be heated. And they will produce or add different numbers of TMS or methoxines depending on how many hydroxyl amino or ketone groups there are. So the net result is that sometimes you will have a compound that isn't derivatized. That compound may be derivatized once or twice or three times because there happened to be one, two, or three hydroxyl groups or one, two, or three amino groups. So from one compound, you can actually end up generating five, six, seven derivatized compounds or combinations of those. So that makes it a little confusing. So if you were trying to look at a mixture of 100 compounds, now you've essentially got 600 compounds. But that is the characteristic of GCMS. It's widely used for identifying and quantifying amino acids, organic acids, sugars, and fatty acids. It's limited to molecular weights of 500 or 600 Daltons. So it doesn't cover the big molecules. It doesn't cover lipids or some of the larger metabolites. The gas chromatography that we've mentioned before is much better than liquid chromatography for resolution. The other thing about GCMS and EIMS is much more standardized. So this is just the fact that people were smart about it and standardized things. And this is essentially made GCMS almost automatable. And we're going to show you a tool that almost makes GCMS automatic, just like NMR. There's a database, the NIST database that many people use, and a software tool called AMDIS that is frequently used for GCMS. NIST, the National Institute of Standards, which is based in D.C., Washington has just released the 2020 data MS database. It includes a lot of compound data. And if you guys are doing metabolomics, I'd certainly recommend this database. It has a lot of ionization EI data, 300,000 compounds. It also has data from ion trap, triple cod, and QTOF. It also has lots of retention indices and retention times for compounds. So it's an extensive comprehensive database that's very well maintained. It has software called AMDIS, and this is what it looks like. It's a fairly primitive user interface. They haven't modernized it in 15 years, but it allows you to compare, which we called mirror spectra, between the predicted or the observed and the database spectrum. And when you see matches like this in terms of the peaks and their intensities, you can be pretty confident that you found the right compound. So this is decalbenzene or desalbenzene. It's been identified. So this is the spectral matching, this is sort of the software that you can use in GCMS. AMDIS, which is bundled in the NIST AMDIS software, stands for Automated Mass Spectral Deconvolution. Hear that word? Deconvolution and Identification System. So just like NMR spectra, GCMS spectra has some noise. So it does some noise analysis and clears things up. It does its own peak picking. This is always a challenge in mass spectrometry. And then it creates a model spectrum based on those peaks. And from that model spectrum, it does some of that deconvolution saying, is there one compound here or are there two compounds here? Are there three compounds here? So it decides whether there's one, two or three and separates those out. And then it compares those spectra to the library using a thing called a match factor. So the match factor is just basically a way of measuring the similarity. It's a dot product if you've heard about linear algebra or a cosine score. A dot B equals AB cos theta. This is essentially matching the intensity of the actual experimental or query spectrum with the reference of the database spectrum. And it compares the mass for those two. And then there's a weighting score. And they multiply. So it produces a value between zero and one and they multiply it by a thousand. So a perfect match factor is a thousand. Pretty good match factor is about 700. And a lot of people will say they found a match if they get something between or anything about 600. If you're doing GCMS, you generally have to run a set of alkane standards. There's eight, nine, 10 of them, ranging from eight carbonates to 16 carbons. And that is used as a calibration set to determine your retention indices. But it also can be used to help some of the quantification. You run a blank sample in GCMS. And that's to sort of sort out the contaminants. So there's usually a solvent, some derivatization agents. So that's used to sort of help clean up or denoise your spectrum. And then you run your sample of interest or samples exactly the same way you ran in the blank. And because GCMS is so reproducible, you really don't have to worry about too much drift at all. So this is what your alkane standards will look like. There's two, four, six, eight, nine of them here. And you can see there are different retention times, you know, from two minutes up to about 10 minutes in this run. And you can see that octane, non-ane, eight, nine, 10, 11 carbons, they are roughly spaced by about the same distance. So you're going to be using a software tool called GC AutoFit, but you could also use the same thing for the Amdus one. Once you have created a calibration file using the alkane standards, then you can determine from your retention time, you can determine the retention indices, the COVAX retention indices. And by calibrating and adjusting your retention indices, then you can use that information. And then this database has the retention indices and it has the mass spectra. And so you can use that information to identify your compounds. You can also try and get rid of any of the false positives by comparing your spectra to the blank. So in this regard, using the Amdus system, it's manual, but of course you've got tools that help facilitate that. So you saw the calibration standard that we ran. We upload that into Amdus and there's some various windows to do that. I'm not going to go into detail about it, but this is a similar thing that you would do in GC AutoFit. After you've uploaded the calibration one, you can actually tell it to calibrate using the calibration file. So this will adjust everything so that the retention time and retention indices are properly scaled. And this helps obviously with compound identification. So that's the equivalent in NMR. It's like chemical shift referencing, but it's just adjusting your retention times. And at this stage you can actually start using the manual tools to do the spectral deconvolution. So what you can see is that there's this white peak that we've highlighted with the little red box. And we can see the white peak expanded out here. And inside the white peak, there is a red peak and then there's a blue and a yellow peak below that. And what's being identified as well from these peaks also corresponds to the mass or the mass spectrum with this particular compound. So by clicking on any of these, we can see some of the peaks. So in this case, we're seeing a spectrum where there's something at 73, 59 and 172. And this is for the compound or compounds coming off at 11.597 minutes for the retention index. So again, we're looking at this again from this zoomed in perspective where we've selected this peak now at 19.5 minutes, I guess. And we see the white peak, the blue, the red, the yellow. Yellow is not terribly important. But when we look at the spectrum corresponding mostly to the red and the blue, so this is actually the same compound, we see this spectrum. It's the same and so we see a fragment ion at 73 and a parent ion at 144. And then we can compare the reference spectrum, which is in the NIST database. And you can see an exact match, the 144 and 73 in terms of their intensity. We can also see fragments of 59 and 100. And so a match factor is calculated. We don't show it here, but 600 out of 1,000 or 60% is the threshold typically to say this is a match. This one probably has a match factor close to 900. So we've shown you how you would do an analysis very briefly with GCMS using the AMDIS software and the NIST resources. But it's still a very manual process. And so we've been working on trying to make this automatic. So this tool called GC auto fit has been developed. There are other tools you can get from different manufacturers. There's Promatoff, Analyzer Pro, AMDIS, these were compared. There hasn't been much change except just in the size of the databases. There are other types of GCMS databases. There's a GOM database maintained in Germany, mostly focused on plants. And Oliver Fien has developed his own library that he sells through LICO and Agilent. These are not quite as large as the NIST ones, but they are alternatives. So as I said, the AMDIS one is a manual one. It's kind of like the equivalent of the Kinomics software. You do have to pay money for the NIST databases, just like you have to pay money for the Kinomics software. So we wanted to try and develop something that was free and open access. And so this is what we've come up with. And so it's called GC auto fit. And just like AMDIS, it requires, you provide it with Elkin standards. You have to provide it with a blank GCMS spectrum. And then you have to provide it with your sample or samples. Just like AMDIS, except more automatically, it does automatic alignment. It does peak identification. It calculates peak intensity and reference concentrations. And so it'll do the compound identification and concentration work. It accepts a variety of files. It's actually faster than basal. And it's slightly more accurate actually. And it's been optimized to work with a bunch of fluids. It works quite well with urine, whereas NMR can't. You can also work with blood and saliva. Now the disadvantage with GCMS is you have to do sample work. You have to do derivatization. You have to do a little bit of chemistry. So unlike NMR where you can just take your sample and immediately drop it in, you still have to do this extra chemical work. So in that regard, GCMS is not as automated as NMR. There's still a lot of manual work out. So for GCAUTOFIT, you have to put up or upload some files, the LKIN standards in a format called MZXML, a blank sample that's labeled the same way, and then your sample files. There's a standard set, a converting software, and Prodio Wizard is probably the most popular for converting all the different file formats that every vendor produces. So that's critical for being able to analyze the software. GCAUTOFIT has a similar interface to basal. You upload your files. You can upload them one at a time. You can upload your LKIN standards, your blanks and your samples. So you click on Browse. And you have the choices to say from one at a time to zipped files. To be able to do the identification and quantification, you do need a library. And so this comes with its own library for urine and serum and saliva. And so certainly we recommend that people use the library that it has, but it also means that you'd have to run the spectrum in a way that's very similar to what we run in our lab. But that's pretty standard. So the standards, this is the LKIN standard. And these are the sample spectrum. And this is the blank spectrum. So you can see the LKIN standards is about 15 peaks. Here's the blank where there's some derivatives and the solvent peaks that you're in there. And then here's the actual sample spectrum. You can see a bunch of, at least the chromatogram and the base peak chromatogram. So once those things are uploaded, then the computer does the magic. And what it proceeds to do is produce a list, not unlike what you saw with Bazel. It identifies the compound. There's the name. Here's the retention time. Here's the quality of the fit, the intensity, the level of match factor area, the actual concentrations. Some compounds are identified. Some aren't, but here's your concentration information. And then you can, just like with Bazel, you can see the spectrum, but now the peaks are labeled. So it's not a case where you're fitting like you are with NMR. It's just, it's now doing the labeling. So this mostly is there to tell you it's done it. This is really where the meat is the GC out of it results are. So you can then download the results into an Excel or CSV compatible format. And then it gives you both the compound identifier, HMDB identifier. And then you can also go through and look at the spectrum if you like, or if you need to have that for publication purposes. I'm going to stop here. So I've gone through NMR and GCMS. Are there any questions that people have? Or am I coming up clearly? Or is it a little confusing? I just have a question. So I work with waterfowl. So birds. And I'm looking to do fecal metabolomics. And the thing with that is that they have excreta. So it's both feces and they have the rates in there as well. Yeah. We were hoping to run NMR. Like freeze dry them and run NMR. Would I be able to use your basal software with that? If it does contain any rates in it? No, you probably wouldn't. The thing is that the duck or bird excreta will be, have quite a few different compounds. You could run, if you're looking at the duck serum, you could probably run it because it turns out serum is between animals and birds is very similar. But yeah, the excreta is highly variable. As I say, we're doing one for human excreta or feces with magma. But even going from infants to adults, there's quite a change and then even from diet. So it's not simple. And of course humans have a very different diet than ducks. And as you say, birds include essentially the equivalent of both urine and feces together. So it's going to be pretty complicated. And so the NMR, you would have to use this, this kinomics approach or the manual approach. If you wanted to do it by NMR. I mean, we can certainly help with that if you, if you wanted some advice and have to do it offline, but metabolomics innovation center does quite a bit of exotic samples. And they have a lot of tricks to help with that. Okay. Thanks. So any other questions? All right. So I assume everyone's following me. So I'm going to dive into sort of MS now. And this will be the bulk of the talk. But it refers to essentially how we identify compounds in MS. And this is sort of the metabolite metabolomics society. MSI metabolite standards initiative. And they have four levels right now, although they're adjusting it. And sort of level one, level two, level three, level four, level one or compounds that have been positively identified. And they're confirmed by a match to a known standard. And unless you have a large library of standards in your lab, most people don't meet reach level one. What most people get to is a level two, which is still called putatively identified where they match to a reference spectrum that they have in a library, whether it's in METLIN or HMDB or Mona or whatever. For GCMS, it may be retention time. But it's still considered a putative match. And so unless you actually synthesize or buy the actual standard and demonstrate you get exactly the same result on exactly the same machine, you haven't reached level one. Level three is compounds identified as a compound class. It still can be a named compound, but it might be that you're just identifying only by the mass to charge. And so usually there's multiple possible isomers. You can also use other techniques like classifier to help identify them. And then the last one is saying it's a peak and I don't know what it is. And for untargeted metabolomics, that's what you have for about 95% of your peaks. So it's important to remember these, but also bringing this up a few more times as we move on. So if we go to LCMS, metabolite identification looks a lot like GCMS. Instead of a GC chromatogram, we have an LC chromatogram. We still have peaks that are under, you know, sometimes several peaks. We have either a single MS spectrum, which might just be the parent ion, or we have the MS-MS spectrum, which look a lot like GCMS spectrum. And then you have your library and you try and do your matching. So same concepts can be applied just as we did for Amdus and Minis database to LCMS metabolite identification. Now you also can look at LCMS spectra in this three dimensional view where you have the mass to charge retention time and then the peak intensity. And in the case of we've seen before with GCMS, a single compound can generate multiple fragments. So you have your adducts, things that are sodium or potassium or just a negative ion mode chlorine adducts. You might get paired ions. So there might be two molecules bundled together. So doubly charged and sinkly charged and triply charged. Insource fragments that happen just through the ionization. Then you'll get other peaks coming from the carbon-13 deuterium, nitrogen-15 chlorine variants. So the resulting 3D spectrum is complicated. I've shown you examples of it. This is a more realistic one. In terms of what you're challenged to do is to work with those 3D plots. And then this is Kraft and the 2D plots. So you can look at the extracted ion chromatogram slices and see the retention time and see the charge and see the intensity. So just as with AMDIS, you have to do peak identification. And you have to try and identify those signals that correspond to individual compound ions. What you're trying to go from is something that might be multiple peaks down to a single peak, a single unique mass to charge and a single retention time. So for many to one or from in the case of complex spectra, tens of thousands of features to a few thousand features. So retention time matching is something we've already talked about. We talked about the cow method. We talked about the XCMS methods, some of the others. So this is how you can align retention times as you go from sample to sample or for batch to batch. So that helps reduce the number of peaks. So retention and peak matching and retention time correction can be done this way. You can also think of it more in a mathematical way. And this is something that's been developed really nicely in Metaboanalyst 3 to really improve peak selection in LCMS data. So the first thing you do is try and do peak matching across retention time. And then you do the retention time correction. So that's where the sort of visually is being done. Then you redo your alignment and update those groups. So you're trying to match again, both the irons retention time, and then you iterate and you repeat that process over and over. And that iteration process of matching retention time and peaks, mass peaks really helps clean up peak identifications or simplification. So let's imagine we start with these three samples, three runs, presumably mostly the same material, let's say it's Bluider theorem. You have a mass to charge, you have a retention time, you have an intensity. So you can see a mass to charge is a 389, a 389, a 389, same retention time. And all it's varying is just the intensity. So, okay, that's just a different concentration. But here we get something that has 126. Here's another one that has a 102 instead of a 126, but it's exactly the same retention time. So that might be an adept. And then here's another 126, also very close to the similar retention time. Here's another one, which is 102 and also has a similar retention time similar to this. So you can kind of see some similarities in these samples. So let's sort of converting this simple 3D value. We've colored them. And we've marked the ones that match very nicely in yellow. The green ones also match in terms of mass and retention time. And the pink ones match in terms of mass and retention time. And then the cyan or blue ones also are somewhat different. So we've been able to do a grouping by mass to charge. And then we can now group by the retention time. So in this case, we're sorting things out. Here they are grouped by retention time in the yellow. Here they are 51, 51.9, 51.9, and 52. So the green and the pink nicely match according to retention time. Purple is sort of an outlier and the blue is also an outlier. So now that we've done that matching and we iterate, we can reiterate getting these pinks and purples and blues and greens. The final result is this where we now have a total of five retention or five mass to charges, a grouping according to retention time, and then where they are in the samples. And so by creating this table and having done this iteration, it's possible to firm up which peaks are real, which ones aren't, which ones are addicts, which ones aren't, which ones should be bundled or grouped together to reduce the initial set of what seems like tens of thousands of peaks to a smaller number of peaks. So LCMS for untargeted metabolomics is possible through using a whole range of commercial programs. And a lot of companies invest quite a bit of money and sell these tools like the Mass Hunter, Brooker's Profile Analysis, SEV, Progenesis, QI. There's always ones that are coming up and I've just kind of given up trying to chase down all the different ones. I'm sure all of you have used or some of you have used these commercial ones. But what we want to focus on is the free options. And this is one of the mandates for CBW is to make people aware of the free software. So some of you have mentioned MS Dial, some of you have used MZ or MZ Mine, MZ, MZ Mine 2. A lot of you have used XCMS, I'm sure, or used XCMS online. We're also going to be introducing you to Metabol Analyst R. So the ones in red are the ones that we're going to highlight today. So XCMS is very well known, widely used. We'll probably take a little survey at the end of the lecture just to find out how many have used it. Gary Shustack developed it in 2006. It does peak picking, peak matching, retention time alignment. It does batch processing and accepts a whole range of different formats from a whole bunch of different instruments. A survey was done and it's actually the most used to Metabolomics MS processing tool. Now, the challenges with using the XCMS package is that it is an R program. It's something that you download. And so if you know R and have written in R, it's probably something you're comfortable with. Not everyone here knows R and not everyone's a programmer. There's also a huge number of parameters and it's used as a whole bunch of different programs and sub-programs written by different people. And if you just sort of use the default stuff, you actually don't do very well. So there's a lot of learning in tutorials. If you're processing untargeted LCMS data, the data files are huge. And so you have to have a big computer lots around. So because of the problems with the offline XCMS, they've created the online version, a web version for XCMS. This is what it looks like. And this is a web server web address. You have to get a password or an account with it. But once you've got your password, then you can get on to XCMS. So it does everything that XCMS does. It does the alignment. It does the scaling and peak picking. It also goes further. It can do data reduction and feature selection via PCA and PLSDA. So it does some of the things that Metabolanalyst does as well. It does metabolite identification. It does M over Z and Z matches. You can also do MSMS spectral matching. And it uses its own database called the Metlin database. Now, when you're doing untargeted LCMS MS, you are not quantifying. So all you can do is get some relative value. So, you know, 786. That's all it is. It's 786. It's not micromolar, millimolar. It's just 786. That's the intensity. And so if you see something that's 886. Yes, the intensity is higher, but you can't confidently say that the concentration is more. That's just the intensity. So that's a caution I think people need to be aware of when they do metabolomics with an untargeted approach. Hello, everyone. I'm going to present next about 20 slides on LCMS, especially using Metabolanalyst. So compared to GC and NMR, LCMS is not so standardized. So there's a lot of a lot of issues, a process and raw spectra. And yes, David mentioned about using standard reference will help a lot, but a lot of times. And a spectra collected and there's no, no, not so much reference things going on. So if you really want to compare is that you compare the same compounds or same peaks across different samples. So assuming that the same compounds, if they have the iron suppression will be applied to this because the same structure will have a similar effect, you cannot compare the intensity across different compounds, especially of very different structure. And you need absolute quantification for that purpose, but for the same peaks across different replicate, that's relatively safe. So that's I would like to say because this is when we're going to use the picking density table to do like some statistical analysis. So at least the same compounds, that's it's more or less comparable. That's I will just follow up on the question. So for LCMS and my group. So since actually since about 10 years ago, I started using XCMS to do the spectra processing. And it's a it's the first open source tools and it's very powerful. But there's a lot of issues. So they've previous slides already mentioned. And it's very powerful, have many steps and you need you can upload your data and doing all the pick picking peak alignment and doing a statistical analysis. But there's other issues like parameter. So parameter default parameter done working well with the more we try to the more realize we cannot use default parameter. We need to specify the parameters or make it better. It's have a huge impact on downstream also for LCMS is quite used for large scale studies and use auto sample stuff. So the batch effect is very important. So you also need to do the batch effect correction. So LCMS spectra processing very challenging as a really big data issue. And here does the pipeline just showing some commonly used tools. XCMS main thing, but from there you use a package called IPO to do the parameter of the optimization. And then up to the optimized ramp to your round XCMS, then you're doing a batch correction. And finally you're doing this user RAM class R to export data and to do the data studies analysis. So it's very complicated process and you need a powerful server to do that. It's probably takes several days or several weeks. So that's that's really motivated has to kind of make do a better job here because it's a huge bottleneck. So we here we discuss about metabol analyst are we, we have a web based tool metabolize to do main list statistical data analysis doing some functional analysis for targeted metabolomics data. And if we want to do raw data processing, how can we just do it? We're starting with our package. So based on XCMS as initially doing a big picking and the peak annotation based on the camera package. And we do some cleaning and we do the statistical analysis and that knowledge based analysis tool called a Mami Chalk. And we will, we can try today and most likely we'll introduce more tomorrow. How do we do in a functional analysis direct from LCMS peaks. So this is the main floor as shared before, but the issue with this is the parameter optimization and, and how do we do it? There's about a 15 or 16 parameters and nobody can, there's a trial and error, which one works best. They're really, really hard to do it. If we use the IPO package, which is very well established, it takes about several weeks. And it's very surprising to see how how slow it could be and even on a good computer. So we, we have to make a better, better tools to, to, to do the parameter optimization. And the other one is the automated batch effect correction. So there's multiple algorithms to do that. And none of them is always doing best. So like, like combat, like, like there are several other algorithms we tried. Some of, sometimes this one works better. Sometimes that works better. So, and so that's, that's also another issue. How do we do, do the batch effect correction automatically? The third one is how do we do in a pathway activity from LCMS peaks? So we will introduce more on this mommy chalk, which is there's a base for that. The whole idea is we cannot accurately identify compounds individually. But if the groups of compounds shows consistent change, the chance of have this group of compounds, like all involved one pathway, like 20 compounds, all change consistent. And we, even there are some random errors in the individual compounds identification, but if at a group level, it's much more accurate. So just like you have some random gas for one or two, but when you have a whole group and consistently in the same pathways, all change the consistent, the chance, the chance is much lower. So based on this concept, we cannot identify compounds, but we can be very competent about the pathway level function change. So this is how we want to directly from the spectra to biology without doing accurate compound identification. So, so here is we going to introduce the metabolism pre-processing tutorial. This is one for the lab session in the next the next section and probably tonight. And so that's why this section is that we just developed in the past month. And so there's some updates. I want to make sure you get up to date slides and don't get confused. So the overall step is you need to upload your data and select doing some standard check when the data is certified the requirement of the tool. Then you need to select parameter. Of course, you can let the tools to do the parameter optimization and select automatically. You know, submit your job and you download the result. So the key is we are struggling with actually after submit a job what you need to wait and that this is a time consuming step. Not like a GC or MMR. The raw spectra process for LC-MS really takes longer, much longer time. Sometimes takes several hours. So here is that I just show you steps in this using the screen. You can see the bottom. It's not the main server. It's the development server. It's much more powerful. It's not released to the public. So we don't want to interfere with the current server, which is for everyone. But here is for the development server, you get the more powerful computer and with the spectral processing hosted. And so where this new spectra hide is here. You see the spectra analysis and this is bottom here and you will see one, two, three. And in your lab you will see MMR with the basal and the GC autofit and this LC-MS spectra. This is when you're going to get in. So to upload the spectra and this is something we just updated actually yesterday. One thing that we realized if you really upload large data in one go and the internet could reset quite sometimes it resets sometimes not resets. So the best way to upload the large data is upload your files one by one. So if less than what like say 100 megabytes, in our test it will be all successful. So if you're more than 200 megabytes, sometimes it will internet going to have some issues. So this is one is you put all your spectra as a zip file and it shows this mzxml.zip. So you have your open, the spectra need to be open data format. So far as we support mzxml, mzdata, and mzml. So the other ones need to be in a centroid mode. And if it's profile mode, so far as it does not support, why is because it's going to be much larger, it takes much longer time. And we also found a central mode in large scale process and actually doing very well. So this kind of the steps also mainly also highly recommended by the XCMS online. So the thing is that if you want to upload your data so far as convert to the open source format and open format and save it convert to centroid mode, you can use protein wizard to do that. It is well documented. And how many spectra can upload so far as we said about 120 spectra, which is pretty good number. So I guess most of you probably don't have so many spectra to upload. So the other ones you need to give a metadata file. Metadata file is basically a group information about which ones QC, which ones is healthy, which ones control, which one disease. So it will automatically doing this kind of labeling. In the end, you get your peak intensity table, you can do a statistical analysis. Here it will show you upload your samples to this place. It will upload one by one. And together with your metadata file, which you can click here, there's an example and you can see how to label it. And after everything is found, you can click proceed. And on the bottom, there are several example files you can explore. It's by the first one during our lab. You are encouraged to use that first one because it's fast. You can directly see the result without waiting two hours. So that's the purpose. So you upload your data. And the next one, if you check your format, it's whether it's centroid or not. So far as if not centroid, it won't allow you to go to the next step. So that's why everything needs to be centroid. And what's the group label? And you can include the exclude and you click next. And here's the main thing. We have tried our best to make it not overwhelmingly complex, but still allow you to have some control if you really want to do it manually. And so, for example, LCMS platform, we have about 12 or 15 different platforms. You can use the default parameter just as XMS. But on the other hand, you can choose auto-optimized. If you're doing an auto-optimized, it will test in all the parameter combination and choose the best for you. And this is all the parameters. Show you what it looks like if you want to overwrite the... If you want to use default and manually specify. But if you use auto, you can just the computer will decide for you and you click submit a job. This one is going to take a while. And after you submit a job and the next page, it will go into your job status. And here is that at the beginning, it will be in a queue. And the computer at our development server can take about nine jobs simultaneously. So more than nine jobs, it will be in a queue. So once the previous job finished and you will be bumped up there. This is the one is because it's just time consuming part. So we have to control. Otherwise the server will be crashed. And so if it's running, it will go into a process very fast. At least for the example data, you can see one, two, three. It will give you some text information about what's been done on the back end. So it's essentially like your R terminal. So we want to give you some information because sometimes you're not sure what's going on. And here is basically tell you sample one process. Sample two is process doing a peak alignment and doing a step two. So try to communicate with you what's going on. So you can choose to wait here. It's probably takes one hour, one half an hour. I don't know the servers working quite good because we tried to optimize and get things done fast. But sometimes if you really have a large sample and it's bad, just don't wait. So here is that you have a C create job URL. In this case, you click and it will give you a link. And so you can check back. So you save it to somewhere on your computer and check back during night and see the result. So this is sometimes for the large jobs, you don't want to wait. In the future, we probably go to ask your email account. So we'll send an email. But so far that we just do it this way and give you a link. We don't know who you are. So we cannot send an email, but you can always check whether it's done or not. So after like assume after half an hour or 45 minutes and it's been done and you can see the go to next proceed. And you will see some summary of the PC intensity box plot and the baseline peak profile. And you can see the samples how many peaks, how many missing peaks. You see that when we do the peak alignment, we know these peaks should be there, but it's not necessarily every samples have that particular peak. We know it's a missing. And sometimes our retention time also shaped it a lot. And so you can see there what's the difference with similarity of MC range. And you can click individual to see it. So we're probably going to add more information here and give you more informed your output here. And this is just the basic summary. And finally, if you go here, you can download everything annotated peaks and a few to the peaks and how what's been done. And once you get this peak intensity, or you can almost do all the stated analysis and doing like tomorrow of mommy chug stuff. So I guess this is a result table. If you want to see is a pick in sample names and the group label. And this is a large data set. Probably a T will share this with you for you to explore. So there's a group label and there's their peaks and. So this our approach. So you see that why we're doing that is we really want to make sure the XMS result is we improve XMS. And it's it's it's better than default. So here's that we compare with the default on XMS online. We use some we use a standard mixture. We we know who who what is expected inside of the mixture what peaks is more likely to be true. So based on that we compare the XMS online default parameter which is the it detects about 10 16,000 the peaks and about three hundred eighty two is a two peaks. You really can see how much how much noise it is. But this is a reality. So so you people doing LCMS the high resolution spec high resolution MS you get so many features but a lot of the missing noise. So this is very clear when you see this example. And here is that if you use an IPO which can run about one week or two and you improve you see a much more two peaks that you can see the noise that will also increase so also increase. And the other tool is I called auto tuner which is very fast compared to IPO. But on the other hand the performance and not as good. So it it it it increase but it's not increased. Not not that that good to compare to the IPO. So with metabolism is our we use our built in optimization. See we we increase the two peaks but we didn't increase so much noise. You can see it's only slightly higher than the default. And but the two peaks are much higher. Also if you see the peaks identified quantify so how do you identify peaks quantify peaks we also optimize parameter. And how do you see whether good or not is and you see it's a Gaussian shape and a linear linear erity. So there's some empirical rules we see the peaks we picked is better and the module peaks. So this is something I would like to show. So David I'm getting getting back to you. I think that's back to a database. Okay, thanks very much Jeff and I'll try and wrap up quickly here because we're a little over time again. So with NMR and GCMS we talked about how they are good for certain classes of molecules but LCMS is particularly good for more hydrophobic molecules than say NMR. But it is good for obviously lipids and lipidomics very popular. It can be used for fatty acids and organic acids. But they have to use different columns. Really to identify compounds you have to use MSMS data. And if you can get some retention time ideally you'd like to have some internal standards to help both validate and semi quantify. Now the different approaches some groups around the world still just use high accuracy mass matching. That's dubious. But if that's all you can get then that's what you work with. But the preferred weight just like with EIMS or GCMS is to do MSMS matching. So if you're trying to do simple matches to M over Z values or molecular weight you can go to a variety of databases. There's Kebby chemical entries of biological interest. It has around 70,000 molecules in it. PubChem which has 70 million. ChemSpider which has 40 million. HMDB which has about 110,000 compounds. They all support molecular weight searches. There's also searches that are more sophisticated where you can search by MS over MS. So you can use the NIST database. The Metlin database also supports that. MassBank and MONA also support it. There's also another tool called CFMID which we'll talk a little bit about. So when you're working with MassBank as Jeff highlighted and I've mentioned before is that when you do ESI you're going to have these salt adducts, these neutral losses, in source ionization, multiply charged species. Lots of I guess we'll call noisy peaks that need to be validated. So as Jeff was highlighting on the example he gave with 18 or 16,000 peaks simplifying to around 700 actual compounds. You lose 80 to 90% of the peaks that you detect in an untargeted mass spec. So you want to try and distinguish those adducts or multiply charged or in source fragments from the parent ions or to group the adducts to become the parent ions. Here's an example of a mass spectrum where we have sodium adducts and where there's this addition of, you know, 22 Dalton's to the typical M plus H peak. So essentially there's extra peaks. Sometimes the adduct peaks are much, much more significant, much more intense than the progenated peaks. And again this just reflects the fact that things are organized differently. So intensity doesn't tell you concentrations. So if you're deconvoluting, so you can see the base peak chromatogram at the top, then you extract out the extracted ion chromatogram and then from the extracted ion chromatogram, then you can see the mass spectrum. And so this could be done in two dimensions as we're seeing here or in three dimensions. So the extractive ion chromatogram for this particular molecule, we're seeing three clear peaks. One with a sodium adduct and one with two sodium adducts. And so these all three peaks belong to the single peak of 525. And so we're trying to convert our multiple peaks into that single peak of 525.08. There are a variety of tables where people have generated adducts that you'll often see depending on the solvent, the salts, and the behavior of the molecule. So you can see that you can get more than just a sodium adduct. If you're working in solvents, whether it's ammonia or formic acid, if there's potassium or sodium or chloride, again, whether you're in a positive or negative mode, you can get double ionization. That's where you see two M's. And you can see the potassium, you can see the removal of hydrogen, the addition of hydrogen, the addition of methanol, there's lithium. So all these adducts are possible for the same molecule. So take one molecule and it looks like there's around 30 possible peaks. So this is the scaling that you can potentially see in a given untargeted study. We have Oliver Fien, who works at UC Davis, and probably many of you have heard of him, has a tool called the adduct table and adduct calculator. And this is listing even more adducts that include even acetonitrile as well as methanol, isopropanol, DMSO, along with ammonia, sodium, potassium, adducts. And these are just for the positive ion mode. And the negative ion mode, there'd be others. And then there's this multiple charges, two M's being doubly charged molecules. So again, just to emphasize the complications that you see with LC-MS, there's also a process of neutral losses, where things are fragmented. And so depending on where things are cut or broken, you'll see molecules that are detectable. So we can see the parent, nominal parent, is 122. But we don't see, say, the 45 Dalton 1 or the 17 Dalton 1 because these are neutral losses. Instead, we just see the 77 and the 105, which are the charged components. So a number of databases actually are designed to handle and predict adducts and to predict things like ion pairs or multiply charged species. So METLIN can handle multiplying species and neutral loss species. And because of this issue of so many extra masses, purely searching by only the mass to charge ratio that you think you're seeing can generate a lot of false positives. So as Jeff highlighted, and as he saw with some examples with Metaboanalyst are, and as we gave you a few other examples, the process of handling or pre-processing LC-MS data involves removing and consolidating adducts, consolidating multiply charged species, removing or identifying fragments and neutral losses and breakdown products, any rearrangements, removing or consolidating isotope peaks. Also, there's a process of removing noise peaks, so from using sample blanks when you put in things like technical replicates or quality controls or some people will test and dilute samples and will show dilution trends. And if you don't see the dilution effect, then you can also assume that's also noise. So these are tricks or processes that all have to be done when you're working with untargeted LC-MS runs to reduce those 18,000 features or 15,000 features down to a countable number. So this is a typical process where you can use tools like M-SetMine, Metfusion, TableList, R, X-CMS. If you had just a single positive mode spectrum, getting 15 to 20,000 features is not unusual. If you remove the attics or merge the attics, that'll knock it down by about 20%. If you consolidate the multiple charges, that'll reduce it by another 20%. If you remove the neutral losses, that'll consolidate them, that'll knock it down by another 20%. If you handle the isotope peaks properly, that'll cut it by more than half. And then if you remove noise peaks, that'll knock things down by another 15 to 20%. So for a typical positive ion spectrum, you might end up with some 15 or 16,000 features down to perhaps 2,500 actual peaks for positive mode. And generally for negative mode, it's not quite as sensitive, so you'll typically see about half that number. So you saw those examples with Jeff talking about the specific example of the defined mixture where they went from 18,000 to 700 identified compounds. Those compounds also probably would have included additional peaks. So this is the level of filtering you typically have to do. Now, you don't, I mean, at this stage, you have a lot of mass to charge ratios and some peaks. And depending on the type of mass spectrometer, if you're using Orbitraps, TOFs, QTOFs, you can get enough information to make a stab at what a level three identification. You can try and convert the mass, the mass to charge to a formula. And a formula is a class of compounds. You can look it up on databases and you can say, it's got to be one of these 20 compounds. So you need the accurate monoisotopic mass and you need to have some kind of estimate of the error. So the database, I should have checked, but at least last year it was working, this year should still be two. It's maintained in Wales at the University of Aberystwyth and it generates a nice molecular formula. And you can, it's a simple server. You can type in the accurate mass, the tolerance. You can apply what's called the seven golden rules, something about myolipherphene's group and that type of composition that you think compounds should have. And it'll generate a formula. And you can also search once you've got the molecular formulas go off to large databases like Pub Chem or ChemSpider and identify the actual compounds that match those formulas. So from a mass alone, you might have a hundred possible molecules. If you've got a molecular formula, that may shrink it down to on the order of 30 possible molecules. And then if you go to a database, that might shrink that M over Z value down to maybe 10 possible molecules. So from mass to charge to formula to something that's in a known database, it sort of reduces the total possibilities down. Now you can even clean it up a little further because there are some formula filters that are designed to use information about atom types and the bond valency and rules about atomic composition and bonding restrictions. And those are built in to actually help reduce both false positives, but to make more realistic molecular formulas. And so this is what's called the seven golden rules. And these have been around for about six or seven years, maybe longer now, that all of our fiends group developed. And so they can take, you know, accurate mass along with the isotopemers that you see. Now those extra three or four small peaks that are associated in your extracted ion chromatogram. Now these are not the adducts. So these are ones that differ by one Dalton. And the intensity of those is very important. But if you have that information about those isotopemers, you can put it into tools like the Aberystwyth machine or all our fiends golden rules or some of the commercial software, and they will generate a pretty good guess of what the formula should be. So there is more information in the molecular formula than in the mass. So if you apply the molecular formula filter, and this is an example of a Brooker one, here it's generating possible formulas and possible matches. And in this case, this one that's C24, H15, F3, N5, O4P is likely your best match. And so with that formula, you can now go and search against databases like Kemspider or HMDB or whatever. And you can see that the mass alone had at least seven or eight possible mass matches at 525.0808. But the formula filter zeroes it down to just one formula. And that one formula probably matches to a half dozen compounds. Whereas if you just use the mass alone, you might get 30 or 40 matches. So this is the advantage of using formula filters to help narrow down your search. And this is a graph that was illustrated that when you're using formulas, obviously you can narrow it down using standard chemical rules and the seven golden rules. So you start off with eight billion possible elemental compositions, seven golden rule shrinks it down to 600 million. The formula is if you search through PubChem, you're working at this kind. This is back when PubChem only had 10 million compounds. But then you can also search more tightly and say, well, look, I know that I'm only looking at this class of compounds from this known organism. And if you can narrow that down, then in some cases, the formula alone will give you a unique hit. This is some statistics about just the frequency of formulas. So as you go up to larger and larger molecules, the number of possible matches or isomers, molecular weight isomers increases linearly. So something with a molecular weight of 200 Dalton's or an M or reserve value of 200 Dalton's will have somewhere on the order of 20 to 25 million. So that's the number of possible molecular formulas or possible compounds. And the number of formulas increases as you go up with molecular weight. But applying these seven golden rules and other filters saying, I know I'm looking at natural products, so there shouldn't be any fluorine, or I know that I'm looking at a compound that is only produced with carbon, hydrogen and oxygen, because there's no sulfur in the system, so it's going to narrow things down. This also shows you how having very accurate masses also helps things. So if you don't have isotopic information, you have to very much depend on having really accurate mass. So this is Orbitrap FTMS type data. So if you're able to measure this accurately for smaller molecular weight compounds, you need to get the right formula. If you have larger molecular weight compounds, then you need higher accuracy to get fewer formulas. And this is without using isotopes, but if you have the isotope information, and if you're able to measure those accurately, then you get even further narrowing or further reduction in terms of the number of formulas. So even if you had a mass spec that only had three PPM, but you included the isotopic abundancy, you could go from a thousand possible formulas down to 18. So high resolution mass combined with isotopic information, combined with these filtering rules, like the seven golden rules, can get you to identifying some cases a completely unique molecule. Now it's not a confirmation because you don't have the authentic standard. And it's still based purely on a formula, which is more computational. So you could be wrong. And this is just again, highlighting how the use of looking isotopic abundance allows you to at least determine, in this case, the correct more formula for this rather complicated molecule. So this is a real case where they used, looked at solanine, and we're able to get sufficient mass accuracy, sufficient isotopic abundance to get an absolute confirmation of the formula, which then allowed them to determine that this had to be the compound, which then they confirmed with an authentic standard. Now a real problem is that a lot of the databases, like PubChem, Kebby, and even Matlin, and the NIST database, mix non metabolites with metabolites, or they'll put in, you know, exotic explosive compounds in the NIST database. They're very common there. PubChem has a lot of essentially theoretical compounds that are created in chemical screening libraries, these are things that have never left the lab or never will be found in a plant or the duck or a bird or in humans. And others include purely buffer compounds, you know, Tris and Trisma base. These are things that are not really useful. And this is leading to examples, which I'll highlight again, where people are finding, you know, exotic drugs like cocaine in rats. It's because some of the drugs are getting a mass match, and that just is, it's not possible. So if you know something about the organism and the source organism that can greatly limit the size and the number of compounds that you're matching, and that can make a real difference when you're doing formula matching. And we'll talk about some of these organism-specific or application-specific databases. I mentioned before that the quantitation in LC-MS is not frequently done. And most untargeted techniques have no real solid indication of quantitation. There's some relative measure, but it is not absolutely quantitative. Even if you knew exactly how much you added to your sample, as your reference or quality control, it still is not going to give you absolute quantitation. And it's not going to give you absolute quantitation, but it's not going to give you absolute quantitation at all. It still is not going to give you absolute quantitation. To do absolute quantitation, you have to have spiked addition of isotopic standards. Those are expensive. You also have to make sure that those isotopic standards are the same type, same chemical, or very nearly the same. And typically to do quantitation by mass spec, you have to do it with, more often with triple-quad instruments. You can do it in cutoff and orbitraps as well. But you go into something similar to an SRM, or multiple-reaction monitoring mode. Multiple-reaction monitoring, or single-reaction monitoring, involves the use of these isotopic standards. It looks at identifying a specific precursor ion and a specific product ion. So you have to have fragmentation. You have to have something that you can say, I'm looking at this parent ion, 121, and I'm looking for these product ions. And so here's a list of parent ions and product ions. And those parent ions and product ions have to be unique, or the pair of them has to be unique. And so when you're doing quantification, you've got to make sure that those are identifiable. You've got to make sure that those are identifiable, that they're uniquely confirmed in the molecule, and then the intensity of those peaks, based on the isotopic standards, is used to quantify things. So although our focus has been mostly on untargeted metabolomics, there are targeted techniques. A number of labs and a number of companies have started creating targeted metabolomic kits. Biocrities is one of them. P180, the P400, the P500 kits. Shimadzu has kits. There are other groups that are also producing. The advantage of targeted quantitative metabolomics is that it is incredibly fast. So we talk about weeks of processing time for untargeted metabolomics data just for the computer. With targeted metabolomics, it's often possible to process 80 samples in 24 hours, both at a data collection, data processing, absolute quantification, and full identification. It's semi-automatic, but it's approaching automatic. So again, just like, you know, GCMS with GC AutoFit, Bazel, and the targeted kits like these, you can really fly through a lot of samples and get a lot of quantitative information very quickly. This is an example of a concentration range that you can measure. So with the kit systems, you can go down to 10 nanomolar or lower to as high concentrations of 7 to 10 millimolar. So essentially a million-fold concentration range is detectable and quantifiable in these targeted MS-based systems. So it's quite impressive. And as I say, this is, if you want to do high throughput metabolomics with either trending towards automation or semi-automation, these are the way to do it or the way to go. We're going to take a break now, but for those of you who are going to be involved in the lab, we want to make sure that you can download these data files because this is what you're going to need for the lab.