 Okay, anyways, as Michelle said, I hope you guys had at least the hands-on experience. A couple of you actually did remarkably well in racing through some of the samples using economic software. So kudos to you for learning this and getting acquainted with it. So this is part two of metabolite ID and annotation. And what it's now going to go is from, as Michelle said, from the idea of this manual approach to some of the other automated approaches at NMR. And then we're going to drift into GCMS, and then we're going to talk about more about MS methods, mass spectrometry, identifying compounds by mass, molecular weight, formula generation, and other ways of identifying unknowns or known unknowns using largely mass spectrometry techniques. So it's important to really understand that what we did just before lunch, what we're going to be doing here is the idea of taking a spectrum, NMR, GCMS, LCMS, MS, MS, DIMS, whatever, and producing a list. The list is going to have compound names, and it's going to have some number associated with those compound names. It could be a relative concentration, it could be an absolute concentration. So we're not binning, we're not grouping, we're actually naming and quantifying. And that's all that's needed, actually, to do any of the PLSDA, PCA, other statistical, multivariate statistical techniques that we're going to talk about mostly tomorrow. So this is the input that we're trying to do or generate. Now we did spectrum deconvolution, you guys had the experience of looking at a real biofluid sample, and you had a chance, at least a few of it, to figure out where acetone was, and DSS, and alanine, and maybe even citrate or a couple of other ones. And we use the Kinomics software, and it's quite user friendly. It's actually had lots of years of development, and I think lots of people agree that it is particularly user friendly. There are other alternatives, and I'm going to race through a few of them. I've already mentioned Toxspin Amix, which is done by Brooker. It's a commercial package. We can't afford it, so we're not using it. Then there's the Metabo Miner. That's freeware that Jeff actually wrote, your TA. There's also tools that are available through the Human Metabolome Database. There are other tools available through Wisconsin, Madison's, Biomag Res Bank. So Metabo Miner was developed a few years back, and the link might be a little out of date, but it's still around. Anyways, it's a downloadable software, and it was actually developed to look at 2D NMR. So you guys were looking at one-dimensional NMR spectra, but there's just like with 2D gel electrophoresis, there's 2D chromatography. You can do 2-dimensional NMR, and it actually allows you to see many, many more compounds. And it works with things that are called TOXI, Total Correlation Spectroscopy, as well as HSQC spectra. So it has its own library, just like Kinomics has its own library. It has sublibraries that are specific for certain biofluids, just like you guys had a list of compounds to work with, with the CSF sample. This also knows what's supposed to be in CSF or plasma or urine. Jeff did some really neat things, which other people have sort of adopted since then, because he was a few years ahead of his time, and he developed ideas of minimal signature peaks to identify pairs or clusters that allow you to identify things. And it uses a couple of other tricks to make sure that compounds can be identified. In principle, and it's largely agreed by almost all spectroscopists, that if you can look at two-dimensional NMR spectra, it is more reliable than trying to do one-dimensional NMR spectra. The annotation, the identification is much better. Unfortunately, with 2D spectra, it takes a long time to collect them. So the spectra you guys were looking at took about three minutes to five minutes to collect. To collect a tox, your HSQC spectrum would take several hours, and so if you're trying to do 500 samples, it's a lot of time. The other thing is that two-dimensional NMR data is not as quantitative as 1D NMR data. So 2D data is for identification, but not so much for quantification. Anyways, we did a number of tests, and Jeff was here, he could tell you how long it took him to do these tests. But this is essentially semi-automated, so you don't have to do all of the phasing you guys were worrying about and all the other stuff. So you can just sort of upload this. And it was hovering around about 90% correct in terms of its precision and recall, looking at both Toxy and HSQC, and it was identifying 30, 35 compounds in a sample. So that actually got people intrigued with the idea that it might be possible to automate what you guys just spent your morning doing. So that's one example, and as I say, Metabolominer is still used, it's sort of cited as one of the first examples where we had this opportunity to do sort of semi-automation. Now NMR compound ID is also supported by the Human Metabolome Database. It allows you to type in lists of chemical shifts, and it'll look up its collection of NMR spectra, and it has a collection of about a thousand compounds that have been analyzed and assigned. It also has a whole bunch of predicted NMR spectra as well. And so if you type in a bunch of chemical shifts, just like what you saw, so you could have just used Kinomics or any regular software, just identified here's a peak at 1.72 ppm, here's another peak at 1.86 ppm, here's another one at 1.92 ppm, and typed in those things. HMDB would be able to look through its library of chemical shifts and putatively provide you a list with some compounds. Now this is not as good or as reliable as Kinomics. It is not as good or as reliable as the other software, which I'll mention. But it is sort of a point of if you want last resort, if you're just trying to identify something that you're uncertain of. If you're working with a reasonably pure compound that you've isolated and you're trying to figure out what it might be, this is also something you can use, and that's probably where it's had its greatest utility. University of Wisconsin also developed an NMR suite and set of databases, part of their Biomag Res Bank initiative. This is led by John Markley. And although they've sort of pulled away from NMR based metabolomics the last few years, they did produce some good software. So one is NMR analysis software written in R. So you guys had a bit of your tutorials on R, and there's lots of techniques and tricks that are available through R, both for data analysis and even graphing. So they've done this, and it actually was intended to help with metabolomic data processing. I'm not sure how far it's gone. I was talking with the developer I think a year ago, and he was planning on upgrading it and making it a little more useful. But it's more intended just like our metabolominer, largely for two-dimensional NMR analysis. Just like the HMDB, the folks in Wisconsin have created another thing called the BMRB peak server where you can type in a whole bunch of chemical shifts and then look up their library of compounds to see if they can find any matches. They have a different library than what's in the human metabolome database. Some of them overlap, there's maybe about 60% overlap. A lot of the compounds in the biomegres bank are actually plant metabolites. And so those are useful, especially if you are doing plant metabolomics. And as I say, if you can't find one in HMDB, you can probably go into the BMRB and vice versa. So they have both proton and carbon HSQC data. So these are, as I say, services, they're not blockbusters, but there are two things that have become, I guess I would call blockbusters for NMR. So one is a package called Batman. This is produced by Tim Ebel's group out of Imperial College. It's also written in R, and it stands for Bayesian Automated Metabolite Analysis for NMR. And you can take the B and the A and pull out a T somewhere and eventually get Batman out of it. This was the first software that really claimed to do automated metabolite identification for NMR. So you guys had to do, well, it didn't do the phasing, didn't do the water correction, doesn't do the baseline correction. But the stuff where you're profiling, this is where it would do it. And so some of you tried to do the auto-fit economics. This is sort of like auto-fit, except it does it more automatically. Problem is, it takes about nine hours to analyze the spectrum. So even if you guys are just learning, you could still work faster than Batman. So it's not so much an advance in the sense that, I mean, yes, maybe if you had a big supercomputer, it would be faster. But it's just not quite practical in that regard. Why does it take so long? It's the computer, just grinding away, trying to fit a simple spectrum. If you can get a very simple spectrum with about a dozen compounds in it, it can do it in about 20 minutes. And if you can be quite selective before, so you go through the spectrum and say, I just want this, I just want this. So you could take the CSF spectrum. You could also do it in about maybe half an hour or at least the computer could take it. But by then, you're doing so much manual work because you've already had to do the phasing and all the baseline correction. And then you're also isolating these pieces, then it doesn't really pay off. And I think they admit to that, but the concept is there, the idea that you can automate it. So, yes, question. Sorry, does that mean there was one? One spectrum. So we've been working on a thing called Bazel. And this is actually on the web. And if you guys wanted to flip out your laptops, you could actually go to the website. Has it been published yet? Just submitted the paper I think today or maybe tomorrow is when it's supposed to go in. This one's a lot faster. So what would have taken Batman nine hours, this one takes about three minutes. It's very accurate. And the other thing is everything is automated. So all of the baseline correction, all the phasing, all of the water removal, reference identification, all of that is automated. That took a long time to figure that one out. The other part was to try and do the fitting very quickly. So a skilled person could have analyzed the CSS sample that you guys looked at in about 15 minutes. And the idea was to then, if a human can do it, surely a computer should be able to do it just as fast. And so the idea is to mimic what a human typically does. And the way that we do things is we pick low hanging fruit first, and then the higher fruit later. And that's a pretty practical way of doing it. But you can't code that in easily for global fits. The way to do this was to use something called a hidden Markov model, or probabilistic graphical model. Yeah. So HMMs are things that are used in pattern recognition, speech recognition. And this one is essentially what was used, a few other tricks. And the net result is it fits the way an expert tends to fit. The other thing is that you can't just give it a carte blanche. You have to tell it just like we gave you guys. You have to give it a hint. You have to tell it it's looking at blood, or you have to tell it you're looking at CSF, tell it you're looking at saliva or whatever. And it knows what compounds are typically found in those bio fluids. So once that higher knowledge is there, that's part of the Bayesian component, then it can do the rest. And as I say, it does it very, very quickly. So just like you, it does the processor and the profiler. So if you run the website and so many examples on there, it'll come up with a initially a crummy looking spectrum. It's all out of phase. Then it will phase it. So you can see the phasing is done. Then it will remove the water. As you can see at the bottom, the water is gone. It'll also identify the zero point position. So it's got the zero PPM. It's also fixed up the baseline and all the other things that you guys took a while to do. And then at that stage, it can start deconvoluting. So top is what human would do, 60 minutes bottom. If you're running on a faster computer like this one, rather than a server, it's about 30 seconds to a minute. So the server is about, I think, three minutes or whatever. So that's automated. And I think, as Michelle said, to appreciate, we just simply said, oh, here's the automation. Everything's simple. You wouldn't understand how much work this is taken or appreciate what the computer does. This has actually taken us 10 years to develop. Yes? So can you say that please? You're saying that you should have it searched through a library of, say, 200 compounds or something like that. So the reason is you want to minimize the number of compounds that it's looking at. And it's a case of you've got a few thousand peaks that you're trying to fit. And you've got, if this is 200 compounds, then it's 8,000 times 200 in combinations. If it's 50 compounds and 8,000 peaks, then it's a fewer, smaller number of combinations to look at. So it reduces the problem and speeds it up. In that case, by a factor of four. But it also is an issue where there's a huge amount of overlap. So we have tried very hard, but we can't get this system to work for urine. Urine is too complicated. There's about 150 to 200 compounds in urine. It can work for simpler spectra, for CSF, cell extracts, saliva, fecal water. Anything with about 70 compounds or less. So beyond that, it's too complicated. There's some tricks, higher field instruments, more powerful computer, other ideas that we have. And so maybe a year from now we'll have that one figured out. But it was our goal, was to get urine done, but it just didn't work, at least not at this stage. So it's extremely difficult computationally. Yes, you could potentially put in a urine sample into this and just tell it it's serum and it'll try its best and it'll fit those 50 compounds that it's expecting. And it would probably do okay. The thing about urine and some of the other things, especially in NMRs, you have to be very certain about your pH and you also have to try and remove, particularly in urine, divalent metals, magnesium, calcium. They cause some very unexpected shifts in peaks and so the best route is to add a small amount of EDTA to the sample. So those are protocol issues and if those aren't handled in it, it becomes very difficult for the computer to do it. Humans, we can do that. We can look for patterns that are shifted, but computer, as I say, it's pretty bad at straightening pictures on the wall. So there are still advantages with automation. This is why the effort, this is why we spent so long at it, this is why we're kind of happy with what we're getting at least so far. This is why other groups are doing this. There's a group in Finland that's recently published another one that looks to help. So it's a lot faster, 30 to 60 times faster. It's pretty accurate. The point is it's something that you can press a button, batch load, go home, sleep and answer or the paper's written for you the next morning. The other point is that as Carolina mentioned is we don't have this differential bias that every one of you as I was looking as you were doing your baseline fits and doing your phasing, every one of you had slightly different spectra. Some were blowing up to, I could see the noise all the way covering your screen and others that you could just barely see the water peak. So some were zooming in and others were sort of just saying, oh, this looks good enough to meet. So there's different levels between what we say is good. And so that reduces that bias or potentially user errors. And we have found that actually it picks up things that we miss. And we know this because we ran the example and it said, oh, this must be wrong. Then we went back and said, no, the computer was right, we were wrong. And that's happened more than a few times. So this is the advantage of actually having a computer do that sort of thing. Okay, I'm gonna switch from NMR to GCMS. And I think this is maybe where people will perk up again. At least this is more familiar with most of you. So we've seen the different chromatograms that are produced by LCMS and GCMS. And so a total ion chromatogram for GCMS is typically shown in the top left corner there. And we'll see some peaks. We'll typically see 50, 60, 70 peaks that are clearly identifiable in any given GCMS system. What we tend to forget is that any one of those peaks or under those peaks can be a single pure compound but more often than not, several compounds. And that's illustrated below where we'll see one peak when black but three other peaks that sum together to produce the one large black peak, which says in fact there are three compounds under this apparent single peak. Furthermore, we can collect the MS spectra from these ones. And so each of them has a unique mass spectrum associated with it. A blue mass spectrum, a red mass spectrum and a turquoise mass spectrum. So what we have to do in GCMS is we have to not only deconvolute the peak but then we have to do the spectral comparison just like we did in kinomics. So you could imagine that these are now equivalent to NMR spectra like what you were doing except now these are mass spec. And just like kinomics had its own library of 450 spectra to look up here at GCMS you'll have a large library of mass spec. And the point is to compare the spectra to your database. So we've got three colored spectra and we're going to look at all the black spectra on the far right. And if you look close enough this looking is very good. You can see the very top spectrum matches identically with the blue one. The middle one matches the red one and the bottom spectrum in our library matches with the turquoise one. In this case these compounds are from pure reference compounds so we know what they are. The top one might have been alanine, the middle one might have been adenine and the last one might have been citric acid. I don't know. So we've identified your compounds. So in GCMS we use electron ionization or EI or electron impact ionization, different names. And so the net result is we fragment molecules into component ions. So even almost the simplest molecule, methanol, generates almost half a dozen different peaks each corresponding to a fragment. And because we conduct electron impact ionization with a standard protocol, standard voltage, standard flow rate, standard electrodes, everything standard, it means the spectra are highly reproducible. That means the libraries are valid. Same thing with the Kinomics NMR libraries, highly reproducible are valid for every compound, every instrument. So a standard mass spectrum from a GC instrument is typically you will have a largest mass. The left-most side is typically your molecular ion. And then you'll have to the left of that, the fragment ions. So this is very much like an MS-MS spectrum, which we'll talk about later. If you have, if you use not just EI, but you use chemical ionization, you can also end up with sometimes a gas adduct or an adduct ion. In that case, that ion's shifted to the right. So GC-MS, it's either EI, in some cases it's chemical ionization, but the vast majority is EI. So just remember there's the molecular ion, which is usually the highest molecular weight, and then there's all these fragments that you detect afterwards. Now when I was talking about GC-MS, I mentioned, this is in the morning, that we have to derivatize most compounds. We have to make them volatile so they will fly. The way that we standardly do this is adding TMS, or there are other ones like TB-DMS, or compound that Celias's group uses. Carolina, is it MSTF or what's the one? MCF. Okay. So there's a number of roots that allow you to make compounds evaporate at modestly high temperatures. These compounds will react to different groups, hydroxyl groups, amine groups, also to ketone groups. And the net result is that you have changed the chemical character of your compound. So if you had glutamic acid or citric acid or something like that, you're going to get TMS glomming on to the molecule, and maybe you'll get two TMSs, or three TMSs, or one or two, or three TB-DMSs, or maybe you'll get a methoxene addition. So the compound that's actually being analyzed is not the original one that you... original metabolite. You have to remember that, and you have to look at that parent ion mass with that modification. Derivatization is not clean, not 100%. And so in GCMS, you'll often end up with four peaks for the same compound. One with one TMS, another one with two TMSs, a third with three TMSs. But it's still the same parent compound, so you don't want to get confused with that. GCMS is used a lot. In fact, it was the original technique, I think, when Mantaplomax first sort of dawned on the world in 1999. Several groups in Germany were using it. It's suited for compounds that are less than about 500 Daltons. It's not... It never was really designed for really large molecules. People use them a lot for amino acids, fatty acids. You can detect the number of sugars with GCMS. But obviously it's not good for lipids, because they're too big. And so you have to do some different techniques if you want to characterize lipids by GCMS. I mentioned this before, and I'll mention again. Gas chromatography is better than liquid chromatography. Far more reproducible, far more precise. Higher plate counts. That's why we can rely on retention indices and report them and use them around the world. Now, the thing about GCMS is the fact that we use a standardized mass spectrum analysis. That is, we use a standardization mode. So it means that all the EI spectra that have been collected around the world are comparable. So when people analyze GCMS data, they can use the same concept that we just did with kinomics, which is nice. You can deconvolute it's standardized. It doesn't really matter about the system. And instead of kinomics, the tool is called Amdus and NIST. So what's that? So NIST is the National Institute for Standards. It's maintained in the US, and they've been collecting over the last 30 or 40 years spectra, and they've been archiving that. And it's a huge number of spectra. A quarter million EI spectra in the NIST 11 database. They've also done other things. They've done QTOF, triple quad, ion trap mass spec for a large number of compounds. They've collected retention index values. Again, almost a quarter million RI values for many different compounds. Now, a lot of the compounds are different TMS derivatives, not as if they've found 200,000 metabolites or something like that. This is one TMS derivative, two TMS derivative. They also have other derivatization techniques. So if we're talking about the original compounds, it's maybe on the order of, I don't know, 20 or 30,000 compounds. The other thing to remember is that a lot of these compounds in NIST are not metabolites. They're synthetic compounds. They're pollutants, toxins, exotic synthetics. So it's not necessarily going to be your number one reference for metabolomic data. It wasn't intended to be, but it's still a rich, rich resource. You can use the NIST software to search for compounds, and it'll give you masses. It'll give you names of the compounds. You can look at how some of the mass spec are matching. It's quite rich in terms of the information. And then the other part to the NIST equation is this tool called AMDIS. So AMDIS is an acronym for Automated Mass Spectral Deconvolution and Identification System. It's quite old. The concept is quite old. But it does a number of the things that we kind of saw in economics. And the reason why we're not going to use this is, again, it costs money. And we would have had to have bought, I don't know, $18,000 worth of software for you guys which I can't afford. So I'm just going to tell you how it sort of works. How many of you have ever used AMDIS? One? Okay. And it actually costs more than $1,000. But it does do things just like economics. So you can do some background noise analysis. It identifies which ones are real peaks. We do this by eye with economics. Bazel does peak identification on its own. So this is also sort of like AMDIS. It does its own spectral deconvolution. So it takes some of these big lumpy peaks, pulls out the mass spec from them to generate what are called clean or model spectra. And it presents options for you to say, I identify this compound. It's really automated. You still have to be manually assessing things. Again, not unlike what you were doing with Kinomics. It uses a thing called a match factor. So this is how it quantitatively reports the degree of similarity of the mass spectrum that it has deconvolved or pulled out from your GCMS data with the reference database. So we'll go back here. So by your eyes, you can say, yes, that one's similar. That one's similar. That one's similar. But how similar? Is it 98%? 62%? You want to quantify that. And that's what AMDIS does. And it does it through this match factor. It's like a score. It's their score, yeah. And it depends on which version. But it's out of 1,000. So 960 is very good. 55 is very bad. Anyways, it's a dot product of the query, QRY, and the reference mass and intensity values. So it just tries to match each of the mass values that it's seeing and the intensity values that it's seeing between the query and the reference one. It's mathematically what our eyes do when we're matching spectra. And so it does produce this value, which is between 0 and 1, and then you multiply it by 1,000 to get their overall score. So if you want to do GCMS, there's a few things you typically have to do. First thing you typically want to do with your brand new GCMS instrument is you end with a typical run. If it's at the beginning of the week or even at the beginning of the day, you'll run some LK standards. These are the calibration standards that allow you to determine the standard retention indices, RI. So you can buy these commercially. They're pretty standard, and you can inject them into your GCMS. And that allows you to get references so that everything will have exactly the same retention or appropriate retention index. The other thing you want to do is because in GCMS you're dealing with derivatizations, chemicals, messy stuff, you usually run a blank sample. This would represent just the solvent and the derivatization agent. So these are the only other things that have presumably been added. You can also generate a bunch of peaks. These are things that you want to subtract from your normal GCMS spectrum. And then you run your sample of interest. That's the one that's been derivatized and you run it through. So you have reference standards for calibration. You have a blank sample to get rid of the noise, just like when you do a UV run on a blank sample. And then the sample of interest. The alkane standards give you a set of points, retention times. These are running between, say, two and ten minutes. You can give longer standards, different running conditions that could go for up to half an hour. But they serve as reference points that allow you to say that this time on this machine and this column corresponds to this time on another occasion for these compounds, which then allows you to give you this normalized retention index. That retention index is universal, or nearly so. So once you have a calibration file with your retention indices, you will take your sample data, the one that you've just run, and recalibrate with your retention index calibration file. So that gives you your retention index for every compound in your urine or your blood or your plant sample. So you now have peaks with proper retention indexes or indices. And then you can start running NIST and AMDIS to start identifying and matching. You can also get rid of some of the false positives by comparing the blank against your spectrum and that saves you time. So with AMDIS, this is how you would create your CAL file with your LKine standards. Its gain can't show it live because you can't afford the license. And once that calibration standard has been run, then you take your urine or blood or plant sample and it will calibrate generating all of those values with proper retention indices. The NIST search, then, is one where you're letting AMDIS, or your NIST database, let AMDIS identify peaks from noise, deconvolute the peak. So what's marked here in red is a peak. Mac peak might correspond to this retention time slash retention index. And then under that peak, which is marked in white, are three colored peaks. A red peak, a yellow peak, and a blue peak. And they actually have, of course, distinct masses. 78.58.172. So that allows you to, in essence, sort out the spectrum. As you zero in a little more, we can actually see this, but what it's doing is the most abundant ones, or the ones, this is actually, maybe if you shift a little bit more, but we're seeing 73 in this case, 144, clearly the most abundant. So they must be part of the same spectrum. With the AMDIS software, you can actually calculate the match factor. See whether this particular spectrum matches anything in the database. And in this case, it actually matches veiling amino acid. And we can be more confident by comparing this retention index to the retention index that's already in AMDIS, and maybe it's 19.65. So we've got a nice match with retention index, and we've got a very nice match with the spectrum. It's a 218, 203, 156, 144, 133, 73. They're all features, and we can be pretty confident based on the match of the retention index, which is in NIST, and the match to the mass spectrum, or the match factor, that this must be veiling. Now note that veiling is indicated, but it's also a trimethyl silo veiling. So it's been modified. So NIST has cataloged all the compounds, not just as veiling, but all the TMS derivatives. Does that make sense in terms of how we're matching, fixing, identifying? So these essentially correspond to masses, or... Okay, you take the 7-degree mass, and then you add the different points. So 73 blue, 144 red, and then there's a yellow of 130. Looks like... Oh, those are the times. Sorry. So those are... Is that the 10? So we've taken this red, where I marked the mass of that particular peak, which was 19.6 or something, and then we've blown it up, and then it can evolve at that peak. And it turns out it's largely just one mass spectrum, but there's one other one in yellow, very tiny, which isn't veiling or anything else. It's just too weak. It might be the one that's just to the right of it a little bit. Maybe we'll see that peak, or there's another compound. So in this case, it looks like it's mostly pure veiling, and not much else. It could have been more complicated, and we could have had a few more compounds hidden underneath it. We could have, as long as it's 60% of the crash factor above, it's good confidence, or is there, like, a better confidence, and not so confident? Yeah. Some people will use 60%. That's what they recommend. Generally, in our lab, we usually do, like, 80%, or 800%. There are cases where, there's a really nice match to the retention index, and maybe only a 65% match from the mass. But you say, okay, two pieces of evidence are pretty good. Obviously, you'd like to have a perfect match on the retention index and a perfect match on the mass spec. But there are issues here where the reference factor were collected on different instruments, with different sampling rates, some of the spectra that AMDIS has and its reference library aren't great, or NIST, and then there's other subtleties with contaminants and how AMDIS has deconvolved things, perhaps imperfectly. So you'll have a couple of things that will prevent you from getting that perfect match. Colin? Yeah. So that's from one sample. Do you gain any confidence in looking at all of your samples and saying, okay, six different samples, but you've got a factor? Yeah, that's it, exactly. And I think this is just like with the kinomics thing that you guys did. As you've done one, then you start doing others, or if you've got, you know, you're looking at essentially blood, and you'll have a working list. And we'll also have a retention index for a lot of those compounds. So typically, in our operation, we spend a lot of time, if we've got a brand new sample, say, worm blood or something, something we've never seen before. We'll spend a long time, several days, just trying to make sure we can characterize all of it. We'll read up the literature, see if anyone else has got some stuff. We try and have a written list, usually handwritten, retention indices sorted out for some of these compounds. And yes, once we know that, then it becomes much more routine. And you have much more confidence about whether 60% is acceptable or not. But this is an issue for people where you'll just sort of dive in and say, oh, it doesn't matter which sample. We'll plant to some microbial thing and boom, boom, boom. No. Each of these takes a fair bit of time. In our laboratory, we have standard lists of all of the major samples that we've ever analyzed. So there's a reference list that we can look up so that we know what to look for. And I think that's critical for doing good work in a lab anyways. So a lot of GCMS work is manual. I mean, yes, AMDIS is semi-automatic, but it's a lot of work. And we've had examples in our lab where we were trying to analyze a thousand samples by AMDIS and NIST. And a lot of people went insane. So we went into trying to come up with an auto-fit system. There are other tools, other people use things other than AMDIS. Analyzer Pro and Chromatoff are two examples. Anyone has used these or tried to use them? What's your preference? You like Chromatoff? Anyways, there was an evaluation where they compared AMDIS, Analyzer Pro and Chromatoff. I think in this one, they claim the Analyzer Pro one, but it's six years ago, so some of them have gotten better. The AMDIS one hasn't changed. They've put it out there. They just, I don't think, care to change it. But it was so far ahead of this field when it first came out, I think they just got kind of lazy. There are other databases out there that are much more oriented towards metabolomics. So, Oliver Fien has developed a the Fienlib database. A lot of that was sold to Leco and Agilent, but it's also, I think, sort of accessible now through his websites. And then the GOM database is an access one, and it's maintained in Germany. So, the GC AutoFit software is something that was developed by one of our staff, Gam Suhahn. And it takes, you know, the sample, it takes the blank, it takes the alkyne standards just like we talked about before with AMDIS and NIST. But it will take multiple spectra, so then with multiple spectra, it can do auto-alignments. You guys will learn about this with XCMS tomorrow. It can do, you know, the retention time calculation. It'll do the peak integration, which relates to the area. It'll also calculate the concentrations. It can work with a whole bunch of file formats that are pretty standard with GCMS. And it takes anywhere from a few seconds to a minute. So, on average about 20 seconds per spectrum depending on how frequently the data has been sampled and how long the spectral run has been. And with that, you can identify between 45 and 70 compounds. So, that's about as good as what you guys are doing with, say, NMR. And the accuracy is up around 95%. It doesn't work for everything, just like with the NMR and Bazel and these other things. They have to be oriented to sort of common biofluids or common samples. Sort of trained or optimized for them. So, this one is largely working with human samples, but it's also not too hard to move it to other types of systems. Again, this hasn't been published, but probably next year we'll have a lab where we'll actually go through this one, just as we will have probably a Bazel lab, which we'll go through with NMR. So, as I mentioned, there are these other alternatives to AMDIS, like CTOF and Analyzer Pro, other databases like the GOLM database. GOLM1 is really important. It was one of the very first spectral databases made for metabolomics. They cut the data with GC quadruple instruments and GC TOF instruments. They have mass spectra and retention in X data for lots of metabolites. So, unlike NIST, which took everything, this is about metabolites. But it's a large number. It's 1400. So, there's other spectra that are linked to unknown analytes, which a lot of them are still unknown, but at least they lie to say, I've got the same sort of thing in my sample as they do. Their formatted samples data are compatible with NIST and with AMDIS, so you can upload their data and make that part of your NIST-AMDIS tool set. As I said, it's mostly plant metabolites, but there's a fair bit in common between even plants and humans and microbes and plants. This is just a screenshot of the GOLM database. They've changed it a lot. It used to be one of the worst and least user-friendly websites I'd ever seen, but finally, I think I got someone who knew how to build a website. So, there are tools that allow you to do searches. You can search for names, you can select by cast number, you can choose molecular mass, other things. So that's a new addition. You can do spectral library searches, so here's an input spectrum in their format, and you can do a search through that. So, whether it's the regular search or MS analysis, those are now much more accessible and much more usable. All of our Fiends library, the GCMS database, LightGOLM, it has retention index data, LightGOLM, it has MS data. It has about a thousand compounds, so about three-quarters of what GOLM has. Very broad coverage, a bit more mammalian-centric or human-centric. And the data is sufficiently high quality that a number of companies have used it and make that part of their central library. So it's important to remember that these GOLM and the FiendLib databases are for metabolomics, whereas the NIST library was never intended for metabolomics, it just happens to be defaulted. I think NIST continues to try and expand it and understands its role in metabolomics. So then there's LCMS and like GCMS spectral issue, I mean take both examples that I showed, they're largely the same. Same issue with one peak can often mean several compounds. We look at the spectra underneath the peaks and we match them to reference spectra. The issue with LCMS is we don't have this incredibly reliable index called the retention index. Retention time means almost nothing in LCMS unless you're working in exactly the same lab, running exactly the same column, exactly the same thing every day. But comparing between labs, comparing between databases, it's hopeless. Similarly, many of the mass spectra ESIMS also are not comparable between systems. There's no standardized method for ionization or collision induced decay. So that does make it a little more challenging. According to a metabolomic society they came up with a metabolomic standard initiative and they proposed four different levels for metabolite identification. And this system is still persists and it was largely under the motivation of people who were doing LCMS. So it was not people doing NMR didn't have much of a say and people doing GCMS didn't have much of a say. So what they identified is that the highest level is a positively identified compound. Where you match the compound to an authentic standard. The standard has to exist in your lab. You measure it, you run it on the same spectrometer in the same conditions and you see identical peaks. The next level down is called a putatively identified compound and that's where you match both to the mass and retention time. So an accurate mass to like 2 ppm or an MS-MS spectrum and the retention time or retention index. The third level down is called compound identified by mass or putatively identified as a compound class. Now this one they've had a lot of debate about. So some people just simply say I can identify an organic acid, okay. But where this categories essentially become is the catch-all for anything that's identified by mass alone. And this is where about 98% of all metabolomics is sitting right now. Almost all of the mass descriptions, all the compound descriptions are at what I would call this third category where identifying compounds based purely on the mass. We're not using retention index data. We're not using MS-MS data and we're not using a standard. This is really problematic and it's something that we're trying to teach you not to do. And then the lowest class is saying I have a feature. So it's an unknown compound and that also accounts for 90, 95% of the data that's being dumped into metabolomics databases or described, particularly by vendors where they say I've got this wonderful new instrument and I can see 10,000 features or 15,000 features. As I'll show you later about 80 to 90% of that is pure noise or garbage. There are many cases of unknown compounds. We just haven't identified them. But that's a separate issue. So we need to be up here in top 2 and we need to avoid these ones. Unfortunately, most of the metabolomics is stuck around this bottom end here. Yes. Yeah. Yeah. But there are cases where there are polar and non-polar compounds and you can fairly distinguish between something that comes off in the void volume and something that is retained. So this definition is sort of evolving and they want to rewrite some of it partly to accommodate the fact that there's other modalities for detecting compounds and the fact that there's been this fortunate trend to essentially identifying everything purely by the mass. Yeah. So my advice to you is only report things you can positively identify. Stick with 200. We're so anxious I think to try and partly driven by vendors, partly driven by certain groups to push towards claiming we've seen all these things. Our own experience when we've taken quote, untargeted data working with compounds that have been identified purely by mass and then when we tried to validate them they weren't there. And that's happened not just to us but many, many groups. I think we can say unequivocally it's a failure. It's a failure in the approach. There's no point continuing with that. I think what we need to do is as a community work towards getting either more authentic standards but also just be satisfied with identifying 200, 500, maybe a thousand compounds. Be satisfied with quantifying them as well. If you can quantify and identify even just 200 compounds you're doing about 10 times better than what most people do in proteomics. So it's just we haven't as a community in proteomics, metabolomics, transcriptomics done a really good job about thinking about quantitation and yes, in metabolomics we can be frustrated by the fact that we're not identifying thousands but if we positively identified them then we can have much more confidence about the biology we're thinking about about the biomarkers we're claiming about the discoveries we're eventually making. I think there's a tremendous opportunity for people to expand these libraries to prepare more authentic compounds as a community effort I mean it's partly how we sequenced a lot of the human genome it was a community effort for many, many years it's how we sequence most bacteria community effort and we just haven't thought about that for a long time because of these mega projects so I think it is important we need to start, if we want to expand if we want to go beyond a thousand so every one of us characterize one compound each, that's 18 new compounds for the community that's another thousand scientists that's another thousand compounds for the community so LC-MS which differs from GC-MS usually you'll find that it's better for things like lipids, hydrophobic molecules people have been quite successful for larger compounds you can see the bases you can't see amino acids, some of the things that aren't so obvious for GC-MS to do the identification properly you need MS data you ideally want MS MS data you also want retention time information other pieces and ideally you want the authentic standard or internal standard what people as I say still continue to do and I'll show you how to do it but not recommend it is to do mass matching to figure out what it's there it's more than nothing but it's not something to hang your hat on so different tools, different resources keby pubchem chemspider, hMDB all of these support molecular weight searches so pubchem has an interface that isn't the most convenient but you can search for certain masses from 89.000 to 89.099 AMU that's how you'd code in your search in pubchem and you press go and boom you get 400 answers now pubchem is our largest collection of compounds it is one that a lot of people use but I think somewhat naively only 1% of the compounds in pubchem are natural products everything else is synthetic that means 99% of the stuff shouldn't be in your sample problem is you don't know what's synthetic and what's natural unless you're a very good chemist or have an amazing memory some people can look at the names and say yeah that looks like it's a real compound it's got to be a natural cove product no it's not very obvious what's that yeah well their intent wasn't really to serve the tabloid community it was really to serve the chemistry community and so there are people that are trying to see if they can winnow down pubchem and identify natural products partly by similarity to other known natural products some of the information based on the source library that they were accessed can kind of give you a hint that it might be a natural product so they do have the natural product database set I think that they grabbed so you could grab that one and say ok that's my natural product but some of these could be really exotic plants found only in obscure desert areas in southern Sahara so you're not going to find that in any of us so again it's kind of useless and especially some really strange medicinal plants where there's lots of really cool compounds but again no one eats them they're probably finding them in our bodies so this is an issue and it's a bit unfortunate because a lot of people still go to pubchem as their primary source for metabolomics and it was never intended for that purpose on the other hand keby which stands for chemical entities of biological interest was somewhat developed for metabolomics so instead of having 50 or 60 million compounds 38,000 compounds but there's nothing in keby that says it's specifically for metabolomics it's just anything of interest so they'll have some interesting buffer compounds they'll have some interesting exotic toxins because it happened to be on the front cover of science so they'll include things because that's the topic of the week or the compound of the day so again it's not specifically designed for metabolomics but it's typically obviously these compounds are more biological you can through keby search by their weight and it gives you a range so it's actually a little more intuitive than pubchem to do the search you can also do more advanced MS searches so it's not just molecular weight but you're actually looking for molecular weight ranges you might be able to look for positive, negative or neutral ions you might be able to search for MS, MS so HMDB, Mass Bank, METLIN and also the NIST resource have that capability so these are really intended more for people who know a little bit about mass spec and are working with real mass spec data so this particular slide is outdated I think Allison can probably give me a more current one but so the human metabolism database has gone through a bit of an upgrade in the last week but it does have the capacity to do searches with masses and to select what kind of ions and adducts you want to look at to select subsets within the database to select biofluids that you happen to be looking at so this makes it a fairly powerful approach because it's able to include what are called adducts so these are things that when you do an LCMS run you will pick up sodium or potassium or other kinds of chloride adducts depending on whether you're neutral or positive ion or negative ion mode ammonia and these sort of contaminate your spectrum but they actually are real compounds and they'll have slightly different masses than what you would expect with that with that ability to do adducts it means that what you're dealing with is instead of not just 40,000 compounds but perhaps up to 400,000 masses which can be discouraging because instead of just giving you one answer it can give you perhaps a dozen or more the system allows you to select from a different bunch of databases or at least it used to I'm not sure if you guys changed that so what happened was that HMDB went through an upgrade last year and a lot of the mass searching capabilities that were there disappeared and so we were just bringing those back now and enhancing them it also allows you to do some mixture deconvolution so if you have a bunch of masses that you've seen and you could enter them just like you had a bunch of chemical shifts and then potentially identify potential compounds so in addition to the mass search it also has MSMS spectra so it collected we've got several thousand experimental MSMS collected on triple quads and other instruments, Q-toffs as well as things that different impact energies so this is where you're trying to match sort of a complicated tandem mass spectrum to an existing compound different spectrometers will produce different intensities and they'll also produce different ion fragments but one thing that has emerged as a bit of pleasant surprise is that almost regardless of the instrument you will still see the same pattern of masses not intensities but the same pattern of masses which suggests that it is possible to identify a compound through tandem MSMS matching not unlike we're doing with EIMS in GCMS so that's, as I say, it's a bit of good news it was a surprise but now that people are starting to measure all these different compounds on all these different platforms thermo, waters, Q-trap, ion trap, FTMS that we're seeing the similarity it's not perfect but it's often good enough to be pretty confident and we're also moving towards having multiple collection collision energies low, medium, and high typically it's 20, 40, 60 is sort of the one that we tend to use as a community unlike the MSsearch this is designed for searching for one compound at a time so you produce a list of peaks so this is making use of experimental data you can go one more and this is something that's just come out last week I guess so this is using prediction where we predict tandem mass spectra for a compound and there are some vendors that do this but it's done in a fairly primitive way Allison was involved in building the interface for this one so she can tell you all about it so what it is is a web server and it has a large library so every compound in the HMDB has had its tandem mass spectra predicted so that's 40,000 compounds we've also done the same thing for keg so that's about 30,000 compounds they don't all overlap keg has a lot of microbial and plant metabolites but if you have a mass tandem mass spectra you've collected for something that you've isolated or been able to collect then you can compare this tandem mass spectrum to the predicted ones and to see if there's a decent match now predicted mass spectra are not perfect they there are flaws but the prediction algorithm in this case uses machine learning and it's so far the most accurate one that we have seen it has several options that you can do one that most people would be interested in is compound identification option if you click on that it simulates the collision process it imagines all the fragments that's right like the B and the Y ions so it can predict those it does peptides it also does lipids it's also learned things like mclafferty rearrangements it's learned a bunch of other things so it can take many compounds predict how they should fragment predict some of their intensities and then it also performs sort of a match factor using the zhikard score to identify which compounds match most closely that's right that's right so you can choose well this is an example but you can enter your set of masses and their intensities you can choose how many you want how many hits you want to record you can choose a mass tolerance which is typical for how accurate is your mass spectrometer is it a triple quad or is it a q-tough or whatever and then you press go and about 15 seconds later your answer is revealed and it shows the mass spectrum it shows the level of matching it shows the options it gives you it's overall score it's all interactive and viewable this tool is being migrated into hMDB I don't know if that was not quite done but as I say you guys the website is given you can use that tool so the papers online now I think it should be quite useful for the community Matlin is another great resource and they have a very nice interface which I think that was Alice and you guys were trying to build an interface similar to that for the mass matching but what's nice about the Matlin resource is it allows you to pick and choose which add-ups you want you can click on two you can click on ten you can click on all of them you can check which mode you want and so you measure your mass your tolerance positive mode and choose the type of add-ups and then just go find it will look up its database of spectra and there's about I think there's 8000 it's getting up to 11000 spectra 8500 are peptides so about 3000 will really correspond to what we call real metabolites so take an example submit and it will produce a list with their compounds, names structures, links and occasionally some of them will have links to their tandem mass spectra so Matlin not only supports the pure mass searching, it's just a single mass but you can do tandem mass searching on the HMDB where you upload a peak list and press go and it will look through its experimental list of as I said about 3000 real spectra now the limitation of course with Matlin is that it only searches its known spectra it doesn't have predicted spectra so the ideal system would allow you to search both predicted spectra and if the real spectra are there it would take them and if the real spectra aren't there it would take the predicted ones so HMDB the CFMID only has predicted Matlin only has real HMDB only has real but there is a website called Metfusion which actually has put the two together the real spectra predicted spectra now Metfusion's predictions aren't that great but it's a cool concept and a lot of people like the concept and are using it and have had some success the problem is that I think Metfusion they ended up using PubChem which was kind of useless because those compounds are as I say none of them are metabolites Metlin is not Metlin? No, it's free Metfusion's free all the ones I'm mentioning except for the NIST-AMDIS one are free and that's what we're sort of trying to emphasize in this course just to make sure that people can try and access the free stuff Free stuff So I hope in another week or two HMDB will actually have those capabilities of being both similar to Metfusion So I mentioned this already and the fact is with LCMS we do see ADEX and in terms of the number it's 80 to 90 percent of the peaks that we see in LCMS are actually that's junk data ADEX or noise sources multiply charged peaks or other things So ADEX as I said before are cases where the parent compound is a sodium, acid or some other ion because there happens to be a negative charge or a positive charge depending on whether you're working in positive or negative ion mode And so these are extra peaks So for this one compound we would see 6 peaks these are ADEX and then the isopropyls So if you didn't know that these are isopropyls or if you didn't know there was an ADEX number of compounds but it's just one and this is a typical scenario where people claim I've got 15,000 features no divide that by about 10 that represents the number of compounds So these are a list of some of the ADEX that people see You can see that potassium, sodium, lithium combinations methanol some with hydrogen some with potassium, some with two sodiums You can get even more extensive ones This is Oliver Fiends' ADEX table it goes on and on So these are all possibilities that can actually happen with a single compound Now normally you wouldn't see all these ADEX because people don't run samples with both sodium, potassium Cetanitrile, methanol everything else But this does show you the range of possibilities that are there leading to extra peaks Yes It would only run if you are running an electrospray with methanol as your solvent or acetanitrile with your solvent So if you don't have that as your solvent then that will never happen If you are running with that it's rare but evidently it has happened So it's hard to know which ones are going to show up Certainly the potassium and sodium are very, very common But in terms of which molecules to which ADEX we don't know That's something that would have to be studied through databases, statistics, machine learning It's a worthwhile thing I think it would really help if we could figure out which compounds are more prone I think ones that are very strongly charged things that have phosphate, sulfate those are much more likely to form sodium ADEX versus something that like a sugar or something that's just got a hydroxyl group or something like that So the polarity is one indicator of how frequently they will form ADEX or how many types of ADEX I think there's intuition people who have been doing mass spec for a long time typically will know which ones to look for more frequently and certainly if we could put that knowledge into some kind of program then I think everyone else would benefit from that It's a little bit like it, it is That's right That's right, yeah There's a nice resource called MZDB that's maintained out of Aberystwyth in Wales A lot of people don't know about it but one of the pioneers in metabolomics John Draper set this up and they have some nice tools and utilities for it I like using this database a lot and this web server It's a little bit like Matlin but it uses a different different set of data So you could put in a compound where you just have the molecular formula and then it calculates all the possible different masses that may be consistent with these ADEX or ADEX forms In addition to ADEX you can also have other things that happen again just as was mentioned about proteomics you have modifications so you have acetylation on proteins and methylations and things but molecules can also have at least in mass spec neutral losses where they lose hydroxyl or water or something like that and these neutral losses are in addition to the ADEX so in addition to the six peaks that we saw for the other compound you can have all these other neutral losses which produce another set of six peaks so one compound just simply passing through electrosplay you still get a bunch of extra peaks and to be able to identify these neutral losses also helps to distinguish real compounds from noise so Metlin, HMDB, MZDB are able to handle ADEX Metlin, MZDB are also able to deal with multiple charged species. I think HMDB also does that or should do that Metlin though is very good at handling some neutral loss species which neither MZDB or HMDB do Nevertheless when you do this sort of search with any kind of simple mass search or mass range search given all these ADEX and neutral losses we're going to end up with lots of hits and therefore lots of false positives so how can you sort this out so there is software that you can buy it's part of many tools now to remove ADEX and multiple charged species there are software tools in many instrument vendors systems to help remove fragments the neutral loss and simple breakdown products there are software tools for removing and consolidating isotopemers and there are methods for removing noise just like when you submit a sample blank and or just doing a simple test to see whether technical replicates give you the same peaks over and over again and the remarkable thing is that when people do these technical replicates they still find about a 20% difference in peaks so there's just 20% noise if you want on top of noise from the blank sample so vendors will tell you about a great new instrument I can routinely detect 15,000 features some people doing metabolomics also will say the same thing but if you do these processes where you're dealing with all the ADEX and getting rid of those you might drop by perhaps 3,000 features remove the multiple charges which you could also say are ADEX you're down to 10,000 features deal with the neutral losses which is essentially the same compound but just happen to lose a water or hydroxyl group you're down to 8,000 features all the isotopemers you're down to maybe 3,000 features remove the noise you're down to 2,500 features so that's a more typical and realistic value so 1,6 are real peaks or roughly 15 to 20% then people will do this for a negative mode they might also get 10,000 features we'll know it down, might get 1,500 negative mode peaks overlap between the two is maybe around 50% so net results maybe 25,000 features might represent 3,000 to 3,500 compounds maximally there are tools to help with this some for vendors but also freeware tools like MZMind Magma and I think Metfusion does that so that's an important step important process another thing that we need to do is to work with high mass accuracy you guys have seen this table before but this is one of the real important breakthroughs that's happening because of technology the fact that we have very very accurate mass specs technology is becoming much more accessible when you have very accurate mass spec and mass data you can actually calculate molecular formulas so that's that third class of compounds we're saying we can get you a class of compound maybe not the identical compound but we've got the mass we should be able to get the formula the formula doesn't identify the compound but it gives you an idea of what it is or what it could potentially could be so there are tools MW Twin which is I think commercial where you can enter masses high cam another one where you can enter the mass and it'll generate potential formulas free wine which I just mentioned, MZDB this one actually does formula generation it uses some of the rules of 7 that Oliver Feene described and so it's again another really useful hidden gem that most people don't know about if you know the molecular formula then you can enter that formula into a database like pub cam or others so from the mass, the accurate mass you can go to the molecular formula then you can start searching the idea is the molecular formula if you've done it properly has narrowed things down it's used chemical information it's not just using the mass information it's restricted things further so it's actually harder to search by molecular formula after you've done this filtering of mass to molecular formula so pub cam supports it with the usual caveats that some of the things you're going to hit probably aren't real natural products keby supports it HMDB supports it many do as I said the molecular formula calculators use a bunch of I think intelligent rules ideas about what the restriction with single, double, triple bonding given the atom types atom numbers they also have some information about possible structures and topologies and they can build that in and this set is called seven golden rules and Oliver Fein described these about six or seven years ago he actually got a software package that does that you can download it's an excel spreadsheet and as I said it's the one that's also migrated to MZDB it'd be also helpful I think if it was an HMDB hint hint now I guess we're down in the last few minutes and I've got maybe another five minutes of slides I don't know if people are willing to sit around use the stickers what's that okay read sticker please if you're okay with going for another five minutes of slides okay okay alright so I know we're thirsty and I'm downished but so the point is that molecular formula is allowed to sort of shrink the space so one estimate if we just looked at all compounds that had carbon hydrogen nitrogen sulfur oxygen and phosphorus in them which represents the vast majority of standard organic compounds and that there are things that are under 2,000 adultins we know there's at least 8 billion compounds that could be created that way that are valid that would follow the general structures if you use the filtering process of seven golden rules or MZDB we shrink that down by a factor of about 15 and then the ones that we actually know it's a much smaller set around 700,000 and then if we just use the natural products we're shrinking down to about 50,000 compounds that would be potential from that subset so using both molecular formula rules and the fact that the compound has to be natural shrinks your space by huge amount and so the searches don't have to be as hopelessly large as some people think and that's why at least for the CFMID and many other things we're just choosing only natural products to search for no point complicating it also what was calculated from this same paper was ideas about the frequency distributions for different formulas so if you have a small molecular weight there's far fewer formulas a large molecular weight many more formulas so we can see this growing linear from molecular weight from 200 to 300 Dalton's from about 20 possible formulas to something like 70 or 80 formulas so bigger molecules more choice the other thing that was pointed out is that if you include isotopic abundance which you can with highly accurate mass spectrometers and and along with just good mass accuracy if you put those two things together you can hugely decrease your search space so instead of having hundreds of molecular formulas say at the 10 ppm level if you're at 1 ppm level which is typical with orbit traps that shrinks down the number of formulas but then if you use the isotopic abundance then you're down to just almost a number of formulas you can count on your hand so huge wins if you're using the information from high resolution mass spec to restrict your formulas so this is just an example where you're able to have a more highly resolved spectrum you're able to see the isotopic abundance and in this way with a fairly large molecule get the precise formula and only one formula things called isotopomers leucine methyl 2-pentanone, 2-hexanone molecules with the same molecular weight same atomic composition also sometimes give you identical MS spectra and sometimes it's sometimes hard to distinguish them that's just a fact there are tools that will generate different isomers help you choose things this is for dealing with completely unknown compounds we still really don't have a good idea about how many different isomers there possibly are so that's a scary issue millions are listed here the number that are actually known is listed on the far right which only represents typically one percent or a fraction of that so as I said before a lot of the databases mix non metabolites with metabolites plant metabolites with animal metabolites microbial metabolites with drugs buffers with other reagents that makes it really problematic there's a lot of papers where people have got completely nutty hits because they just didn't choose the right database if you know which organism you're looking at then use that information go to an organism specific database if it's not an organism but you know that you're looking for drugs then look at the drug databases if you know you're looking at food components or phytochemicals look at the NAPSAC or other plant specific databases other ways of getting around this is to use rather than mass matching people are moving towards chemo selective labeling others are using very targeted mass spectrometry with kits others are using computer aided mass matching I won't go into this because it's short on time but this is a really elegant approach that was developed by Liang Li but there are other groups around the world that are doing similar kinds of ideas it's similar to eye track for proteomics it's using heavy label, light label you're labeling or chemically modifying compounds in this case it's not with trimethylsilane but it's a carbon labeled dancel chloride with active groups with high efficiency you have heavy label light label and just like with eye track you can actually isolate compounds you can look for paired peaks you can measure the intensities so you can actually quantify them and you can also be certain of which compounds are real and which ones are fake or false positives this was done five years ago and they were able to identify about 120 or I guess 90 that they identified and quantified they could get down to 30 nanomolar they can also do this with carboxylabeling dermatization just like for GCMS has some advantages with LCMS you can convert a compound that was invisible to UV to something that is visible dancel chloride is very UV absorptive get much better eye inefficiency the intensities increased by a factor of 10 because the eggs are actually hydrophobic you can do C18 HPLC with them and you get great great separation you can do this quantification so you don't have to have authentic standards for every single one you just need to have one chemical standard which is the dancel chloride and as I said it allows you to distinguish real peaks from fake peaks and that's why instead of seeing 15,000 features typically with this method you'll see around 1500 real peaks there are kits now being sold for doing targeted mass spec they use single reaction monitoring multiple reaction monitoring they have deuterated or C13 labeled isotopes in the kit so you can be absolutely certain the compounds that you're identifying you can also be very certain about their quantification Hong mentioned she's developing kits that way as well so this is happening in many labs actually in metabolomics using the same idea of SRMs and MRMs making authentic standards using little 96 well trays using simple mass spec techniques and it's very effective and this is an example again measuring from a 10 nanomolar to almost 10 millimolar using these kits they're reproducible very consistent the other part which we talked about was the idea of identifying unknowns or completely unknown compounds and this is a technique called computerated structural lucidation one where we try and predict metabolites from known metabolites that's called top down and the other one is essentially assembling metabolites from molecular fragments that's called bottom up the top down approach is actually something that's feasible you can take the metabolites in keg or hMDB or foodDB or whatever and you can do an in silico transformation you can predict their sulfate hydroxyl, glucuronide variations the phase one, the phase two transformations the microbial transformations so you can go from maybe 10,000 endogenous metabolites to about 400,000 metabolized transformations and then from there you can predict the mass you can predict the masses, the formulas you can even do the tandem MS so now you've created a synthetic database of potential metabolites and then you can compare your compound spectra to those synthetic metabolites so this is done with a database called my compound ID it computation metabolized all the compounds in the hMDB and the net result was about 400,000 theoretical metabolites so you can go to the database and run it and it lists all the different reactions that it considers it doesn't have the structures it just simply created the masses using a little bit of chemical wizardry and if you do the search you'll get a whole bunch of hits you can do the searches just for unmetabolized ones so no metabolites or reactions one reaction, two reactions two reactions generate a synthetic list of about 4 million compounds this has been really useful people have found that the number of hits that they get from MS studies go up by a factor of 3 or 4 which is huge the other approach is the bottom up approach which is we know a little bit about chemistry let's create synthetic molecules by computer and see if we can also predict their properties this is traditional computerated structure elucidation methods it's how we actually determine structures of novel compounds using NMR and mass spec it's very difficult there's only a few programs that actually do this well there's only one company that has a viable one that works so I'm not going to talk about it because it really isn't feasible so with that we are done and I think we can take a 30 minute break