 Okay. I think, yep, it's started. At least it says it's started. Okay, we're welcome back. Hopefully you guys have got acquainted and nourished and now you're awake for at least another half hour. So we're getting into, we've done metabolite annotation or compound identification using NMR. And we're going to go to the second phase, we'll call it metabolite identification and annotation part two. So this is all about trying to generate lists. So we worked with the Kinomics software. You guys might use a little bit more as part of the assignment, but I wanted to tell you more of other tools that are available. It's not the only one. Then we're going to move into other fields. So there's GCMS. So you can do deconvolution with LCMS. So we'll learn about that. And each of these methods, all we're interested in doing, as I said, is to go from spectra to lists. And that's an important thing. I'll also sort of close off on how we can do compound identification with high resolution mass spec, formula generation, and also a little bit of unknown compound identification. So here's that goal. This is the central focus of today, really, to go from spectra or multiple spectra to lists. Lists of metabolites, preferably with concentrations, but potentially also relative concentrations. So we saw how this was done within MR. You guys were doing deconvolution. And some people were wondering, why are we doing this phasing? Why are we doing this baseline correction? And I guess the analogy I was giving was, imagine if we took a photo like we did today, but in fact it was each of us individually. And imagine that the camera was really out of focus and furthermore the development process or whatever we did just made a very distorted image. And so we printed off one of those photos and we said, who is this a picture of? And if it's really out of focus and really badly distorted, most of us probably couldn't figure out who it is, even if it's just the 23, 24 people in this room. So if the photo is in high focus and not distorted, then it's much easier to identify who it is. Our facial recognition in our brain is more effective. What you were doing with the process of baseline correcting, phasing, all of that stuff was to try and make the spectra less distorted, put it into focus so that you could compare it to a reference photo of a very pure compound that had been perfectly phased and perfectly corrected. These are perfectly phased and perfectly baseline corrected spectra. And that's what our reference spectra are. And we can only do that deconvolution if the mixture spectrum is also perfectly phased and perfectly baseline corrected. So that's what you were doing. Now in terms of economics, I think some of us appreciate that it's fairly manually intensive. It's a fair bit of work. So there's other types of tools. There's AMIX, which is produced by Brooker. There's other tools that are somewhat automated. Auto Fit is one, Batman is another one. So this was published two years ago. This was just recently published. There's software specific for 2DNMR, and this is one of your TA wrote. There's databases that support spectral matching. There's tools in Japan, and then there's downloadable software also for spectral matching. And I'll go through some of them. So as I said, you saw how tedious or manual it is when you're doing this. So it'd be nice if you could do this automatically. Software is starting to come out. This one was developed, but the code has not been released unfortunately. But it uses a compound library, has a spectrum. It identifies the DSSP, calibrates things, and then it looks at a various collection of those peak clusters just like you guys were doing, except it does that automatically. And then it has a non-linear optimizer that helps determine what fits best and handles the noise. So this is an example of, I don't know if, so the black line is the actual spectrum, the red line is the calculated spectrum. This is one where someone did manual fitting, and it takes anywhere from half an hour to an hour to sort of do all of the auto-fitting. And you can see it fits pretty well. There's some things sort of in the low noise area where it's not so great, but in these cases, the compounds hadn't been formally identified. Here's the automatic fit, so with the auto-fit. And there's not much difference, except this is a lot faster. So part of it is you have to appreciate how hard it is to do it manually, and then now you can appreciate how nice it is to have things automatic. So automation speeds it up, obviously. Measures on precision recall are very, very high. If you can go to a computer system, then it doesn't vary with individuals, and different individuals perform differently. And we could do a test today, and you know, you might say, okay, we want everyone to complete their analysis. What do you end up with? And the odds are with 25 people here will end up with 25 different answers. The answers will differ perhaps just in their concentrations and maybe not in the identification, but there will probably be at least half of you who will identify some compounds that others wouldn't have identified. So by moving it to a computer-based system, you get something that is reproducible. So even if the computer is wrong, it's still reproducibly wrong. And that's important when you do anything in statistics. And it avoids some of the user bias, which can persist and develop over time. But also is able to pick up and decondulate some things that people consistently get confused with. So that's another advantage of automation. So you can take real spectra from CSF and urine, or at least your sort of simulated spectra. So you put a composition in, create what you think is urine, create what you think is cerebral spinal fluid, change the concentrations, change some of the compounds that are in there. And then you run it through this auto-fit. And this synthetic urine and synthetic CSF, the fit is perfect. When you use real CSF, real urine, real serum, fits no longer perfect. And that has to do with the fact that humans are imperfect. So we're comparing to what a human has determined, compared to what a computer has determined. And we can't tell who's necessarily correct. Is the computer correct? Is the human correct? Who's wrong? So the correlations still are very, very high, 0.98, 0.99. But there are some misidentifications, and that's still an issue. Concentrations are also very highly correlated. Again, it looks like human experts match pretty well with what the spectra are able to do. As I said, the software was published two years ago now, but unfortunately, code base has not been released. It actually belongs with Canomics. So maybe one day they'll make that commercial. There are freeware packages. So dealing with mixtures, Jeff developed this software tool to allow people to look at two-dimensional NMR. Some of you may not have heard about two-dimensional NMR, but this is a standard way of collecting spectra, and instead of one-dimensional NMR, which is what you were looking at, you see peaks not in a spectrum, but in a plane, and you see sort of looking at them overhead the way you look at a topographic map. So this is a toxic spectrum of, I think, I don't know, urine. And it has a variety of tools for both displaying, visualizing, and marking off metabolites. But it has a large library of both toxic spectrum, about 225, large library of HSQC carbon 13. And then the key thing for this, and the key thing as well for that autofit, is that having that prior knowledge, which is what is the fluid you're looking at, and what's the likely composition of that fluid. And so this gives you what we call a biological constraint, as well as a mathematical constraint. So when you have both constraints, the problem is largely solvable. So the method that Mattel and Miner allows you to do automatic processing, also allows you to do some semi-automatic compound identification, looks for things called minimal signature piece. And people can do direct annotation. So it's been tested on a bunch of artificial or synthetic cocktails, compounds that are identifiable given the accuracy or sensitivity of the instrument, the number of compounds that identify the percent that were correct, both for toxic, for HSQC. This is for synthetic cocktail. This is for blood plasma or serum. And then this is measures of what we call recall precision. So this is a statistical measure of the performance. So it's not perfect, and it is somewhat pH dependent. And this has to do with the fact that the reference spectra weren't collected at every pH. They were all typically collected around pH 7. So if you start analyzing a sample that's very acidic or very basic, the method won't work too well. But overall, the performance is actually quite good. And given that the software is free, that's a pretty good deal. So there are other approaches. You can use databases, which are publicly available. One database is called HMDB, the human metabolism database. And in this case, if you have chemical shifts that you've identified, and you can do this with any standard software, then you just produce a list of chemical shifts. You feed in that list of shifts into the query tool in HMDB. Just type in the numbers, and it will identify likely candidates in terms of the compounds that may be there. Software called Prime, which is produced by Rican in Japan, sort of similar approach but has a much smaller database. But again, you can paste in a bunch of chemical shifts, press submit, and it will give you a list of possible matches. So this isn't fitting. It's not going to give you quantitation, and it will give you possible alternatives. So it's not, neither the HMDB approach nor the Prime are intended to do what, say, the kinomics or auto-fit do. The other thing, I guess, Tabo Miner identifies, but it does not quantify. So that's a different issue. So doing two-dimensional NMR allows you to reduce some of the spectral redundancy, but it's very hard to quantify with two-dimensional NMR, and this has to do with spin relaxation issues and collecting 2D spectra. A package that you can download called RNMR, it's written in R, but it does have a nice user interface, for both processing and annotating spectra. So this sort of does a little bit like what Tabo Miner does, but it is, rather than being, well, it's recently modified itself, so it actually uses most of the algorithms, apparently, that Tabo Miner has to help with compound identification. The Biomag Res Bank, which is also maintained in Madison, Wisconsin, has a large collection of NMR spectra about the same size that HNDB has, and it also has capacity where you can type in peaks in search for proton or carbon, or just QC peaks, and it'll see if it can match something in its database, which can give you a suggestion of what the compound might be. The last one that I mentioned was Batman, and this is something that was developed at Imperial College by Tim Ebbles and colleagues in it. It uses what's called a Bayesian, which is where the BA comes from, method. So Bayesian statistics sort of allows you to incorporate prior knowledge, it also allows you to do some exclusive or type operations, if you want, and they've used this program to help perform automatic fitting of NMR spectra, sort of like the auto-fit program that I talked about before. It doesn't do quantitation, and it has a fairly limited size in terms of, it can fit maybe about 20 or 25 compounds, but it's appealing and they continue to work a lot on it, because I think this is where people would like to go, they'd like to make these systems automated. So that's sort of filling in, I guess, the holes about the NMR. As I say, you don't have to use kinomics software, but hopefully it puts it into a context where you say that some of them are for quantitation, some are for identification, some are for 1D, some are for 2D, some are for carbon, some are for proton, some are free, some aren't. I'm going to switch now to metabolite quantization by GCMS, and this is a general concept here, which is in gas chromatography mass spec, we have a chromatogram with peaks, and we've seen a bunch of those before, and what we do is we typically will then analyze a peak. Now, in chromatography I mentioned before, separation is imperfect, and often under a single peak there may be three other compounds, or four other compounds that have essentially the same migration time, but, so there's the red peak, the blue peak, the turquoise peak, each of those are separate compounds, but they produce a single peak. They also have separate mass spec, and these are electron ionization mass spec pretend, but each of these mass spec tells us there are three compounds, the integrated area under these tells us relative concentration if you want. What are these compounds? So this is where we do spectral deconvolution again, and what we do is we look to see if this spectrum matches anything in this database, so the idea is to create a large database of EI mass spectra, just like we were dealing with a large database of proton NMR spectra, and so you can look at this spectrum and you can just sort of do a scan and see which one looks most similar, and this one looks the most similar. So if we know the compound that generated this, then we've identified this compound, and then because we know the peak area, we can estimate the relative for absolute concentration. Same thing with this one, we can look through this database of seven or eight compounds and say is there anything that matches? Yes, this does. So now we've got another compound identified. So that is spectral deconvolution, except for mass spec. A little different, conceptually, but similar overall. So in the case of electron ionization, we saw this before, this is methanol, and we saw that typically it's characterized by multiple peaks because it's fragmented into multiple small fragments. So here's a more realistic one with multiple peaks. Here is the molecular ion, or parent ion. Sometimes you'll see an ion larger because maybe there's a chloride adept attached to it, but not often. So this is the parent ion, these are the fragments, that's a GCMS, and that's a mass spectrum from electron ionization, and we can use that to run that up against the database. Now in the case of GCMS, the important thing to remember is that these are typically not pure compounds. They are derivatized compounds. Now for volatiles, aromatic compounds that make any smell, those don't have to be derivatized. You can feed those directly into the GCMS, and so they aren't necessarily derivatized. But most metabolites that we look up from tissues and biofluid samples are not volatiles, so we derivatize them with TMS. And TMS will react with hydroxyl in the amine groups. There are other variations of TMS, so TB-DMS. There's also methoxenes which will also react with certain compounds like ketones. So each of these things can react. Sometimes you can have one, two, or three additions, or four or five, and these will add certain masses to that. So these are the metabolite plus TMS, the metabolite plus methoxene. So in the case of GCMS, as I mentioned before, it's typically used to look at more water-soluble compounds, amino acids, organic acids, sugars, which LCMS doesn't do a very good job with fatty acids. It's also limited to looking at relatively lower molecular weight compounds, 500, 600 Dalton's. So it misses the lipids, which are typically very big. The gas chromatography we'd mentioned before, very reproducible, higher plate count, higher resolution. So it's a good form of chromatography. The mass spec that we use in gas chromatography is also more consistent, more universally applicable, more standardized than the list of soft ionization methods. So the databases you use in GCMS are actually more useful. And typically when we want to identify compounds, identifying the known unknowns, we use, most people use this combination of the AMDIS software and the National Institutes of Standards database, the NIST database. So the NIST database, it's version 11 now, has a lot of spectra, a quarter million spectra. So it's a lot more than the 450 in kinomics, so the 225 in the metabolominer. And it's corresponding to 212,000 compounds. It also has MS data for ion traps and QTOFs and triple quads for about 3,700 compounds and for 4,600 compounds. Additionally, it also has retention index values for a lot of compounds, about 20,000. So this is the single largest resource of spectral data that is available. Unfortunately, most of the compounds are not metabolites and probably will never be seen in any living system. So that's a little frustrating. So as large as it is, it's not quite as useful as it might first see or first blush. Yes. It's been populated over decades. So the data collection that they do at the National Institutes for Standards, they get samples sent to them. A lot of them are, as I say, non-biological. A lot of them are toxins and poisons and pollutants and things like that. Some of them are food additives. But a lot of the compounds are really things that it never see outside the lab. The ones that are, as I say, are very low abundance poisons which are often below the detection limit for GCMS. So in the early days, everyone just sort of flocked to this database thinking they would get all kinds of hits, but they don't. They don't get that many hits or as many as they'd hope. There is a tool for searching. They have structures for all of these. They have mass spec data. So this is some of the mass spectral searching tools that are part of the NIST database. The other thing that you have to couple the database with is this software called AMDIS. And that stands for Automated Mass Spectral Reconvolution and Identification System. So AMDIS does a lot of the things that you guys were doing with the NMR. It tries to identify, deal with the noise, identifies the peaks. It also tries to do the deconvolution. It also does the compound identification. And it's sort of semi-automatic, sort of a game like what the Kinomics software was. So as I said, the idea with the exercise you guys did was to just sort of say, okay, this is one approach. But really it's very similar regardless of the type of spectroscopy you do. They're all challenged with trying to identify compounds, trying to clean things up, trying to reduce noise, trying to stabilize the background or baseline, and then trying to deconvolute. So with the Kinomics software you guys used, you were trying to look to see what the subtracted line was and whether the green line zeroed out. That was kind of your visual match factor. There's a computational match factor that Amdis does. And it's essentially a normalized dot product. And what they do is they treat the masses, their positions as a vector of masses and their intensities. It's another vector component. And so you dot the observed mass with the predicted mass spectrum. So there's multiple values. And then you normalize it based on the size of the reference and of the query. And they scale it by a thousand. So a perfect match factor would be one thousand. Generally, if you get a match factor of more than about seven hundred, that's considered good enough. The compound is real. But there are different cutoffs and different labs will use different cutoffs. So how do you do GCMS identification? So I sort of highlighted this before. The first thing to do is prepare a set of standards. You need to have your reference standards. Those L, K, seven, eight, or nine that span a range of elution times, a range of sizes. So that's your calibration standard. Also, before you start doing a GCMS run, you should always run a blank just to make sure that what's coming off the column isn't going to throw you off. Usually stuff does, because there's always stuff that sticks. So that identifies sort of your background noise. And then using the same conditions that you run your blank, you run your sample. So here's your external elcane sample. You can see some contaminants that may be part of the elcane mixture or stuff that comes off of the column as it ages. But you can see each of these compounds beginning from octane to hexadecane coming off range about two minutes to about 10 or 12 minutes in this particular run for this particular column. But this allows you then to convert all of your retention times to retention indices, RT to RR. So once you've got your calibration file, a CAL file, a CIL file, this is what's going to be used when you run the Amdus software. So you analyze your subsequent data using that calibration file so that you can calculate the retention indices. And then once you've done that, then you can start searching through the NIST database to match, using those match factors to see what matches both in terms of match factor value, but also sometimes in terms of retention index. Usually you also want to make sure you're not picking up any blanks or noise. And so you also make sure that any of the pieces you've identified are not also found against in a blank. So here's your calibration file. Here was your set of eight or ten standards. You convert it to a CAL file. And once it's called a CAL file, then you can upload it into Amdus so that it now adjusts your actual GCMS spectrum. Let's say this is a Vuran so that everything is produced to the correct if you want retention index. So at this stage you can start looking at individual peaks and seeing what's in the mass spectrum under each of those peaks. So this is when you start the database search. So I'm not sure if it's that visible, but in red we've highlighted a peak and that peak is marked in white. Under that peak there are three other peaks. A red peak, a yellow peak, and a blue peak. And those are marked as well with their parent iron values. One at 172, one at 173, and then I can't make up what the yellow one is. But this is an example as I say where in GCMS often there's a single peak that you think is there, but it's not a single compound. It's potentially several peaks. And this is what actually the Amdus software has been able to sort of deconvolute by looking not only at the chromatogram, but also looking at the mass spectrum that it's collected. So here we are zooming in a little further. It's the same peak that was barely visible here. So we've just zoomed in. We can see these, the white, red, blue, yellow sort of peaks. And now we want to see if we can match for this peak or this particular spectrum the two most abundant values. This is the, we'll call it the peak spectrum. And then we look through the reference spectrum, or in this case you can do an automatic search using the dot product and it's returning the highest match. So the match factor was 840 or 84%. And you can see that the spectrum looks essentially identical. And so in this case we've been able to match this peak from this compound or the set of compounds to valley. So we don't have a lab with Amdus in part because it costs money and to buy it for 22 people would break the bank. There are other tools and Amdus is not the only software nor is this database the only database. Recently, well not that recently, people evaluated the performance of Amdus and compared it to a tool called Analyzer Pro and Chromatoff. And interestingly Amdus didn't do the best. I can't remember. I think Analyzer Pro may have done the best. Likewise, people sometimes are frustrated with Anis databases because they aren't compounds that we typically find in biological systems. So they created a few other databases. The GOM database and the Oliver Fiend or the FINLAB database which is sold by Leco and Agilent. And the HMDB has at different times in its life had to have reference GCMS spectrum. So the GOM database is in Germany. It's prepared by the Max Planck Institute and they have focused specifically on plant metabolites that collected about 1400 plant metabolites but many plant metabolites are also mammalian and microbial. So it has some general use. They also not only provide the mass spec data for typical co-audit and tough GCMS but they also give the retention index for that. And the retention index is really, really important. Because they've collected on different platforms, different instrument configurations, they have about eight spectra for each metabolite. The libraries they have are actually compatible with Amdus and NIST so you can actually add them into the NIST system and analyze things and it also supports web searches as well. So this is a screenshot of the database so you can look at it and search through it. You can view the mass spectrum. You can get lists of peaks and information that they've collected. So it's very extensive, very well maintained. The Femelab all over Femes group has collected lots of GCMS data for many, many samples. They have about a thousand compounds that are below the standard mass spec cutoff for GCMS. They cover lots of metabolites and they've done it for both quads and toffs and so they've got lots of spectra in there. This is primarily a commercial database so if you buy an Agilent GCMS you typically will be able to get that bundled but there are a few questions about GCMS and how the process is done. I didn't want to go into a lot of detail because in some respects it's similar to what you guys just did in your exercise with the NMR. Obviously the software interface is different but conceptually there's a lot of similarity. No, what it's looking at is the full fragmented spectrum so you'll see the major peak and then the fragmented peak so in that case the valine there's a, I'd like to wait was 144. I guess I'm not sure if it's silated or something maybe not. That's I think just the paradigm and then there's a fragment that comes out at 73 dolms so it picked that out as being those are two peaks those are the part of the spectrum then it matched it but there's a couple other tiny little fragments also that weren't so intense and then it just matched that to the reference valine spectrum that's already in this database so there's a lot of stuff in the background that's going on in the Amdus software that I don't think anyone really knows how it does it there's been a paper describing it so everyone knows about how it's dot product function works and a lot of people use that same concept even for LCMS and other things but how it actually pulls out the peaks and which ones are going to be part of that it's a kind of a mysterious process for their software what they strongly recommend is this alkane mixture because that calibrates it's a standard thing and standardization is really critical for consistency in any field whether it's genomics proteomics or metabolomics so GCMS has been framed in standardization for several decades and this is good because GCMS even in the 1970s was used standardly in clinical chemistry and analytical chemistry and it was critical to get that stuff standard so that every lab doing clinical chemistry would get the same results on the same samples so they pushed very hard in the analytical clinical chemistry community so metabolomics people newcomers shouldn't try and change a good thing we should really stick with those standards and follow them as closely as we possibly can because all of the databases, all of the retention indices were structured on those standards ideally yes if you're going to try and quantify those things you want to have those actual standards so what you'll do is you'll run the standards, you'll run calibration curves for that to get your quantitation under the same or identical conditions so you'll run your quantitation standards maybe shortly after you've run your retention time calibration standards but typically after you've run your quantitation standards you could use the same standards for several months that is you don't have to rerun them every single day the values should be consistent if the instrument's being well maintained what's that? if the column is stable and things and eventually after a few months you might have to do another rerun for your quantitation other questions? okay so we're going to do LCMS and in this case as you notice the picture looks almost identical in fact it is identical to the GCMS and guess what it really is fundamentally identical obviously the databases may be different in many cases it's not just two or three peaks it could be half dozen metabolites or more under a single peak or dozens under a single peak but conceptually it's still the same thing you have a chromatogram it's not a GC chromatogram but it's an LC chromatogram under each peak you have multiple other compounds and you do matching in the case of metabolite identification this is something that the metabolomic society and the metabolomic standard initiative has been advocating but there are different levels to identification the lowest level are the unknown compounds so these are just peaks I don't know what it is it has a retention time and has this kind of spectrum but I don't know what it is you can go up another level where you might say this compound or this or I've based on its molecular weight it's so large and so hydrophobic it has to be some kind of lipid so you can identify things by their compound class lipid, sugar, amino acid something like that a third level is putatively identified compounds and this third level is where most of us are at it's a case where we've matched to the mass spectrum and maybe the retention index to tend a mass spectrum to something in a database or the EI MS to a date something in a database so most of the matches you get even in GCMS to amdis and nis are still sort of this putatively identified in the case of the highest level positively identified compounds this is when you can match it to a known standard so you've actually run the standard you've spiked in the standard and this is a case for MS NMR is a little different because often in NMR you have not only peak positions but you also have the peak intensity and the peak patterns so there's enough information in the NMR spectrum in most cases to actually have them qualified at the level for identification but mass spec because it's often sometimes just a single mass and because the reproducibility of intensities is highly variable reproducibility of patterns is highly variable and even the reproducibility of retention times is highly variable that's where most mass spec falls is sort of this level 3 so in the case of LCMS unlike GCMS it's really good for lipids fatty acids generally hydrophobic molecules you can pick up amino acids you can pick up organic acids but because most people run these reverse phase columns it has a bias often towards more non-polar molecules than GCMS or NMR ideally what you'd like to do to be confident of your spectral identification is you want both a mass match but also a tandem MS-MS match many people just simply stop and say I'm done and that level I would say is often closer to somewhere between here and here it's not very reliable and often when people go back and when they use more powerful targeted methods to see were these things really there more often than not they weren't and this is I think we'll get metabolomics into a lot of trouble if we have very light light weight identification so simply matching by mass is not the way to do metabolomics ideally you also want to have standards if you can and there are you don't have to have exact compounds you can actually match by classes of compounds using single reaction monitoring or multiple reaction monitoring and that's a technique that's well used, well understood works very very well so the compound ID that most people still use as I say and I frown upon it and I think more and more people are frowning on it is the fact that if you can get pure mass single mass matches more and more they demand you know evidence by tandem MS matching and by authentic standards so at the simplest level if you just have a molecular weight to a mass paradigm mass let's say you can go to a variety of databases I think most of you have heard of these certainly everyone's probably heard of PubCAM it's Kebby, there's ChemSpider there's also HMDB each of these databases has different roles and it's important to understand these so the largest database is PubCAM the close second is ChemSpider 35 million, 28 million compounds, lots remember what I told you 99.99% of these compounds in here have never left the laboratory therefore they cannot be in any metabolome ever so unless you know which compounds are metabolites and which ones aren't you're gonna have lots of false leads and this is really unfortunate where people feel that the larger the database the better off I am well the result is there's a lot of garbage that's being published because they're getting seriously spurious hits to these things another level is Kebby and these are chemicals of biological interest this database only has about 35,000 compounds but these are biological compounds Kebby covers plants, animals, microbes and mammals so it's pretty diverse includes drugs, includes toxins and poisons so it isn't organism specific so if you get a hit on say trehalose which is a plant sugar it shouldn't be in a say a bacterium if you get a hit on sphingomyelin that shouldn't be in yeast because they don't have nerves however that doesn't distinguish this so ideally what you want to have is some association where metabolite to an organism so HMDB is an example of an organism specific metabolite database so there's about 40,000 compounds all of which have been confirmed to be in humans now it doesn't cover every compound that has ever been in humans but right now it's our best estimate so yeah there are a few compounds in PubChem that are in humans that are not in HMDB but it's getting to be fewer and fewer on the other hand there's actually several thousand perhaps close to 10,000 compounds that are in HMDB that are not in PubChem so you can do molecular weight searches through PubChem under advanced search you can type in a molecular weight range and it will give you a list of compounds so if we typed in a range in this case you've got 473 hits with this molecular weight range from 891.1 to 891.5 can't read it .5 so here's a whole bunch of hits the compounds these have matching molecular weights so that's a simple mass search Kebby you can do the same thing you can draw a structure but you also can do mass searches you can search through that and get hits so lots of resources have mass matching but that's really, really weak it's not what you should ideally do you would ideally want to do more advanced searches where you're listing not only perhaps a single mass but maybe the tandem mass spectra or the EIMS spectra and so in that regard there's the NIST database we've talked about and some of the other ones there's the Metlin database which is widely used by people in the mass spec community HMDB and mass bank are other examples that also have archival MS data or MSMS data so you can put in at least these databases allowing you to search by molecular weight which is not so good molecular weight ranges not so good but you can do these paradigms searches where you're looking at neutral or positive or negative ion you can have tandem MS data submit that see what matches based on that dot product that they use so these are more powerful unfortunately they don't have large collections of MSMS data so this is an MS compound search for HMDB you can type in a list of metabolites or not of masses and this may include adducts and doubly charged species so just a list of masses that you have it could be tens to hundreds to thousands and it will produce a list of possible or probable hits along with their adducts and adduct masses and a measure of the quality of the hit say 40,000 compounds from humans so if you know you're working on a human or mammalian system that's probably pretty good but the effect is because of all the adducts there's maybe 400,000 half a million predicted masses in the database there's about 30 different adduct systems that it generates it also integrates some of the data from other databases like FoodDB and 2-3DB and Drug Bank it is potentially useful it's not something I would use a lot myself for doing mixture deconvolution when you've got multiple compounds it does as well have about 1,000 experimental tandem MS spectra done at three different collision energies collected on a triple quad we've compared these to ion-trapped instruments and they're about the same in terms of the fragment patterns so they seem to be generally valid so you can also compare your tandem MS spectra to this collection to see what matches so in this case you've identified a clear spectrum you want to see or confirm if your spectrum matches that you can do that sort of search as well METLIN has a very similar kind of protocols or methods that HMDB does you can put in masses you can select whether positive, neutral or positive ion whether you want to have different types of adducts and then you can search for different metabolites and it produces lists METLIN is nominally mammalian but it covers lots of things plants, microbes, other things so it's really more a mass spec database of metabolites you can also do tandem MS spectra and rather than typing in a list of peaks you can actually upload XML, MZXML, MZData type format and it will read it and it will do the search so you can save you a bit of time in terms of having to type in peaks or select those peaks so identification of metabolites by LCE ESIMS the very nature of it means that you typically produce lots of salt adducts you also have what are called neutral losses you can also multiply charged species in some cases so these generate lots of extra peaks and I think I had written down 50% it's actually these days it's typically about 80% of the peaks that you see in an LCE ESIMS spectrum are these peaks so you can call it noise because it's not say the paired ion that you're really looking for so your challenge is to distinguish those adducts as multiply charged species and to group them together so you can identify or isolate what is the parent ion which is the one that has most of the information so here's an ESI spectrum and here's the parent ion and then here are some of the isotopemers and here's this ion which is a sodium adduct so it's 22 Doltons more essentially and that is the isotopemers so this one is more intense and it's the one you might want to use but in fact this is not the compound it's the sodium ion this is the parent ion and so sodium adducts are very abundant depending on the number of charges you can have something that might have two sodium adducts a single sodium adduct so it depends on the chemical nature of the molecule depends on your solvent how you've prepared your biofluid or what your biofluid typically is so these adducts which we've mentioned before some of them are listed here sodium doubly charged things where there may be losses of water or formate potassium doubly charged protons all of these things can create extra peaks so these aren't derivatives like we were talking about in GCMS but they are extra peaks that can cause confusion and compound identification so Oliver Fiennes created a very large adduct table it's posted on his website but this probably only represents a fraction of known adducts more recently last week I saw a much larger adduct table with many, many more adducts that can have been seen in systems some of them depend on your solvent that you're using but others just naturally and spontaneously form because you're dealing with a biological matrix there's a nice database called MZDB has anyone heard of this so it's maintained abrwist with in whales and it does some nice adduct calculations you can select certain types of adducts based on what your solution conditions are but it also does a lot more because it has a lot of plant metabolites and that makes it somewhat unique in the metabolomics world so you can type in essentially a compound formula in this case it was glucose it lists all possible adducts and masses for that particular compound so if you were expecting to see glucose in your spectrum maybe you will, maybe you won't these are all possible peaks that could be generated in addition to all of the isotopemers another thing that happens in mass spec especially in ESI mass spec is you'll end up with these neutral losses so this is a fragmentation that happens, it spontaneously happens during ionization possible collisions with gases in the system you can also get neutral losses occurring as well in electron ionization mass spec as well but these things are common and they are fragmentation patterns and these ones are useful also for tandem mass spectrometry and identifying compounds so a number of databases allow you to predict adducts a number of these databases are able to handle and predict ion pairs and multiple charged species some of them can handle the neutral loss species and if you only search by just mass ranges you get lots of false positives more you can refine of the peak list and the more spectra you can use or spectral information you can use the better and more reliable your hints are so these are slides that are not in your notes these are added last night or the day before about how we can help resolve the complications so if you want to take some notes I'll let you pull out your pens so the process of handling LCMS spectra and there is software available it's usually sold by the vendor there are packages that also you can download for free the idea is well you need to eliminate these extra peaks so you can either remove them or you can consolidate them so consolidation means these 21 peaks all belong to an adduct of glucose so I'll merge them into glucose or you can just simply say these 21 peaks are adducts of glucose I will delete them because I only want the paradigm same thing with the multiple charge species these are all multiple charge versions of this compound I will merge them into single compound peak or I will delete them the fragments things that are rising from neutral losses or spontaneous breakdown just through the sample prep you also want to try and identify and remove those many many peaks with the more sensitive instruments have these isotopamer peaks those little peaks that trail off at the higher masses you can either remove them or consolidate them into the parent modern isotopic mass peak and then there's noise piece noise coming from the column noise coming from the instrument and the way that people are typically removing noise these days in LCMS is that they will either do technical replicates where they have to have 2 out of 3 or 3 out of 4 samples showing the same peak as I don't and that peak is considered noise or a better one is to use a dilution series and you can do dilution with a technical replica and if you see 2 out of 3 or 3 out of 4 still showing up and you also see that it's diluting at the same level you were diluting the sample then that also confirms that it's a real peak so there's a lot of clean up you have to do with LCMS spectra a lot and just to give you sort of a scale typically people will say I've just collected my spectrum on my FTMS I've got 15,000 features in the immediate message to many people even in last weeks conference is I've got 15,000 compounds no what you have to do is start removing all of your add up peaks sodium and the formate and everything else so that knocks it down from 15,000 to 12,000 removes a multiple charge species that moves it down from 12,000 to 10,000 remove all those neutral losses and fragments that typically pierce you're down to 10,000 to 8,000 removes the isotope peaks so if you've got a sensitive system that's a lot of peaks typically at least 2 sometimes 3 peaks per large peak so you're down from 8,000 to 3,000 so the noise which is in the blank or which didn't match your dilution series or didn't reproduce some 3 or 4 technical replicates so the result is that you may have started with 15,000 but more frequently you're down to about 2,500 features and even then a lot of those are probably not real compounds so for a positive only you might end up with about 2,500 real peaks still call them features we can't necessarily identify them and then typically when you run negative mode there's issues with sensitivity and how the ions fly and typically you generally get about 50% or 60% of what you'll see for the positive one so people may claim about 15,000 features in positive mode and 12,000 negative mode, 27,000 how wonderful in the end if they've done the cleanup they'll typically be down to around 4,000 now there are tools that allow you to do work not only that you're sold with vendors but also some freeware that help with that process now it would be nice if we could do this but as I say to teach you guys how to do this and to analyze something with 15,000 features would easily kill the 2 days so we're not going to do it so once you've got it cleaned up and you've got most of your parent ions identified you're doing some of the fun stuff and that's the identification so high mass accuracy, you've seen this before allows you to identify compounds particularly with these two Orbitrap and FTMS so one thing you can do quite routinely is with sufficiently high MS data you can generate the molecular formula it doesn't tell you the compound but it certainly narrows it down and so that's what we might call that level 2 classification compound class so you need an accurate mass and then probably some error limit that you can measure from your instrument so there are some packages commercial packages where you can type in an accurate mass a mass error and it will return a set of formulas that are viable let's say it doesn't identify the compound but it identifies a probable class high chem some of the stuff I think they're starting to move to freeware but it's still more commercial it has its own molecular formula generator type in your accurate mass mass tolerance and it'll give you a list of compounds that fit it but there's freeware so MZDB which I mentioned before also allows you to do this and it uses all of Ruffine's paper called Seven Golden Rules and you can choose your elemental composition so in most cases when we look at metabolites it's just carbon, hydrogen, oxygen, nitrogen and sulfur and maybe a little phosphorus but it's rarely fluorine, it's rarely bromine or chlorine or potassium so you can choose to eliminate a lot of those things and if you just use those five elements it reduces a lot of the molecular or possible molecular formulas so if you have your molecular formula then you can actually go to some of these other databases so you could go to PubChem and type in your molecular formula and it starts giving you some hits but remember, most of the hits are synthetic molecules that have not ever seen outside the lab Kebby also supports formula searches these are biological compounds so their hits are generally more relevant HMDB also does formula searches now if you use additional information particularly the isotopic abundance so that was sort of the data that we threw out originally which is not related to the paradigm but if you bring that information back in, that isotopamer abundance you might recall how we were looking at the different profiles for chlorobenzene, how you saw those six different isotopamers and they had different intensities and positions that's useful information so you could find a molecule that has exactly the same molecular weight as chlorobenzene but you would never find a molecule that would have that same isotopic abundance as chlorobenzene because of the unique features of the abundance of chlorine 37 so there are other things that you can use not only isotopic abundance but you can use what are called chemical bonding restrictions that limit you can always come up with a formula for maybe C6H5 is that a viable chemical and using these formulas no it isn't so there's other issues like compositional data restricting to carbon, hydrogen, oxygen, nitrogen other things restricts the number of both formulas and chemicals that could be generated so this is embodied in Oliver Fein's paper that appeared a few years back called the Seven Golden Rules and this is implemented in that MZDB it's also implemented in Oliver Fein's website where they have a basic macro programming macro written in Excel that allows you to do that same sort of thing so you can download that Excel macro to do it but it's probably easier to do it through the MZDB website anyways this is just an example of the sort of thing if you had no restriction and just simply used molecular formulas you could potentially get lots of composition but as you start restricting things going from this case 8 million you get a 12, 13 fold reduction using the Seven Golden Rules if you look through databases that have known chemicals that's even smaller and then if you look at natural products it's tiny so this is a great way of restricting it as you get up in molecular weight you can obviously have more and more viable compounds and this is a sort of a feature climbs and as you get up to molecular weights of 7 or 800 there's potentially many many formulas that you could generate almost getting to a point where it's ridiculous there's been calculations a number of groups have been showing this you reduce your mass or increase your mass accuracy or the precision substantially restrict the number of possible formulas and then if you start using isotopic abundance then it gets it even smaller so these are mass accuracies that you can even get with TOF instruments this is the instrument if you had an FTMS and had 6 million dollars but with a half million dollar TOF instrument and using the isotopic abundance and recently good mass accuracy you've nailed it you've identified the molecular formula and this isn't even trying to say just restricted to pub chem this is all possible chemical formulas so this is an example where people actually think have used this to help identify a compound but they figured out the paradigm mass they looked at the isotopic abundance with a higher resolution not really high resolution instrument but you know here's the high accuracy mass and it allows you to in this case identify the compound uniquely a challenge in mass spectrometry is identifying isomers so lots of compounds have identical molecular weights and molecular formulas but are fundamentally different structures if you only used paradigm mass you could not distinguish these but if you are able to use retention time or if you had software that generates mass fragments so we're seeing these mass fragments you can see that the mass fragment patterns distinguish those isomers so parent ion alone will start failing with this this is why you tend to have to use other pieces of information retention time, retention index or tandem mass spectrometry or these isomer generators which will generate possible structures that allow you to sort of say is this reasonable, would I be able to expect these kinds of fragments is this something that should be there we don't have a good idea of how many possible chemical isomers are available but it could potentially be very very large so this is another challenge it's unique to mass spectrometry because we can't easily distinguish isomers the way they can with NMR this type of four there's this issue where many of the large databases particularly Pubcam NIST have non metabolites others mix plant metabolites with human metabolites or microbial metabolites with plant metabolites or drugs and you don't find drugs in any of these bacteria as far as I know don't take drugs so these are things that MESS can really seriously mess up and if anyone's looked at the eco psych database they might be aware of about half of all the compounds listed in that database are not in E. coli and this as I say can lead to many what I call silly hits and there's a lot of that that's still published where people are just simply used a mass match and then Pubcam got the first hit said this is what I got so if you know something about the source organism use that information if you know it's a plant, if you know it's a micro if you know it's human use the specific databases and there are and I'll tell you about them later there are also databases that are very specific to drug or food or to plants like MZDB and NAPS there are other approaches that people are starting to use and one approach that's becoming increasingly popular in mass spectrometry is called chemo selective labeling it's been used for a long time in proteomics is there anyone here that uses selective labeling at all in their metabolomic studies you can that certainly can be used that's an approach yeah so that helps when things are isotopically labeled and you know what's been labeled you're using carbon or nitrogen or deuterium there are also these kits, which I've mentioned before biocrities is one group that produces kits for quantation and then there are also techniques called computer aided structure elucidation or case methods so this is an example of using chemo selective labeling where it's a case where you will use carbon 13 labeled the Dancel chloride and this is heavy Dancel chloride this is light Dancel chloride and the Dancel chloride will react with a variety of groups on compounds primarily amines and hydroxyl groups and so you've now stuck a carbon label onto the substance you've also stuck a big Dancel group onto your metabolite and that has a couple of benefits but by doing this heavy and light labeling where you label it as carbon and 13 carbon 12 you can actually look for paired peaks one that has mass unit greater that has identical dilution times and with that so this is a Dancel chloride and this is an example I think of Dancel chloride with its label carbon 12, carbon 13 and this is a technique that was developed by Liang Li University of Alberta but there are several other groups in the US and in Europe that are also doing in a similar way different labeling agents but it allows you to do this externally so you don't have to feed things so in the case of humans you can't feed them C13 material so this is a way of labeling things post hoc but it allows you to quantify and that's something that is rarely done in mass spectrometry in the ways that you're able to do this because you've done the carbon labeling and it's another example as I say of the importance of quantitation and here's some examples as well where quantitation was done for urine measuring down to 30 nano molar with mass spectrometry and as high as 2.5 millimolar so they spiked in a bunch of standards to identify about 100 compounds and quantify them this is actually still the world record in mass spectrometry for compound identification and identification and quantification now the world record for mass spectrometry for identification is something on the area of 350 compounds but none of them were quantified so when you derivatize compounds with this dance chloride method you convert non-UV active compounds with the dance chloride into UV and fluorescently detectable compounds so you can actually quantify them independently of the mass spect one thing as well when you derivatize with these tags and it doesn't have to be dance chloride people as I say around North America and Europe are using these things it improves the ionization efficiency and you get better detection limits likewise the purification is much simpler molecules are generally much more hydrophobic so you get much better separations you get quantification as I say which is important and because of all the other bonuses you actually increase the number of compounds routinely detect it also allows you to identify real piece so you don't have to spend so much time denoising your spectra and potentially it could lead to automation by LCMS another approach is the biocrities kit and this is a commercial kit works on a specific instrument QTRAP AB SCIACS instrument QTRAP4000 QTRAP5500 it can identify about 160 compounds under ideal circumstances in blood or urine there's a modified version which can detect up to 180 arguably this still holds the record for quantitation identification but the lipids that it identifies are not pure compounds they're class compounds so it still falls short of what Dr. Lee's group completed what the biocrities kit does is it makes use of something that's been used for many many years called multiple reaction monitoring or single reaction monitoring but these are ways of essentially fragmenting a single ion and then quantifying it based on the fragments that come from that molecule and by quantifying the intensity of the peaks from those fragments you can and then comparing it to an isotopic standard you can precisely quantify how much is there so this is an old technique but you can largely automate it using modern mass spec software and so if you run the biocrities kits this is an example of redotes and you're getting compounds there's some amino acids there's some carnitines there's some phosphatylcholine groups they aren't uniquely identified and then there's sphingo myelins and you get concentrations and you range from 10 nanomolar to 7 millimolar so gain an example of quantitation by mass spec so you can also, well, focused on mass spec but there's also issues of what if you're trying to identify it, an unknown unknown so the things that don't match our libraries that aren't in the databases that aren't in the mass spectral libraries how do you do that and so this is what computerated structural elucidation is about it's matching with new novel unknown compounds there's a top down approach and a bottom up approach and to do either of these typically you have to combine both NMR with mass spec you can't do it from mass spec alone you can get a long ways with mass spec but it's not quite possible so what are these top down and bottom up approaches so the top down is to take say your known metabolites 20,000 endogenous 40,000 human 100,000 plant metabolites, whatever and then predict what kind of chemistry they might undergo so one way of predicting the kind of chemistry they might undergo is to do biotransformation prediction so there's software that actually calculates what will happen they'll go run molecules through your liver in a virtual way and they'll predict the phase one and phase two metabolites they'll predict a variety of other transformations that microbes will do on metabolites and so from your starting set of 20,000 compounds you might climb to about 200 or 300,000 compounds from those predicted compounds so they don't exist in Pubcam, they don't exist these are predicted compounds you can then predict the spectra you can predict the NMR, the MSMS the GCMS spectra and once you've got this collection of predicted spectra then you can compare your observed spectrum and say does this match and if there's a good match then it's a hypothesis that you need to validate but at least it gives you something to work from and so if you validate and say yes it is then that's the case where you've identified a novel compound and several groups have actually succeeded in doing this recently, particularly with food compounds, polyphenols there's another database that was developed by Leigh and our group it's called My Compound ID and it does sort of this virtual production of phase one and phase two metabolites and these are transform molecules and so now you can search these types of compounds many endogenous compounds in our bodies also goes through that phase one and phase two transformation, this is just an example of the website and some of the adducts that it generates are actually the transformations that it generates so it does both adduct as well as glucuronides and other variations there's a bottom up method for computer aided structure elucidation and this idea is to actually assemble structures, sort of spontaneously knowing the general features of most metabolites and the general working groups like indoles and imidazoles and cholesterol rings and things like that and from there you can start assembling making virtual compounds it's synthetic chemistry on the computer and so you can generate a whole list of viable metabolites that seem to maybe follow some general rule about metabolism and then from there do exactly the same thing predict your NMR predict your GC and LC-MS spectrum and tandem-MS spectrum from this virtual set of compounds and then see if it matches so that's a bottom up trying to assemble chemistry whereas the other one which is the top down is take the known chemistry and then metabolize it virtually using a synthetic liver or kidney so these are approaches to try and deal with unknown compounds in a computational way of course the other approach is just to do good old chemistry, analytical chemistry where you look at the NMR and think about it and do some mass back and think about it and draw out some structures and think about it and generally the amount of time it takes to truly identify a novel compound is about three to four years per compound so that's it for our session