 And what we're actually going to be starting now is the second part of identification and annotation. So this is again the idea of going from spectra to lists. And we talked a lot about NMR, but I'm going to talk a bit more about the other techniques like GCMS and LCMS. So that point about metabolite annotation is a spectrum could be any type of spectrum, but we want to produce a list, two column list, metabolite identifier and a concentration or relative concentration. So that formally is called annotation. So we did one where we de-convoluted by NMR spectroscopy, and I think now you have a better appreciation of what we were doing or what exactly this picture means. And I know a couple of people were asking about, you know, are there freeware alternatives that we could use? So in some cases there are, there aren't. So Amix is a commercial product for economics one. I'll tell you about AutoFit, which is a project that we've been working on for a while. This is a software tool that Jeff wrote, actually, and it's free. We're going to talk about how you could use this, and then there's something that's maintained or was maintained at Reichen. Lots of things have gone on with Reichen recently, so I'm not sure if they're still active. Rather Madison also maintained something, and then Europe has been working on something as well. So what you guys saw or did was a manual fitting process. And I think if everyone was working on the identical spectrum and everyone was trying to do identical curve fits and everyone was trying to, but if we'd looked at all of your data and compiled what you saved or downloaded, I think we'd see some variability. Some people may have misidentified compounds, some people may not have fit things, some people probably didn't do the baseline correction as well as they could have or should have. Now that's, again, training experience is one of the things you try and teach people. But in principle, if you wanted consistency, you should try and make this fully automatic. So that's something that we've been trying to do for the last few years. And this was the concept here, which is to take your spectrum, upload your library, but to automatically locate the DSS, to automatically calibrate what your pH was, to automatically figure out the height and intensity of the DSS, which you guys had to drag up and down. And then to take the information about where those peaks were and then do what's called a chi-squared fit, but observe minus expected squared. And from that, adjust intensity, positions, all of those variables that you guys were doing by hand to produce a fitted spectrum. So you guys didn't have 60 minutes, but if you had, you would have been able to generate a fit, perhaps like this. And then you can see, well, maybe this is a baseline issue. But the red is matching the black pretty well. And it takes a while, and it takes a lot of skill. So with the auto fit program that we developed, same spectrum, this is the fit here. And I think you can look pretty closely, and it's pretty much identical. And in fact, the super positioning and overlay is all there. So instead of 60 minutes and lots of heartache and headache, this was one minute, and this is done on a computer. And so it wouldn't matter where you were, how much skill you had, it would generate the same result. So that gives you consistency. Now it could be consistently wrong, but it's consistent. And that's important whenever you're doing some kind of analytical technique. And so at least having that and then being able to make that kind of inspection and say, yeah, that's darn good, done. That's usually sufficient for. So if you can automate, okay, it's faster, that gives you the high throughput in. If you can get the precision and recall up around 95% great. If you could do this overnight while you're asleep instead of being awake, that's great. But this is the most important part, it gives you consistency and reproducibility. So you don't have user bias, you don't have user errors. And this is the fundamental problem. Not only in metabolomics, but it's a final problem in proteomics, the fundamental problem in transcriptomics. The only one where it's really not an issue is base calling in sequencing. Where they've had to go through the same process of curve fitting and matching. And they've got that program sorted out, there's a bunch of them now. Because it's computerized, it also can occasionally detect signals. That you can't, as a human, easily pick up. So these are some examples where we created synthetic cerebral spinal fluid, synthetic urine. And in this case, that's assuming you've got a really good spectrometer, someone who's shimming it, running it very well. And when we run that, so we know exactly the composition in these things. We get perfect fits. Now we use real CSF, where in some cases we don't know exactly the composition, because there's some other things that no one knows. And in some cases, spectrometer, there's a tune, right? There's noise, you guys saw the noise in the peaks. And so we don't get perfect correlations, but we get 96% with an r-squared of 0.98. Here's synthetic urine, very high. Here we are looking at serum. So this was published last year, a paper that we produced. And so it's possible. We're tweaking some more of it. And next year, if you were to take this course, we probably would use the autofit. And of course, we wouldn't turn it into a half hour hour lab. It would probably be about a one minute lab. But that's the point. And this is an example for NMR. Jeff is working on something to do with this for GCMS. And I think the same thing needs to be done with LCMS. So that's one alternate. That's where we're trying to fit one dimensional NMR spectra. In NMR spectroscopy, you can also do two dimensional NMR. And to explain it, it would take a little while. But just trust me, it's an approach that's standardly used now. It's been around for 30 years or more to spread out peaks in two dimensions. And you get spectra that are things like called toxies, and nosies, and cozies, and rosies. And these are things that are commonly collected. And in fact, these spectra have been, reference spectra, have been collected. We've collected them. Madison, Wisconsin has collected them. Jeff has compiled them into a library. So he has his spectral library. And he's put this into a program called Metabominer. So this is free. And if you can collect a two-dimensional spectrum of your extract, biosolid, whatever, his program will semi-automatically identify. Now, it won't quantify, but it will identify the compounds in there. So his library, Jeff has put together, it's 225-toxy, almost 500-carbon nitrogen or carbon proton HSPC. And then, as I said, the trick for this to work was to know what is in cerebral spinal fluid, plasma, urine. I think you've also done a cell extract. Did you do a cell extract as well? I think you had to have four or five different specific fluids. It's going to check. But anyways, so what Jeff did was he came up with fingerprints, minimal signature peaks. He looks at peak lists, and then he also has this combined. So you can do this process. You can do the compound identification. And you can do this annotation step. This was published. And so we checked at different things. We looked at serum. We looked at defined cocktails. We looked at using toxies. We looked at HSQCs. We also played around with the pH. Almost everyone now settles at a pH of about 7.2. But when we were doing this, people wanted to see what would happen if you changed the pH. Anyways, the performance isn't as good as that auto fit. But it's good enough. And it's automatic or semi-automatic. And then recall and precision is in the high 80s and low 90s. So it's quite accurate. And given that the fact that it's free and the fact that doesn't take a lot of time, I think is an example of some of the things that people are trying to do to make spectral annotation faster, easier, more reliable. I think it might be time because it's been a while. Maybe Jeff could revisit this because we now know a lot more about what should be in some of these biofluids because we've done much more careful assays. Toxy is called Total Correlation Spectroscopy. So it's a method of collecting proton, proton, 2D NMR. And it looks at the correlations between certain coupled protons in a compound. HSQC is a heteronuclear single quantum coherence spectroscopy. And that measures carbon and proton chemical shifts, carbon in one dimension and proton shifts in another dimension. Another approach is using reference databases, spectral databases. So you guys will learn a little bit this evening, this afternoon, about HMDB. That's the Human Metabolome Database website is here. And it has spectral reference libraries, about 1,000 spectra that have been collected. And you can type in peak lists. So if you've got your list of peaks and their positions, you can just type them in. It will do a search through its libraries and identify probable hits. So you don't have to do the peak fitting and curve fitting that you did. You just need to have to list your peaks. Now it's not going to quantify things. And it's not going to tell you what fit everywhere. So in some cases, this only gives you a suggestion. The advantage of this over economics is the library here is 1,000. The economics library is 450. Reichen in Japan produced this web server. I haven't checked on it recently. Reichen up to about a year and a half ago had this huge NMR facility, the world's largest. And then suddenly Japan just shut it down. It didn't have to do with the reactor. It didn't have anything to do with it. It just, they said, you guys are taking up too much money, too much time, and they just literally shut it down. So I don't know if they still maintain this server, but they published stuff. And they were actually, they were an amazing facility. They were doing much more, much better than anyone else. So it's still, I think everyone's kind of reeling why. They were shutting down. Keg is still operational in the sense that it's available for academics. But if you want to download it, I think there's a fee that's, what's that? Yeah, OK, so you still have to, is it a license you have to buy or is it a free license for academics? Yeah, part of it, I think so, yeah. The whole batch one, yeah, they don't allow that at all or you still have to register to do the batch download. So most people are just doing the individual file, so they wouldn't notice much of a difference with Keg. Here is Wisconsin. So Madison, Wisconsin has been running a metabolomics group for a number of years. And they've produced software RNMR, so that's for open source software for NMR data analysis. They also have a peak server, which is very similar to the one that I talked about for the HNPB. You can type in some chemical shifts. It will identify a compound that corresponds to those shifts. CCPN is a UK European Union effort to do metabolomics and other things, and they've produced a site. They're kind of a mysterious organization. They serve primarily Europe, and they just don't want to have anything to do with North America. And they don't publish very much, and so you kind of either have to know about it or be in the circle to find out what's going on. So this is what they had. I haven't checked it recently, but they were kind of a Johnny come lately. They came into metabolomics about three years after everyone else was well advanced. So that's sort of a summary of sort of these freeware options that you can use for NMR-based metabolomics. Personally, I think I'm most excited about the auto-fit concept because I think I've seen it work a number of times and in different issues, and it's moved along very, very nicely, and so I think it's a concept that will apply and will be used in many other types of analytical work, including things like GCMS. So we're going to shift over to GCMS and LCMS and discuss some of the things in terms of compound annotation. Now, you'll notice that I'm not, and we don't have enough time in the two days to talk about sort of the non-targeted approach to metabolomics, which is collect lots of LCMS spectra, normalize, calibrate, let's deal with, you know, XCMS formatting, and dealing with add-ups. That could easily occupy two full days, and when we were putting this course together we realized we're just not going to have the time and enough flexibility. So if you were looking for that, my apologies, but it's just we cannot do that in the time we have. So to say we're looking at this one approach and one that is increasingly being done by many people and one that I'd encourage people to try and follow. So this is this idea where we have, here is a chromatogram. We've seen a few pictures of GCMS chromatograms. And what you may not know or may not remember is that under any individual peak, in many cases, there are multiple compounds, nicely hidden. And so they sum as to look as almost like one peak. And within one of those peaks you will have different mass spectra. These are the electron impact, or EIMS spectra. And they are fingerprints. So the blue peak has this blue spectrum, the red peak has this red spectrum, the turquoise peak has this turquoise spectrum. What you want is a spectral library. So the economics had 450, HMDB has 1,000. There are spectral libraries of about 10,000 compounds. So what you then do is to look and see if you can match these spectral fingerprints with anything in the library. And you can see from your eyes that this matches that pretty well. This matches this one pretty well. And this one matches this one. Now, this library would have some names corresponding to these compounds. And so by doing that match, you've identified your compound. And by doing some integration, you've been able to semi-quantify your compound as well. Yes? How did it evolve? Yeah, yeah. Now, this one is obviously a little more challenging. It's more schematic than reality. So typically there would be some differentiation. But essentially what you're trying to do is identify mostly your mass peaks from here. And you're trying to see in some cases that these some are more intense as you collect your mass peak, and some are less intense. So the deconvolution has done more at the mass level than at the peak level. But if you can see three well-defined peaks, then I think about that. So this is schematic. And it's to illustrate that it's just the worst possible case you could imagine. Or should in that case, wouldn't your frying pines also be that if you're setting something down as a mass, you're frying all of them at the same time? That's right. Yeah, you wouldn't. So this is where you'd be essentially trying to scan over time, hopefully seeing that there's something more distinct here, something more distinct here, something more distinct there. And then trying to deconvolve it a little bit. And that deconvolution part is what's buried in some of these software programs, like AMDIS. So again, we're talking about electron impact, at least with GCMS, and the electron impact fragments molecules into smaller pieces. And some of them are very bizarre molecules that only exist transiently. And this is an important thing to remember, because mass spec does create bizarre molecules. And you don't want to be fooled by those. And there are many examples of people being fooled by unusual peaks. So this is the fingerprint for methanol. And because the standard conditions that are used in EIMS doesn't matter which instrument, doesn't matter which lab, doesn't matter which country, you're going to get essentially the same spectrum. So here's another MS spectrum. And here where you may see the fragment ions here, here's the paradigm. Typically, the resolution is only one mass unit. So they aren't the highest resolution instruments. You can go to GCTOF and get very high resolution. And that actually is generally much more helpful. Now, when you're looking at GC spectrum, you're not going to see the pure compound. Because remember, we derivatize them. We derivatize them with TMS, or TBDMS, or methoxene. And so some of these compounds will have silyl groups. So the hydroxyl and I mean groups will have trimethylsilane. And so depending on the number of groups, you're going to have increments of 72 for TMS, 114 for TBDMS, and 29 for methoxene. So these aren't going to correspond exactly to the compound in urine or blood or whatever. These are going to correspond to these derivatives. And sometimes you're going to see more than one derivative for one compound. So that's also a complicating factor with GCMS. So given the way that we derivatize things with GCMS, given the nature of the instruments of mass spectrometers, people have found that they're really good at working with amino acids, organic acids, some sugars, fatty acids, and anything that has molecule weight less than 500. So big molecules more than 500, almost invisible, rarely seen by GCMS. So that's a limitation. So we're not looking at nucleosides, steroids aren't too easily seen, a bunch of other things that we would like to see, just can't see by GCMS. On the other hand, GC as a chromatography method is much higher resolution, higher plate count, much better reproducibility than LC. Also because EI is standardized around the world, it means that EI spectra are standardized. They're comparable. So the same library can be used by anyone around the world. So the root that people have adopted more so in North America is to use a program called AMDIS, which it does the deconvolution, and the database produced by the National Institutes of Standards, NIST. So the NIST 11 database has 240,000 EI spectra from 212,000 compounds. They also have QTOF and triple-cut spectra, ion trap spectra, and these are things that people may not be aware of. That's a huge resource. And then these are the retention index values. So lots of RI values, but not for 200,000 compounds, just for 20,000, so about one-tenth, which is disappointing because this is really, really useful, really important for confirming a compound. But they're always updating. The database is constantly being updated and improved. You can look through that database and search out, they'll have compounds, they'll have information of where it came from, characteristic masses, there's the RI value or the COVAX index. Here's a sample of what the EI spectrum looks like. So this is the database. So you can look it up and if you want, you can search through that database. They sell the database, yes. Yeah. What's that? It's $4,000 now, it's the most recent one. How do you practice it if you don't buy it? What's that? How do you practice it if you don't buy it? That's right, if you don't buy it. So I mean, it's cost a lot to maintain and put together. I think, I know that people are usually used to freeware, but I think when you think about the, and this database is 30 years of effort by thousands of scientists, so $4,000 is not a lot to pay for the amount of work that's gone into that. So the database is nice, but it's kind of useless unless you have a way of integrating it with the GCMAS Spectra. So the way that you integrate it with is through this AMDIS software, which is Automated Mass Spectral Deconvolution and Identification System, so that's what AMDIS stands for. And it does some of the things that you just saw with, say, economics, so there's sort of this, well, there's a noise analysis, but that's part of the spectral slew that we talked about. There's peak detection, they call that component perception. There's deconvolution, what you guys were doing as you were matching your spectra. It's sort of generating, you had a model spectrum for each, and then there's the compound identification process. So these are qualitatively the same sorts of things that we were doing with the kinomics one. Now the trick for AMDIS and for any deconvolution software that's sort of semi-automatic is how do you match your fingerprints? So the FBI has a fingerprint matching service. The RCMP has a fingerprint matching service. AMDIS has a fingerprint matching service. And the matching is done through a match factor. And it's measuring the similarity of those spectra to the ones in its database. And this is the formula, match factor. And it's essentially comparing the intensities of your query to the intensities in its database. And they're weighted by the masses or the m over z values. So this is a convolution of one time to the other. And it's a factor that I think ranges up to about 1,000. So perfect match is 1,000. An OK match is maybe 800, 700. A lousy match is 500. And again, these are your choice. So it's not absolutely automatic. But people over the years have identified good cutoffs. So that's how you match spectra, the matching factor. How do you determine your retention indices? So when you are running a GCMS experiment, you have to begin either the day or the set of runs with a set of standards, your alkane standards. Usually it's eight, nine, or 10 of them. They usually start with octane. Then they go up to hex or heptidecanes. And these are your calibration standards to help determine your retention time. So you might then run this external standard. Then you will typically also run a blank. Or you just use your solvent to make sure that whatever's coming off from the gas or from the column is detectable and removable. And then finally, after you've run this calibration standard, this one, then you can run your sample. So GCMS and LCMS is a lot of issues with quality control and quality assurance. Because A, they're so sensitive. And B, they can pick up things. And you want to make sure that you can run things identically. And you need to start the runs and calibrate the runs each morning. Sometimes people put calibrants and quality checks every 10 samples, which is a good idea. NMR is different. It doesn't have to deal with the quality control issues. Partly because the instrumentation is much more stable than GC and LCMS, partly because it is fully quantitative. But it's also less sensitive. And that's another reason why you can get away with some of that stuff with NMR. So here's your calibration standard, which is what you start for the beginning of the day, typically, and here are the seven or eight standards. And you can partly, well, see where they migrate. Therefore, this will allow you to adjust your retention times to retention indices or COVATS indices. So that's your calibration file. It gives you your retention indices. And then you take your sample file and the gateset calibration file to get your RIs. Once you've got your RIs adjusted, then you can start searching your NIST database for matches, displaying results. And then you can start looking sometimes to get rid of false positives by looking at your blank spectrum, which, as I say, we'll pull off or see things that material that was stuck to the column, stuff that might be in the gas lines, whatever, which are false positives. So here's your calibration file. And since we don't have the software running here, just trust me that once you've pulled this up, then you can calibrate. You now know what your retention indices are. And so this is done in the AMDIS software. So that was your reference file. Now you can take your actual spectrum of your CEL-X draft and now allows you to calibrate those retention times to retention indices. And this is now. So you calibrate it. So that was like the referencing, the chemical shift referencing. This is the same thing for GCMS. Now you go to your library of spectra. And this is kind of hidden, but let's choose a peak. We'll choose this one, which has retention in depths of 11.597. So we highlighted this peak here. And what the AMDIS software has done is it's sort of deconvolved characteristic peaks into, there's essentially, I think it's three here, a red peak, a yellow peak, and a blue peak. The red peak has a value of a mass of 73. The yellow peak has a mass of 59, and the blue peak has a mass of 172. So these are three peaks that you can mass peaks or mass spectra that can be devolved or deconvolved from a single GCMS peak. Now you can punch along to all of these peaks along this spectrum just by clicking this where some automatic peak picking has been done and it's identified maybe 75 or 100 different peaks. So you can just click through here and each of these will show you a partially deconvolved spectrum, and you can identify some of what seem to be the molecular ion peaks for these things and some of the partial spectra that are seen here. So we can zoom in. This one shouldn't be, it should have been 11, but it should have now it's shifted. But anyways, here's our peaks. And from here we can see at least the 73 and 144. Oh, this is a new one now, okay? 73 and 144 are actually the most abundant ones. So what we do now with this particular spectrum is look against the spectral library and it runs through the 212,000 and what we get in addition to 73 and 144 we're seeing 133, 114, 159, we get a match factor of 84% to this spectrum. So this spectrum as it turns out is valine. So this was collected on pure valine sample back in 1983 or whatever and that's what was stored. And you can actually compare and your notes are on here and you can see the match is amazingly good. Every peak matches, the intensity is all match, it's brilliant. So the match factor is correspondingly very hot, 840 or 840 out of 1,000. So the rule typically for match factors if you can get about 60% or 600 compounds probably real. So again, we don't have time to go through AMDIS, we don't have the money to buy copies of AMDIS and NIST for everyone here. But there are, I mean NIST and AMDIS are widely used. There are other programs and tools that are available. Analyzer Pro from Spectrum or Chromatof from LICO. People did a comparison between the three and I think Analyzer Pro actually may have come up on top. In addition to the NIST databases, NIST 08 was from 2008, NIST 11 was 2011. It's a GOM database. Has anyone heard of that one? So this is one that was done mostly on plants but it's really, it's open access, it's free and it's quite extensive. Oliver Fiend has produced a library and that's now sold through LICO and Agilent. And then, I'm putting HMDB, we have had at different times GCMS Spectra and HMDB but they keep on disappearing so I don't know. Anyways, the GOM database, it has both GC Quad and GC TOF Spectra, it has the mass Spectra, it also has the retention indices. So it's not hundreds of thousands of compounds but these are metabolites and the NIST database that I should mention is one that, it wasn't designed for metabolomics. In fact, a lot of it was done for environmental chemicals and environmental sensing, pesticides, a lot of other stuff. So a lot of the hits that people naively get in NIST they're matching to something that just isn't there. Maybe they got a match factor of 60-something percent but it's some exotic man-made chemical that's only found in that lab and nowhere else. So having a reference library that is targeted to metabolites is an important thing and so this is what the GOM database does. They know their biochemistry, they found the compounds or synthesized them, these are things that you will find in living organisms. So of the 1400 metabolites, they also have lots of Spectra from the TOF. They're compatible with the NIST and AMDA software so you can upload them directly into your NIST library. GOM is mostly focused on plants but metabolically humans and plants aren't that different so this is still useful, same with microbes, it's still useful. And this is compatible with the other kinds of software. So this is the GOM database if you wanted to take a look at it after the lecture during the break. They always change their look a little bit but it's there and you can just Google GOM and it'll generally take you there. You can explore some of their Spectra, they produce their plots and mass spectrum fragments and intensity and descriptions of how they collected it which the platform was. So it's very accessible. Bin Fine Lab, Oliver Fiend, they have 2000 EIMS and they also, RI data just like for quadruple and top six very much like the GOM compounds gain keeping to low 500, lots of different types of compounds. This is a commercial database so you gain you have to pay money to get it. So that's a quick survey of how you can do metabolite identification by GCMS. They didn't get into quantification, that's another layer on top of that. There's internal calibration curves, you have to have authentic standards, sometimes people use isotopic labels, it's a lot of work. But that is one of the things that people will try and do. So LCMS, it's basically the same idea. So this could be a GC chromatogram or this could be an LC chromatogram. Under each of these peaks, you will find a set of spectra and these might be, well, in this case these are MS-MS spectra which would correspond to a fingerprint. And just like we did with the GCMS, we try and match the MS-MS spectra to a library of known MS-MS spectra. Now the problem with LCMS is there's about 100 different platforms and instruments. So there's no standard way for producing standard reference spectra. So GCMS is just one way of doing it. LCMS, there's iron traps and there's triple quads and there's orbit traps and there's different ionization energies and different ionization gases and different concentrations and different configurations. And so these spectra will differ from model to model, instrument to instrument, platform to platform, lab to lab. So if you want to create a library, the library usually ends up being local to your lab which is a real limitation. Now in mass spectrometry and even in NMR and NA field, HPLC, there's different levels of annotation or identification, arguably five, but we'll just deal with the four. So the best, highest level, the one that you want to attain or achieve is something, a compound that is positively and firmly identified. How do you know if you've achieved that level? Well in mass spectrometry, the only way you can do that is that you have to have that standard. That compound has to have been spiked in to prove that it really is there. Because simply saying I have a nice mass match and I have a nice retention time match, that's not sufficient. It's the equivalent of saying I have one peak in the NMR spectrum and it matches the intensity and position. Is it the compound? No, it isn't. We don't have enough evidence. If you have two peaks in the NMR spectrum, that's pretty good. If you have 20 peaks, it's 100% sure. So you need to spike in. And so in NMR, if we see a single peak, we spike in. In mass spec, you only have a single peak typically. Then you have to confirm with a known standard. So that's a positive, that's a level four. Level three is a putatively identified compound. And so this is the one that most people attain. They match to a mass spectrum and they've matched to a retention time. So that could be the AMDIS match. It could be something that's done with any of the local software that you get with your instrument. It also could be a mass, or an MS-MS to retention time. And because the matches in MS are not, you're not matching according to intensity. Recall that we were matching intensities in the NMR spectrum. You can't match to intensity in mass spectra. So you've lost a piece of information. So even if you can get MS-MS spectra and even if you can get an accurate retention time or retention index, it's still only level three. It's only considered putative. But so many people think this is the end goal. And it's not. And this has been pointed out many times in papers written by the societies and by the mavens of metabolomics to try and point this out. That most people are quite happy to sit here when you really should be up there. Another one is that you can also, yes. Technically it's still be putative. If you had the, I mean, if you can do M-R-M where you actually have the standard and then you can quantify, then you've done everything. You've quantified and confirmed you're up to here. Now people have debated, is NMR matching here or here? And the consensus is it's generally here because in fact you're matching horizontal and vertical positions and you're usually matching many pieces of information together. M-R-Ms you can still potentially get overlapping ones. And obviously there's the isomer problem as well that can sometimes lead to similar results but not a different compound. NMR you can almost always distinguish isomers. The only distinction is maybe an R or an S version and that's not much of a difference. So some people can identify compound classes. So sometimes this is derivitizing compounds so it's derivitized thyls or it's derivitized hydroxyls or something like that. It's an alcohol, it's a thyl. So you can get it by compound class. Most people don't do that. And then there's the unknowns which in MS-based metabolomics accounts for about 90% of the signals. So LC-MS is best for a wide range but mostly more non-polar compounds. So hydrophobic molecules. And generally to get good solid ID, you need the MS-MS data along with retention and to achieve that level four and internal standard. Even with the most accurate masses, it's still not quite good enough but accurate masses are better than inaccurate masses. So what can you do to help identify compounds to achieve at least a level two, level three identification? Well there's a variety of sources that have metabolite and mass-searching tools. Kebby, how many people have heard of Kebby? Just one. So this is maintained at the EBI and it's the chemical compounds or biological compounds of chemical interest or chemical interest of biological interest. HUB-CAM, how many of you heard of HUB-CAM? More of you, okay. So this is the largest collection of compounds on the planet but only 0.1% are metabolites. A lot of the other things are nothing to do with metabolism. Kemspider, has anyone heard of Kemspider? Okay, so it's a, the name and collection Kemspider is generally better than HUB-CAM. And then the HMDB which is very specific to in this case human metabolites but it also includes a lot of microbial metabolites that are found in the mouse or the gut. So it's a much smaller database, only 8,000 compounds as opposed to 80 million in HUB-CAM as opposed to 25,000 in Kebby as opposed to 25 million in Kemspider. Most people, oddly enough, tend to feel that it's better to choose a large database like this than a small database like this. That's wrong because what's happening when you're doing mass searches is you're getting all kinds of false positives that have nothing to do with metabolism. Anyways, as long as you know something about biology and metabolism, use HUB-CAM because I think it's still a very, very useful resource and you can go to the advanced search and you can search by molecular weights. So if you know the parent ion or molecular ion mass, type that in, give it a little bit of a range and then you'll get a whole bunch of hits, 473 hits here and you can start scanning through to see maybe what makes sense. This is just from a single mass. It doesn't have retention index so you're just sort of guessing at this stage. Kebby, smaller database, it's about 28,000 compounds. These are biological compounds, although some of them are drugs, some of them are plants, some of them are exotic things. So it's a little weird what they put in, but you can search by mass here, type of range, you can draw structures in. You can do that with PubChem as well. Run the search and you get a hit. So this is if you just have the paradigm mass. That's assuming you just have a quadrupole or CI instrument or, but what if you can do MSMS? So there are tools like NIST, there's Metlin, how many have heard of Metlin? Mass Bank, does anyone heard of that? And then the HMDB, again this one is not respected to metabolites, this one is, this one is not, this one is. So you can get lots of false positives here and you can get false positives here. This is a pretty good set, actually it's a very good set now that Big Ben, I think Agilent and others have been paying them to develop. So you can do like the weight and weight ranges, but you can also start doing positive ion, negative ion, neutral peak searches, MSMS data searches. So this is more powerful. So HMDB, this is an example where you can go to the MSMS search, you can type in a list of peaks that you may have seen from your spectrum and then it will go through and it will, in this case, these could be some of them might be adducts, some of them might be ion forms, they might be doubly and singly and triply charged. It will sort of deconvolute some of those possibilities and based on that, give you estimates of what those molecules might be. So from that known set of metabolites, it's generated a whole bunch of variants, adducts, multiple chart ions. And in addition, although I'm not sure if it's still active right now, but it was allowing support for a drug bank and I think T3DB. And then you can play around with a different ionization model. So it was also designed to deal with mixtures rather than just sort of pure compounds. Yeah. The masses, are those based on predictions and actual reactions? Yeah, it sort of explodes them. I think it could have made it smarter by just looking at what the charge ionization states could be and then make some predictions on that. It's hard in some cases. I mean, you almost have to look at it individually to make a decision of whether this one will be positively ionized or negatively ionized or whether this adduct is possible or not. So yeah, we just simply let it explode, calculate all possibilities. No. So in terms of the MS-Mess spectra, at T3DB there's about a thousand reference compounds where we've run them through a triple quad. But we found that the triple quad spectra, running the way that we ran them, actually matched extremely well to the ion trap, linear ion traps. So these are sort of like having two for one. They can match. So if you collect on an ion trap, these spectra will be valid. If you collect on a triple quad, these will be valid. So you can choose different collection and collision energies. These are medium, low, high, tolerance. And then you can, in this case, identify a single compound at a time. So that's HMDB. Metlite supports a mass search. So you can enter mass. You can determine whether it's negative positive or neutral. You can detect whether you want to deal with different adducts. So it does this explosion of all possible adducts. And then you just press find and it will search through its collection of metabolites and give you hits, answers that would match to your original hit. So that's the paradigm mass. There's also the MSMS search. And in this case, you can submit data, not just as peakless, but in a variety of different forms, an MZXML file, which is fairly long and extensive. And it will parse through it and check through its database, calculate a match factor, and determine what's there. So I mentioned this before, that with electrospray ionization, you do produce these different types of ions, which lead to different types of piece, which fool a lot of people a lot of the time. These are called adducts. So sodium, lithium, potassium adducts. There are things like neutral losses, if you'll occur. You'll also get multiple charging, single, double, triple, quadruple charging. It's probably more than 50%. It's probably around 60 to 75% of those signals arise from noise sources. Also not included with higher resolution instruments are the isotopomers, which are, again, smaller peaks that you'll show up. So to try an essential challenge in mass spectrometry with metabolomics is to try and figure out those real peaks from these not-so-real peaks. So this is just an example of a sodium adduct. So here's the real mass, the real molecular ion. These are the isotopic peaks. So you have to know that this is the peak and not this one. This is the peak, or this is the peak and not this one. And here's a sodium adduct. And you can see that there's an addition of 22. And these are sort of the different combinations of where you'll get sodium, apparently, with a negatively charged anion. The adducts are sort of jumped, but can they be used quantitatively as well? So if you have any plus an anion, if you get one sample on any plus an anion or another sample, could you figure out which one they are? Yeah, you could. Problem is you need to be able to figure them out. And so this is, like, if we just gave you this and I didn't tell you this was there, this could equally be another metabolite, totally. And there's no way you can prove that it's right or wrong, actually, whether it's an adduct or not. And that's essential for a challenge in mass spec. Again, if you have the authentic sample or something compound, then you can really check that. But if you don't, then is that two different compounds or is it just the adduct? So these are some common adducts. And as I say, there's different rules that people will produce. And it depends on the solvents you're running. Depends on the salts that are available or found. Sodium, potassium, chlorides, whether it's positively or negatively charged. Oliver Fiennes produced an adduct table, which is also pretty useful. And roughly there's about 30 that people generally identify between positive and negative ions. Has anyone heard of MZDB? So this is produced by a group in Wales at the University of Aberystwyth. And they're a really excellent research group and they actually pioneered a lot of metabolomics. It's just that no one knows about them. Anyways, they produced this adduct calculator. They've also produced another resource. But again, you can type in a market formula and ask what sort of adducts you want to have and it will produce, calculate various adducts for that compound in their mass. And so if we wanted to know what sort of adducts would we get for glucose? If we type that in, this will give you the possibility that it partly handles the chemical nature of the molecule. Not just saying everything. It does look a little bit at the chemical characteristics. But it's a prediction. These aren't the ones that have been validated. So those are adducts. This is what's a neutral loss fragment. So this is fragmentation. So the compound is breaking up. So benzoic acid has a molecular weight of 122 but we could see for benzoic acid we could see 105 and we could see 77. These are neutral loss fragments that can occur. It's one compound that we can't predict that those are gonna necessarily be there. And someone could say, oh, this is an adduct. It's 28, so is that potassium or something? And okay, that's an adduct. How can you prove I'm right or wrong? Or is it a neutral loss? How do you prove I'm right or wrong? This is a challenge. So some can predict the adducts. Some can predict multiple charged species. Some can predict neutral loss species. So Mettlin is particularly good and MZDB are particularly good. But just by searching by MS you can lead to lots of false positives. Yes. So about adax, so if you, for any multiple experiments under the same experimentation, generally it's doing roughly the same proportions so that it's reasonable if you're, are you violating the same percentage of model? Well, yeah, so how reproducible can you be? I think, and this is part of it was, a lot of people go to extraordinary efforts to try and make the MS runs and the MS instrument is reproducible as possible. There's a thing called the matrix effect and it has a lot to do with, what is, how did you prepare the sample? How did it run through? As long as there's not huge perturbations to the sample prep, then presumably things are supposed to be relatively reproducible. But I think it's also known that, if you're looking at some disease versus control, something is perturbed, the disease clearly has something that is gonna change the matrix. And that matrix effect also will have effects on other ions and other, so, you know, how much, just you don't know unless you do sort of a quantitative assessment where you substitute the actual compound, do MRMs or isotopic substitution and be certain of whether you actually change the concentration and whether it's really a matrix effect. The addicts, yeah, if they're still flying and they still mostly behave, but sometimes, you know, the disease changes the sodium level or the potassium level or some of the other things and so you're gonna get a different addict effect because of control versus disease. So I think mass spectrometry gives you a qualitative insight. And if you want to go that next step by using the MRM and the isotopically labeled stuff, then you can make it quantitative. And once you've made it quantitative, all those complications, they disappear. You don't have to worry about it because you've calibrated and you have quantity and the quantity is comparable from instrument to instrument, lab to lab, country to country. And it's a theme I'll come back to over and over again is that if you can make and report metabolite levels in terms of micromole or millimole or nanomole, your life is so much easier. All the controls you have to worry about all the issues about, everything is all gone because you have got your reference, you've standardized, you calibrated. It's, you're 100% confident. You're at level four identification. To get level four, you technically will have to quantify. You've got an absolute concentration. Don't worry about your addicts. Don't worry about your neutralized. You've identified it because it was there. So going back a little bit, we saw this picture before and we talked about high mass accuracy. When you get to really, really accurate masses, you are typically using these types of instruments. This is a million dollars. This is half a million dollars. But you can now type in mass and actually get a molecular formula. Sometimes if you're lucky, the molecular formula is so unique it gives you the structure. And if you can also include some information about the error and also the isotopic mass distribution, you can really narrow things down. So there are tools that you can buy. MW twin, which allows you to type in essentially a molecular weight to five significant digits. And here are the possible molecular formulas. High cam is another commercial one. Type in a molecular weight, tolerance. And here are possible molecular formulas. MZDB, which I mentioned before, this is not commercial. This is a free web server. And you can type in a molecular weight, tolerance, you can even play around with the elemental composition, and then you can use a set of rules called the seven golden rules developed by Oliver Fiend's group. And this will also give you a molecular formula. You can also do pub cam, type in, or this is rather sorry, this is not. Once you've found your molecular formula, then you can try and find the compound. So once you've got that formula, what's out there? This gives you a search. Same sort of thing. Kebby, you've got your molecular formula. What's out there? Type it in. So getting the molecular formula is part of the way, but molecular formulas still give you lots of options. 400 to 600 to 700 different compounds. So that's not really helping things. So you can use additional information like the isotopic abundance. You can use things about bonding restrictions that we know about. Atoms give them a certain number of carbons, certain number of pyridines, what's allowed. Atomic compositional data, you're not going to find a lot of fluorinated compounds in humans. You're not going to find a lot of brominated compounds. So you can kind of eliminate a lot of these things. And then there's also information that we have about hypothesized structures. So that's the seven golden rule idea that all of our fiend developed. And this essentially allowed you to take all those hundreds of possible structures that were coming from your formula to in some cases, a very specific structure. So it's a filtering thing, and the seven rules are given here, some of which may mean something to some chemists that mean nothing to most of us. So you can actually download their software. It's an Excel basic spreadsheet. You can type in things. But what it does is it reduces the chemical space hugely from billions to millions to if you just start looking at natural products to a space of perhaps 50,000 or less. The size of a molecule has also an impact on the number of possible formulas. So small molecules have fewer formulas. Large molecules have more possible formulas. If you have very accurate masses, you have far fewer formulas. So if you can get up to the very best FGMS instrument with .1 PPM mass accuracy, basically when you've got some isotope abundance or without even, you can get down to essentially one molecular formula. If you can include isotopic abundance and even fairly low accuracy, it can also reduce the number of molecular formulas you typically see. But once you get up to very large molecules, 900 Dalton's, even the most accurate mass spectrometer can only give you 32. But the isotopic abundance, so if your instrument is a lower quality Q-trap or Q-top, that helps a lot. It's almost like having an FTMS. So this is just an example where there's mass data, mass data, the paradigm, and then also the isotopic abundance from that paradigm. Feeding that in allows you to identify this particular compound exclusively. Of course, there's the problem with isomers. And if you only have the mass spectradata, Paranine, you wouldn't be able to distinguish these two. The MSMS data can partly distinguish them. You can see this 71 is not here. So that can help. So there are programs that can help generate different isomers. Still the challenge is actually what is the size and the number of possible molecular isomers and it's scary in terms of the possible size. I brought this up before, that a lot of the databases, especially the really, really big ones, mix non-metabolites, buffers, poisons, synthetic chemicals, pesticides and herbicides with standard metabolites. And people report hits to these exotic molecules when there's no way they could possibly be in the system. So as I say, it produces these silly hits. So what you wanna try and do is try and limit the search to organism-specific databases so at least you aren't gonna get those silly hits coming from just the mass alone. So as I said, this large portion of the MSMetabolomics community does follow this approach and there's lots of techniques we brought up and tools that are available. But there are alternatives to mass filtering and mass matching. Some of them are using chemoselective labeling, which is done by several groups. And this eliminates a lot of the signals, eliminates a lot of false positives. You can also use kits that are quantitative or you can use techniques called computer-aided structure elucidation, which sort of combine mass spec, mass spec fragmentation and sometimes NMR to formally identify compound. So chemoselective labeling, we have someone from Liang-Mee's group, there were several groups that pioneered this idea, but the idea is to take samples and then use a reactive compound. You can use dancyl chloride, benzyl chloride or benzoyl chloride, a variety of other reagents that react specifically with amines, hydroxyl groups, carboxylates. Now these things will react and they do two things. First they react very rapidly and then they label what used to be a hydrophilic molecule and convert it to a hydrophobic molecule and they also stick a fluorescent or UV-sensitive probe and you can also make them carbon-12 and carbon-13. So now you have created an isotopic variant. So once you've done that, this is similar to if you're familiar with proteomics, there's eye track, eye cap methods, so heavy and light. So if you've labeled heavy and light, you can actually quantify. If the labeling efficiency is the same and it would be kind of identical, you can then compare samples and quantify. So you can positively identify compounds and quantify compounds as long as you have essentially a standard mix where you have these labeled reference compounds. So it saves on having the isotopic synthesis. You just have to buy the compounds off the shelf and you just have to label them with this reagent. It takes 10 minutes for each labeling reaction, array them out in a 96-well plate, five or 10 of those 96-well plates. By the end of the day, you have your whole library of isotopically labeled reagents. It's dirt cheap and it's very effective. Using this approach, now you have something that's, this is Gansel chloride, so it's fluorescent, detected, you can even quantify. You don't even have to use a mastermind. But you can have the carbon-13 labeled one and you can then compare and you'll see C12, C13, so you've got double the peaks, so you can be sure these are real compounds. You can compare the C12 and C13 intensity so you can quantify. And because it's labeled with a hydrophobic molecule, it comes off very slowly in the HPLC instead of the very first thing that it looks in one massive amount. So you get great separations and the other bonus is these ionide really, really well. So your intensity is increased by sometimes a factor of 100 or more. So using this technique, the Lee Lab was able to identify and confirm hundreds, thousands of peaks in urine and they were able to quantify. They were able to quantify from 30 nanomolar to millimolar levels, about almost 100 compounds. So this is quantitative mass spec. They didn't have to do a lot of isotopic synthesis. They decided to synthesize one compound and they just had to react it with easily purchased library compounds and they were able to do quantitation over a huge range. So these are some of the advantages of it. So this is chemo-selective labeling. There are other types of labeling approaches other labs have introduced this idea. So you can convert non-UV active to UV active so you can quantify by UV or fluorescence. You get great efficiency, improved detection. You can do affinity purification. You can do LC in reverse phase. You can do isotopic quantification and then you can increase not only the compounds detected but you determine whether they're real or not. So you get rid of the addict question. You get rid of the neutral loss question and you can start doing quantitation. There are kits that are sold now which don't use the stirrupization method but what they do have are isotopically labeled versions and they use multiple reaction monitoring. So these can be adapted to largely Q-trap type instruments or triple-quad instruments and these will generate quantitative measures between 160 and 180 compounds. So it's, again, don't worry about the addicts. Don't worry about the neutral loss. Don't worry about the noise. This is a targeted method, quantifies those compounds. It's mass spec based. So we've talked about multiple reaction monitoring but it is a way of essentially looking at how things fragment and looking for the very specific peaks that correspond to a specific class of compound. And then if you have some kind of isotopically labeled form you can use that as a reference to quantify. So it's being done to quantify proteins in proteomics. This works very well for many years in drug discovery, drug metabolism. So it's a set kit. So it only gives you those 20 amino acids, biogenic means, a bunch of phospholipids. So they do have a steroid kit, which gives I think 19 steroids. And it's a 96 well format. So it's not just for one sample. It's for about 80 samples. It takes about 24 hours to run. But it's basically automatic. So load up the thing. I mean, crap is a bit of time. But then you just walk away, come back a day later, and you've got the quantitation data. And this is some quantitation data that we measured. This is for urine. We've done it for CSF. We've done it for serum. We've done it for rumen. We've done it for all kinds of things. So it's a very effective mass spec based method. And we really like it because it's very high throughput, very accurately cross checked it with many different methods with NMR, GCMS, ELISA type methods. And the numbers always come out very accurately. But yeah, I mean, some of the compounds are there because they're easy and cheap. If I was designing about 20 different compounds I wouldn't use 20 others that would be used. So the last part here, because we're over time again, is computer aided structure elucidation. This is a way of determining structures that has been pioneered by actually Chris Steinbeck at EBI. But there's now a Toronto company, ACD, which has probably the world's best computer aided structure elucidation software. And a different approach is for doing that. But it's whether you've just got mass spec or whether you've got, well, this is the top down one, is to take known metabolites, predict possible variants through metabolic transformation. So there's software that does that. Then from those transformed or predicted theoretical metabolites predict their spectra. And then see if those spectra match anything you're seeing. Sometimes that works. And there's reason to think that compounds that are transformed this way do exist. But we just don't know how abundant they are. There's the bottom up approach, which is to basically think about molecules in terms of structures, using a variety of neural networks, genetic algorithms, but in the end try and piece together different possible compounds, and then to predict spectra. And then to compare the observed spectra with a predicted spectra. So this is where you're just simply saying, this is what I know about metabolism. This is what I know about chemistry. Let's see if I can make some compounds that are chemically, physically realistic. Let's see if I can create some spectra that would match those compounds. Do they exist in my repertoire? So again, this is trying to identify those unknown unknowns. And again, this is something that could easily take two days of lectures and talks. But this is just to point out that this is one of the approaches that people are trying to do, where the idea of predicting spectra allows them to potentially identify these unknowns.