 Okay, so I think we've dealt with our question as we'll dive into metabolite identification and quantification. So we're going to look at, essentially, and that's to pick up where I was emphasizing last couple slides from the last module, which is quantitative metabolomics. We call it targeted, it's also untargeted, but it's still quantitative. So we're going to learn about spectral deconvolution or assignment, and we're going to use it both for NMR, for GCMS, and then also LCMS. And then we're going to look at a very variety of mass database searches and databases. So the goal in metabolomics is to go essentially from spectra, whether it's GCMS, LCMS, NMR, whatever, to lists of compounds and their concentrations. And the concentrations can be relative concentrations or absolute. My own preference is to have absolute concentrations, and that's certainly a growing trend in the field of metabolomics. So historically the problem with metabolomics was it was lagging behind the fields of genomics and proteomics. If you had a genome sequence, you could very quickly either translate the genome to protein sequence or proteome sequence, and then you could just do a blast search, and you could basically identify all of your genes and proteins. Sequence alignment, sequence comparisons could identify functions, the names, all kinds of things. And this is something you could do in a matter of minutes. In the world of proteomics, if you've got MS spectra and MS MS data, you can submit your unknowns to the mascot server and very quickly get protein identifications. So software, public software has existed both in the world of genomics and proteomics for a long time to identify your unknowns. You know, submit a sequence, submit a mass spectrum, get your answers. In the world of metabolomics, for a long, long time, there just wasn't that tool. Basically people would go to the library, pull out books published in the 1960s and compare their spectra to whatever, and then hope that they could find what they were looking for. And so that's been an issue with metabolomics and probably one of the reasons why it was the last to sort of join the triumvirate of omics. We're going to essentially in the next bit show how that's now changed and show you some of the tools to do that. But I think there's also this issue of what's now become the unknown, unknown problem. It's partly picking up from Donald Rumsfeld, if people might remember his quote. There are known unknowns that say we know, there are some things we do not know, but there are also unknown unknowns. The ones we don't know, we don't know. Anyways, in the case of metabolomics, there's lots of situations where the peaks are there, and so the moment we see them, I can't really tell from a retention time or even just a collection of mass, M over Z values or chemical shifts what the compound is. But by submitting it to a database, these databases or software tools actually can identify them because they were previously identified. They sit in a database, so they've gone from being unknown to known. But then there's the more challenging ones which are compounds that, yes, you've identified or you've collected spectra from, but they match to nothing in any of the known databases. The mass doesn't match to anything. The NMR spectrum doesn't match to anything. The retention time doesn't match to anything or a combination of all those threes. And so in those cases, for the unknown unknowns, you have to actually use computer aided structure elucidation or case methods, which are something we'll briefly talk about. But what we're looking at is mostly the known unknowns. That is, there's some point in history has actually identified, purified the compound, collected the spectra for it. And so it's your task really just to see if it's in this particular biosample or not and what its concentrations may be. So for those known unknowns, we do either something called spectral deconvolution or spectral assignment. And so you're doing, all you're doing is matching peaks to a known set of peaks from pure compounds using a pre-compiled database. So someone's had to collect these pure compounds and run literally thousands of spectra, tabulate them, put them online or in books, and make them searchable. But that's been done and it's being actively pursued and it works for NMR, GCMS, LCMS, and MSMS data. And so that identification of the known unknowns is what I call quantitative metabolomics. People can call it targeted, but it's also untargeted because in the case of NMR and GCMS you're not really targeting. So that's the theme for today, at least for Module 2. So in the case of metabolite deconvolution by NMR, this is essentially what you're doing. If you're given a mixture, which is at the top, and if I had shielded off the other components, the question is, you know, is that mixture one compound, two compounds, three compounds? If you look at it, there's I think six spectral clusters and say maybe it's six compounds. Or if you think in the world of GCMS, you might say there's 20 odd compounds. Well, what you can see is in fact that top spectrum is the sum of three spectra from compound A, B, and C. And not only can you see that it's a sum, but you can also get an estimate of the concentrations. The compound A equals compound B in concentration and compound C probably also equals compound B in concentration. So there's three equal molar concentrations of three different compounds. In some cases, the spectra peaks overlap. They have exactly the same chemical shifts. In other cases, they are unique. So this is relatively simple to do with three compounds in a simple mixture. It's much more difficult to do with 50 or 100 or 1000 compounds in a complex mixture. But that's the challenge. Now, there are tools and software programs that actually allow you to do this. There's a company called Kinomics, which is based in Edmonton, which allows you to do spectral deconvolution with NMR. And the software itself is something you can download and run it for free, at least for a month, I think. So it allows you to manually process the NMR spectra. So this is where you do the Fourier transform. You convert what's called the free induction K into an NMR spectrum. Then you phase it with the software that they have. You can remove the water signal, which is usually very, very prominent in an NMR spectrum. It'll perform baseline correction, which is something you have to do with MS and LC and all kinds of things. You also have to do with NMR. It'll do some referencing so it identifies the chemical shift. It'll normalize the peaks. Again, you have to do that manually. And then you fit the spectra. So you'll take the reference spectra and you say, I think this peak looks like, you know, loosing. So you pull out the loosing peak and it's done with a mouse click and it will fit that. And then you can shift it around. And it's actually very easy to learn and to use. And in past years, we would actually teach people how to use that. But we found that it was taking a long time. And in fact, the tendency is that it takes about half an hour to even an hour for a person to fit a spectrum that's maybe relatively simple, but has maybe 40 compounds. And if we had 30 of you doing the fitting, we would get 30 different fits. Even though it's the same compound, same mixture, same solution, just the tendency for people doing different styles of baseline correction, different styles for fitting and phasing, all of which can lead to misinterpretation. However, there are other tools. So Brooker has a software system called Amix, not really as friendly or as usable as the Kinomics one. But there's some automatic tools. Brooker has developed an entire NMR system and software. It sells for about a million dollars for analyzing juice. And then you can get another one that'll analyze wine. And these will automatically identify the compounds in juice samples and in wine. And there's about 40 compounds that these can identify. There's a tool developed by Tim Ebbell's group at Imperial College called Batman, which uses a Bayesian method to automatically assign or characterize NMR spectra. And Batman typically will take about 8 to 10 hours for a sample mixture of about 20 compounds. So it's not exactly fast and not really amenable to a course like this where we've got 30 people. So what we'll talk about and show you in the lab is a tool called Bazel, which was developed in my lab in collaboration with Russ Kreiner in Computing Science at the University of Alberta. Why do we want to do automation? Well, it's fast. So instead of 30 minutes a sample, you could potentially do it in one minute per sample. And it's consistent. So it'll give you identical results regardless of what sample is done and it'll do exactly the same all the time. So if it's wrong, it'll make the same mistake every time. But if it's right, it'll do it correct every time. Obviously, if you've got hundreds or thousands of spectra, it's something you can leave on your own. You don't need teams of people staring at spectra or working on computers night and day, which is actually happening. And because it's a computer, it can also see signals that sometimes are not obvious to us unless you're a real expert. So there's one of the real strengths or many of the real strengths with automation. Yes. Generally, what you want to do with the Kinomics software is start with the original FID. Because the processing that Kinomics expects you to do, it guides you through that. And it actually does a really nice job. In fact, it generally does a better job than the NMR instruments themselves. It has much better baseline correction, much better phasing and water suppression than the instrument software. It's just best to take the FID, and that makes the FID somewhat independent of what someone else has done to it. Yes, this is all 1D NMR. There are some tools that allow you to do analysis with 2D. And the problem with 2D is it's not quantitative. So it's good for identification, but peak intensity is very considerably with 2D NMR. So that's why we're focusing on 1D here. I mentioned Batman, and so you can download the program, you can install it. Jason here has spent much of his life trying to install it and trying to run it. But some cool ideas, it's just incredibly slow. So we'll be using Bazel today. And what's nice about Bazel is it's actually web-based. So you don't have to download and install. You don't have to have an operating system. So all of you, I think everyone now has web access. So you should be able to do it. We've tested on many samples. We use it in my own lab, which is a core lab. So we've looked at hundreds of samples. It's quite accurate. It uses what's called probabilistic graphical models. And most of you have never heard of them. It's like a hidden Markov model, which some of you might have heard. And these are things that are used commonly in speech recognition. So if you've ever used speech recognition like Siri or anything like that, those use or analyze spectra and look for identifiable patterns. And that's how Siri or other speech recognition tools recognize words. Here, we're just making it try and recognize compounds. But it's a similar type of problem. It functions in an inferential way. So it uses inference techniques, which is what humans also use when they fit and we watched how people would fit spectra. And so that's sort of embedded into Bazel. The reason why it works is, and the reason why it's successful, even for humans to fit complex spectra, is they have to know what the spectra is. So is this of blood? Is it of urine? Is it cerebral spinal fluid, saliva? Is it hemolymph? Is it plant sap? If you know what it is, you also have a good idea of what its composition is. Give or take about 10 compounds. And so if you have that prior knowledge, which is where this Bayesian concept comes in, then it helps with the fitting. It makes it a convergent problem. What we've also done with Bazel is we've made it fully automated in the sense that it automatically phases things. It automatically gets the chemical shift reference. It automatically removes water, baseline corrects, and then it does all the deconvolution. By taking that away from the human, it gives it consistency. And that's critical, especially if you want to move this sort of thing into the clinic, which is one long-term plan. If you can make things automatic, then you can move it into, especially general practice for things like screening, for industrial work, or for clinical work. So these are some examples of where Bazel has fit some compounds and spectra. The description of Bazel actually just came out a couple of weeks ago, so we had to submit the slides before that came out. But anyways, you can see how complicated things can be. But you can also see how well fit it is. You can compare the green line to the black line, or the red line to the black line. And this is what Bazel's able to fit. Here it is working on Cerebral Spinal Fluid. And in this case, the fit was a human. An expert took them about 45 minutes, and you can compare the red to the black. Lots of peaks. There's about 45 compounds. Bazel, on the web server for the same sample, would take about five minutes. If we were running in parallel or a really fast computer, it would take just a minute or less. So the website, and this is actually, I guess, the reference for the paper as well, is very simple. It was actually designed by Jason Grant, who's with us. So is there any problems? Blame him. And so there's sample spectra you guys will be doing in your lab. You'll actually be uploading spectra, and it'll be processing and analyzing these things for you. But it's an example of, I think, a trend that's already happening in metabolomics, that things are moving more and more towards automation. Because, in some respects, why waste your time identifying compounds when really the interest is, what do these compounds do or mean? What's the biological interpretation? That's why you have your graduate degrees, or that's why you're pursuing your graduate degrees. So there is a trend, whether it's through software or through kits, to try and make metabolomics, at least the compound identification part, simpler, faster, more robust, more automatic. So, if you're operating basely, as I say, click on either example, and then press submit. And what you'll see within the first five seconds is it will take the, what's called the free induction decay, or FID, and do a Fourier transform of that. So it converts it into a spectrum. And that's what the spectrum looks like initially, which is really awful. So it's got peaks pointing up and down. So that's called the defazed spectrum. This is what a lot of NMR spectrum looks like. It has a giant water peak, which covers all kinds of things. That's the thing in the center, because we're collecting the sample in water. So that's after five seconds. About 15 seconds later, it will have started phasing. And so now the peaks that were all upside down are now right side up. And that giant water peak that was sort of giant unfazed thing is now mostly phased. And so it's starting to look like a pretty good NMR spectrum. And then about 30 seconds later, it's now got a completely flat baseline. It's now properly referenced. The giant water signal is completely removed. So all the baseline correction, phasing, reference correction, peak deconvolution is now done. So about 30 seconds in all. And then over the next three to five minutes, it's going to try and identify every single peak in that spectrum. And this is what it'll produce. So you'll see, again, a spectrum, which you can use a variety of tools to zoom in. And it will display. And you can just barely see in the figure, there's a blue outline of what it's fit and a black, which is the actual spectrum. And I think you'll see that just about every peak or every peak is identified there. So five minutes, it's done what would have taken you guys probably an hour to fit using manual things. And then if you scroll down a little further, it provides you the list. It gives you a list of the compound, the HMDB identifier for the compound, the concentration for the compound, and the confidence score. So confidence of 10 means it's absolute certainty that the compound's there, nine, very, very certain. Five says we're not that sure. It could be there, it could be not, so it might be worthwhile looking at it in a little more detail. And I think anything below five is considered probably not there. So it's a web server, so it can't do everything for you. And so what it currently allows you to do is to look at serum, plasma, and cerebral spinal fluid. And that can be from humans or rats or any mammal, probably even fish and frogs and lizards too, but that's because these have larger, the same composition or expected composition. It doesn't work for urine, it doesn't work for plant sap, it doesn't work for wine or juice. And that's because you have to collect the reference specter for that and it's not trivial. It doesn't work for every kind of NMR, so it doesn't work for 400 or 700 or 800 or 900. But it works for the common ones. If you are going to be using it outside of this course, you should be really careful that you're preparing the samples exactly as described. And this is essentially in any analytical technique you have to follow a protocol. And I think what we found when we put this on the web, almost immediately I'd say 90% of the people didn't bother to read the protocol and so they'd be uploading anything and complaining that it wasn't working. So again, it helps to read. So right now it's in a single spectrum mode. It's a privileged opportunity with a bulk spectrum analysis that Jason has spent a long time working. Again, if it doesn't work, blame them. So that's what we'll see in the lab. So I'm going to switch now to metabolite identification by GCMS, which is a little different. So in GCMS recall we separate by gas chromatography and then as I said what you'll find is that typically under any given peak, if you're lucky it might be a single compound but more often than not it may represent two or three or ten compounds. So what we can see is we've taken this little peak and then we blow it up and we can see oh it's actually the sum of three peaks and within each of those three peaks we kind of get mass spectra. And in some cases the mass spectra aren't exactly totally pure but hopefully they're a little pure than others. So you can kind of deconvolute which ones are belonging to the red compound which ones are the black compound which ones are the green compound. Once you've got those spectra these are the electron impact or electron ionization spectra. Then you look up a database and the database would have pure compounds. MS spectra collected under identical conditions 70 electron volt, fragmentation energy and you would compare and you'd try and match them. So if you look at the red one very carefully you'll see that it matches the one that's circled and the blue one matches the one that's also circled and the green one matches the other one that's circled. So now you've identified the three compounds. Even though they had basically the same retention index they had three different spectrums so now you've identified them. So as I say the reason why that's possible is because at least with the electron ionization or electron impact ionization the compounds fragment in exactly the same way if you put them under the 70 electron volt potential. So that's why it's reproducible. If you used 65 electron volts or 100 electron volts you'd get very different spectra and so nothing would be identifiable. So again, follow the right protocol. So what you'll typically see is with an MS spectrum you normally would just see the parent ion or the molecular ion. So that's the compound with its molecular weight plus one if it's extra hydrogen. Occasionally if there's a gas in there you may have an adduct forming so you actually can see a peak that's actually of higher molecular weight. In this case it represents an additional chloride ion that's been added. Even on the lower molecular weights closer to the y-axis you'll see a whole bunch of other peaks. Those are the fragment ions. That's where all the information is because with GCMS the resolution is very low. It's typically one Dalton so it's not enough to identify a compound by molecular formulas but the fragment pattern has all the information. Now the intensities don't tell you a whole lot of what the dimensions do. Now the other thing you also have to remember in most cases as well, GCMS compounds are drivitized so they'll have trimethylsilane or they'll have methoxene derivatives or TBDMS derivatives and sometimes they'll be decorated at two or three or four different places and so you will have increased masses and so instead of seeing an adduct you'll see a TMS derivative or a double TMS derivative or a triple oxene derivative, whatever. So the masses that you'll get from GCMS and the perinion don't necessarily correspond to the pure original compound they correspond to the drivitized compounds. So for this one compound actually had six derivatizable amine or hydroxyl or acidic groups you could end up with six different groups or six different peaks actually all with different masses. So one compound, six peaks. So NMR is typically used to identify hydrophilic compounds water soluble compounds. GCMS is also pretty good at identifying hydrophilic compounds like amino acids very good with organic acids also very good with fatty acids which you can't generally detect by NMR or cholesterol in some cases. It has a mass limit so typically compounds less than 500 Daltons so you're not going to get to detect really big molecules or large lipids with GCMS. So I mentioned GC chromatography it's higher resolution, higher plates better reproducibility as well than LC. The other thing that's happened with the GCMS community is they've standardized everything unfortunately that hasn't happened with the LC or LCMS community so that still allows the EI spectra to be very comparable. When most people do GCMS work now is to take to use a combination of a software tool called AMDIS and a database called the NIST database of the National Institute for Standards database. So I'm not sure if NIST is up to version 12 or not but this is version 11 so it has hundreds of thousands of spectra of electron impact spectra. It also has other ion trap and Q-TOF and triple cod spectra but because things aren't so standard for those methods they're not that this database isn't as useful as the EI database. And there's also retention index values for about 20,000 compounds in this database. So remember what I said is that retention index can actually be used to identify or at least narrow down what a compound is. This is what the searching software looks like it's something that you download and install usually on Windows machine and you can see the mass spectra you can see compound lists you can see scores probability matches so it's quite informative and then you can see how the spectrum up in red matches the spectrum down in black which is again another visual cue about whether you've got a matching compound. So essential to how NIST works is this software tool processing tool called Automated Mass Spectral Deconvolution Identification System or EMDIS. So it does what Bazel does for NMR to some extent it tries to identify the background noise it identifies peaks so it does peak picking it does a spectral deconvolution pulling out the peaks the mass peaks from the GC peaks and then if you connected it to the NIST database it will do the compound identification using a scoring technique called the match factor or MF so it's measuring the similarity of the peaks of your query to the peaks in a database so it has a formal mathematical definition so it compares the intensity and the mass of each peak and it scales it by a factor of a thousand so it's essentially a dot product if you remember linear algebra and it's scaled so you get a perfect match perfect match factor will give you a thousand and some people will divide it to express it in terms of percentage and the rest is just a number from 0 to 1 but the way that the match factor officially was calculated is that perfect match was a thousand and imperfect match was 0 so if you're trying to do GCMS and you want to both identify and quantify and identify by retention index as well you need to have a bunch of external standards so usually the standards are L-canes and they usually span from octane to hexadecane and so this can be used to help determine your retention indices calibrate your retention indices and if you actually know the exact concentrations they can also be used to help calibrate a little bit with the quantitation generally you want to do a blank sample which allows you to identify peaks that might be rising from the solvent or the derivatization agents or stuff that's still stuck in the column so blank is really good to have and it's something that you often subtract the blank from your spectrum so you've got your calibration standards for your retention indices you've got your blank sample and then you run your sample of interest under the same conditions that you ran your blank so this is what your external standard for the retention index calibration would typically look like so you've got eight or nine peaks there you can see they're nicely separated and you now can scale everything so you've got very precise retention indices so with that retention index or an L-cane mixture you can start standardizing things and create a calibration file and that calibration file with amdisk and nist calculates retention indices and then you can start searching for those matches you can also start comparing against the blank come back just to make sure you're not identifying any false positives so if you're running amdisk you can create a calibration file this is what a calibration file will typically look like once it's run then you can calibrate when you call up the calibration file so that changes all the retention times to retention indices so now everything's kind of normalized and you can use the retention index information to help confirm an identification and then you can start searching the nist database using the software that they have so here you're seeing a peak and there might be in this case three or four possible compounds and then by clicking on the peaks the red peak, the yellow peak the blue peak, which is this curve then you can see what the mass spectra look like so this is for 11.597 minutes or retention index then you can see what's corresponding to the peak and you can see the location you can zero in so here we're looking at the peak which is marked in the big red box you look inside and you're seeing a yellow, red and blue some to the observed peak and then you can look at the masses and you can look at the match factor and you can see that under this peak we've got a good match factor, 84% or 840 and it matches to veiling there's maybe another peak which is weaker or another set of compounds and that might be something other so you can use the nist or amdus approach but there also is an automatic approach we'll talk to you guys and show you guys later called GC AutoFit instead of the amdus tools people have developed other tools commercial ones, analyzer pro from Spectralworks chromatops from LICO and these are compared amdus chromatops and analyzer pro four or five while six years ago now there's also other databases other than NIST 11, NIST 12 there's a GOM database which is in the Max Planck Institute in Germany or one of them and then there's the Oliver Fiend that's produced a semi-commercial library as well so a lot of the things, the processing techniques that are done for amdus and NIST can actually be automated so we've actually converted that to an automatic tool called GC AutoFit so it's a web-based tool like Bazel it still means you have to collect some reference actually you have to collect the sample you have to collect the blank and you have to collect the L-Cain standards but it does the reference retention index sort of fixing it does the peak identification that amdus does it does the peak integration which amdus doesn't do to get your calculation and concentrations it takes a bunch of different files it's quite fast, much faster than Bazel and it's able to identify quite a few compounds it doesn't work for everything but it works for mammalian blood, urine, saliva, and CSF but if you don't follow the protocol it won't work so that's another lesson or reminder so this is a quick outline of how you would run or prepare GCMS files you would have your L-Cain standards you can label them depending on what format they're in you'd have a blank file format and again you can label it whatever you wish and then you have a set of sample files so if you're analyzing a bunch of runs 10, 20, or 30 you can upload those you can do some conversions NetCDF and MZXML are the required ones they have tools that can do that conversion software that's either downloadable or associated with other instruments once you have those files there are the sample files they might be zipped files or you can upload them one at a time so you identify your L-Cain standards browse select blank files, browse select sample files, browse select and then you also need to choose your database so the database is a library that again is working with what you think you're analyzing so if you're going to be analyzing blood don't upload the cerebral spinal fluid library if you're going to be analyzing urine don't upload a CSF library so there's different types of libraries and there's certain types of standards that you can choose to help calibration, identification, quantification so here you've got your choice in this case we'll be using an internal library we won't be using your own because I don't think any of you brought your own library but it's a library that's internal to the server in this case we're choosing urine and in this case we're choosing our calibration standard of cholesterol which allows us to quantify things once you've uploaded things double check to see your alkanes standards see if they look nice look at your samples see if those look decent and then it crunches away and kicks out a list not unlike what you saw for Bazel which is a collection of peaks and retention indices what the compounds are and then it annotates them and you can view them using a viewer similar to what Bazel has you can also save the files on the CSV table and then of course you can view them as we've just seen on the spectral viewer so that's an automated form of GCMS and we'll run through that as well now it's important whether it's NMR, GCMS or LCMS just to be aware of the different levels of metabolite identification so in most cases what you're trying to strive for is what we call a level 4 or a level 3 identification so these things are positively identified so in that case it often means you have to have the authentic standard in your hands and compare it or in the case of NMR it's the NMR spectra because there's so many peaks match then you can be quite certain that that's the compound in the case of GCMS again if you have a high number of those matches to the fragment pattern and the retention time then you can probably put that into a level 4 but if you're just trying to match only the parent ion mass or a parent ion mass and a retention time then it's a little more tentative and it would be a level 3 that's where a lot of people do or leave things and I think this is an unfortunate situation that's going to lead to many problems so a lot of people identify compounds purely by mass matching and just to the parent ion that's a very weak identification and I would put it more at a level 2 and more often it's wrong than right so if you want to positively identify a compound we strongly suggest that you either identify it by a robust method like NMR or fragment ion matching by MS that's been coupled with retention indices or retention times or you use the authentic standard so level 1 is sort of the unknown which is a feature or an unknown peak and again that isn't terribly useful for writing up or thesis but in tabloid mix there are lots of unknowns so it's still worthwhile identifying them as unknowns so LCMS is very similar to GCMS if you remember the picture I showed for GCMS these are identical but conceptually it's the same thing if you want to be able to look at what's under an LC peak several compounds are typically there several spectrum maybe there they might not be as complex as the ones I'm showing here unless you're doing MS-MS but even with soft ionization you'll get some fragmentation so there's a number of commercial tools that people can use Agilent, Mass Hunter, Thermosev, Progenesis for Waters Profile Analysis for Brooker but there's some freeware that's available XCMS and MZ Miner the two that are most commonly used how many people have ever used XCMS? two okay how many have ever used MZ Miner? one okay, two so anyways how many have used commercial software for LCMS and tabloid mix? one, two, three, four, five okay so anyways neither here nor there, I just wanted to know what people know about so we're going to be learning about XCMS online and XCMS was a downloadable program it still is a downloadable program and it was one of the first open source tools for mass spectral processing so you can download XCMS you can run it and we used to offer a little lesson on how to do it it does peak picking, peak matching but now instead of a program, it's available as a server and servers are always easier to use than downloadable programs so we're going to use the XCMS online which is the server except it's a wide range of formats much more than the GC auto fit because it has to deal with all the different manufacturers and they haven't been able to agree on formats yet now in XCMS it doesn't do satellite identification or at least not in the way that we'd like it to but it is linked to a database called the Metlin database which you'll have an opportunity to use or explore so this is outlining how the XCMS or XCMS online works, it basically takes a large collection of LCMS data it'll get the extracted ion chromatograms and because retention times and retention indices in LC are not very uniform, not like GC they have to be aligned so they have to do things like what's called non linear alignment or time warping alignment so they'll shift and adjust peaks so everything matches once all those spectra which barely aligned initially are fully aligned then you can start basically running XCMS and that's when it starts identifying peaks, it starts separating them, starts measuring approximate amplitudes separates signal from noise and so on and it'll generate a whole bunch of M over Z measurements from there you can also if you run a tandem MS spectrum you can also pull those out from that and so from either the mass spectra if they're accurate masses or from the tandem mass spectra you can compare them to the Metlin database or the HMD database or other databases to identify your compounds in terms of peak detection it'll try and sort of combine peaks from the extracted chromatograms it'll do some filtering to make sure the peak is well detected so there's a fair bit of processing just like with NMR you can see the peaks are not always pretty but peak identification is critical this is an example of what the alignment corrections will look like so initially if you're running the same sample in this case maybe six or seven times you'll see that just from the drift on the column peaks will be coming out differently but in this case it was exactly the same sample so they should have come out exactly overlapped so what the XCMS software does along with any other commercial software tools is it will do this time warping or non-linear alignment so all the peaks will stop, it'll make the appropriate shifts in some cases it's just a simple linear addition, in other cases it's sort of a stretching out that actually occurs so before and then after so XCMS is compared to a bunch of different tools spec array, X-Line MS and spec, 10-Z-Line and in terms of this is published a number of years ago but overall it has or had this precision and recall all the other tools are obviously in their second and third iterations so they're doing quite well so they're all doing pretty well in terms of identification so if you want to use XCMS you can download it but today we're just going to be having people sign up and use the online version now unlike Bazel and GC AutoFit which are pretty fast and only run on a single CPU and XCMS is very compute intensive and runs on a very large compute farm so the scale of MS processing, data processing is much more difficult computationally than it is for NMR or GCMS this is just a very quick one I'm not going to go through this in detail because you guys are actually going to see this in the lab but it's just go to the website, create a job upload your data upload your spectra there's a couple of selection steps again you guys can refer to these slides to actually run it choose your parameters that are function as a type of platform that you're using, we'll tell you which one you're supposed to be using but if you're running mass spec they have pretty wide range of platforms that are supported then you submit and then it'll give you the status updates and report how far things are proceeding and whether it's ready to be viewed it's not instantaneous not one minute, not five minutes but I don't know how long for the samples today might be 20 minutes we'll see, we'll see if we can bring the whole system down once you've got the processing is complete then it has some wonderful spectra and views we'll try to explore things in a little more detail but what you're mostly interested in is your peak list table which in this case lists masses there are attention times, intensities which give you a relative measure of quantity not really a precise measure but that's that's the ultimate thing that you're hoping for so whether it's basal autofits fit or XCMS online all of them produce a list of metabolite names or identifiers in this case M over Z values and relative concentrations or absolute concentrations so in LCMS not all peaks are real so peak picking is never perfect the intensity is only relative and they are not concentration values but they still are useful for doing statistics peaks aren't identified with compound names so you still have to do some annotation after this so that can be a little challenging and that's where you use separate software or use tandem MS data to do some matching and comparisons so we've talked about how NMR is good for water soluble compounds, GCMS is good for organic acids and amino acids LCMS works well for lipids, for bases, for fatty acids generally better for hydrophobic molecules to get solid identification gophers to level 2 to level 3 or even level 4 you need retention time information you need MS MS data you need accurate MS data and ideally you need internal standards so although LCMS gives you lots and lots and lots of information and allows you to identify many potential compounds it's non-trivial to get absolute quantation identification so that's be warned it's appealing but not simple so how do you identify these compounds if you've got your masses accurate masses there are searches you can use a database called kebi chemicals of biological interest PubCAM which has 30 million compounds which only maybe 0.1% are actually natural products or metabolites so common mistake many people make is they just search through PubCAM because it's the biggest database well you're also going to get some pretty strange hits that are not at all biologically relevant but correspond to compounds that have never left the laboratory which is sort of the British equivalent of PubCAM very interesting database and then I mentioned the human metabolism database which is mostly restricted to human or mammalian metabolites but actually covers a fairly large swath of metabolome space even for plants because people eat plants and so plant metabolites end up in the human so here's PubCAM search you can search by molecular weight ranges and it'll give you hits and list these if we typed in this range from 89 to 89.099 we got 408 hits you can browse through to see if they seem realistic or reasonable kebi you can do a molecular weight search these are mostly biological molecules but they also include pollutants and drugs and a few other bizarre things and they cover mammals and plants and bacterial and fungal metabolites so there's a molecular weight search there and you can also do more sophisticated searches not just with molecular weight searches but actually with MS-MS or EIMS spectrum you're not just putting them in the single molecular weight you're putting in multiple peaks so these are the fragment ions so we've already talked about the NIST database I've mentioned the Metland database I'll mention a couple of other ones that are also useful so they do the trivial molecular weight search but they also support looking for positive and negative ion searches they can also search for the neutral mass they can do things like adept searches as well they can do MS-MS searching so these are much more useful as a rule for metabolite identification so one tool we've developed is a database that you can search and compare MS-MS spectra it's calculated or generated MS-MS spectra from all the compounds in the HMDV and all the compounds in the KIG the prediction tool is very accurate and so you can try and see if it matches any of those searches and that's one way of identifying by mass spectral or fragment matching you have three options you can have an option for it just take a compound and predict the MS-MS spectrum but another option where it will do the compound identification and for our purposes that's probably of the greatest interest if you've chosen that then you can enter the mass values the intensities it has an example that you can use and once you've entered that then you can enter at high energies medium and fragmentation energies are low fragmentation energies that's up to you it's a function of the type of mass spectrometer you're working with once you've submitted it then there's a comparison calibration and that puts the hips and you can see the spectral matches with the blue and red lines so as I mentioned with mass spectrometry you'll often get things that are fragments or adepts so you can get neutral loss species multiply charged species chloride addicts, sodium addicts potassium addicts in fact as you can see in an LC-MS spectrum are actually from these so called noise sources and it may even be higher than that so the challenge then is to try and get rid of these fragments distinguish the addicts, the multiply charged species as well as the isotopes for higher resolution mass spectrum and to group them together so this is an example we saw it with chlorine and GC-MS but this is with sodium and ES-MS so the actual parent mass of this compound is 951 dolpens but you'll see another peak above it which is 973 and so if you're just trying to do mass matching and you didn't know there was an addict you would enter 973.287 into METLIN or HMDB or PUPCAM and you'd get a hit but you'd be matching the wrong compound to be a mysotope or an addict and so addicts will form on negatively charged groups nitro groups, sulfate groups anything that has a strong negative charge so people have created tables all over Raphine and others of the types of addicts or masses that you'll see that you should generally look for and so by looking for pairs of peaks that are separated by 45 dolpens 17 dolpens, 28 dolpens 12 dolpens and so on you can identify things that are either addicts or singly charged or doubly charged or triply charged groups and so you can see from just a single compound it's possible to end up with many many different types of peaks and that's why there's so many of these sort of quote noise peaks that can be confounding and confusing but all over Raphine has a very nice list of addicts, a very comprehensive one as I said where the addicts is also the neutral loss fragments, these are spontaneous fragmentation events that will happen in EIMS and they'll create peaks so these are sort of fragmentation that you'll see if you've ever done MS, MS you've just simply done a direct injection but you're simply seeing these extra peaks and this is just because things break up so different tools databases, online servers metlin and ZDB hMDB, handle and predict addicts they're also able to predict dime pairs, a multiply charged piece metlin handles neutral loss species so if you're only searching by MS and you didn't worry about neutral losses and addicts, you're going to end up with lots and lots of mistakes, lots of false positives so the software that people use and it's partly embedded in XCMS online but a lot of other things is you want to remove and consolidate your addicts and your multiply charged species you want to remove and consolidate your fragments neutral losses, the breakdown products, any rearrangements you want to remove and identify isotope peaks called deisotoping then you want to remove any noise peaks from the sample blanks so this is why it's always important to run a blank or to check to see if the replicates the technical replicates don't show dilution so those are all indicators of just random noise that's showing up so these are all the things you have to do to clean up so if you had say a positive mode spectrum and you're getting like 15,000 features well if you remove the addicts it's down to 12,000 so if you remove the multiple charged species you're down to 10,000 if you remove the neutral losses you're down to 8,000 if you remove the isotopes you're down to 3,000 if you remove the noise you're down to 2,500 so you've lost factor 6 so the final spectrum might be 2,500 peaks that are real peaks and even then there's other issues that might not be that they're real compounds and then you'll do the negative ion mode and you'll get as many and so the tools that help you with that are things like MZMind there's a tool called Metfusion Magma, this is another one as well as commercial software we're not going to have time in these sessions for you to get familiar with these and to process them but again I want to press upon you that as rich as the LC-MS is it's a lot of work to get compounded notifications to sort the noise from the real peaks and it's not particularly quantitative because we're just talking about these relative peak intensities this is a quick question can you talk about the isotoping so when you get a mass spectrum of a compound you'll have its molecular ion and so on with its isotope so is it basically saying you don't have isotopes or is that very software dependent? what it's doing is really you're trying to take the five or six isotope peaks that are visible then you just want to consolidate them not treat them as six peaks but treat it as one peak but then the denoising issue if I got a very sensitive spectrometer and I'm not seeing any isotope peaks for this it's probably noise so that's another way of detecting the noise but again it's a function of how good and how sensitive the spectrometer and the mass resolution so in some cases the isotoping isn't necessary because you don't have the resolution so I know I'm kind of plowing through this we've got a lot of material here but we've covered a lot but we're just trying to highlight the different protocols used for identifying compounds one for NMR, one for GCMS and you'll notice I'm spending most of my time on LCMS in part because it is very complex so again the strengths of mass spec are that it's a lot of information but you can also with the best instruments highest resolution instruments get a lot of information to help identify compounds so if you can be absolutely confident that you've got the paradigm mass and you've been able to measure it to 0.1 ppm you're well on your way to identifying the compound so if you've got that high resolution of millidolms you can use molecular formula generators or you just put in the mass so six decimal places and a molecular formula so one example is a web tool called MZDB Maintained in Aberystwyth in Wales and they have tools for generating molecular formulas so you just type in your mass obviously your mass tolerance is usually much smaller than 0.2 dolms and it will generate the molecular formula if you've got your molecular formula then you can do searches through Pubcam so molecular formula is more useful than the mass because if you've got a formula then you can be there can only be one chlorine here there can only be six carbons it's not where you're trying to get you can rule out compounds that have fluorine or nitrogen or whatever so you can go a little further where if you look at not only the isotope peaks but also the abundance as well as rules about chemical bonding restrictions and presumed atomic composition data so generally if you're looking at mice and rats they don't take fluorinated drugs they don't take brominated compounds so you can exclude fluorine and bromine and other things from being in there so you can actually reduce the number of matches quite a bit and this is something that was developed by Oliver Fienden group called Seven Golden Rules for Formula Filting so you can take the accurate mass but if you use some of the other things like isotope intensities and consistency for bonds and valence, constructs, everything else you can narrow things down quite a bit both to the type of the formula and the possible structures and this is sort of shown here back in the early days when PubChem was much smaller but you can have 8 billion possible compounds with elemental compositions less than 2,000 Daltons and you can end up with from those 8 billion about 600 million compounds that are highly probable using the Seven Golden Rules and then if you look at PubChem's 10 million compounds at the time then it reduces it to much smaller and then if you exclude PubChem the synthetic compounds then you're basically left with about 50 to 100,000 natural compounds and so this highlighted also the utility of actually working with databases that were largely with natural products rather than say the PubChem database which covers anything that's ever been synthesized under the sun and the percentage of which never has left a lab therefore will never be in any organism. This is another plot that came from Oliver's description just showing that the large number of molecular formulas that you can come up with based on the masses and mass ranges and obviously as you increase the mass there's going to be more and more possible molecular formulas but what's also pointed out in the paper is that as you increase the molecular mass accuracy going from 10 ppm down to 0.1 ppm or if you improve your understanding of the isotope abundance you can very, very quickly narrow down the number of possible formulas from hundreds down to just one. So that's a very powerful filter and can be very useful in identifying compounds and this is just illustrating how mass plus the isotope abundance gave exactly the right formula for this compound. Now it doesn't give you the structure but it said given this very, very large molecular weight which could be potentially many, many things because there is accurate isotope abundance information and because the mass accuracy not great but still good enough was able to come up with this precise formula which matched the compound. So I think when you're searching through databases it's important to think of a few things. So the PubChem database as I said is kind of useless for metabolomics unless you can confirm the hit is a metabolite which then means having to check against a metabolite database which is sort of two steps. Kebby includes a lot of biological compounds but it's not restricted to any species. So it has metabolites from anything as long as it's alive. Metlin originally used to be a human metabolism database but it now is expanded into all kinds of plants and insects and so there's no origin. You don't know what these compounds are from. So a lot of people and I'm seeing papers where people submit or how, I ran Metlin, I got my hits and I was running on mice or rats and they get hits to all kinds of human drugs. And the mice and rats are not on human drugs. It's because they're not worrying about the origin of what the hits are. Same sort of thing if you're doing plant metabolomics and you're getting all kinds of human drugs that's not right. NIST has lots of pollutants and non biological compounds so those also have to be considered. So make sure you think about the organism and the source and if you can use organism-specific databases generally your hits will be much more robust and accurate. So seeing that NMR is quantitative GCMS can be quantitative. MS generally is not. Most studies are not absolutely quantitative. They will give relative measures. To get absolute quantitation you actually need isotopic standards. Caterium, carbon or nitrogen 15 labeled standards. And you also have to have standards that are either the same or very similar to the target chemical that you're working with. So to do that you use what's called single reaction monitoring or multiple reaction monitoring. SRM or MRM. Is anyone ever done SRM or MRM? 3, 4, 5, 6 So that allows you to get quantification and again it's using a compound that's appropriately labeled. It's selecting for that compound or its fragments and when you have those fragments you can then compare peaks and peak intensities and because you know exactly how much you've added in terms of that isotopically labeled standard you can get precise quantitation by measuring the area under the curve. So it's something that you can do manually and it's a fair bit of work or there are now automatic or semi-automatic ways of doing that. And so one of the best ones is a kit produced by a company called Biocrities where they have created and have all of these isotopic standards and allows you to measure anywhere from 160 to 180 different compounds and get the absolute quantitation. And they have other kits for measuring bile acids and I think there's another one coming out for steroids and there's a trend now in many other organizations to start creating kits to do quantitation via mass spec and if you ran a kit say on your end this is what you typically would see with concentration of metabolites and compounds identified and you can see with the remarkable sensitivity you get with MS getting concentration ranges from 10 nanomolar to 7 millimolar so a 10 to the 6 full difference in concentration. So if you do things right with mass spectrometry with LCMS you can get levels of quantitation and identification that are better than GCMS and they're better than NMR but it's a lot of work and if you don't do it right you can get some pretty bad results and I just want to emphasize that there's too many MS based metabolomic studies that are appearing that we're just not being carefully done so don't be lax careless with analyzing MS MS data okay I think we're about due for lunch actually