 So just repeating because we're trying to record this, so the question is, are there different approaches or better approaches for normalizing and scaling given the fact that there are some compounds that are naturally high intensity and other compounds that will have naturally low intensity. So typically, even with compounds having naturally high or naturally low intensities, that's the way that's the intrinsic feature of them, essentially how well they fly in the mass spec. It doesn't matter. You would still use the same scaling, same uniform scaling approach. You want to apply it across all of the spectra for all of the samples so you don't want to do scaling on one side of your spectrum and other side of the spectrum of different scaling and normalization. It's got to be uniformly applied. It just goes back to the point that you can't quantify in untargeted methods. So if something just naturally is more abundant, it doesn't necessarily mean that it has a higher concentration. It just means that it flies better. It just means that it's less ionizable. The only way you can quantify and the only time you really should be worrying about quantification is actually putting in isotopic standards. People will add extraction references or they will add extra compounds to their system just to help with, you know, as a calibration and to help with the scaling. That's certainly reasonable and feasible to do. Some people will add drugs of some types because these are not expected to occur in the sample so you know what they are, you know that they're unique, but again it also depends on your extraction protocols and whether it's a hydrophobic or hydrophilic extraction. But yes, scaling has to be applied across all regions of the spectra and all spectra in the same way to be able to get things properly compared. Any other questions? Okay, so we're going to go to Tablight Identification and Quantification and we're going to be looking at three general approaches to compound identification. We're going to be looking at NMR first and then we're going to do GCMS second and then we're going to do LCMS. And this is sort of a preamble to the lab that we're going to be doing after lunch. And so this is getting you primed so that you'll sort of have an idea of what we're doing. And then we'll also talk about searching through different databases for compound identification especially with mass spec. So we talked about the last time which is this idea of being able to go from spectra to lists. And formally this is called metabolite annotation. So your spectrum is unannotated and when you finish your spectrum is annotated. And so the annotation could be just saying feature one, feature two, feature three. It could be saying that it's a specific compound or it could be saying this is a specific compound and this is its intensity or its concentration. All of those are examples of annotation. This is just an example of an NMR spectrum that has been annotated and then the resulting list of compounds. Now when metabolomics began about 20 years ago it was lagging behind a lot of other fields. Genomics had already been well developed, proteomics was evolving and what had happened is that the genomics and proteomics had online resources, particularly genomics had GenBank, proteomics had tools like mascot and blast, same with genomics, blast or NblastX which essentially allowed you to take sequence data or protein data uploaded onto the web and instantly get your gene identifiers, your abundance of transcripts, your protein identifiers, even concentrations. But even as recently as 15 years ago with metabolomics you'd have your nice GC or LCMS or NMR spectrum and there was nothing you could upload to. There was no database, there was no blast, there was no mascot and nothing that would give you your metabolite IDs and concentrations. And so it's largely over the last 15 years that people have been focusing on trying to develop exactly those tools which can upload your data, press go and instantly your answer is there. Now it's not yet there but it's on the way and we're going to try and show you some of those examples over the next few hours. So in metabolite annotation or metabolite identification you're dealing with two situations, one we call the known unknowns and the second one called the unknown unknowns. So the known unknowns are here's my spectrum, what's in here, unless you study spectra whether it's mass, GC, NMR you're not going to be able to look at those peaks and instantly identify them. You have to use a technique called spectral deconvolution where you have reference peaks that allow you to compare and the reference peaks are usually pure compounds and then you're able to align those reference peaks and say yeah, this must be the compound and this is another couple of compounds here. Now that is part of what it's typically called targeted metabolomics but it's also part of the metabolite annotation at the end of an untargeted metabolomic study as well. The other situation which in the case of untargeted metabolomics or LCMS represents 99, 98% of the peaks that you're seeing these are the unknown unknowns. These are the ones that don't match to masses or values in HMDB or Pubcam or anything and to do these ones you have to use a completely different technique which is called computer aided structure elucidation or case. We're not going to talk about that but it is something that can take months or years to determine the structure of a truly unknown unknown. The words known unknowns and unknown unknowns actually comes from a speech that Donald Rumsfeld gave so he used to be the secretary of defense and it became a joke actually because of the way he stumbled over his descriptions but it's still useful to say that there are known unknowns that is that we know there are some things that we do not know but there are also unknown unknowns the ones we don't know that we don't know that we don't know. So the unknown unknowns is still a fundamental challenge in metabolomics but we're going to talk about the known unknowns and the spectral deconvolution is a technique that works for NMR, it works for GCMS and it works for LCMS so it's a general method and the principle as I say is to basically match the peaks and your mixture of blood or urine or tree sap whatever to known peaks of pure single compound spectra and so that means you have to have a database, a pre-compiled database of those pure clean spectra and this is an example on the right of a spectrum that has been deconvolved, the black is representing the actual spectrum, the red is the deconvolution where you have done your best job to match and you can see in some cases it didn't match very well but this is again deconvolution. So conceptually for NMR, but you could also say the same is largely true for mass spec, you have a mixture which is a top spectrum in blue so you can see about a dozen peaks there, some are tall some are short, in NMR we talk about doublets and triplets and singlets but you can see a bunch of triplets and a couple of doublets and a couple of singlets. This is a mixture of three compounds and you can see how it is because you can see compound A has one spectrum, that's the pure spectrum, compound B, pure spectrum is green and then a purple one, compound C. If you add all three spectra together you will get the top spectrum. So the challenge in deconvolution is to say here is my mixture and it's to go the opposite. What individual compounds will produce this mixture spectrum? That's a little more challenging, it's called a reverse problem and it's potentially multiple solutions. You could argue that all of those doublets could be just a bunch of singlets from different other pure compounds that are yet unknown. But as I say you start with the top and you figure out based on your library what things best fit. So it's like a jigsaw puzzle. In this case again it's the sum of the intensities and it's the location that ultimately tells you whether you've got the correct combination of compounds. Not only identifies but it also quantifies saying that each of those are present in exactly one millimolar each. So there are tools around for NMR that do spectral deconvolution. The first one that came out around 1999 or so was called Kinomics. It's still produced and this is an example where it's zoomed in where you can see a bunch of peaks from an NMR spectrum and in yellow you can see one of the spectra from the reference compounds that fits to it and it also identifies, I can't tell it's an alanine as being the matching compound there. On the lower right side you can see the full spectrum of this sample. This is I believe a urine spectrum and then on the left you can see the picture of the structure of the matching compound alanine. So it is a bit of drag and drop to try and associate the reference spectrum with the observed mixture spectrum. So the NMR suite produced by Kinomics as it's been around for almost 20 years, users will manually process the NMR spectrum so they have to do a Fourier transform. They do the phasing and I was showing you examples of how things they get rid of the water peak which is very very strong in NMR. 110 millimolar protons from water. You'll do a baseline correction so everything looks flat. You'll normalize the peak shapes, you'll do the chemical shift referencing. Then once you've got your spectra looking nice then you have to do essentially a guess and check. This is how most people solve problems, reverse problem. So you have a library of about 300 or 400 spectra and you will click on different compounds and drag and move those reference spectra to see if they fill in the peaks that you're observing. So you're superimposing images. It's the same thing if you've done jigsaw puzzles you'll take a piece and you'll see it kind of looks like it, you'll move it around. But this one is done on the computer and mouse. So if people are trained in this it'll take them about 20 to 40 minutes per spectrum. Now we used to have people run this for this course but we found that even if we had two hours for people to try and do the kinomics most people couldn't do it. Now it takes more than two hours to train. It can take up to three or four hours to get really good at it. So that led us to look to alternatives to try and teach people about how you could do spectral deconvolution. Now there are commercial programs that Broker produces called AMIX. It's again similar to kinomics, it's somewhat manual. They've modified their software so that they can now analyze juice and wine pretty automatically. Kinomics has also upgraded its software to make it more automatic than it used to be. And then there's some freeware that's also been produced, one called Batman, which is produced by Imperial College in London and another one called Basil which is developed at the University of Alberta and we're going to be using that later today. So when you have something that is totally automated as opposed to guess and check it obviously is faster. So instead of taking up to four hours for people who didn't know what they're doing or 30 to 40 minutes for people who are pretty skilled it can be done in a couple minutes. When you have something that's done by a computer it's also very reliable. Give it the same spectrum and it'll give exactly the same results. If you gave each of you the same spectrum we'll get 30 different results and if we tried to gain on the same spectrum tomorrow you'd also come up with 30 different results all different from what you did today. It also allows you to do something overnight. People get tired. We used to do a lot of manual analysis. Mark has done that, Manoche has done manual analysis. They've all aged tremendously from doing this. So it's tiring and so if you let a computer run overnight you can get hundreds of specters fit. And in some cases it actually is able to pick out things that people just because of their biases or lack of knowledge that they cannot easily detect. And so in some cases the computer does better than humans. This is the Batman home page. It's probably been updated somewhat. It uses a Bayesian technique to help with spectral peak fitting. There are other tools that more recently appeared. I think our dolphin I think is one. But Batman was one of the early ones and then another early one was Bazel. Bazel is appealing because it's a web-based tool. So you don't have to download or install anything. It's quite accurate. It uses what's called probabilistic graphical models or hidden mark-off models. If you guys use speech recognition for Google Home or Alexa or Siri, those use hidden mark-off models to recognize your voice. So they recognize complex patterns and in many respects NMR or GC and LCMS spectrum look a lot like voice signals. So what it does is it fits and shifts peaks of their intensity and position, a way that a person would do in a guess and check approach, like with economics manual fitting. In order for it to work, just like with Alexa or Siri or Google Home, it has to know that you're speaking English or in this case you need to know that you're working with a probable bio fluid. So to tell it that I am analyzing blood when you're actually analyzing urine, it'll do very, very badly because you've told it the wrong information. So it needs to know what the composition or the bio fluid is. But once that's given, then it'll do automatic phasing, automatic chemical shift referencing, automatic rudder removal when baseline correction and then it'll do all the automatic identification and quantification. So you can get some pretty complex spectra. Here's the top one is an example where there are 90 different compounds in this particular sample. And you can see all the peaks beneath it, the different colors, all of them fitting so that you can get this almost perfect match to the observed one. So this is why it would take someone who's untrained many, many hours to try and get all of those peaks fit properly. That's 90 and then you can imagine how much worse it is if you have to deal with 150 compounds. So lots of comparisons, they're done, needed to get it published. But this is an example of one where you do a manual fit at the top. And so this one took this individual about 40, 45 minutes. And then the same spectrum was fit with basil. And it took about five minutes with the instrument. And you can see that the matches are almost identical, almost all the peaks compared to the red to the black. You can see that everything fits quite well. So it's a website. This is what the homepage looks like. And users can kind of select which examples they want to use or upload the Rome data, you'll get a chance to do this later today. So once you start the operation, you have to provide some information about the instrument frequency, the type of reference standard you had. And how fast you want it to run faster, slow, slower one is a little more accurate. Once you select those initial parameters, then you can basically press go. So within about the first five seconds, the spectrum is Fourier transformed. So this is what it initially looks like. At that stage, it looks pretty awful. This is after phasing. So this is correctings of the peaks have gone from dispersive to absorptive phase. And you can start to see that looks a little bit more like an NMR spectrum. That's about 15 seconds. 30 seconds later, it has done the baseline correction, it's removed the water, it's done the chemical shift reference. So at that stage, it's ready to start the deconvolution. So this is something that normally is done manually by most NMR people, but this system is able to do it automatically. And the deconvolution will take about another three to four minutes. And in the end, the spectrum is produced like this. And if you look closely, you can see there's a black spectrum, but there's also a blue fit to the spectrum. The blue fitting is the deconvolution. And you can zoom in and expand both the x and the y direction or click on different regions. And you can see exactly how things have been fit. And to your eyes and to my eyes, it looks like it's an exact match. Every peak has been matched precisely. And in this spectrum, there's actually several hundred peaks, not just the really obvious ones. And then below that is a list of the compounds that were identified. So acetic acid and betaine and carnitine and caratin and citric acid. So that's on one column and then their concentration is down below. And then on the far right is how confident the fit is. So a 10 means 10 out of 10, it's very confident. Lower confidence around six or seven. And those might correspond to a single peak, which could or could not be the match, just like trying to use a single mass, paradigm mass is often not sufficient to identify a compound. But in NMR, most compounds have between three and 30 peaks. And so in order to match both all the peaks and the chemical shifts and their positions. Once you've done that, sometimes you have very, very high confidence. So this is a partial list. This wasn't the full list of the compounds identified. And it's to say everything was done automatically. So to do this, you know, it doesn't work for everything. It's for relatively simple bio fluids, more complicated ones like your and it just can't handle. It's currently limited to 500 and 600 megahertz instruments for NMR. So measure the magnetic strength. Many people who use the online version don't read any of the instructions. And so they'll collect their NMR spectra completely the wrong way, or on completely the wrong fluids or completely the wrong spectrometer strength, and then they'll try and fit and they'll evidently complain bitterly afterwards. So you know, as with anything in chemistry, you know, read instructions, follow the instructions carefully, collect your spectra as directed, and then things will generally work. The web server can do things in a single spectrum mode. You guys will be using one that's designed for batch uploads. Now, Mark and Manoj have been actually working on rewriting Bazel and a new version of Bazel will be coming out later this summer called Meg Met MAG-MET. It's both faster and more accurate than Bazel. But since we didn't quite finish in time, we'll be using the Bazel server this this time. If you want to come back next year, you can use the MAG-MET one. Now, GCMS is a different beast. And the way that it typically is done is to work not with an NMR, but with a mass spec and a gas chromatogram. Typically, you'll have a chromatogram that's up at the upper left. You might choose a peak and within that peak, you might find, in essence, three other peaks, many compounds, if you want to really have the same elution time. And so the only way you can distinguish things that have the same elution or retention time or retention index is to actually collect their spectra. So within that, this is where we have this retention time versus M over Z marking, you will see spectra. And so in this case, the MS spectra will be the electron ionization one, so there will be lots of peaks, lots of fragments. So this particular small peak had three other compounds in it. Those three compounds had three spectra for it. What you have to do for metabolite identification is then compare those reference spectra to your, which is in your library, to the spectra you've got from those three peaks. So my reference library here has seven spectra, but I can compare visually you can too, and see that the top one matches the top one of our reference library. The red one matches the one kind of in the middle, which is circled in red, and the green one matches the bottom one, seventh one in our database. And so if we knew the structures or the names of those compounds, then we've identified them by spectral matching. So this is spectral deconvolution for GCMS, not too much different than what it is for NMR. Now in GCMS or EIMS, or use electron ionization, we have multiple peaks. So it's not a single peak. It's anywhere from three to 10 to 20 peaks that we'll see. This is a better example, a more realistic one or a more useful one of a fragment ion. So if you count the number of peaks here, I count about 15 peaks, I guess. Some small, some big, we can see the parent ion or molecular ion. We can see that it's generally a low resolution technique. It's unit mass. And then we'll see the fragments or fragment ions. And these are fragments from the paradigm. In some cases, we'll see peaks that are even higher than the paradigm. And these may be certain adepts. Now in many cases, it's important to remember that the GCMS spectra, the compounds are derivatized. We talked about trimethylsiline or TMS. And they will be decorated. If there's one TMS, it'll add 72 Doltons. If there are two TMSs, it'll add 144 Doltons and so on. There are other derivatization agents. TBDMS is one, and it'll add obviously more mass to the molecule. And methoxene, or MO also is a derivatization agent, and it will also add certain masses. So you have to remember that when you're looking at GCMS, most of them are derivatized molecules. The masses do not match the pure compound. They have to match the derivatized compound. Now GCMS works for molecules that have a molecular weight less than about five or 600 Doltons. So that's a fairly severe limitation. It works really nicely for amino acids and organic acids and some sugars, some fatty acids. But it doesn't work for really hydrophobic molecules or for really heavy molecules. We talked about how its resolution and chromatography, the plate count and reproducibility is much better than liquid chromatography. Key advantage of EIMS over LCMS is it's highly standardized. So that the EI spectra that are collected on one instrument compared to another instrument to another to another from another generation, they're still very, very similar. And the way and because of that standardization, it is possible to make use of very, very large spectral libraries for GCMS. The largest library is the NIST database, the National Institute for Standards. It's maintained in the US. It's commercially available. And they have these things as NIST 11, NIST 14, I think NIST 17 or NIST 18s out. This just represents the year. So NIST 14 was released in 2014. NIST 11 released in 2011. And this 14 database has almost 300,000 EI spectra for almost a quarter million compounds. Now it's a little exaggerated because they're accounting all of the derivative derivatives compounds. So many compounds will have three to four derivatives. So the actual number of unique parent compounds is maybe a quarter of what they list there. But it's a lot. So it's a pretty impressive collection. They also have ion trap mass back and Qtof mass back. They also have retention index values for about 80,000 compounds. The software looks like this. Has anyone ever used the NIST software? 123? Not too many. It's I think you can tell it looks ancient. It hasn't really been updated a lot. But it does does allow you to do searches. It has compound identifiers. And then you can compare sort of these mirror spectrum between your observed spectrum, and the reference spectrum, the one on red versus the one on blue, which is kind of the thing in the middle in the in the window here. NIST makes use of a software called Amdus, which means stands for automated mass spectral deconvolution spectral deconvolution and identification system. So just like the NMR concept, it'll help identify, you know, peaks, distinguish background from noise. It'll pick out peaks, which is done similar to basal and also to magnet. And then it'll do the deconvolution, which involves matching spectra to observed spectra, and therefore identifying compounds through the library. Now, rather than giving a, you know, zero to 10 confidence score, it produces something called a match factor. So this is a formal definition of the match factor. And so it's measuring the similarity of the mass spectrum of the query to the mass spectrum in the reference database. So you could use a match factor for NMR, you could use a match factor for GCMS, you could use a match factor for LCMS. It's a dot product. So if you have taken vector algebra or matrices, you remember dot and cross products, but it's essentially matching the intensity across two peaks, and then normalizing it based on the intensity of mass. Match factor is scaled by a factor of 1000. So the highest possible match factor you can get is 1000. The lowest, obviously, is zero. So if you're going to do GCMS spectral deconvolution, there's a few things you have to do. So the first thing you have to do is run a set of standards. They can range from octane, taxa, decane, these are L-cane standards. And these can serve as your calibration standards, not only for intensity, but also for retention. So that's your external standard, calibration standard, with quantitation and retention. Then you run a blank sample. Anyone who's ever worked with mass spec knows that you can inject essentially nothing and end up with lots of peaks. So this blank sample allows you to sort out what those strange peaks are often solvent or dramatization agents. And then you run the sample of interest under the same conditions, solution, heat gradient, temperature gradient, as the blank. So here your external standards, which you run, they should be well separated. If you know their concentrations, then you can match their intensities to help calibrate the concentration. That calibration file is CAL will be used to calculate your retention indices, but also to help calibrate the intensities. And then once you have those retention indices from your L-cane standards, then it helps narrow down your search with AMDIS. So AMDIS will search through the NIST database, the two are coupled, and match and display those best matches. It'll also through that matching, some of it automatic, some of it manual allow you to get rid of the false positives that are in the blank. So here is the AMDIS tool that's allowing you to create this calibration file, we're not going to run AMDIS, I'm just showing some screenshots, you can see the L-cane standards as they're marked in there, then reads them in and calculates your retention indices for the actual spectrum. So you're calibrating with that L-cane standard, L-cane standards then readjusts your spectrum, which as you can see, the white has a whole bunch of small and large peaks all scattered all through it. And so now the retention indices are properly calibrated. Once you've got things calibrated, then you can do this thing where you're going to look at individual peaks on the chromatogram. Now remember, a peak is not necessarily a pure compound. It is going to be a retention time and then a bunch of m over z values. And so this particular one, we've chosen this peak, we've clicked on it, and then we can see a white, red, blue, yellow set of peaks that are all in there. We'll zoom in a little more so you can see it. I've clicked on the peak here. And I can see a yellow, red, blue, white. The white is the overall sum and the blue, red, yellow are the individual peaks. Clicking on those peaks, one has a mass of 73 Dalton's another has a mass of 144 turns out they are both on the same spectrum, roughly the same intensity. And then I can compare it or Amdus will compare it to its library. And it finds it's almost an exact match, an 84% or 840 match factor to veiling. So that peak marked in red there with the red box, based on its match factor, and the comparison to the deconvolved spectrum of pure veiling, which you can see at the bottom there matches pretty well. And so you can with high confidence say that that peak is veiling, and nothing else. There might be some element the yellow one may not be veiling, it might be some other small, low abundant, small, a cure. And you may have to do a bit more deconvolution. So it's an interactive process. You're working with visual cues, you're clicking on things, you're assessing your value of any match factor. So it takes time. And different people will get different results, just like with kinomics as a manual approach. So there are some and you guys will try this today called automatic approaches, a GC auto fit. I'll talk about that later. There are also other manual approaches produced by other companies, analyzer pro chromatops, and Amdus, these were compared about 10 years ago, things haven't changed a whole lot. There are obviously other alternatives to the NIST database. NIST 08 NIST 11 NIST 14 and NIST, I think 17 or 18 now. There's a gold database, and Oliver fiends group has developed a database called the thing lib database. So those are commercial. GC auto fit is one that we developed again for this course, and also for our own work. So the intention was to make this some freely available and over the web. So it's a web based tool hasn't been published yet. But it's been kicking around for a couple of years. It follows the same protocol, you need a three spectra. One is of the sample, one is of a blank, and one is of the alkane standards or calibration standards. It does auto alignment, that's the calibration, it does peak identification, just like Amdus, it does peak integration, just like Amdus, but also does concentration and compound identification. It'll take a bunch of different files that CDF or MZ XML, and I think we'll have to adapt it to MZ ML. It's faster than the NMR. And it can actually do more compounds. It's able to analyze urine and saliva. It works okay for blood, but it's particularly good for urine. Like an NMR method, you have to follow a protocol. So if you don't follow the protocol, it won't fit if you give it the wrong bio fluid and say it's urine when in fact it's blood, it'll also do a poor job. So you have to, as with anything, follow the rules. These are some files that are intended to help you with running the software when you do the lab later today. The alkane standards are marked with special names, the blank samples are marked with obvious names, sample files are what you will be provided with. If you have to convert, you can convert net CDF to XML, and there's various tools like chem station and proteo wizard that are freely available. So it's pretty simple to do. You have three files to upload. So you browse them, the alkane standard, the blank and the samples, you guys will have those files that you can download and you'll be able to do this. You can also, if these are all zipped into a single file, you can upload that single zipped file. And that'll save time as well. Generally, people running GCMS will have their library, there's preferred libraries for different types of bio fluids, some for serum, some for urine, some for saliva. In this case, we're running urine, we're choosing a particular urine library. And we're telling the type of bio fluid that it is, and whether there's a specific calibration. This is the calibration process. So it's taking the alkane standards, adjusting the retention time so everything is all scaled and properly aligned. And blank spectrum is then added so it can get rid of noise or junk peaks. Those are also removed. And so now you reduce your initial spectrum, which had kind of odd retention times and noisy peaks to something that is interpretable. And at that stage, the deconvolution starts. So it uses essentially the same viewing tool that is used in Bazel. You see a little spectrum in the upper left corner, which is the full spectrum. And then you'll see peaks, which you can click on to identify which compound is which. There's a comma separated value file, which indicates the compound name, its retention time, intensity and calculated concentration. So as I said, it's even the table view, which is what most people are interested in the concentrations or the spectrum view, which just sort of gives you some reassurance that the fitting has worked. Sometimes it may completely go sideways, but that's quite rare. And as I said, the table values allow you to read off your concentration. So just like Bazel, you can just upload the spectrum, press go, wait a few minutes, and you've got your compounds identified and their concentrations rolling out. Now at this point, it's probably useful to remind people about levels of identification that are used in both metabolomics and particularly as it relates to mass spectrometry. So the metabolomic standards initiative from about 10 years ago identified four levels of metabolite identification. Highest level are the positively identified compounds. So those have to be confirmed and matched to a known standard. So it means physically having this known standard in your freezers and most of us never do that. So punitively identified compounds correspond to what you're matching the mass and retention time. So that's the M this thing, that's the GC auto fit thing. And it's matching not just the mass spec, but also retention time or the tandem mass back or the EIMS and retention time. Third category is compounds that are putatively identified by compound class. This is where most people actually are. So they identify compounds just using a paradigm mass. So I've measured something at 172.1638. I look it up on pub cam. And I find the first thing that hits my eyes and I say that's my compound. That's how most people still do metabolomics, unfortunately. Other people may only be able to say I can't really say what it is, I can just give you the mass and therefore I can calculate molecular formula. So that's also considered a putative identification. So level three and level four, the totally unknown compounds, which in case of LCMS account for about 98% of the peaks. So again, to reiterate, most of us, the best we can do is sort of this level two. Many people still unfortunately doing levels three. Almost no one is able to achieve level one. On the other hand, NMR inherently is level one, because inherently you're matching dozens of peaks, not only their positions, but also their intensities. And so it can't be anything else. So NMR inherently gives you a higher level of metabolite identification. Any questions about that? All right. Now LCMS, you'll notice the figure looks almost identical to the figure for GCMS. And in many respects, it is. We work with liquid chromatographic outputs, most peaks from an LC run will have multiple compounds buried under them. And we can either collect the ESIMS or MSMS spectra for those. And the combination of the ESIMS or MSMS spectra then can be matched against the library. And that's how we ultimately identify the compounds. There's a lot of tools that are around to facilitate that. Commercial ones from Agilent, Brooker, Thermo, Waters, Syax, all of these produce commercial tools. And then there's a variety of free options, XCMS, MZ, MINE, several others, all of which can help with compound identification. We're going to focus on XCMS and specifically XCMS online, partly because it's free. It's also one of the original ones for spectral processing. It does the deconvolution, similar to Amdus, similar to Bazel, similar to GC out of it. It does peak picking, peak matching, but also does an extra thing, which is retention time alignment. And that was one of the things I mentioned the last time about how you've got all this collection of many spectra, and you have to try and match their fact that sometimes the column is fast, sometimes it's slow. You can get XCMS as a program, or you can access it through the web. It accepts lots of different data formats, and it's linked to a database called Metlin, which allows you to identify two reference spectra and match the compounds. So this is the general workflow for XCMS. So it's an untargeted technique. The idea is to upload lots of LCMS data, 10, 20, 100, or 1000 spectra. You get the extracted ion chromatograms. So from there, you will then perform a alignment. This is this nonlinear alignment method. Once those are aligned, then you can perform some of the peak picking, and then the mass measurements, spectral matching, as well as paradigm masses are ultimately used to identify those compounds. Peak identification in LCMS is difficult. It's much harder in LCMS than it is in GCMS and in NMR. And there are a variety of tools that have been developed. XCMS doesn't have the best peak picker. There are others that seem to be too better. But this is an example of how the peak picking is done. Peak alignment and retention time correction. This is an example where you can see where everything's slightly off. And the top one, and then after alignment, everything is nicely matched. So these alignments, obviously, there's also scaling that can be done or should be done. All are critical as are peak identification and distinguishing peaks from noise. This is published about 10 years ago, and the time XCMS was doing extremely well. As I said, there are other tools now that seem to outperform XCMS. And some of the other ones that were also later modified. So it's a it's a busy field in the sense of the number of tools and the performance of the different algorithms. But as I take, we're just going to focus on XCMS today, because most of the other ones are not available over the web. Most are either commercial and others require lots and lots of challenges of installing. How many people have used XCMS? Two, three, four, five, six, seven. How many people used XCMS online? Two, three, four, five, okay. So that's what we're going to use today. This requires a user account. And I don't know if people actually have, or if we told them to make user accounts. We did not. Okay, lunch break. Instead of eating lunch, you're going to be making user accounts. And you're going to register. Potentially, you could do this now if you want. So once you create a user account, then there's several steps. And again, we're going to go through this in a bit more detail. This is just a preamble so that you're not too shocked when you actually have to do this. You're going to be choosing the different jobs or four types of jobs, single pairwise, multi group and meta XCMS. Then you're going to define your upload, and define certain parameters, and then you're going to submit your job. Again, we're just sort of stepping through this fairly quickly. So once you've chosen your job, then you're uploading your data, there are more slides we'll show you to sort of walk you through this. But again, it's fairly standard with web tools, click, press, click, press. And as long as you know where your files are, it's relatively painless. Once you've uploaded the data, then you can submit the job. And it's getting step one, step two, step three. Once that's done, you're going to have to wait for notification. It could be 10, 15, 20 minutes for some of the things to be processed. We will probably hit this pretty hard. So it could take a while for some results to happen. What we will do is we're going to break you guys up into different groups so that we just don't demolish the system. So there's actually three sets of two tables each. So the first set of tables will be for NMR, another one will be assigned to GCMS, another one will be assigned to XMS, then you guys will flip and switch. So once you've received your email notification from your account, you can view results and download them as needed. There's graphical summaries, which are quite impressive. There's also tables, comma separated or Excel like tables, which have very similar types of information that you would have seen with the GC auto fit. So identifiers, not necessarily names of the compound, but information on retention times, average intensity, individual peak intensities, their mass, paradigm mass, and so on. Now, as I've highlighted before, when you're looking at LCMS data, there's lots of peaks, tens of thousands of peaks. A lot of them are not necessarily real. The peak intensities that are recorded are only relative. So with XMS, you do not get concentration data, unlike with say GC auto fit or unlike with Bazel or Kinomics. Peaks aren't identified with compound names. So in order to annotate the compounds, then you have to go to separate tools, say Metlin or HMDB to use either the MS or tandem mass data. Now we've talked about how NMR is good for hydrophilic compounds and how GCMS is good for acids and in some cases, amino acids. LCMS is best for lipids, for fatty acids and for generally hydrophobic molecules. You can get amino acids, you can get some hydrophilic ones, but not as many. To really do proper identification with LCMS based metabolomics, you need both the MS, which is the paradigm and the tandem MS data, along with retention data. I do lead also like to have some internal standards to validate that. If you have very, very high accuracy orbitrap FTMS data, it can help narrow down what your compounds are, but it doesn't necessarily tell you what they are unless you've either got the authentic one or a lot of MSMS or MS to the end spectra from reference libraries. So as I said, for many years and unfortunately still today, a lot of people identify compounds purely by mass matching. So I found the thing at 172.1634, they'll upload it into kebby, they'll upload it into pubchem, they'll upload it into chemspider, they'll upload the mass in HMDB and simply say, do I get a hit? And you could do this at pubchem, which has, you know, 100 and some million compounds. And here we are uploading a particular mass or mass range. I've got anything between 89 and 89.1 Dalton is what I've put in. And what do I find? This particular list gives me 400 and some matches. So I could choose the first one or the fourth one or the 24th one and say that's my match, but that's not good science. I could use kebby and say, okay, this is more biological stuff. 99% of the compounds in pubchem are actually not in living systems. Whereas with kebby, these are biologically related. So kebby also has a mass search and you can give it a specific range as does HMDB as does chemspider. But what is fundamentally important to remember is that if you're doing biology, look through biological databases. If you know what the organism you're studying, look at the database specific to that organism. Trying to do a search and hit a match through pubchem is about 99% guaranteed to give you a false positive. Now there are more advanced searches where you can now search not just with simply a mass or mass range, but you can actually do spectral matches or associate things for addicts and addict variants. We'll go through some of them, but we've mentioned NIST, which has lots of databases for QTOF and IonTrap, Metlin has a variety of spectra for QTOF and IonTrap mass bank. And then we'll talk about another tool called CFMID. So it's not just simply mass searches, but parent ion peak list searches, as well as spectral matching. So this is more proper, more correct, although there are caveats. I'm going to talk a little bit about a tool that was developed recently called CFMID. Has anyone heard of this? Okay, no one except Karen. So this is a server that allows you to do compound identification from tandem mass spectra or MSMS data. It also predicts MSMS spectra from known compounds and it uses machine learning techniques. It's been adopted also to GCMS data as well. So currently it's the only tool out there that allows you to take a compound, draw it and predict the EIMS or MSMS spectra. It also has a large library of predicted spectra, as well as known spectra. So people can then upload a spectrum and say what does it match? It can also annotate spectra with the individual fragment ions and tell you what those fragment ions are. So it has three options, spectral prediction, peak assignment, or compound identification. So the compound identification option is you upload your MSMS spectrum of your compound of interest, pick and select things as you normally do, indicate that collision energy. If you have multiple collision energies at which you collected spectra, that's usually a little more informative. And then it'll have its predicted spectrum. This is a little older view, but it now has I guess the mirror view with the observed spectrum on top and the predicted spectrum on bottom. And then it gives you the listed compounds and their match factor based on that. Yes. Yes, you're allowed. It's specific to individual molecules rather than sort of classes, but you will find common fragments from many things, you know, see benzene rings, or it sort of predicts, I guess, McClaffery rearrangements and things like that. So it's, it's used a lot by people now to do unknown identifications. And in fact, it's been used extensively in the HMDB to predict spectra for all of the compounds in HMDB. And then most mass spec companies are using the code now to generate MSMS spectra predictions using their different instruments from Orbitrop to IonTrop to FTICR, depending on the collision energies they use. It can. Yeah. It's not perfect. I mean, no one yet has got something that predicts perfectly, but it's very accurate. For some compounds, there's a new version coming out which is very, very accurate for lipids. And it usually gives you a hint of what's what's going on. Natural product chemists have used it a fair bit to do compound ID. Okay, any other questions? Hey, Francis, hi. Some of these compounds, did I saw some reference to keg and so forth? Do they all, they're like taxonomic origins? Let's say you're looking at the human sample. Yeah, so there are organism specific databases. And then within the human metabolism database, they also indicate where it's from. Yeah. And in some cases, an example, alanine can be both endogenous, but it's also produced by microbes. And you can't know how much or what portion is. But there's others that are exclusively produced by microbes. So hypoic acid is exclusively produced by microbes. Although it comes from polyphenols that you eat from your food. So it's sort of a mix of both a food and a microbial metabolite. So that information is encoded in a number of ones. The human metabolism database has also been updated recently with an ontology, chemical ontology similar to gene ontology. And it's providing information about the provenance, health applications, industrial or functional roles. So all the compounds are being annotated with with that chemical ontology now. But yeah, the origins and pathways still sometimes challenging because in fact, sometimes compounds are, you know, pools pooled from many months. For those of you haven't met Francis, Francis Willett, he's one of the founders of the CBW and in charge of CBW program. David and I are the only two remaining founders. Yeah, all the others died from old age. No, it's true, I guess we've been around for a long time. And that was the sign there is we think is 20 years old and starting to show it's anyways, I'll carry on because I guess we probably have some hungry souls here. So so when we're doing metabolite computation by mass spec only, we can and we'll see things like salt additives and neutral law species and multiply charge species. These are extra peaks that show up in the mass spectrum. And these extra peaks are a lot of what we see. In some cases, we mistakenly think they are real features. Technically, they are noise. We often have to do a lot of work to try and distinguish those addicts and multiply charge species from the parent ions or to merge them into single peaks. So what is an addict? This is an example of a spectrum of a compound, or several compounds, we're seeing sodium being attached. So this particular lipid, I think has a molecular weight of a parent ion of 951 dolphins, add sodium to it, and you increase its molecular weight by about 22 dolphins. So we've replaced the hydrogen with a 23 of the sodium. So 23 minus one is 22. And you can see these structures, these will fly very nicely in a mass spectrum. So it's easy to mistake a sodium addict or potassium addict or calcium addict for the parent ion. That makes it a little more confounding. In many cases, you'll see multiple addicts. And so again, you might think these are all different compounds. So there's a variety of software tools that help sort this out, but it's not trivial. This deconvolution process is typically to try and take or convert all of these addicts into a single parent ion and to make that much more simple. So in terms of common addicts, we can see a long list, the variety of ones that are provided. Some represent cations, some represent anions, some represent combinations of additions and subtractions. There are tables, Oliver Fein has produced a very popular table showing another long list of addicts and how much a mass will be added or subtracted in some cases. There are also a variety of neutral loss fragments that will occur. So these are essentially the removal of moieties in the molecule, leaving in some cases just a neutral compound, which is not detectable, along with the fragment which is ionized and detectable. So these issues are not handled simply by pure mass searches. They have to be handled by specialized databases. And there's a variety of those that are available. Metlin, HMDB, MZDB are able to handle addicts. Some are able to predict ion pairs and multiply charged species. Some can also predict neutral loss species. When you're working with LCMS data, particularly with the more sophisticated software, which you guys will deal with, with XCMS or with MZMind or other things, you will try and consolidate those addicts and you'll try and consolidate those multiply charged species. You'll try and consolidate some of the fragments, the neutral loss fragments, some of the in source fragments, breakdown products and rearrangements. You'll also then try and consolidate the isotope peaks, all of those cascading smaller peaks into a single peak. You'll also try and remove the blank or the noise peaks, which we all know show up in all mass spec ones. And this can be done through things like comparing between technical replicates or doing a deletion series to see which peaks appear and disappear. So typically, if you start off with a spectacular looking LCMS chromatogram, which gives you 15,000 features in the positive mode and 10,000 features in the negative mode, when you start doing these cleanups, if you start removing and consolidating addicts to go from 15,000 to 12,000, consolidate the multiple charges to go from 15,000 down to 10,000, remove the neutral losses and the isotope peaks to go from 15,000 down to 3000 removes a noise, go from 15,000 down to 2500. So the net result is you reduce things by about a factor of six, in terms of these are real peaks. And typically negative iron mode less sensitive. So usually not as many. So you can go from 25,000 27,000 features down to about 4000 peaks that are confidently confirmed. But you still see people publishing and saying, Yeah, I have 27,000 peaks. And when you see numbers like that, you immediately know they haven't done proper cleanup. So once you have done some of the cleanup, then you can start trying to identify what those compounds are. And as I said, we can still use very, very high resolution mass spec to not necessarily identify the compound, but to identify a class or at least a molecular formula. So there are tools that are called molecular formula generators, which will take a high accuracy three, four decimal plates mass value and allow you to generate a formula. So MZDB is an example. There are other ones that are out there that use the golden rules with all the refined commercial packages as well. And you can restrict things to say, Well, I know that my compounds don't have fluorine in them that they don't have boron. So I can narrow it down to carbon, nitrogen, hydrogen, oxygen, maybe sulfur. And so with that information, it'll reduce the search space and also narrow down what the molecular formula or formulas are. And then you can go and search against databases, not purely by mass, but by molecular formula. And that gives you, I guess, potentially a slightly better search option, but still not great. So I'm just simply saying it's possible, but I strongly discourage it. Now you can go further, where it's not just simply saying it's just got to be carbon, hydrogen, nitrogen. If you make use of other bonding restrictions, the atomic composition, but also the isotopic abundance, you can do even better. And this is something that grew out of work from Oliver Fiend, and the so called seven golden rules, which have been around for a while. But these allow again, using information in the spectrum, as well as at high accuracy masses, as well as information about what's allowed in chemistry. Now you can't have a C 1201 compound, it just can't physically bond. So those rules allow you to narrow down what is a reasonable formula, and what's a feasible structure. So these formula filters are available also in commercial packages. Brooker can also provide that sort of formula filter, thermo, I'm sure, and others. So this is sort of a scale of what's possible, in terms of if you limit things to all compounds less than 2000 dolens with carbon, hydrogen, nitrogen, sulfur, oxygen and phosphorus. So that's eight billion elemental compositions. If you use the seven gold rules, that shrinks it down by a factor of 12 or 13. And then if you look at isomers and the formulas in PubChem, that shrinks it down to 700,000. And then if you shrink that further in terms of the number ones that are known, it's even smaller. As you increase the molecular weight, the number of possible compounds increases. So if you're looking at large molecules, it's technically more difficult to identify them purely by formula or by mass matching. If you're looking at small molecules, it is intrinsically supposed to be easier. There's also some data indicating how mass accuracy does improve the performance for identifying or zeroing in on what the compounds are. So either you have an incredibly accurate mass spectrometer at 1 or 0.1 ppm, or if you use the isotopic abundance, then you can also shrink things down quite considerably. If you have a low resolution mass spectrometer, you can see how the number of possible matches increases quite significantly. You can also see on this table that as the molecular weight increases, the possibilities also vastly increase. Now, these are all essentially trying to highlight the fact that it is dangerous, extremely dangerous to identify compounds on the basis of mass. A, you can see the number of possible matches or possible molecules. And that even if you are still trying to use mass plus isotopic abundance, you still have lots of possibilities. And you still haven't met the criteria from the MSI standards of going much beyond the level 3, not even close to level 2. This is an example of a compound that was actually identified ultimately by using isotopic abundance, paradigm mass, but also lots of NMR and lots of MSMS. But again, where you see high resolution mass spec, you can see the isotopic abundance. It narrows down what the formula can and should be. And then by matching to what the known compounds for this particular plant could and should be, they're able to get a pretty good idea of what it was, which ultimately was confirmed. So this is an identification of a known unknown from a tomato and looking at the actual mass matches. Now, as I said, if you use databases improperly, you're going to end up with lots of mistakes. Many databases, especially Pub Chem, also Metlin, also NIST, mix non-metabolites with metabolites. Others will make plant metabolites with animal metabolites or drugs with buffers. So if you're looking at a pure system, I don't know, looking at pine needles, you should not see antidepressant drugs in it. Pines do not take drugs. But if people are not smart, they will simply look for the best mass match and then they'll simply say, well, this is the first hit I found. There's lots of examples where people have found truly ridiculous matches for, especially mouse and rodent studies, again showing that they're all on antidepressants or something like that. The other thing people seem to forget or neglect is that if you know your organism, there are plenty of databases that are now organism specific. So the human metabolome database is specific to humans. If you're analyzing drugs, you might as well look at a drug database. If you're looking at the E. coli metabolome, look at the E. coli metabolome database. If you're looking at yeast, look at the yeast metabolome database. If you're looking at a rabidoxys, look at the rabidoxys database or the knapsack database. If you're studying foods and food products, look at food DB. There's no point and no reason to start searching, say, PubChem when you know precisely what your organism or system is. Now, as I'd mentioned before, non-targeted mass spec or untargeted mass spec is not able to do quantification. So vast majority, 90% of published MS studies are not quantified or quantifiable. In order to do quantification, you have to spend money. You actually have to buy isotope states labeled standards or spike them. In many cases people even have to synthesize them. Those isotope standards have to be identical or similar to the ones that are in or being measured. You use a technique called single reaction monitoring or multiple reaction monitoring, SRM or MRM, to ensure not only the compounds are identified but also to ensure good quantification. So this is standardly used in clinical triple quad mass spectrometry or clinical ion trap mass spectrometry. And this is an example of a compound where it's deuterated and non-deuterated and so those two are added or the deuterated version is included. And the multiple reaction monitoring allows you to look specifically for the fragments or characteristic fragments of that molecule. And those are known and they are ensured to be unique. They don't overlap with anything or sometimes it's a combination of fragments to be certain that this is the only molecule that can be there. And then it's quantified by looking at the isotopically labeled fragments to ensure that you can quantify. So there are a couple of quantification kits that allow you to do quantitative mass spectrometry by mass spec. A company called Biocrities has produced the P150, the P180 and the P400 kits. P180 is quite popular. Has anyone ever used a Biocrities kit for mass spec? No one. Anyways these are a great way for people to do quantitative mass spec and are quite popular in core labs in the US, Canada, Europe and Japan. They're not heavily used by smaller independent labs which is a shame because these are intended to make metabolomics very simple. So these are some examples. Typically with mass spec you can go down to 10 nanomolar levels and high concentrations of up to about 10 millimolar. So the lowest limit is about a thousand times better than what you can do by NMR and the highest limit is about 10 times worse than what you can do by NMR. Anyways it's still very impressive, still very reproducible and a very simple way to get quantitative mass spec data. So we're going to wrap up now but again if you want to start getting ready for lab number two there are some data files that you can start downloading from the website or the wiki. You guys also have to try and get an account for XCMS online so I think if you want to try and do that downloading may take a little while because I think we've got a slow connection and perhaps Anne might have some comments or cautions for us.