 going to be talking about aspects of metabolite identification and quantification, and this is this idea of going from spectra to lists. So, again, we're going to talk about spectral deconvolution. We're going to look at NMR, GCMS, and LCMS. And this is sort of a lead-up to the lab that you're going to be doing this afternoon. We're also going to be talking about databases and some of the techniques that are available now. So, what we're trying to do in metabolomics is, at least initially, is to do something called metabolite annotation. It's not on like gene annotation or protein annotation. It's to look and to go from what might be the raw data that's on your left to something that is annotated with a figure's labels or a list that includes names and concentrations. Now, in genome annotation and protein annotation, in fact, the tools have been around for a long, long time. So, in genomics, we have BLAST, we have GenBank and NCBI. We try to take your raw DNA sequence or your RNA-seq data or your transcriptomic data, and you can get very quickly, based on the sequence, you can identify genes, their tools gained for measuring, calculating transcript abundance and so on. Proteomics, your tools like Mascot, where you can take your mass spectra, upload that and you can get your protein IDs. There are other things that can calculate relative concentration data based on peak abundance. So, whether it's tools like Mascot or BLAST and databases like GenBank or the other proteomics databases, but in metabolomics, for the longest time, if you took an HPLC, a GCMS, an LCMS, or an NMR spectrum, you uploaded that. You couldn't do anything. You couldn't get your metabolite IDs and concentrations. There wasn't a BLAST or hasn't been a BLAST for metabolomics. There hasn't been a Mascot for metabolomics. And that has been, arguably, still continues to be an issue for metabolomics. In metabolomics, we use a term, the known unknowns and the unknown unknowns. So, this is something credit to a quote with Donald Rumsfeld, I think, in 2001. And the quote is given here. And he was talking about what they didn't know about Al Qaeda in Afghanistan. And he kind of just dug himself a deep hole talking about known, known, unknown, unknown, unknowns. And he just... But it was useful in the sense that in metabolomics, we're often dealing with a set of initially peaks. But in many cases, those peaks have already been characterized. People actually have the structures. They know the mass. They know the reference. It's just that we haven't found the reference. We haven't found the reference spectrum. We haven't got the compound in our library, whatever. So, in those cases, it is possible, if you made a sufficiently large, sufficiently complicated, or comprehensive library, you could actually figure out those things. The unknown unknowns are the things that have never been characterized. They are not in the literature. No structure has ever been drawn on a blackboard or in a book that describes this molecule. So, those are the truly novel ones. And these are sort of the things that metabolomics people dream of finding. The problem is that there's probably only a handful in the world who've actually truly identified a completely novel compound, at least in the realm of metabolomics. Now, there are natural product chemists who can usually put up a list on their wall of perhaps a dozen that they've identified in their lifetime. So, truly novel compound identification is extremely difficult and time-consuming. And it has represented what we now know in terms of a few hundred thousand natural products that are listed in the Natural Product Dictionary represents the cumulative effort of tens of thousands of scientists over a hundred years. So, you think of the number of scientists, number of years, it translates to a few compounds per scientist over their lifetime. So, it's hard. So, ultimately what we're trying to do in metabolomics is mostly identify the known unknowns. And that's what I'm going to talk about here, and we'll talk a little bit about the unknown unknowns. So, the best way to deal with the known unknowns is a technique called spectral deconvolution. And this is essentially to match the peaks or known side of peaks from a database. So, you can do spectral deconvolution for NMR, GCMS, LCMS, MS, MS. You can even do it for FTI or all of the techniques. When you do spectral deconvolution, to some extent, it becomes a targeted technique because you're looking for things that are known or are within your library. If you do the spectral deconvolution correctly, you can not only identify, you can also quantify. So, that's an advantage. So, we're going to talk about these techniques first for NMR. So, again, this is sort of the original metabolomic techniques, so the history is longer. And this is how they conceived it. If you look at the blue spectrum at the top, this is actually a mixture of compounds. So, it produces a bunch of peaks. In this case, they're not overlapping. They're all separable. Now, if you didn't know any better, you might think that that spectrum is one compound. But if someone tells you know there's a mixture, then it's possible to look through a library of known spectra. In this case, let's say our library has three spectra in it. You can see how, by summing the purple, the green and the red together, you can come up with the blue spectrum. So, that's sort of the forward. The reverse is how do you take the blue spectrum and deconvolve it into those three compounds? That's a little harder. That's an inverse problem, which is computationally challenging. Even for humans, it's hard to conceive. But you can see how these things sum together. So, that's going up and then coming back down is the deconvolution one. So, there is software, actually, that allows you to do this. There's a company actually based in Edmonton called Kinomics. And they developed one of the first software tools to do spectral deconvolution. And it's used quite widely. We used to teach a course with the software. And they have demos that you can download that allow you to do this. So, feel free to do that. What you're seeing here is they have an NMR spectrum. And you can see where this sort of reddish, huge peak is, our collection of peaks is identifying the compound that's sort of hidden under all these other peaks. You can also see a list below that, which gives you the names of the compounds and actually their concentrations, I think, if you expand it, of all the compounds that are found in this serum spectrum. So, there's about 50 compounds. And a person who's recently skilled can actually in about half an hour, 40 minutes identify all these 50 compounds by deconvolution. Not only do they identify, they quantify. So, to use that for NMR, you have to do some manual work. You have to do that fixing step, where I talked about phasing, where you get the peaks pointing up in the right direction. You transform them using a Fourier transform. You remove the water signal, which is a special frequency deletion technique. You perform this baseline correction so things aren't shifted or wobbly. You reference. And then you use this guess and check technique, which is maybe the old way you might have learned how to do algebra or something. But it's the same thing with NMR, I say. This peak looks like it's a lot like a lactate. I'll slide in a lactate peak and see if I can move it around. And there's other lactate peaks and you'll shift it up and down. So, this fits or no, it doesn't fit. I'll try another one and maybe I'll try acetate and put that one in. So, that guess and check, the superposition, you're using your mouse to drag things scaling up and down. And as I say, it's, it takes half an hour and 40 minutes. If you're skilled, again, we used to have people try and do this. But we had problems. Jeff can probably attest to it. But we'd only typically get people to fix or identify about three compounds after our hour lab session. And most people guessed wrong. So, it wasn't great. But the people that have taken the time to learn it actually are quite proficient. And that's just like, you know, learning chess or something. If you take the time, you can get pretty good at it. So, there are other tools that have come along where they've been trying to make this more automatic. Brooker is an NMR company. They have a tool called Amix. They also have things that does juice and wine screening. So, you can take your favorite wine and have it automatically analyzed if you've got a million dollars to get your instrument. But it's bundled free with the software. And it's actually used for determining where things are, which source of wine, source of the juices that are used and sold in Europe. There is a wine screen, I think, in BC at Simon Fraser. And then there's other efforts that have gone on, which are more open source, an effort in Imperial College called Batman, which is automated deconvolution. And then one called Bazel, which you guys will learn today for automated deconvolution. So, when you do automatic metabolomics, and this is what we're going to try and introduce you guys to, is that it is possible to automate a lot of the flow in metabolomics. So, instead of 30 or 40 minutes, it's about a minute. So, it's 30 to 60 times faster. In terms of performance, precision recall, compared to an expert human, it seems to be about as good as an expert human. But because it's automatic, you can just go home and let it process dozens to hundreds of spectra overnight. When you let a computer do it, it's reproducible. So, even if it's wrong, it's reproducibly wrong. When a human's doing it, they might say, oh, I think I'm feeling this way today. So, I'll do it this way. And so, the reproducibility is not great. So, the bias or user errors is sort of eliminated. And in some cases, we found that the automated methods actually pick up things that people just ignore, either because they are trained to ignore it or they're tired and choose to ignore it. So, the Batman is, as I say, it's a free software package that you can download and install for doing NMR metabolomics. They've written papers on it on how to use it. It's quite slow. And it's actually slower than a human. So, at some level, it's not, there isn't a real benefit to it yet. And it's somewhat limited in terms of the number of compounds it can detect. So, because of those issues, we worked on another tool called Bazel. So, it's pronounced like the spice, but it uses Bayesian theories to help with the deconvolution. So, it's web-based. So, as you said, it's very accurate. The Bayesian components using the hidden Markov models, some of you may have heard of hidden Markov models. These are used for speech and language detection, recognition. They're also used in sequence alignment. But they're great for detecting patterns. And really what you are doing is a pattern fitting process. When we watch it perform, it actually seems to perform very similar to the way human does it. So, it's mimicking. So, it's learned how to do it or it's mimicking what humans do in terms of pattern recognition. It also requires some prior knowledge. Again, this is part of the Bayesian thing is you have to be able to tell it that this is a spectrum of blood or this is a spectrum of cerebral spinal fluid or a cell extract. If you tell it the wrong thing, it'll do pretty badly. But it's the same sort of thing if you tell a human that what you're analyzing is this and in fact it was something completely different. You know, aha, I fooled you. Yeah, everyone will get, as just a computer, will also get confused. So, it knows what it has to look for and it will try to fit that. The other thing that was really important was to try and automate all of these manual processes that are typically done in NMR. So, manually, people will phase, which is getting peaks to point up. Manually, people will reference so that they know where the zero mark is. Manually, people will remove water and they will manually perform baseline correction, which is kind of like aligning or moving a picture so it's straight and everyone has a slightly different view of what's straight. So, if you automate those things, then it's consistent. And that was, I think, a critical thing and that's not actually in most other or any other tools we know of. Yes. It is. Humans are imperfect and so they'll misidentify, an expert will misidentify at about a rate of 5%. So, this is an example of a portion of an NMR spectrum and you know, you guys can see maybe a few dozen peaks there, which is the overlaid component for the spectrum. But underneath are the detected compounds and you can see how these all sum up. And so, when you're dealing with 90 or 150 compounds, it gets really complicated. And they're sort of an expert threshold, just like with a chess grandmaster with a chess amateur. Certain people can do up to maybe 30. Some people can do up to 50. It's maybe only a handful of people walking around the world that can do more than about 100. Just because it becomes so complicated and they have to know so much. But this is what Bazel is able to do. And then this is some earlier tests, but this is where human was doing the fitting. So, there's the black spectrum, which is the actual spectrum. The red is the fit, where they do that spectral deconvolution. So, it takes them about 40, 45 minutes. If you're an expert, if we gave it to you guys, it would be four or five days. And then the computer, this is five minutes now. It's running about a minute, I think, now. So, it's obviously much faster. This is a website. You guys will see that later today. And you can log in to it. We're going to be logging into a version which can accommodate sort of 25 hits at once. And we'll sort of take you through some of that operation. It's also a freely open accessible website. So, you don't have to log in on the regular version. So, the URL that is publicly available is free. What you'll do is you'll fill in things like this. It's too small for me to read here. You'll choose a spectrum. You'll put in how much of your reference compound is. It's used to quantify. You have to choose what type of NMR you ran it on. There's a 500 megahertz NMR. There's a 600 megahertz NMR. I think we've got a 700 megahertz of hope coming online this week. And then there's example ones that you can just kind of run. So, if you clicked on just an example to see how does it perform, what you would get is you first upload your spectrum. And it does what's called a Fourier transformation. So, it takes the raw signal it gets from the NMR, which is just this wavy collection of peaks and it converts it from a time domain to a frequency domain. And so, that's the spectrum you get. So, you have all of these distorted peaks all over the place. But the Fourier transform is done very quickly. So, then you do a phasing or a basal does the phasing. So, it tries to get the peaks all pointing up. So, you can see how they were at the top. They're dispersive. Now, at the bottom they're absorptive is the term. But you can also see somewhere in the middle this is where the water signal is. And there's a distortion that happens. So, you can see it dipping and it's kind of messed up in the middle there. So, then it has to do a couple other things. So, it's going to fix the water signal. It's going to redo the baseline so that things are more balanced. And then it's going to find the zero point, which is the reference thing. So, that takes about, in this case, 30 seconds. It's much faster now. So, this is the final spectrum. This was done automatically by the software. Normally it's done by a human, but it varies. So, if we had the 20 of you trying to do this, every one of you would come up with a slightly different-looking spectrum. If they're all different, then the deconvolution process will lead to different results. So, if you do it in an automated way, that's reproducible. Now, it might not be perfect, but this is, it's going to do it the same time every time for the next 50 years. So, that makes it reliable. So, about a minute later, it will have then deconvolved that spectrum. And so, you see, you can see a faint blue. So, this is the fitting that it's done. So, as I said, it would have taken you guys a couple of days. This is done this in about a minute. And it's now fit every single peak. And if you click or scroll down a little further, here are all of the compounds that identified and hauled their concentrations along with an estimate of the confidence. So, this is automated metabolomics. You can go to an NMR instrument and it will automatically load your samples, just like an auto sampler. Hundreds at a time. All the spectra will be collected. They can all be directly fed into basal individually or in batches. And then you can have it just run and it will just generate your list. Question? Yeah, I mean, there's a little bit of Bayesian work. It's more... There will be some very subtle differences each time. It uses sort of a seeding, which is a bit randomized to identify or play out where some of the peaks could be. It should largely converge to basically the same answer. There will be subtle differences in maybe the concentrations of very low concentration values. But yeah, for the most part, it's exactly reproducible. The fits, the numbers, as I say, might differ by 0.1% or something like that. You can run it for a short period or you can run it for long periods. So depending on how long you let it fit, you can end up with different answers. So as long as you run it for this, they call it the slower, fast version. So if you flip back and forth between your slow and fast versions randomly, then you'll end up with sort of inconsistent data. So just use exactly the same method. Longer is better. Yeah. Yeah. But some people don't have the time or whatever. So it'll always be better. It's always better. Yeah. So it's sort of a minimization, if you can think about it, or an optimization. So you're getting better and if you could run it forever. But it's a very complex, multi-hyper dimensional optimization. And those almost never converge perfectly, instead of what you're ending up with. Could you run it again and again? On the same thing? Yeah, you could. As I say, it'll always converge as long as you're running on the same algorithm. You can, we just have it as a web server. There's only one person who actually understands the theory for this. And he's doing a postdoc at Carnegie Mellon. So we've tried to figure it out ourselves to deal with it. But most of what we can do are just sort of subtle changes. It fits according to area as opposed to the peak. And so humans tend to fit to the peak. And so that's been an issue for us. So we're rewriting it to see if we can get it to fit to the peak, which is more understandable. But that's, you know, delves into a lot of complicated math. Anyways, there are limitations with this approach. So it works for simple biofluids. So we've structured it for serum, plasma, and CSF. But that basically covers any mammalian substance basically, except for urine. So you've tried for three years to get it to work for urine. It doesn't. Urine is much too complicated. It works at 500 and 600 megahertz, and then a 700 megahertz one is basically done. And so that should be up sometime, I hope, in the next week or two. Those are the common NMR instruments. So it's fairly generic. So most of you have never done NMR, but if you have colleagues, they will have these types of instruments. And so it's very doable. It's descriptions on how to prepare the sample. This is what we realized in the course of doing this. There's many ways that people can prepare samples and many ways they can collect spectra in NMR. You have to standardize that. If you don't, then none of the automation will work. Just like with GCMS, they have standardized it. And so this is, we've proposed a standard, which is pretty generic. Many people standardly use it. This pulse sequence is very simple. As I say, there's been a speed up, so that it's down to, I don't know, more like two minutes instead of five to seven minutes. And then the open website that everyone can access allows you to just a single spectrum at a time. You guys will be using one that I think should be able to do batch versions. And then there's a, I guess we'll call it commercial version, but it's more essentially a not non, not for profit version where you can get the whole bunch of the material to do the sample preparation bundled in a kit. And then you get the software that allows you to do much more interactive work. So this one does it automatically, but some people want to tweak things. And so that's okay. And so that's what this other semi-commercial version does. So that's basal. We're going to try it later this afternoon. Now we're going to talk about GCMS and then LCMS. So conceptually, GCMS, LCMS is also sort of a spectral deconvolution process. You start off with a chromatogram, which is shown up on the upper left corner. Under that chromatogram, you're going to find usually one, two, three more peaks or compounds underneath that sum to produce the observed peak. With EI, you're going to get this fragmentation pattern, and you're going to look at, from the extracted ions, you'll see different EIMS spectra. So we've got a blue, a red, and a green one. When you've pulled those out, whatever software you're using, you're now going to do a library analysis, and you're going to deconvolve them. So on the far right is your library. And the idea is to see if you can match that library, your observed spectrum, to the library spectrum. So I think we only had one person doing it, and how many people do GCMS for their metabolomics? One, two, three, yes. Okay. So in this case, critical thing obviously is the database. The other thing to remember is that EI is a hard fragmentation method, so it shatters your compounds to produce a spectrum that is not unlike an NMR spectrum, because these peaks are fingerprints. They're characteristic of that molecule, unique to it. So they're typically going to be a high mass peak, which will be your molecular ion, your parent ion. The fragmentation patterns, some cases are predictable, although it's hard for most things. The intensities are impossible to predict. But here's another GCMS mass spectrum. We see the parent ion. Sometimes things maybe have adducts, but most cases don't. Then you see the fragment or daughter ions. And the position of those daughter ions is telling you what the fragments are. In many cases, with GCMS, the compounds are derivatized. So most commonly it's trimethylsilane, but you can get TBDMS and methoxine. These are extra atomic molecular groups that will attach on to hydroxyl groups or others. And they will add specific masses. In some cases, by looking at these mass increments or mass changes, you can say, oh, this is TMS derivative, or this has got two TMS or three TMS, or two TBDMS, or three methoxines. That helps, in some cases, to identify things. In the case for GCMS, whereas for NMR, it's many polar metabolites. GCMS, you will see polar metabolites, but it's often best used for organic acids and for volatiles. As we talked about the gas chromatography, the plate counts much better than LC. EIMS is standardized, so that means that you can use libraries of EIMS spectra. And the route that most people use to identify compounds is to use a software tool called AMDIS and a database called NIST, the National Institute for Standards. So the NIST database is an amazing database. It trumps all other metabolomic spectral databases by many fold. So there's almost 300,000 electron impact spectra and almost a quarter million compounds. Compared to LCMS databases, which typically have probably less than 12,000 unique compounds. So GCMS EIMS is 20 times better. It does have ion trap data as well and QTOP and triple-quad data as well. Again, very rich source. It costs some money, but it's overall relatively cheap. It also has retention index values, which in the case of GCMS is a very reproducible, very useful way for identifying compounds. It's an orthogonal measure. And again, if LCMS had this, it would be so much easier. But it's all in GCMS and this is one reason why I think it's a neglected, but probably a far more useful technique for metabolomics than most people realize. This is sort of the software tools that couple with AMDIS and then more just indicate the sort of interface you can work with and the way that you can identify and score, whether you've got a similar or an exact match. That software, the AMDIS slash NIST tool has a number of features that will allow it to extract and identify things. It gets rid of background noise, compares peaks so that it can identify which ones are noise peaks and which ones are real peaks. It'll generate a clean spectrum, sort of that deconvolution, not unlike what we talked about for NMR. And then it'll go through the library and use what's called the match factor, which identifies or scores the similarity between peaks or spectra. So you have your experimental MS spectrum, your extracted MS, and then you have the library. That's your reference library. And this is how the match factor is calculated. It's sort of a dot product of spectral intensity and position. So how you compare vectors, if you remember, dot products. And then it's scaled by a thousand. So your best you could possibly do is get a thousand for your match factor. Generally, the cutoff is about 700. Some people use 800 and above. So if you get a match of 800 or 700 above, then that's probably your compound. So there's a protocol for GCMS. Typically, if you want to use retention index information and you want to quantify, you prepare an external set of standards. These are alkanes. And these are things to calibrate your retention times. So you can get a retention index. Also, you typically have to run a blank sample because you do this derivatization step. There's always derivatization junk that shows up. And so this blank sample allows you to pick up things that are both stuck to your column, that are always coming off, as well as sort of the chemical sets that typically show up. And then from there, you watch your samples of interest, which have obviously run in the same conditions as the blank. So here's a set of standards run through your GC. And you can see how evenly spaced they are and their positions they come off it. And from that, you can normalize your retention times to retention indices. So this is, again, if only we could do this for LCMS, but GCMS, it is very feasible. The calibration files are done. You calibrate, calculate or recalculate your RIs. And then you start searching after you've got your sample files and this database is for matches. You might be able to get rid of some false positives by comparing your blank. So this is this AMDIS protocol where here's your calibration file loaded up. Here's your calibration. Step two. Step three is searching your NIST database after things have all been calibrated. And this is the sort of interface that AMDIS provides. And you can see things that are marked in red and yellow and blue are indicating the positions of certain peaks and what's marked in red is sort of this particular peak that's come off the chromatogram. So we'll zoom in a little bit and we can see, you know, here's the peak that you've isolated with the AMDIS software. Zoom it out and you can see a yellow, white, I guess it should be yellow, blue and red. And below that, you're seeing what AMDIS is trying to match. And it has calculated that what's what it's seeing based on the positions of these masses, the M over Z values, there's a 73 and a 144, sort of from the extracted chromatogram. And it has got a match factor of 840, sometimes it's reported as a percentage, or 84% to the reference spectrum of veiling. So this peak through the AMDIS software has now been identified as veiling. And you could double check it to see the retention index matches to what's reported in the database and probably does. And then if you've done some calibrations, you can also quantify it. So this is sort of a gain of 30,000 foot view of how to do AMDIS. But it is something that allows you to identify and deconvolve. So not unlike Bazel, but it's not automatic. You have to click and choose and process things. What we're going to show you guys is an automatic way to do what's done in AMDIS and that's called GC Auto Fit. There are also other programs, commercial ones, Analyzer Pro and Chromatof for examples. They've been compared between AMDIS, Analyzer Pro, Chromatof and about five or six years ago, now seven years ago. And then there's also different databases you can use. So every three or four years NIST releases databases, so another one was due in 2017, but there's the GOM database and then the Oliver Fien has also developed a database for GCMS. So as I say, you guys are going to use GC Auto Fit. So like Bazel, it tries to do everything automatically, but it's aligned with how most people typically do GCMS. So it needs your sample, it needs your standard, which does the retention index but also helps with the calibration of intensities, and it needs a blank. So it does all of the things that you would do normally manually through the AMDIS NIST thing. So it will do auto alignment, it'll identify the peaks, peak picking, it'll do the integration, calculate the concentration. It'll take a bunch of different files. The time actually I think is now less than 60 seconds. I'm not sure what you, when you've tried it, Jeff, how long it took. Yeah, so batches will take longer, but it's proportional to the number of things. You can identify, in our hands, up to about 119, 120 compounds, and it's very accurate. Actually, it's more accurate than Bazel. Yeah. So they've changed, we've changed it for this. Okay, so yeah, there's, write that down so you won't end up going to the wrong version. I think, again, we've tried to do this because we're going to be dealing with a lot of heavy load, and so we've created sort of mirror sites so that we aren't going to bring it down for the rest of the world. So as I said, three standard files that everyone has to do, typically the LKN standard, a blank sample, usually recommended, and then obviously the samples you're analyzing. So those are the three files. It will support various conversions. There's also conversion software that allow people to switch back and forth. Again, this is just a function of different instruments to produce different file formats. You can upload a single file, you can upload a zipped file of a whole bunch of ones, so that's why it would take potentially longer, so you can choose either a standard one or a zipped file with a collection of those spectrum. Like Bazel, you have to be able to tell what your sample is. So if it really is serum but you said saliva, it'll do a bad job. So make sure you align it with what it really is. So in this case, you have serum, urine, saliva, and this current collection that you guys will be using. People can also upload at their own library for a specific biofluid or extract, and then there'll be sort of a calibration internal standards for quantification. It can do everything automatically. You can sort of stop just to make sure is it is it doing it right so you can see what's been happening. So you can look to see if your standards look right. It'll pop up your sample spectrum just to see if it looks right. And then it'll crunch away for a few seconds and then spit out the output. So it'll mark things off in terms of the peaks it's identified and it'll identify specific compounds and provide some connotation. And then that can also be exported as a Excel format or CSV format for identifying the samples. Some cases it'll see multiple peaks and it'll merge those peaks to calculate a summed concentration for that particular compound. So you can see the table or you can see the spectrum to just sort of see if it all makes sense to you. So both with Bezel and GC Audifit, these are attempts to make it fully automated. Now in GCMS efforts for full automation were done in the 70s actually. But it varied or they do it and it would work for one computer platform which became obsolete within a year and they just didn't bother to do it again. And so surprisingly the vendors don't offer automation which is weird and it's also weird that the folks that developed Amdus or NIST haven't really tried to do it either. One of the reasons as well that we're trying to do this automatically is we found in this course that if we tried people to say okay why don't you try and run Amdus you know three hours later they maybe fit one compound and they say well what did I accomplish? And the same with economics three hours later they fit three compounds. And so you didn't really sort of learn a whole lot. Now the reason why we've done the automation is partly to see well in the end is that compound identification should not be the most time-consuming part of metabolomics. It's the data interpretation. It's what you guys have trained to do or learned which is to interpret that data, to understand the biochemistry, the biology, the physiology. But it's also to try and make it standard and to make it consistent and that's been I think another problem with metabolomics is that there hasn't been a lot of consistency in very little standardization. Efforts towards standardizing come out of the metabolomic society where they're proposing these ideas for metabolite identification at different levels. Level one being positively identified compounds which are confirmed by a match to a known standard even an authentic compound or a large collection of spectra that say this is definitely there. Most people actually don't achieve level one identification. Most cases people are at level two which is identifying compounds based to a match to a spectrum. In some cases it's an M over Z and a retention time or retention takes sometimes it's an EIMS or an MSMS and a retention time. A large majority of compounds are putatively identified as just sort of going to a certain class that's primary lipids. And then 97 percent of things from LCMS are in the unknown category. So we'll jump now to LCMS and if this figure looks just identical to GCMS it is because in large part they are very similar. Now in most cases with LCMS people will stop only at the parent ion. They won't go on to get the MSMS spectra. This is depicting assuming that you've gone on to get not only the parent ion but also the MSMS spectra. Some sort of SWAS or data independent acquisition method. So again you've got you know MSMS data fragmentation data. It allows you to do some database comparisons and aligning those things. There's a lot of commercial tools for that unlike GCMS and unlike NMR. So every manufacturer has their own. Every manufacturer wants to have you use their own. So how many of you use LCMS for your metabolomics? So the majority of you and how many people use commercial tools for those? Small number. There are free options. So how many of you use XCMS for your work with LCMS? Only one, two. How many have ever used MSM mine? Have you ever heard of it? There's a few others that are appearing but we're going to maybe use XCMS. I don't know that the website went down today. So this is a problem with with online especially with MSMS software. We're limited and we can't buy the commercial packages installed on everyone's computer. So we're largely trying to use open access online tools for this. So XCMS was one that last year worked okay and up till a day ago was working fine and we're at the mercy of UC San Diego Supercomputers. So XCMS actually is an open source tool. It does a lot of the things that you need to do with with mass spec. It does the peak picking, matching retention time alignment. You can download it as an r package command line and then the XCMS online is a web server. It accepts a whole bunch of different formats and it is linked to a well-known database called Metlin which allows you to identify compounds. So it's very much sort of a chemometric and targeted method. So this is where you take a whole bunch of spectra that you have got. You align them so there's sort of a retention time non-linear alignment and then you pick out the features. So there's a large picking component to XCMS and it identifies which peaks are real, which ones are not, assesses them. And then from there from the identified extracted ion chromatograms that identifies masses and primarily does compound identification on the parent ion mass but will also attempt to do compound identification by the MSMS matching. The tough part or the difficult computational part is the peak detection. And there are different algorithms that different programs have developed and there's no one universal answer. And ironically the best peak pickers are a human eye. We always beat, always trumped computers. But there are tools and tricks people have learned over the years to do things like certain types of filtering and deconvolution and certain algorithms that have been described. And so this is central to XCMS's success really in terms of the peak detection. The other thing that happens with liquid chromatography because you're typically working with LCMS is to do peak alignments. So you've got a whole selection of chromatograms and so now you want to make sure everything is aligned so that you know where things are. XCMS does this quite well and that was another reason why it became quite popular. So you can see these unaligned chromatograms at the top and then it works, does the time warp to try and get things nicely aligned. When they compared it, this is seven years ago or more, to a number of other tools X align and MZ mine and MZ inspect and so on, it performed better than any others. More recent tools that have been released by Oliver Fiend's group have outperformed XCMS actually. So there's a bit of an arms race between the different groups to try and get improve things which is good for everyone. This is the XCMS online. If you use Firefox it works, if you use Chrome it doesn't and today it doesn't work at all. But if it was, this is sort of how you go onto the website here, it's based out of Scripps. It uses I think clusters at the San Diego Supercomputer Center and what you do is you'll start a job and you can do different types, single jobs, pairwise jobs, meta, multi-group jobs, depending on the type of whether it's groups or comparing groups or whatever you want to do. So again that's relatively simple. You can upload your data so if there's two groups or two data sets you'd upload them in two different panels. Upload your spectra and again these are just sort of stepping through if things were working. So this again is just stepping you through how you would upload the spectra. Now the challenge with LCMS and for an online server is that the data files are ridiculously large. In contrast with GCMS and NMR files are ridiculously small. So for NMR and GCMS it's quite amenable to online, it's quite amenable to rapid processing, it's quite amenable to getting answers almost real time. I think in all likelihood we're expecting XCMS online to disappear in part because of the load but also because the file sizes are getting so large they can't handle it. The web can't handle it. And it's ironic because I don't think there's that whole lot more data really in LCMS files. So this is kind of ironic. So after you've uploaded results then you can set a variety of parameters. In this case the data set was from a single quad instrument that used HPLC. So there's lots of different file formats and then you let it run. And last year we overloaded the site I think so it took anywhere from half an hour to a couple of hours for people to get results but if it's a quiet day then you can get your job reasonably quickly and then you can start clicking through the panels. So it's a really nicely designed website. It's just, as I say, limited by this bottleneck of huge huge data files. And then you can download your results as well in an XML or Excel type format and you've got your mass, peaks, identifiers, retention time, average intensity, individual peak intensities. So this is giving you your metabolite list. Now the problem is that there's a lot of false positives in XCMS still and in most things. The other point is that this is not giving you concentration data, it's relative intensity or relative concentration data. Some cases the peaks aren't identified, in many cases they aren't, a vast majority of the aren't. And so typically you'll have to identify or annotate compounds using other types of software or through more manual comparisons with MS-MS data. For LC-MS, because you primarily use reverse phase, generally more hydrophobic molecules are picked up. So more hydrophobic amino acids, hydrophobic, organic acids, or hydrophobic fatty acids. It's great for lipid analysis. If you can derivotize your compounds, as many people are starting to do for isotopic labeling and quantification, also this is enhanced with LC-MS. You need both the parent ion and the MS-MS data to really help identify things. So retention time can also help. If you have very, very high mass accuracy you can generally get reasonably good suggestive matches of what the compounds are. So if you have high mass accuracy, generally 1 ppm or better, so that's orbitrap, FTMS, or really high-quality Q-toss, you can go to mass searches and start looking through databases of compounds. A lot of people like to do this. And a lot of people identify compounds just by parent ion searching. How many people use that approach for compound identification? A lot of it depends on the databases you work with and the precision, but at that stage you're only at what we call level three identification. So the largest database of chemical compounds is PubChem. There's about 83 million compounds, and you have a mass search tool that you can give both a range or an exact mass, and you can see what pops out. Obviously if you're doing mass searches, it's not going to give you the positive or negative ion, so you have to be able to get the neutral molecule mass. Likewise with LCMS you have to make sure that you're giving not an adept, but again a neutral one. So we search, this is what PubChem will give, and in some cases it'll list a couple matches. We can be more specific. We could go to a database like Kebby, which has got about 45,000 compounds. Now there's a problem in using resources like PubChem or ChemSpider, and this is that less than 1% of the compounds in PubChem have ever left the laboratory. So 99% of the chemicals in PubChem are essentially synthesized compounds that were made for proof of principle or something like that, or intermediates for synthesis. So they should never ever be found in a biological sample. So by searching through PubChem you're increasing your odds by a factor of 99 of getting a false positive. So this is important to remember that when you are doing metabolomic searching choose an appropriate database, especially to match by mass to charge or MS to MS. And this is I think a problem that's seen over and over again where people are just searching on trying to find the largest database to see if they can get a hit and putting down something just to say they've got something. That's not great science. The other thing to do is that if you're looking at a human sample look at a database of human metabolites. If you're looking at an E. coli sample look at a database of E. coli. If you don't have an exact sample, so if it's a rat or a mouse probably a human database is okay because in large part they're both mammals and mammalian metabolism is highly, highly conserved. The other thing to do is to watch out for mixed databases where people combine everything. So kebi combines everything. So these are just chemicals of biological interest. So some of them are plants, some of them are microbial, some of them are sponges and algae and things that were found at the bottom of the ocean. They're biologically interesting but they're not something that are going to be found in mouse models or frogs or whatever. So again, think about what what your sample is. Now you can go a little further and you can actually start searching not just compound databases but mass databases. So these complete or collect archival spectral references. So this is for MS, MS searching as opposed to a pair and IMS searching. So METLIN has a large collection of spectral resources. NIST as we went through has a large collection. Mass Bank and MONA. Has anyone heard of MONA? So this is called the Mass Bank of North America which is actually a consolidation of all of the mass banks in Europe, Japan and others and they've taken just about every free resource for mass spec and put that into MONA. And then there's CFMID which we'll talk about a little later. So in these cases, you can search for masses, MOBresid values, but you can also look for ADEX. You can do P-clusters for MS, MS. And this is a much more powerful approach to getting mass spectral matching. How do you identify unknowns in mass spectrometry? So this could be both for EIMS and LCMS and there's quite a bit of effort now to try and come up with ways of predicting mass spectra. So there's a couple of commercial tools. So there's now this freeware tool called CFMID which was developed in 2014. So it's the only open source tool now that allows you to predict MS-MS spectra and just a couple weeks ago EIMS spectra for GCMS. So that's it's important. So there's spectral prediction is different than compound prediction. So this actually does the fragmentation and predicts the fragments and the intensities. Now there's other approaches which will take a spectrum and say, aha I think it's this compound which is very useful, but it doesn't predict the spectrum to do that. It just simply figures out what fragments might be and what fits with existing data. So if you have tools that can actually predict spectra, people can actually help with a novel compound idea. And there's been a couple of papers published in the last couple months that have used CFMID to do unknown identification based on the LCMS. Because the EMS is just being reviewed right now, we'll see how popular it might become. But this is essentially the best route that we can see for doing unknown unknown identification. Now it also allows to identify compounds and so it's been run through a bunch of spec compound databases Keg, HMDB, and it has predicted all of the mass spectra for them. So those exist as a library of predicted spectra. MONA also has a library of predicted spectra for about, I don't know, 60,000 lipids. So this is becoming increasingly used where the idea of predicting spectra and putting those spectra online to do the searches to help you identify sort of unknown unknowns. Where no compound exists will likely ever exist or be synthesized. But because we know the structure or likely structure we can predict the mass spectra. What level would that take you to? Would that take you to a level two at least? Probably two plus or it wouldn't be three because you don't have the real compound. But if you've got parent ion, molecular formula, a good match with your observed ms and a predicted ms ms, and maybe some retention time, I would call that level three or level two rather. Yeah. And you could do for biologically, I mean not could you take from biologically from an organism actually or is it still your infirm from? Well what you're dealing with with these things is you can take a list of all known structures. So an hMDB which has 42,000 compounds only about 3,000 of those compounds have authentic ms spectra. But with this we can do the other 39,000. Now they're not going to be perfect but that's the status that we're at right now is that only a couple percent of known compounds actually have ever had their ms spectra collected or ms ms spectra. And that's the way it's going to be for the next 10, 20, 100 years. We're never going to be able to complete that task. So as a community we have to be able to predict very accurately what those spectra should be to try and get those matches. And I mean this is an attempt it's not perfect but I think over time as the libraries improve this uses machine learning other techniques who develop it should get progressively better. In the case of lipids the prediction is very very accurate. This also does very very well with lipid or peptides and a few others. So there's certain classes where if we have enough spectra and the behavior is simple enough it's as good as having the real spectrum. And then theoretically someone over pride where you could post you know you could check your spectra with that's right for your organism. That's right yeah. So anyways it's pretty simple to use you can choose what you want whether it's compound identification or spectral prediction. So there's three options and as I say it now supports EIMS as well as msms. You can upload things again I can't see the screen enough but fill that in click and go and then it produces the spectrum and then the match. The format's been changed so the match is sort of the mirror image of the observed or whatever. So you can see the blue and the red lines matching up it'll give you scores and you can see what is appropriate or what's sufficiently good match. In the case of LCMS there's always a challenge of dealing with salt addicts neutralized species and multiply charged species. So a typical spectrum may have 50,000 features and really what you're trying to do is eliminate a lot of these or consolidate them. So technically I would call these noise they're just uninformative although they still can be useful in perhaps the structure determination. So you need to be able to distinguish those addicts or multiple charged species from the parent ions and again there's a variety of software tools that can do that. Addict formation depends very much on the chemical structure depending on whether you have negatively charged groups or positively charged groups and it'll stick on sodiums or chlorines and so these are the common addicts you'll find in LCMS or direct injection MS depending on what solvents you're working with. And addict tables can get very very large. Oliver Fiend's group has one that they put on their wiki that lists a whole bunch of different addict forms that can be seen. Obviously not everyone is going to be seen in a sample because it's very much solvent dependent and salt dependent. But given those things you can do that and then each addict produces both a change in charge and a change in mass and so again you can predict based on a formula chemical formula chemical structure what sort of addicts and what sort of masses could be there. Say there's also neutral loss fragments which will lead to a couple of other peaks from essentially the same compound and again that's that's a challenge of sorting out those ones. So different tools are available. Tools like MZDB, Metlin, HMDB can handle and predict addicts. I can predict iron pairs and multiply charged species. Some can deal with a neutral loss species. Obviously if you just search by mass charge it will lead to all these high false positives if you're able to get the MS-MS spectrum that improves it. Likewise removing the addicts, consolidating things, consolidating multiple charged species, getting rid of or consolidating the neutral losses, getting rid of fragments. Isotope peaks again resolving merging those. If you do that you can get a tremendous simplification and this is a typical thing that you'll see. So here's a positive iron mode running LCMS. It's not unusual to see 15,000 features. If you remove the addicts that'll reduce to 12,000, reduce the multiply charged species. You're down to 10,000 or 8,000. Neutral losses, drop some more, removing the isotope peaks. You're getting down to 3,000. And then the final spectrum perhaps 2,500 peaks are considered real. If you repeat that for the negative you generally get fewer. You might find around 1,500. So from a set that might be total of 25,000, 15,000 for positive, 10,000 for negative, if you do the proper filtering you might be only dealing with 4,000 true features. But it's challenging to do that. And different tools like MZMine that can do that, Metfusion, Magma, those help. Now the other thing to do, especially with very, very high resolution mass specs like the Orbitrop and the FTMS or some of the better QTOFs, you can, once you've consolidated things, you can start pulling out a pair and iron mass fairly easily. And in this case it's possible to use both the pair and iron mass and the isotope features from that peak to calculate the molecular formula. And so that's a level four level, well probably yeah, level three level four identification. So there's a tool, a web server that's been around for a while called MZDB based in Wales that uses basically the seven golden rules developed by Oliver Fien to calculate molecular formulas. So it's a web server, seven golden rules is available from the Fien group, but I think is an Excel program. You can find compounds by molecular formula, so that's something you can do by Pubchem and many other tools, but again the same caution, the same caveat that only one percent of the compounds in Pubchem are actually biologically relevant. So if you can use accurate mass as well as isotopic patterns, and remember those patterns we showed with chlorobenzene and other things, as well as certain rules about how chemicals must bond and valency, the Lewis and senior rules, you can greatly reduce both the number of possible formula and the likely compound set. And this is the principle behind the seven golden rules with with Oliver Fien. How many people have heard of the seven golden rules? Okay, not many. So after today all of you will have heard about it, but this has been a real important development for mass spectrometry and metabolomics and it allows you to get, at least for small molecules, fairly reasonable formulas and reducing the number of possible formulas. So in the case for unknown unknowns, this is often all we can do. In the case of just looking at molecules that have carbon, hydrogen, nitrogen, sulfur, oxygen and phosphorus, and based on the molecular weight cut off of 2,000 deltons, there's 8 billion different possible chemicals that can be created. If you apply a few more rules, that 8 billion can be reduced to about 600 million. Remembering that PubChem has 83 million and PubChem includes a lot of compounds with chlorine, fluorine, bromine and so on. So that's still quite a few more. And then there's also the isomers, but this just shows you from the space in gray to the space in red to the tiny dot to the invisible dot in terms of the size of chemical space that you're dealing with, but just from formulas alone versus the number of compounds that we actually have structures for and an even tinier number that have actual mass spectra. This also plots out the the frequency with molecular formulas based on the mass. So small molecules have a small number of possible formulas. Large molecules have a very large number, and so this goes up linearly. So the larger your molecule, the more likely you can have different, very different formulas, or very, very good different structures. If you have better mass accuracy, the number of possible formulas is greatly reduced, and this again just illustrates this. So if you've only got 10 ppm or 5 ppm accuracy as you climb up in terms of the number of or the size of the molecule, you don't have very many unique compounds. So if you're just using molecular formula, molecular paradigm mass, and given today's resolution about 1 ppm, as soon as you get up to about 500 Dalton's you're looking at 20 possible formulas. But if you use the isotopic abundance, and you have good resolution with that, then you can drop that set of 20 down to a few. So adding additional pieces of information, including the isotope patterns or isotope abundance, really can make a difference in terms of zeroing in on that molecular formula. And this is what some of these programs allow you to do. So this is just illustrating a very large molecule, but how the isotope abundance from this particular mass spec allowed them to resolve it. So as I mentioned before, this issue of what mass database is, what chemical databases you use is really important. So a lot of people have just been using things like PubChem and ChemSpider, and basically getting nonsensical hits and reporting that. If you know the source organism, use that information to limit your source, and that can profoundly limit what you are likely to see, but also greatly improve the hits. So if your formula set is instead of out of 83 million, as this is done for here, but instead is out of 10,000 or 20,000, then you could be almost certain you're going to get a unique hit or a unique match. So now if you don't see anything, I'm going to assume it's simply an unknown unknown, rather than trying to scrape below the bottom of the barrel and report sort of nonsense data. So try to use organism specific or theme specific or biofluid specific databases that greatly improves your reliability and your odds of having matches and having those ones confirmed. The other thing that I guess at least with LCMS versus what I was showing you with NMR and GCMS, which are can be and routinely are quantitative, most LCMS studies are not quantitative. So to get absolute quantitation you have to use isotopic standards typically, and it either has to be the authentic compound or something close to it. But there are some other tricks where you can use selective isotopic labeling, which also gets around this. So you can use techniques called reaction monitoring. So there's single reaction monitoring, multiple reaction monitoring that allows you to get correct compound identification, and it's also being used to help with compound quantitation. So that's actually being used in commercial kits now. One of them is called the biocrities kit. I don't know how many of you have ever used or heard of those. One or two or three. And this is a really cool concept because it makes metabolomics also much more routine. If you've ever done molecular biology everything is done in kits now. In the old days it used to be you have to clone your own DNA polymerase and clone your own restriction enzymes and it would take months. Now you just buy all the stuff and it's actually in a little kit and you can do hundreds of these things. This is the same concept. And with that you can run it through and actually measure metabolites from concentrations as low as 10 nanomolar to about 10 millimolar. So huge concentration range and about 180 different metabolites. And you can process up to 100 samples in about a day. So it's very efficient, very quantitative, and it includes a lot of the isotopic standards that you would want to get that quantification. There are other kits that are emerging. Tablelemics Innovation Center has a few that are online. People can get, I think there's Shimadzu I think is starting to sell kits. So this is just, it's a trend I think in metabolomics and it's very targeted but it gives you quantitation, it gives you consistency, it becomes routine and the data is reproducible from one year to the next to the next. It's a lot of QC that's involved in preparing these kits. And if you want to be able to move metabolomics from just sort of a curiosity science thing to something that's actually useful in the clinic, in the field, this is something that's going to have to be done. So I think it's time for our lunch break. So I guess we'll wrap up.