 We're going to go now to the next part, which sort of picks up where I left off, and we were talking about NMR, and what we're going to try and do is get into the business of working with some of the data and trying to go from that raw spectral data, whether it's GCMS, LCMS, NMR, LC, to those lists. So it's metabolite identification and quantification. Now unless you work in the lab, this is typically things that you wouldn't typically do, but as I say, I want you to, as bioinformaticians or as people that want to look at the data, you need to know a little bit about where it comes from, because if you don't, you don't really appreciate the data. And in some cases, how hard it is to measure, how reliable or unreliable it is. So as I said, there are these two roots that people have historically done. The very first method, Jeremy Nicholson, the founding father of metabolomics, really advocated this chemometric profiling. It's an approach that actually works quite well for mass spectrometry. Measure many, many profiles, presumably of near identical sources, animals, plants, whatever. Do very careful normalization, correction and everything else. Compare them, look at the statistical differences, and then after seeing where those differences are, go back to the spectrum, identify them. And then the other approach that I said, which is more recent, the one that we'll be using here, is to say, look at this and say, as much as peaks, what are those peaks and how much of those compounds associate with those peaks. And this is even what Nicholson's group does more and more now. So it's not, I'm not the only one, in fact, this is a trend for the bull field, is to go first to try and quantify, identify, target, get as much as you can, it's firm, solid and certain. So to go from that stage where you had all of those peaks to the list of metabolites and IDs, that's historically been the problem with metabolites, because in genomics or transcriptomics or proteomics, there are tools like BLAST or like MASCOT, web-based tools that basically allow you to feed in these things and to get lists of gene identifiers, transcript abundance, protein identifiers, and protein abundance. There wasn't historically tools or a tool to do this sort of thing, and that was the real limitation. And when people at least are thinking about metabolite identification, they're often talking about compounds that are really well known, but also compounds that are unknown, but eventually we do know what they are, and then they'll also be compounds that we just have no idea and never will have an idea what they are. And this terminology of known unknowns and unknown unknowns is actually used a fair bit now in metabolomics, and it came from a quote from Donald Rumsfeld, who used to be the U.S. Secretary of Defense, who was giving a rambling presentation right after the Afghan war started at the Taliban, and people are asking, what do you know, and what don't you know, and where's Osama hiding and everything else? And so he had this quote, there are known unknowns, that is to say we know there are some things we do not know, but they're also unknown unknowns, the ones that we don't know that we don't know. And this is sort of the case with metabolites. We don't have a way of sequencing all metabolites. We will see peaks that are unknown. We don't know whether they're real compounds. We don't know if we'll ever figure out what they are. But in many cases, we will see peaks, and when we initially see them, they are unknown to us. But it turns out that if we spike in or add in or fit, these compounds actually do fit something, and they are known to us. They were known by someone who studied this compound in Botswana, or they were known to someone who was a Russian scientist working in some bunker, but they were known. We just didn't know they were known. And so that's another situation. This is a universe of compounds that we don't know about, but they've been studied to death. And that's something that has been a real effort, I think, from metabolomics, is to try and get that stuff that's hidden in the literature. And these are known, and they can be identifiable. And then there's some of the other ones which are just completely unidentifiable or the unknown unknowns. So to identify the known unknowns, what we would like to have is a way of deconvoluting these spectra, to try and fit the fingerprints. So it's like having a database of all the fingerprints of all the criminals. If a criminal isn't a known criminal, then you can't identify them. So if you have your fingerprints of everyone, then you could identify them. So that's what you really want is you want to have a fingerprint database of everyone or every compound. And you can use fingerprint databases as I hinted before. You can use them in GCMS. You can actually use them in LCMS. You can use them in MSMS. And you can use them in NMR. And as long as someone has maintained a very large reference database of those spectral fingerprints, you can identify these known unknowns. These, as I said, they're unknown to you, but they're known by the people who built these libraries. So this technique is called targeted or quantitative metabolomics. So here's an example of spectral deconvolution. Here's a spectrum you have of, say, a Blitzen or a Plasmodium extract or whatever. It's a bunch of peaks. And there's, you know, 10, 15 peaks here. I think you can see that if we deconvoluted or pulled apart the spectrum, we could actually potentially see three compounds. Now if we had these three compounds in our spectral library out of hundreds, we could pull these three out and we'd say, okay, here's this compound, compound C, compound B, compound A. If I add them together and assume that they're equal concentrations, can I reproduce this spectrum? And I think you can see that. So here's the red peak matches this one. Here's the purple peak that matches this one. Here's the green and the red peak. Two add together and now I have something twice as high. This red peak matches that one. These two peaks match up line up, so I get twice as high. And then these two, I get twice as high. So it reproduces the observed spectrum exactly. It therefore tells me exactly which compounds are there. It tells me exactly what concentration or relative concentration is there. Now this is kind of unique for NMR because NMR produces multiple peaks for a single compound, whereas purely GCMS might typically produce, it's just one peak. Now within one peak, if you get the EI spectrum, you'll get multiple peaks. And so then you compare those multiple peaks to the reference spectrum and now you can identify it. So again it's matching, spectral matching. MSMS, do that. You can look at those fingerprints and again, well chromatographically it was one peak, but the MSMS spectrum produced multiple mass peaks. Check that, if it matches, you can be quite certain that that compound is what you've seen. So the concept we're introducing here is the same, spectral deconvolution. It works for all the methods. We're going to be doing it for NMR because we could get the free software. And there are a couple of companies that produce this sort of spectral deconvolution software. There's one called Kinomics and another Brooker, which produces AMIX. We're using the Kinomics one because it's Canadian and it's compatible with lots of different instruments and it's pretty simple to use. And as I said, the key thing is that you have to have a library of reference spectra. So NMR is kind of fortunate because it's not sensitive and so you don't have to have an infinitely large library. They've got a library about 450 and it'll basically cover everything that you'll ever likely see in any extract or biofuel. Mass spectrometry because it's so sensitive, you'd need a library of about 10,000 to be reasonably effective. So that's a little more imposing and that's also why we're not going to do this in the short time we have using mass spec. Do you have any open source for software like a freestyle or a bioconductor to do the same job? To do the same sort of fitting. Well, there's something that Jeff wrote called Metabo Miner, which is open source and it does this for two-dimensional NMR. There are some other tools we'll talk about with Metlin and I think there's some labs that have sort of local software that they've developed for mass spec or GCMS and some of them are instrument specific. So I don't know, Jen, do you guys have one set that you've been working on? Yeah, so anyways, so this is, I think you guys were sent instructions some time back, although past experience usually tells me only about half the people follow the instructions, but you're supposed to download the software from the Kinomics website and it's a demo software and it should be something version 7.1. How many people, maybe I'll say, how many did not download? This is amazing. How many are too embarrassed to say they didn't download? Okay, so the software is sort of divided into two phases, two sets. There's one which is called the Processor software and this is the thing to sort of make the the the spectra look nice and whether it's NMR or GCMS or LCMS, you still have to do this sort of thing. Some instruments and manufacturers are more automated, but it's still an important thing. So an NMR is a thing called phasing, which is trying to balance the shapes. There's peak deletion, water deletion, because everything we're working with in NMR and metabolomics is usually in water and so water is 110 molar of hydrogen, so that's a lot when everything else is on the order of micromolar. So it's a huge peak that you have to get rid of. Baseline correction, as I say, things are tilted and twisted, so you do this for mass spec at HPLC and then there's the referencing and reference deconvolution. Again, you do this sort of referencing when you get retention indices, retention times. So in this case, we're referencing to the chemical shift. We're making sure that everything is in the right right region. So this, this and something similar to this are what you will do in almost any kind of metabolomic spectroscopy, mass spec, NMR, whatever. So once you do this processing, and you guys will find this a little tedious, then the next part is maybe a little more interesting or fun. It's now then you have a nice looking spectrum and now you're going to try and figure out what compounds are in there. And this is where you use that reference library of 450 spectra and you're going to try and see which compounds should or could be in that in that spectrum. Now the example spectrum we're using, Jeff, what is it? Do you know what's? So it's Srebel spinal fluid. So this is, you know, it's a real fluid taken from, you know, back of someone's spine that we forced them down and pulled their spinal fluid out. Some graduate students. And then now we're measuring their spectrum and you'll try and identify what, what those compounds are. So you're going to be doing that spectral deconvolution and matching spectra to the observed set of mixtures. Yes. Is that the processor that also called, I mean, more commonly called pre-processing? Yeah, yeah, yeah. So that pre-processing, so this is a walk-through and perhaps a little word of warning that they've changed a little bit of their interface since the last time we took the screenshot. So it might not exactly be the same. Yeah. So you'll start your program and as I say, I'm just walking you through then you're going to do this yourself. But you can use the sheets and you may have some questions that we can try to give. But you'll launch it, you'll select and upload the spectrum. That's the CSF one that should be in the file set that they have. They'll ask you to confirm your parameters. It automatically knows these parameters. So you'll just basically say yes. And you're going to phase the spectrum. I'll explain phasing. You're going to reference the spectrum. You're going to remove the water peak. You're going to do your baseline correction. And then at the end, we're going to move into the, well, maybe do the reference deconvolution or it'll actually do it itself. So if you launch the program, what you'll see typically is almost a blank window like this. And then you go to file and you'll open your spectrum. Now, you shouldn't be doing this while I'm doing it because I think you might get into trouble. I'm not going to be able to help you. So I'm just going to step you through. But I think, so what you should be able to see is the name of the spectrum. And it should be something like this CSF 10 nosy one. And you're going to just click it just like you're opening a, you know, Microsoft Word file or whatever from the file folder. You won't see anything yet, except you'll see the import query. And it'll, it'll ask you what is your reference chemical shift index reference, whether it's just going to automatically figure out the pH. It might ask it was zero filling line broadening phasing, everything that's being clicked that sort of auto, you're not going to do anything. I mean, you'll have time you could play around with this afterwards, but initially you just want to say sick with the defaults and they go, okay. So what should happen then is that that window, which was originally blank, so suddenly brighten up and you'll see a spectrum appearing. And then you'll see these things coloring in at the top. What you're going to have to do is then take your mouse, hopefully people have a mouse. And you'll kind of just like your drawing, you'll highlight this region of the spectrum, kind of a narrow box, and it will expand the spectrum. Now, this is what you should see or something like this. So this is human cerebral spinal fluid, spinal tap. And what you're seeing here is this is the water peak. And you can see it's really warped and distorted. And it's very big, it messes everything up. You can also see that there are some other peaks, this is imidazole. And you can see that this is out of phase. So there's things that are instead of being nicely larynxine or Gaussian shaped, they're sort of distorted or warped. It's also distorted or warped here. So your task with the processor is to try and fix this to make sure that things look nice. And as I said, this is this is urine but this is, you know, warped out of phase. And what you're trying to do is use these tools for referencing, phasing, water, baseline correction to get something that looks nice and flat. Everything looks larynxine or Gaussian. And at that stage it's ready to be processed. So you're going to have a, so you've worked with file to open things. And now you have this option to do with processing. So under processing there's a couple of tools and this might be where it's a little different. But you'll have traces of phasing, baseline correction, line broadening, peak deletion, I think is now what they use instead of water deletion. But it'll be something that looks very much like that. So first thing you'll do is you'll phase it. And the phasing is just trying to get the spectra that were sort of twisted or bent to make them all look like they're all Gaussian or Lorentzian. And usually you'll just tell, all you need to do is auto phase. If anyone's ever done NMR, you can actually do this manual phase and you can do very fine and fine phasing and you can shift things around and you can click on regions. And usually you click on the left side and the right side at different times and see if they can get balanced. But that's sort of what NMR experts do. In this case we'll just let the computer do it auto phase. This sort of explains phasing and it's a mixture of sine and cosine, imaginary and real numbers. And it's an electronics thing and I'm not going to get into it but this is what you're doing. You're shorting the signal to make sure that it looks nice. Referencing, you want to make sure that your DSS peak is found. And I think the newest version actually finds it automatically. It should find it. So you may not even have to do this. But this is to say it gets your DSS and it references. So just like your retention index references things. This will give you, what's that? DSS is dimethylsilylpropionic acid I think. It's a chemical shift standard that's added to all substances. And that peak which resonates at 0.000 ppm is the reference. And it allows everyone on the planet to reference to the same chemical shifts. So you've phased, you've referenced. Now you want to remove that giant water peak. So now this is your NMR spectrum. This is water. So you can't see anything except water here. And so there's a water deletion or peak deletion. And typically what will happen is you will see it will mark off a little region around the water. And then you can broaden it or narrow it. You can zoom in. We didn't zoom in here. But once you've sort of defined where you want to remove the water peak, then this giant peak is now gone. And now you can see the signals arising from the metabolites in CSF. And there's about 40 compounds in CSF. So the water removal makes a big difference. So we've phased, referenced, we've done the water removal. Now we're just trying to make sure that the baseline is perfectly flat, but even all the way across. So sometimes things warp. And hopefully you won't have a problem with this, but you want to make sure that the baseline isn't warped. And baseline correction is done for HPLC, GCMS, LCMS. So this is, it's universal. This will use a spline fit to try and do the baseline correction to make sure things aren't too warped. And again, it should be mostly automatic. You'll see points where it's tried to identify where the bases are. And sometimes you can see where it's a little warped here, where it's trying to fit this spline curve. And you can kind of drag and drop. So if you've ever done these vessel curves, I think on PowerPoint, this is the same sort of thing. And you can add extra points to try and make things curve a little more precisely. And so you'll actually have a perfectly flat baseline. So this baseline is based on the manual draw rather than algorithm? Well, it tries to do it automatically. And then usually humans, it turns out, are better at seeing the baseline. So this is tweaking it afterwards. So the, if you did everything automatically, and just class auto, auto, auto, all through these things, you guys should probably have something that works okay. I guess we'll see. If not, you know, it depends every third spectrum. Usually someone has to do a little bit of tweaking. So at this stage, and again, actually you probably don't even have to do the reference to convolution anymore, because I think that's now done automatically with, with the spectrum. But at this point, you'll, you'll identify the DSS peak, which is a peak that, that actually is, is your reference. And you want to make sure that it fits very nicely. So this is the one at zero. It's an isolated peak. And you want to make sure it fits the same shape. And that's where we've decongulated the DSS. At that stage, you're done with the processing or pre-processing step. And from within that same program, you can go up, it's probably somewhere, it's just behind this window here, about there will be a button up there, which will just launch you into the profiler. So the profiler will then launch. You don't even have to do the file open because it's already open. It'll also probably already recognize that the sample was collected at 500. And then at this stage, you'll just start playing around with some of the peaks that you're going to try and fit. So this is the stage, which is more interesting, you're trying to fit your, your spectrum. So as I said, once you've launched the profiler, which is this button here, you click on it, you should see a green looking spectrum. You should see that it's identified as 500 megahertz. And then you should probably see a list of compounds, 450 possible compounds. And then you're starting to do a fit. At this stage in this little box here, you can type in DSS. And that's, as I said, the library is already going to be selected, but you typed in DSS here. And what it'll do is it will automatically go to its library and identify where those compounds should be or where the peaks. And it turns out that DSS, even though it's one compound, has four sets of peaks. And those four sets actually translate to something on the edge of 20 different individual peaks of different intensities. It's going to do a best guess of where that DSS is and how it fits. And you're going to take that peak and you're going to try and drag it up or down to fit exactly the shape you're seeing. And as you're dragging it, it's actually going to tell you what the concentration is. And that number is going to be changing as you're dragging it up or down. The true concentration should be 250 micromolar. And you're going to look at that first peak, which is the one at zero ppm. And then you're going to look at the other four peaks. So this is the one at zero. This is another one at 0.8. This is another one at 1.7. And you're going to click up on these little corners, which tell you the peak clusters. And it'll take you automatically to that. And you're going to try and drag peaks so that it actually fits the curve. And you can see where this, the blue part is sort of fitting. You can see here that the baseline correction wasn't good. The baseline is too high here. And so you probably have to go back and do some baseline correction to bring it down. But we're fitting that noisy stuff with this sort of perfect peak. And you can see where the peaks match. And so this also illustrates the fact that, as I said, there's maybe 20 different peaks corresponding to this one compound. And they have distinct shapes, distinct intensities, distinct positions. And if all of these fit, then you can be 100% certain that that compound is there. Because there's nothing else in the world that produces the same characteristic patterns and shapes and positions. By fitting the peaks so that the areas match, you can also be very, very confident within about 5% of the actual concentration. So we've done that for DSS. But then you can start typing in other compounds like bioinositol and serine and aline and glycine into these tables here. And you can see if these peaks fit. And you can then drag and drop or shift. And as you're doing that, what's happening is that the concentration of the compound, as well as the identity of the compound, is being tabulated. So as I say, CSF has about 40 compounds. So over the course of a half hour an hour, if you're really good, you'll be able to fit these 40 odd compounds. Now I don't know if we've given you a list of the 40 compounds you should expect to see. We have published that online. Jeff might have a list of some of the ones. Some of you can just sort of have fun and see what you'll see. So here's an example. Acetate is a compound that you should find in CSF. You just type in this little box here, type in acetate, and it'll show a peak at 1.91. Then you can kind of drag this blue peak up until it fits exactly. And then as you do that, it'll give you the precise concentration of acetate. You could type in alanine and it'll give you at least two clusters of peaks, one at 3.8, one at 1.5. You go to the 1.5 and you'll drag and shift until you get a nice fit. And you'll click at the 3.8 value and it'll also try and fit that. And eventually over time, more and more of your green spectrum is being filled up with red fits. And over time, you hope to eventually identify all of the compounds. And by the time you've identified the compounds, you'll also have gotten a complete list of the concentrations. Their maximums in range is probably, because there's about a 5% error in terms of measurement. So that's another example of fit. And then once you've finished fitting, then you can export that data to a table, an Excel table. And now you've got your lists of metabolites. So that's from spectra to lists. And whether it's NMR or GCMS or LCMS or HPLC, that's what you'd like to do, is to get a list of compounds, identifiers. And it could be absolute concentrations as here, or relative concentrations. But it's just basically a two column table. And you do that for each sample. So you've got 100 samples, two columns, 100 samples. That's your data set that you try and generate.