 Hello everyone. Today is the first session. I will be giving a lecture on how to use XCMS for untacted metabolomics. So this is a standard slide. So how many of you have used R before? Not because of this one, just there for, okay. And how many of you have already used XCMS in your research experience? So XCMS is an R package. It's very popular and very flexible. So the purpose of this section is try to first get familiar with R programming language. And bioconductor is a big package that comprises a lot of R packages and for mathematics. And it's very useful in general for a lot of other use including microarray, which is mainly developed for, now it's metabolomics and proteomics. So learning R as one objective, which is the basic R commands and familiarize yourself with how to interact with, from the command line. And the second goal is learn about XCMS and how to deal with different input format. So XCMS is very flexible and today we're just really going to cover the basics. And there's a more advanced and more customized analysis. You should go to the forum. We also posted a tutorial and tutorials and you should just check that for more advanced one. It will cover almost the whole day if we really go to that detail. And the third objective is for basically the raw spectra is very big and a lot of people want to upload a raw spectra and use it on a type of analyst, which is another very popular software for statistical analysis. But because of the size of the file is so big you cannot do it. One option is to try to first process your spectra and convert it into a peak list table, which will be more compatible. So you can upload to a meta analyst. So this is a three objective for this section. We already covered a bit about what is R and what is a bioconductor. And basically if you guys know about Perl and Biopro, this is quite similar to that. But the focus in R is statistical analysis and data visualization. It's especially quite popular for Perl. For example, it's quite good at text processing sequence analysis. But R is a lot of, in terms of statistics of visualization, R is just superb. I don't see any alternatives just competing with this capability in terms of programming language. So it's really worth it. You start to learn it and just spend more time with it. And it will benefit you a lot in terms of your data analysis and other projects really need to analyze the data. The limitation and a lot of people for beginners, it complains about it. It's hard to learn because if you know a little bit about Java, Java programming language is very neat. And it's a lot of programming support. So if you commit some syntax error, it will highlight and tell you it's wrong. And it suggests what's the correct way. But for R, you basically start programming either directly from common line if you hit, but you just get error. So you just write a script in your text file. And there's no much good editor help you to highlighting and help you complete your command. So it's kind of hard for beginners. And also the other one is mainly command line driven. There's some packages allow you to use a GUI interface to do that. But I think it's intentional to be command line driven because it's flexible and immediately you will see the result. So it is a hurdle, but once you're over that, you'll enjoy it. So this is my personal experience with R. Now we come to XCMS. XCMS, I think it first released in 2005 to 2006. And it is about the first open source software to process in LC-MS spectra. It said it's XCMS because it can be easily adapted also for GC-MS, so it said it's X. But in this course, we're just going to use what they designed for initially is for LC-MS spectra analysis. By that time, it's probably the only one, open source and EMR. So it's a lot of people starting just improve on that and asking the original author to add more functions. So it's a lot of activity going on and there's a big forum and people discuss and email at least. So after it released in 2005, it is basic support for XCMS and MZ data, just parsing and picking and alignment and doing basic analysis. Then after, I think it's two years, then there's another algorithm developed by Stephen Newman from MaxPlant. They're adding sand waves feature detection for more high-revolution mass spectra. Then one year later, they're adding this 10MMS support. Then further on, it's obvious more advanced alignment. But for today, we're just going to cover the first one, the basic one. For the other features, I'll just briefly comment what's the parameter you get there. For the spectra we're using for today, it's for the first one. So if you have spectra for more high-revolutions or MS-MS, you just probably need to change parameters and just check manual. There's detailed manuals for if you have this type of data, follow this path. So that's more advanced, but we're not going to cover for that today. So this is an overview of how to use XCMS in your metabolomics, untargeted metabolomics approach. The first one is just you have your LC-MS spectra. This is you see probably from your chromatography. Then XCMS refers to extract ions. Then you have to do a nonlinear alignment and then align the ion chromatography up. It looks like this. After that, you can do a simple building statistical test, such as a t-test. You just try to find the significant features. Today, we're also going to cover if you can save it to a big list table. You can upload to a metabolomics list for more advanced multivariate statistics. For this section, we'll just briefly cover how to use a building t-test to get the most different peaks. After you get these peaks, this is a go back to your experiment. You can really see how these features peaks, how their intensities were regarding two different samples. You see the most promising ones. You can further do fragmentation and doing MS-MS and try to identify. XCMS do support the tandem MS-MS, but this is just a whole flow. We're just going to go through that A, just from your spectra to just differentially express the peaks. Now we come to the programming part. If you want to follow, it's fine. We can just go through the slides first, then you can do it privately. I can just walk around and help you find. This is a note for the code I wrote on the slides. That a big band means it's a common, so R will ignore it. The arrow to the left side will be to the right side will be the start of the argument. When you type, you don't type this arrow, so you just type what's up to this arrow. I hope all of you have already installed other required packages. Can I make a suggestion that we just follow along in the slides. There's not very many slides. You explain everything about what each of the code needs and then we do it. Just because that's been my experience in R that you're going to get the most out of it. You scribble down your notes in the code right now, and I can hand out even further code. Is that okay? Yeah. The total lines of code is probably 20 lines of code. I'm going to line by line and try to explain what it does and what's expected output. Make sure you just try it. If you see some errors, you can see just bring it up and we just have to figure it out. We don't really go to more variety of advanced options. Just go focus on this line of code and this is very basic. If you want to do more advanced, we can always just starting from here. Just make sure you understand what each step is doing and what's available parameters if your data type is kind of different. So the prerequisite for this session is you need to install XCMS. And we also have a test data set. This is previously shaped together when you install XCMS, but recently it's been separate package. It contains about 12 LCMS spectra in CDF format. So using the source code, just bioconductor, bioclite, you can install XCMS and then install the data package. I hope you guys already did that. Now this is an overview of the command we are going to cover. So the first one is talking about input. So XCMS accepting open format. Mostly it's an LCDF, I'm the XML, I'm the data. So the three format XCMS accept. So for a lot of other manufacturer specific format, I think most of them will give the option you can save it to CDF or AIA format. There's such option you can save as and then you can input to XCMS. And up to you you have your data and saved on a folder and then you can read into this XCMS. The command is XCMS set. It will try to do this picking and it will take a while. But it will at least the following as a tutorial you'll see a lot of output going on, print out on a screen and it's busy. And depends on how big is the file or what's the resolution of a spectra, it will probably take a while. Then up to you get all the ticks read inside XCMS and you'll get an object. Which is all XCMS set object. Then you next step is you need to group all these peaks based on the retention time. And up to you group the peaks and these peaks are considered from the same compounds. You can really try to recalculate the retention time and update the retention time then doing the regrouping. So that's your iterative process. You can do it multi times and until you see no further changes. Then up to you do align, read in the spectra, align the spectra and the computer will have a kind of ideas. Okay, where is all these peaks are and they can try to revisit your peaks as been picked before but focus on these areas to expect on the just try to look harder and try to identify the missing peaks. So this is a kind of very niche feature. Otherwise you're going to have a lot of missing data in your data in the peak base table. So this will help just reduce that. Then up to we get the fill in the missing peaks and we can do some basic statistical analysis. And this is just a simple test adjusted by this multi test like a false discovery rate or by Baroni. And when you see some peaks of interest and you can really visualize it based on their retention time and the mass. So this is steps we're going to cover in the next few slides. So this is all the commands, almost all the commands you're going to use. So it's not a lot. So we're just going to go in over one by one. So this command has been a little bit highlighted. So it's in line with the previous slides, which is about reading spectra along peaks. Sorry, can you, which one is reading it? Yeah, so I just briefly comment. So this is you load your XCM library. And here's a reading path to your files. This is your installed package. And here is the path. Actually, when you have your own file, not from a package, you should specify the directory where it is to your folder. And here's basically you see all the files, names and the absolute path where it is just to get this path. And now you start reading all the files into XMS and at the meantime, reading and picking peaks. So this step will take a while. If you have a lot of files, it's just take quite long. And then after reading, this is a grouping based on your retention time. And this is after a group, you do retention time correction. And basically, this is regrouping to update retention time. So this process can be iterated. It's changes to the update change. And this step is just after everything is set, it tries to fill the missing peaks by revisiting this area with these real peaks decided by previous steps. And this is one command. You get all the peaks at these steps and you can just get all these peaks, peak intensities. And you can save it as a peak intensity table. And basically this is an extra step is put your class labels together with your peak intensity table. So you have your peak list table together with your labels. It is a mutinous, not God. It is disease-healthy. So it will have the right format. Then you can save it as a .csv file. This is directly, this file can be used for metabolism and we covered in module 7. So yeah, you can, these last two slides, the last two command is just for the purpose of metabolism. But here is, yeah, that's the three command. But we're also going to cover how to do the statistical analysis within XMS and visualize. It will be the last few slides. So this is a, what's the format for a net CDF file. And most of you probably don't even bother to open it and set. But majority of your machine should have this option to save to this format. It is very popular and open format. And XMS will pass this out and get things ready. You don't necessarily just open and view it. It's a very huge file. So for you to know sometime, net CDF also called CDF. And this is what used in this tutorial, which is the file is actually saved as .cdf. And other commonly also used is called AIA. So this is, you can probably rename it. So in case you have troubles too. So here is a command used. If your file is not for our tutorial, we use the package. But this is command. If you have your own spectrum, just save it on the current folder. And name it your spectra or my spectra. And it will, then this command will read all the files inside this folder. And if you, after doing this and type this CDF file, you will see each files, all the names and the absolute path. And this is an input for XMS. So it knows all the files and their path. So this is for your own data. This is not for this section. This section is used in the package one. I showed it before. So when we have the data ready and the first, when give the command is XMS set, which actually try to just reading the data and pass the data and detecting the features or fix. So for LCMS data, it is two dimensional. One is mass, one is retention time. So what XMS does is try to first slice the whole spectrum at mass dimension. So it's just, if you just look at from the top, you'll see just like this, just into small slices. And then we just work on each slice. So this will be a one slice from retention time. You just visualize, you see it before from here. We try to like this one slice, this slice here. And some of the slices are obvious. It's from the same compound. And based on your parameter, you will combine or merge these two, merge these slices, not necessarily two multiple slices. If you determine from the same compound, you will merge it into one chromatography. And then the next one is called a filter. So it uses Gaussian secondary directive to really just calculate the boundary of the peak and use this boundary to calculate the area of intensity. So this is kind of just what happened under hood. Oh, sorry. Do you have any issues with isotopes being picked up? So you're binning in this case 0.1. Yeah. But can you bin up to like two or a few dollars? Yeah, you can try. And sometimes if you see the errors and you see the peak that makes sense, you have to readjust the parameter. So sometimes if you change the parameter, you get a totally different result. So the only thing is just based on your column, based on what's the resolution of yours. So GCN and LC, if you use the same parameters, just totally you can just get a result that makes sense. So I think using XEM is the first one is try to find the optimized parameter, which one gets the most reasonable result than just stay with it. So this is a very difficult step. You just try to get the peaks and visualize compared with what you got from a machine and you're okay, this is the most reasonable one. Just stick with that. So I think between different machines, it's better different vendors and the parameter can be quite different. And people just post their code for this type of machine and people just copy and use that work to try and post there. So that's a lot of very good tips online. So yeah, for today's exercise and we just use the default one. So what's the bandwidth, how determined which, how much to combine them and how to calculate. This should be, this will work fine. But for more high resolution one, probably you need to change parameters. But there's quite a helpful forum to give you directions or just send me emails if you have trouble. So now we just go to this code if you are doing the command as we discussed before. And this is your first to get part to the files. And now you're reading the peaks and while you're reading, you get its output from the command line. So this basically will take a while and for the tutorial, we have a very small data set and it's probably finishing two minutes depending on how fast your computer. So here are some tips for important parameters. So some, if you know, for example, at the initial illusion and that kind of spectrum you don't think it's of your interest, you can really restrict what's the scan range and exclude the certain regions. So your folks more targeted are just only interested in this particular retention time. You just put it here. And this is four widths at half length and second is a specified peak width you see on the GCLLC. So the default I think for LC is 30 seconds. If you're using GC probably four, five seconds. Yeah, UPLC, I don't know, it's 12, 15. You need to really try it out for your machine to get it just optimized for it. But there's some initial guess and it will help you to start and see if you get some result. If you change, you get more peaks or this peak is meaningful, you probably, okay, just go for that. So for today, this is hidden. If you don't give it, it will use the default parameter. So you don't need to touch it in case in future you don't get a good result. This is a parameter you probably want to play with and try to do it. And this is a method how to pick up the peaks. One method I just briefly mentioned about send wave it is for high resolution, probably tall for OBJAP. But this is, we're not going to use this. This really takes time. So after we're reading all the files and the peak peaks from each spectrum and the retaining time calculated for each peaks in each spectrum will be slightly different due to this retaining time drift. So you can, if you really try to overlay all these spectrums and visualize, you will see it like this. Actually it looks nice, but some of them probably shake it more. And all you want, because you want to compare peaks, you always want to, the peaks from the same common will compare to each other. So you want to align and make sure the peaks compared with really one spectrum to the other is talking about the same thing. You don't want compare, peaks from common A to peaks from common B that's what we're going to give you a round result. So this is a goal to from this align to this all, to the aligned peaks. So the goal is to match these peaks across all the samples. And then based on these peaks, then you try to update the retaining time. So basically rewrite the retaining time for each spectrum up to your group. Basically update so they have exactly the same retaining time. So when you align them and compare them, you'll just, you will compare exactly at the same location. So it is an iterative process. And this is a command. I already mentioned it before. So this one you can try once and try twice and you just read the output and see if it updates. So yeah, this is, for example, 12 spectres and each spectrum, if you run through the column, it'll probably have a slight drift. So it won't perfectly align to each other. You want to align them totally overlapped like a single spectrum. So basically you want them to have exactly the same retaining time. So when you compare them. You have a target and then you're moving the other one or shifting the other ones to have the alignment or the peaks are wrapping or out there. Yeah, it's a very good question actually at what XCMS did is nonlinear retaining time correction so what you're talking about is warping and trying to align that's another approach. But what XCMS is, you give them a tolerance basically the bandwidth for the retaining time. It will try to group peaks into each group and then update. Yeah, it's initially starting from there. Then try to update. But internally it's called nonlinear and it's not what you're talking about the warping and doing this. It's much faster memory efficient. The one you're talking about overall warping and doing this is another approach but it will really require a huge memory to do that. So tell the fact I don't know exactly how it works because this is one of the major success of XCMS. So they do it faster and a reasonable result. So it's nonlinear. Are you changing the area of the peaks? Oh no, I shouldn't. No, I don't think so. So, yeah, after we now we do the alignment we can do it multiple times and you see the output if there's no updates basically it's fine. For our eight samples it's probably you just run once or twice it will be quite stable. So this is a typical output from the return time alignment and this is all 12 samples and for each sample there's different returns time drift calculated for each time and this summary and this is overall how it's been corrected. Now this is after we read the peaks, align the peaks and then we fill in the missing peaks and this is a very important one step because a spectra can be sometimes if you really just align peaks and get the peak this table and you're going to notice a lot of peaks are not there just missing but if you really visualize your spectra you manually just realize from your machine you see something there so you always want to get max information from your spectra but XCMS can try to do it in a more automatical way this is how they are doing it basically you're doing this peak peak and return time alignment and the update return time all the peaks are now aligned and they can really at this moment when you have majority they have majority of both should I expect to see a peak and here is the boundary so they will try to rescan and use a more relaxed threshold on this particular region and try to identify peaks at this moment since you already know this is the place you're likely to see peaks so the false positive will be minimized so it's a quite reasonable approach so this is basically a rescan raw spectra try to field it's just a simple command you just redo the scanning on every peaks on this specific region of the missing peaks but the peaks missing in some spectra but it's found in majority others so they will just revisit these regions of the spectra and try to fill it up so you may notice some warnings I know some of you complain this warning is fine because some of the peaks you find in spectra in one spectra probably the spectra already be on their end so they complain okay I didn't couldn't find it in the original spectra so they will probably just ignore it this is normal unless you scan your spectra at the beginning you just scan range and make sure all the spectra have within that range then you will be fine and then warning just ignore it's fine now we are going to get peak list with the intention retention time and mass if you really see the peaks from this object you get the command and you see the first 10 rows and you will see the peaks and that left the mean max and retention time and the mean max and here is intensity there is original intent here there is a maximum maximum intensity so this is all from sample 1 so if you really just list the long list of all from sample 1 to sample 12 so it's going to be very long if you see that so XMS have a built-in function for doing differential analysis which is to sample T test and rank them by P values so it is really easy once you got this data object which is one comment about up there doing the filling peak and you will get this one you just get here and they just doing a deep report and put it here and this is to class two class labels one is wild type, one is knockout and this is to just doing a T test with regard to these two classes and this is a report you can save it and visualize open it with XI if you save it but we can see what's inside this one so just top three rows and all these columns so it's fold change, T statistics and P values and there MZ mean max samples so you see here is this peaks being finding all 12 samples and knockout all 6 knockout samples in one wild type so the second one you will see this peak only finding 7 samples and in all knockout 6 knockout but only in one wild type so this is some interest and we can do a visualization later and see whether it's true or not and so now we get all this this is a very simple statistical analysis just T test but you can do more advanced, it's more multivariate analysis and visualization using a type of analyst to do this one you need to format the output to be compared always required by a type of analyst if you guys use multivariate analysis before and this is a very general format basically you want all your peaks or features in rows and the samples in columns because you want to do analysis you need to let the program know what's the group label although the data already have the samples and features you need to add this group labels to tell which group is which sample is from knockout, which samples from wild type so you can save it and manually insert it using Excel but here for this exercise this sample labels you can get a sample label called phenodata from this XMS object you'll get this now you can combine these labels together with the peak intensities tables then you just save it as a CSV file then this file can be directly used on multivariate analysis you can try it later when in module 7 up to that command if you save it to mypeaktable.csv file and if you open it in Excel you're actually going to see the format like this this is all your sample names knockout of 15, knockout of 16, knockout of 18 and this is the class group label this is a, b, o in this case it's just the knockout mutant now here's the peaks we don't know what it is because this is where we do it on target we just get all these peak locations that are used identified by their retention time and their mass and retention time and the values there the values there integrated original raw peak area this is when we used so this is not absolute quantification it's just relative abundance so you just get the abundance to do a statistical analysis so a lot of peaks can be a false positive and low qualities even XCMS try to do its best it's bad to visualize and what's the quality of the peaks whether it's real or not so it's bad to visualize the peaks so you can do it with XCMS so we already did this before so this is the object we returned from your previous analysis if you gave this command called groups it will give you all these groups these peaks and if you really found some peaks you want to find peaks particularly retention time and mass range you just gave this you can explicitly ask for that basically retention time at least 2600 seconds with this but less than 2700 and with peaks show up all 12 samples if you want to change and you'll get multiple heads the last one is basically I just want to get number of first heads then you just get EIC for this index then you just call them now you're going to get your peaks like this colored by different class labels so one class will be red the other one is black but you can really just change this parameter to whatever peaks you find is interesting because you know what these peaks look like where it's located you can always just address this one to narrow it down and just visualize it so I hope it covers all the basics how you just get read your peaks, align your peaks and visualize your peaks and do the basic statistical analysis but as same as a very flexible and because open source really been adapted to work on very different scenarios so for this more detailed discussion about what's already been covered but this gives you more theoretical background you can go to this menu it will give you more detailed stuff but I found it's quite useful and you go to this forum and people this forum is quite active and people just upload their code and discuss their issues and application in the novel field and the authors and some senior users try to help as best as possible so just go here and read what's been done if you think your issue is novel never been touched you just post your questions and people will try to help you so just keep in mind XMS can also do a tandem MS and there's a recent one where you combine multiple experiments to do the analysis and finally if you use XMS and you think it's fine and you get your parameter right and you just don't want to run yourself you can really upload this called XMS online and they have a huge server just upload your file there and you will probably return the result and that's the best parameter for your file then you can select and click right here it will do a best job for you okay it's done and now you can just practice and try to follow the example download the code, copy and paste and see what you get and if you have questions I'm glad to answer any of them