 Hi everyone, welcome to the proteomics hands-on training. In this demonstration, I will show how to use MaxQuant and MSDOTs for the analysis of a label-free skin cancer tissue cohort data set. I will use the European Galaxy server and whatever server you use, I really recommend to use the training material inside the Galaxy server. You can find it here at this graduation head, simple. When you click on it, you end up in the training website and you can navigate to the MaxQuant and MSDOTs training site. And this has many advantages which we will see during the training. So we will immediately start to upload the data because it might take some time and I will explain a bit more about our training while we are waiting for the files to upload. So part of our files is deposited in Synodo and we can just copy the links to them by clicking on this copy symbol here. And then we go to upload data and paste the links in here. Press start and close. We can give our history a name, those are MaxQuant and MSDOTs label-free proteomics training. And we also have data on the Pride database. So this is an open repository for proteomics raw data. And as we use real live data sets, we copy the raw data from there. We can do this with this button again. Go to upload data and paste in these links. It's important that they have the right file types. So to make sure that they end up with the right format, we can check here, click on terminal.raw. And now all of these files automatically are type terminal.raw and that's the right type in Galaxy for terminal raw files. So we can also start this. This will take a little bit longer. And then we make this names a bit more beautiful. So I click on this pencil button and remove here the part of the link. So that it's called Protein Database Faster. And this is a faster file with all human protein sequences. And when we click on the i button to open the beginning of the file, we can appreciate that we have here the header line, which contains the unifrot accession, the protein name and a bit more information. And afterwards we have the amino acid sequences. And here's the second protein. And afterwards we have the amino acid sequences of the second protein and so on. And this we will use for MaxQuant, which does identification and quantification of proteins. Then we have an annotation file. So I also remove the beginning here. And this file is important for the MS-STATS analysis. It contains the experimental design set up of the experiment. So this file always needs to have five columns when we use it with a MaxQuant output. The first column contains the raw file name. And it's important that this name matches exactly the name of the raw files in the MaxQuant output. So we will later see that in our MaxQuant result, these are the file names that we see here in column one. Then we need to give the condition. So these are the groups that we would like to compare. In our case we have two types of squamous cell carcinoma. So we have a metastasizing squamous cell carcinoma. This is a cancer that occurs in elderly people after a lot of UV light exposure. And then we have a very special type of squamous cell carcinoma, which is the RDB and squamous cell carcinoma. So RDB stands for recessive dystrophic epidermolysis bolosa. And it's a very rare genetic skin blistering disease. And in these patients there are collagen seven mutations, which lead to deficient collagen seven. And one of the long-term complications that come with this rare disease is cutaneous squamous cell carcinoma. And this is quite an aggressive type of cutaneous squamous cell carcinoma. And we have six samples here, what is already quite a lot, because it's a rare disease. And we only got so many samples because these were formerly in fixed and paraffin embedded. And thus very easy to store at room temperature for longer times. In the third column we have the biological replicates. We have 19 different entries here. What means that we have 19 different patients. And because we have label-free no-fractionation data, we have also 19 runs in the mass spectrometer. And the isotope label type column is only necessary for max-quant runs in MS-STATS. And because we use input from max-quant for MS-STATS, we need this column. And because we don't have any heavy spike-ins, we label everything as light. Also for MS-STATS, we need this contrast or comparison matrix file. And this indicates which conditions we would like to compare. In our case, we only have two different type of conditions. And therefore, we only have one line here with the comparison of the metastasized versus the RDB, cutaneous squamous cell carcinoma. And here it's important that the names here and they had a line have to match exactly the condition names, how they were written in the annotation file. And this is only the comparison name, but it makes sense to name it according to the order of the comparison. So we compare the metastasized, which is indicated by 1, to the RDB, which is indicated by minus 1. So it makes sense that also our comparison name has the same order that it compares the metastasized versus the RDB, cutaneous squamous cell carcinoma. And if there would be more conditions, we would need to put every condition into a separate column. But if you are not interested in comparing each condition in every comparison, we could just use zeros for the conditions that should not be part of this comparison. Okay, yeah, these are quite a lot of files. So this might take a while, but I will try to already rename them. And here, it's important to rename them correctly. Because these names have later to match to the annotation file name. So if you do it with your own data, it's not so much of an issue because you would probably first run the max quant. And then after the max quant run, you would establish your annotation file and check that the names in the annotation file fit to the names that came out of max quant. In our case, now it's also not super important because the max quant run will take several hours. And we don't wait for it to run, we just take the max quant results, which are also deposited on Xenoto, and continue with these. So if you do a mistake here, it's also not not a big deal, because we will not continue with the max quant results we get from here, but we continue with the max quant results from Xenoto. And because we have so many files, it can be a bit easier to store them in a collection so that we have less entries in our history. And that's what I'm going to do afterwards after renaming them. So this first step, you could also skip because the max quant results are provided. And all those raw files consume quite a lot of space in your history. So you could just go theoretically through this max quant step. And if you want to learn more about max quant, there's a beginner's training max quant for the analysis of label free data. And there a lot of the max quant parameters are also explained in more detail. So I highly recommend going back to this if you are new to max quant. So it's always complaining that the metadata might not yet be changed because the file is not yet uploaded, but it should work here. So to create now a collection of all this raw data, I click on this icon here, operations on multiple data sets, I select all, but then exclude the three files that are not the raw files. And then for all selected, I build the data set list. And I can enter here now a name for the collection, I just call it the raw files. And I'm happy to hide all the original elements, because then I have less entries in my history. And I create list. And now it looks much cleaner. So all the 19 files are now in this collection. And when I click on it, I can see each of them. And when I click here, I get rid of the boxes in front of the elements. So now we start with the first tool, which is max quant. There are two versions of max quant. So for one, we can only specify the parameters in a max quant mqpar file. But with this tool, we can actually set parameters in this typical graphical user interface. So we have to choose which format our raw file has. And it's already selected terminal raw. But it would also be possible to do the analysis on open standard formats like mzxml and mzml. However, they have to follow a very specific format to be compatible with max quant. So it's not always easy to get the data into this right open file format. So first, we have to select the faster file that's already done for us. That's our first file in the history. And the pass rules, we can leave them as they are. So they specify which part of the header is of the faster file protein headers are kept and how they are split up to appear later in our output files. So max quant does identification and quantification. So here we have many parameters for the search options. We set the minimum unique peptides to one. What means that we are only interested in proteins that have at least one unique peptide, what is a peptide that only belongs to this protein and is not shared with other proteins. And then we set match between runs to yes. And this is quite helpful for labor free, larger labor free data sets, because we can now transfer identifications between runs when there's the same feature in another run, but it was somehow not picked for fragmentation. And so it didn't become an identification. It can get the identification from another file just by transferring the identity of the same feature. And now we have options for protein quantification. However, they are not so important because MS starts on the peptide level of the max quant outputs again and ignores all the protein level data. And here we have now to choose our raw files. So parameter group means that for different input files, we could choose different analysis strategies. But we would like to put all files into one parameter group and analyze them all with the same parameters. Because we have a collection, we click here on data set collections. And now we can select our history entry 23 with the raw files. We keep them as cleavages and also the modifications trips in as an enzyme. And in theory, we could set the quantitation method to label free. But it's not really necessary because as I said, MS starts will ignore the protein level quantification and all what happens with those label free quantifications that we get more accurate and normalized protein level quantification, which MS starts will later on anyway ignore. So we don't need to select it here. But we would like to generate a quality control plot with the pdxqc functionality. And the two output files that we need for MS starts as an input is the protein group file, and the evidence file. And I cannot start max one, but it's not really helpful or necessary because as I said before, this is a real life data set. It's not a crop and crop data set. It's a real tissue cohort. And so the max quant analysis will take several hours. And if you don't want to wait so long, you can just go into the training. And here behind the max quant, we have this box. And in the box, we have the max quant results. So we can copy them and upload them and continue with these files. So in the evidence file, we will see that we have the feature level data. In the protein groups file, we have the protein level data. And in the pdxqc, we have the quality report with a lot of different plots that visualize the quality and the properties of the data. And normally that's separate our package, but we have directly implemented it into the max quant in galaxy tool. Because it's just, yeah, it's made for max quant. And it's really helpful to get a quick overview of the data. So we can maybe rename it. Okay. So before we continue, we will have a quick look into the data set. So we need to wait until they are ready. So now the files are loaded into galaxy to open the PDF. I go to the scratch book and then click on the few button. And here we have already an overview about a lot of different parameters and how they perform in the different files. Don't be afraid to have something as red or even if many things are red, the cutoffs do not always fit for all experiments and all type of mass spectrometers. So we have an overview about the files and the parameters from max quant. Get like a PCA plot here. And this is a really important plot. So these are the contaminants. So max quant adds potential contaminant protein sequences to the faster file database. And we see here that up to 60% of our intensity derives from potential contaminants. But when we have a closer look at these contaminants, many of them are human and are skin proteins. And because we have analyzed human skin, we expect that these proteins are part of our sample and not a contamination. So in the next step, we will filter our files in a way that we can actually keep the potential contaminants for the statistical analysis in MS stats. So we have some intensity distributions. We have quite some as cleavages here, several mass spec parameters, how the mass spec performed, how well the match between run worked, looks good. And everything what is red is the amount of IDs that was transferred via this match between run function with the mass error. And here we see that the amount of identified MS2 spectra is counted as bad here. But actually for FFPE tissues, this is quite a common pattern that we see. And soon we see the peptide and protein identifications. And yeah, this looks quite okay. So it's between 1500 and 2000 from small FFPE tissues. So that's totally fine. Okay, then we have the protein groups output. Every line is one protein group. So we have 2621 protein groups here. And we can inspect it in more detail. Here's now all the protein level information about how many peptides we had in which group. If these were unique peptides or shared peptides, how many were there per sample. So this is quite a large file with many columns. Still per sample and the unique peptides per sample. And then we have some sequence coverage, weight, sequence length, the type of identification again for each sample, sequence coverage for each sample, intensities again for each sample, MS MS count per sample. And we are looking also for the potential contaminant column here it is. So every protein group that has a plus here means that this is a potential contaminant. And MS starts will in the first step remove all the protein groups that have a plus here. And exclude them from any statistical analysis. And because we expect that these are not potential contaminants but actually proteins from our skin sample. In the next steps we will remove only the non-human contaminants and keep all human potential contaminants. And unfortunately here the header is now currently not available in Galaxy but this should be column number 118. And we need this information later to know on which column we have to remove the pluses here from the human protein groups. And then the third file is the evidence file. This are 240,000 lines roughly and each line represents a feature. This means a peptide in a certain charge state with third modifications. So we have the sequences, modifications. And here's actually the raw file column. So if you do your own experiment it's important that the raw file name exactly as written in this column is the raw file name that should be in the first column of your annotation file for MS starts. Because this is like the key according to which MS starts matches the metadata from the annotation file to the proteomics data in the max-quant output files. So we have mass errors, retention time length, intensities. And here we have the potential contaminant column. And this should be column 54. So we later also filtered the feature level and removed the potential contaminant plus sign from all human proteins that were labeled as potential contaminant. So this is what we are doing now. So first we remove every protein and feature that is not human. Because these we consider are the real contaminants. And to do this we use the select tool and we can either look for it here but it might be a bit tricky to find so we go to the training material. Let's just reload. And here in this MS starts analysis sections we can just click on the tool and then it will appear and it will be there also in the right tool version. And I can already copy the pattern we are looking for before I click onto the tool. So I can copy the select pattern here. It means that we are looking for any lines that match either human or maturity. So human should be all the human proteins that we match and therefore keep. And maturity this word should only appear in the header line. So this is a way how we can keep our header line in the select line tool. And then we do the same for the evidence file. So in the evidence file the word sequence should appear somewhere in the header line. So we need to select here the evidence file from Zenodo. And with this step we also remove the features that derive from non-human proteins. And now for the human proteins which are the ones that we would like to keep we need to remove the plus sign in the potential contaminant column. And to do so we use the replace tool. And the protein groups are in our first file in file 30. And in the protein groups it should be column 118. And in this column we are looking for plus and in order to remove it we replace it with nothing. And then we repeat the same for the feature in the evidence file. So this is the second select file. And in the evidence file it should be column 54. And again we look for the plus sign and we remove it by replacing it with an empty field. And now to keep a bit track we can rename this file. So the first file was the protein group. And we name it now protein groups input for MS-STATS. And in this file we have now all the protein groups and it should only be human proteins but they should not be labeled anymore as potential contaminant because then MS-STATS would remove them. And the protein groups in MS-STATS are only needed in order to find out the protein inference but their quantities or abundances will be ignored by MS-STATS. And then we also rename the other file. So this is the evidence input for MS-STATS file. And now everything is prepared for MS-STATS. We can either search MS-STATS here or directly click on it in the training material. So there's two types of MS-STATS. MS-STATS TMT is obviously for label data and we use the MS-STATS in this version which is for label-free data. We first need to specify the input source. So we have max-quant as input. And now we need to provide the evidence in the protein groups. Of course we could use the ones that we obtained from max-quant but this would remove all those potential contaminant proteins. So here we use our filtered and manipulated inputs, the evidence and the protein groups. And we select the annotation file. It was here at the beginning. We select the leading razor protein column as an ID column. And by clicking on this max-quant to MS-STATS format options we have more filtering options for how the max-quant input should be treated when it is converted into an MS-STATS compatible table. And the only thing we change here is that we remove the proteins which have only one feature. And then if you're hidden in the data process options there are a lot of processing parameters. And what we will choose is also the run-level data as output and the sample quantification matrix. I don't think we need the raw data. And here you can see all those processing steps that happen before the statistical analysis. We do a log-2 transformation. We equalize for the medians. Yeah, we filled the rows that are not complete and so on. So here we impute the missing values. So we leave all those data processing options as they are. In the plotting options we select the QC plot, but we generate it not for every single protein, but only an overview plot that gives an overview about all the proteins. And to make it a bit more beautiful we can even go into the advanced visualization parameters, and there is a parameter about the angle of the labels and we set it to zero. And then we come to the statistical analysis. So to do a group comparison we need to say yes to compare groups and we need to select the comparison matrix. We obtain the comparison result as an output and to get a visualization we also choose the volcano plot and to make the volcano plot more beautiful we go again into the advanced visualization parameters and we say we use a 1.5 as a fault change cutoff and we don't want to display protein names in the volcano plot and that's it. So this will take a few minutes so it might be a good point to make a little break. Okay so now the msdots run is done. We can inspect the output files. The first file is the log file which captures the information about the data analysis steps and potential warnings and here we can also find the number of proteins and peptides per protein. Then we have the process data file which is a tab separated file with a lot of lines because it contains feature level data so we see for example the peptide sequence in the protein it belongs to and then all the metadata is attached to the features we see an intensity which is the raw intensity and we see an abundance value which is the log transformed and normalized intensity and we also find if the missing value imputation was performed for this peptide and then the features are summarized inside msdots to protein level abundances and this information is in the run level data that's again a tab separated file now it has a bit less lines but it's still a lot of lines because it has one line for each protein in each file or sample and on this protein level we have again the protein name intensities number of features so in the next step we will calculate how many feature a protein had on average it counts the missing value percentages and again all the metadata and another follow-up step that we are doing now is that we look at the file names and count for each sample or file how many proteins were used here to summarize how many proteins were now included per file into these statistical analysis so let's see so we look at this later so now we do the summary statistics on the run level data we use column four and in column four we had the number of features per protein and with the summary statistics we can have a look how many features were on average grouped into one protein and then we can already start the summarization of proteins per run so for this we use the data mesh tool and file information was in file eight so we go to the run level data we say we want to summarize file eight the file column eight we can confirm this here so eight were the different file names and we have a header line which we want to keep and we need to sort the input in order to be able to summarize then really all of the same file names after they were sorted and with this count we just count how many lines are summarized for each single file name and this gives us the amount of proteins per file so the summary statistics is already done it can be done on any numerical column and we find here a mean of six features per protein and we see that the median would be three features per protein but there is probably one or two proteins that have a lot of features so here's the maximum has 271 features that are summarized for one protein and here in the data mesh step we have now summarized our files that's why we have 19 lines so it has 20 lines one header line is 19 different files or samples and for each line for each file we have now counted how many lines had this entry and this corresponds to the number of proteins and we can now do a quick some visualization of this data with a bar diagram and we can here set more parameters let's do it here and now if we have the mouse over this we see that for the rdvcscc4 we have the lowest amount of proteins for this file okay so let's continue with the amistad's outputs we have a qcplot i might need this scratch book here to open it so this gives the distribution of protein intensities per sample and the medians here look all like on the same line what is a good sign and also the box plots are relatively equal and theory we could have done this also per protein then the pdf would have more than 2000 pages but then we could have inspected the data for each protein individually but for now this is enough we know that the data is relatively normally distributed after the log transformation and normalization then we have a file that is called the sample quantification matrix and here we have the protein id in the first column and then for every sample we have the abundance values in a separate column and we can now could look up for each sample the abundance per protein and we have something similar here so here we have the average value per condition for each protein so this information especially for the sample quantification matrix could be used in any other statistical or visualization tool for further analysis steps and then comes the most important table which is the comparison result table here we have again the protein id in the first column and then we have our comparison name but we only had one name so there's only the same label for every line then we obtained the log 2 fault change that gives us an idea how the abundances are different between the two conditions and therefore it's important that the name of your comparison fits to the actual order in which you compare the two conditions because a negative value now here means that the abundance was higher in rdb while a positive value means that the abundance was higher in the metastasized samples and here the most important column is the adjusted p value and we will now filter for this one but before we have a last output of m starts which is the volcano plot we have the log 2 fault change on the x axis and we have the negative log 10 of the p value what means that lower p values which means more significance are higher in this plot so higher values are better here okay so now we start to go on with this m starts comparison result which contains the statistical outcome so the adjusted p value column is column 8 before we do the filtering we can clean up the id a bit because it's if we have a look at the id here in between the two pipes we have the actual uniprot accession but before and afterwards we still have more information that we currently do not need anymore so in our first step we remove all this information so we take the comparison result as input column 1 and we want to remove this pattern but because this pipe character is also special character we need to actually put a backslash before it and we want to remove it so we put nothing into the replace field and we also want to remove the part behind the pipe so we say pipe and then we say we want to remove everything afterwards so that's the dot and the star and again we don't replace it with anything and now from this file we keep only the significant protein so now we click on the filter tool and there are so many filter tool that this is really helpful if the training is used inside the galaxy server because we just need to click on it and we know we have the right tool in the right version number exactly we use the replace and it was in column 8 and we would like to keep forward change p values below 0.05 and we have one header line here's the header line okay and before we continue we rename this file into the significant proteins file and then we filter once for default changes above 0.58 that means these are proteins that are higher in the metastasized condition at least 50 percent more abundance than in the rdb condition so these are the significant proteins and now we use the filter tool we can already copy this so we filter the significant proteins for the ones that are have a higher forward change in metastasized square msr carcinoma and then we do the same for below minus 0.5 that would be done the proteins that are high in rdb so we do the same filtering step again on the significant proteins and this time we keep only these and we rename and label them so these are the metastasized proteins that we don't get too confused by all these filtering steps and these are the rdb proteins and then we also can give them a tag we click on the tag simple and if we put a hashtag the tag even gets propagated and this is the metastasized and this becomes the rdb tag okay and we can already see that we have 134 lines here and 14 lines for rdb but if we have a closer look we can see that here's still one protein in which has a p value of 0 because it is missing in one condition and for now we will remove these proteins potentially they can be interesting so it might also be worthwhile to look into proteins that are completely absent in one condition but not in another but for now we will remove them and we do this by saying that the p value in column 8 has to be higher than zero because we already filtered for smaller than 0.05 we can now say higher than zero so this keeps everything above zero and below 0.05 and we do it with both the rdb and the metastasized proteins so I can click on the multiple input file and select both oops it should be column 8 and now we can do the same with the cut tool so that we only keep the id's we keep only the first column that contains the id's and for both filtered files we cut out only this first column and then we have the significant proteins id's for rdb and metastasized cutainous square massercasinoma and we would now like to combine them with the original protein abundances and to do so we also clean up the id protein id's of the sample quantification matrix so we do a similar step as before but this time we choose the sample quantification matrix in column one we would like to remove the thing before the pipe and we would also like to remove the pipe and everything afterwards and now the id's in the sample quantification matrix should look the same as here and then we can actually join them and this can be done with the join tool because it has exactly the same parameters and just the different input file one type the metastasized and the other type the rdb cut we can again run step six and seven at once so we use the here the replaced quantification matrix column one which has only the unibroad broad accessions and here in the second file we can use the multiple data sets and use the cut rdb and the cut metastasized file we also use column one and in both files we have a header line and we keep only the entries that occur in both files and this gives us now for every unibroad accession of the significantly regulated proteins the abundances in every sample and next we will visualize them with the heatmap tool so if we don't give this a plot name we can again put in both files that are joined and we don't need a clustering and we scale by row and you can see that the tags always continue here sometimes it gets a bit hidden but once you click on the tag icon you can always see it and then you know which heatmap or which file belongs to which of the two conditions and you can already check on the numbers so we have here 13 lines in the rdb and we have 86 lines in the metastasized file and here's the heatmap and we can immediately see that for the upregulated in rdb proteins there's definitely more and higher intensities in the rdb samples than in the metastasized samples and probably it looks very similar here yes we can also clearly see the distinction that here are higher intensities in metastasized than in rdb as a last step we now want to find out which proteins are behind these identifiers we have here there are different ways how one could do it we will use the unibroad tool to retrieve the protein names we can again do it in parallel for both the metastasized and the rdb file and we select that we would like to retrieve our entries by using the unibroad accession and we obtain them as a faster file but because a faster file is not so easy to read for many proteins in a last step we will change the faster file into a tabular file with the faster to tabular tool we can again do the same procedure for both files and we split up our title into two columns and here we can look at one of these files so by using the unibroad accession we now receive the full faster entry which includes also here the protein name that's what we are now interested in and because this is hard to read with all the sequences in between we use this faster to tabular file and here we now have it nicely sorted so the first column is the accession and the second column starts with the protein name and in the third column we have the faster file and we can now have a look at these proteins so we are now in the file that contains the proteins that are upregulated in the metastasized squamous air carcinoma and there are a few proteins that were also found in the original study for example the x-ray repair cross complementing protein 6 and also the serum amyloid p-component and we have 12 proteins that are more abundant in the rdb squamous air carcinoma compared to metastasized so here we see the label again if it doesn't appear we just click on the label and here it's quite obvious that there are a lot of collagen here here here here and this is also typical for rdb itself that because of the missing collagen 7 other collagen are upregulated probably trying to compensate for the missing collagen 7 and one protein that we have here the collagen 14 there was also an immunofluorescence staining done in the original study which we can see here and we see that in rdb the intensity is way higher than in the metastasized cutaneous squamous air carcinoma so this was the training inside the training material there's more boxes that are important when you would like to run the training on your own files for example here how to set up the annotation file and the comparison matrix yeah and i hope that was helpful for you thank you for being here and i hope you enjoyed the training bye