 Hi everyone and welcome to the proteomics hands-on training. In this training we will learn how to use Maxquand and MSTATS for the analysis of a label-free tissue cohort data set. The data comes from skin tissue samples of 19 patients and they have different types of tumors. One group consists of metastasizing cutainous squamous cell carcinoma and the other one consists of RDB cutainous squamous cell carcinoma and we shorten the cutainous squamous cell carcinoma as CSEC here. So it's a type of skin cancer and our objective for this training is to learn how to use Maxquand and MSTATS for the analysis of such a real label-free tissue cohort data set. So we will start by uploading the data. Part of the data is deposited on Zenodo. We can copy everything and transfer it here to the upload button, choose paste and fetch and then paste the links to these three files. We press start and close. If you are not familiar with Galaxy yet, please look at the Galaxy beginners lectures and hands-on training. We then continue with uploading the raw files. Here on this head you have a direct link to the Galaxy training material site. That was where it was before. And so we can also copy the links from here. So this is raw data deposited in the pride repository and this is a repository that hosts publicly shared proteomics raw data. And we have 19 raw data files and they take really long to load and they will also take a long time in the Maxquand run. And that's why we will already load also the results of the Maxquand run, which you can find at the end of the hands-on Maxquand analysis box. So we will already get the Maxquand results and we will also upload them. Because of this Zenodo link, they now have a weird name. So I will rename every of these files. So to do so, I click on the pencil and then remove the beginning of this file name. So we have here a protein database. This is an annotation file that we need for MS Dots. And then we have a comparison matrix that we need for MS Dots. The raw files will take quite a while to load. So we can already look at these files here. So this is a human protein database with 20,000 entries. You remember from the theory lecture that this first line is the header line and it contains here the union prod ID and more IDs and the name of the protein. And then this is followed by the actual amino acid sequence. And here then starts the second protein. The other two files, so this annotation file is important because for the statistical analysis, we want to compare our two conditions. So we have the metastasizing tumor type and the RDB tumor type. And we would like to compare the files of both groups to find differentially abundant proteins between the two groups. So it's really crucial how this annotation file is set up and many of the MS Dots errors happen here. The raw file actually has to match the name in the evidence raw file column. We can later check that out. So here it's very important that this name fits to what is written in the evidence file, the output of MaxQuant. Otherwise, this metadata cannot be attached to the raw files names. Then we have the condition. So this is a binary comparison here. We have only two groups and replicates indicates that from each patient, we have one sample processed. So we have 19 files and we also have 19 different patients. And because it's label free, the 19 samples were measured in 19 different runs. And when we want to perform MS Dots after MaxQuant analysis, it's important that we also add an isotope label type column to the annotation file. But this is in all cases only L, which stands for light because we don't have any heavy spike in peptides. The third file is also needed for the MS Dots analysis. It's called a comparison matrix or table. And in this table, we specify which comparison we want to perform on our groups. In our case, there's only one possibility because we only have two groups. So we compare the metastasizing condition versus the RDB condition. It's important to write again the names here exactly as they were written in the annotation file so that they can be matched. The name can be anything. It doesn't matter. So I just named it that we want to compare the metastasize versus the RDB cancer type. And the condition that comes first has a one. And the condition that is compared against comes in with a minus one. And if there would be more condition, we could also leave them at zero so that we first compare only those two conditions. And then we could add another row with a new comparison. And then we could leave these conditions zero and, for example, compare two other conditions. But here we only have two. So this table is quite short. And you can see that the raw data is still downloading. And this is not typical for Galaxy Trainings. So often we use Galaxy Trainings on cropped and very small but descriptive data files. But in this training, we actually use a real cohort and we didn't crop the files to get a proper statistical analysis. And therefore we don't have time in the training to wait for the MaxQuantRound to be finished. And therefore, on Synodo, you can already get the results of the MaxQuantRound. So this is the evidence output. I rename it again. Then we have the protein groups, save and last, the quality control report. So we can already try. We need them in our collection later on. I know let's first rename. So the renaming might not work well if the data is still uploading. But we might be lucky here. But it can happen that if it's still yellow that it will not keep the new name. So we have only six of the RDB condition, but that's actually quite a lot. So RDB stands for Recessive Diastrophic Epidemolysis Bulosa. And this is a rare disease in which patients have a defect in the Collagen1 gene. And they cannot produce functional Collagen7 actually. And Collagen7 anchors the epidermis to the dermis. And in case this protein is not functional, the problem is that the two skin layers are causing friction, and this leads to blisters and inflammation. And one of the long-term outcome of the disease is the development of such a squamous cell carcinoma, which is then relatively aggressive. And that we have six patients here is already quite a high number because it's such a rare disease. And we were lucky because the proteomics experiment was performed on formalin-fixed paraffin embedded tissues. And they can be stored at room temperature for many years. So for this experiment we could use, yeah, we could make use of also tissues that were collected long time ago. And the metastasizing C-SCC is also not so common because most cutaneous squamous cell carcinomas are quite benign, or at least they don't really often metastasize. But in this case we were lucky to get several metastasizing C-SCCs. But these are just a sporadic C-SCC, so there's no genetic defect. It's most often generated by UV light, normal sunlight exposure over a long time, many, many years. And the aim here is to find differentially abundant proteins between these two relatively aggressive C-SCC types, but they have quite a different origin. So I can try to... So we want to have them in a collection, then it's easier to handle in the history. So I click on operations on multiple data sets and then select all, but not these three files and not these three files. So I build a data set list. This is raw files. I hide the original elements and create a list. And then if I click here again, then the boxes go away. And now it looks way nicer. So here's my collection with the 19 raw files. And if I click on them, here they are. And now they are green. So on the European Galaxy server, the upload immediately turns out the right format. So the format has to be thermal.raw. In case this went wrong on your Galaxy instance, you can manually change it by clicking the pencil again and then click on data type and then look for thermal raw and then press change data type. You would need to do this for every of the 19 files then. We can also give the history a name. So this is the MaxQuant MS stats label free training. And now we are ready to look into the MaxQuant tool. So MaxQuant is pretty powerful. It has, it cannot only identify peptides and proteins. It can also quantify them and it allows many different quantification methods. So label free, but also many types of labels are supported. In our case, we can leave pretty much the default parameters. So here we have to choose our input type. It's possible to also load MCML and MZXML. So the open standard formats, but MaxQuant requires them to be in quite a specific type. So it might be a bit tricky to get the right subtype of these XML files. So we will use the raw data in the thermal raw file format. And here it's already pre-selected our FASTA file with the protein sequences. It's already chosen as FASTA file input. We leave the pass rule, which determines how we split up the header of the FASTA file, which parts we keep. We don't have any fractions or PTMs. So we don't need a template. We are quite happy with all these parameters here. So we do an FDR of 1% on peptide spectrum matches and protein level. The protein quantification is performed again in MS stats. So MS stats will only use the peptide level identifications of MaxQuant and then performs a protein quantification or summarization by itself. So here the parameters for the protein level quantification don't matter that much. And here we have now the actual input that we need to choose, because it's possible in a parameter group to specify different parameters for different files. But in our case, we want to have the same parameters for all files. And in order to recognize the collection, we need to click here to choose the dataset collection. And now it says it's not available. That might happen if there is one dataset which was not correctly recognized, but it looks good. So I just reload MaxQuant. So we anyway left everything the same. And it still says we don't have a collection. This one is wrongly recognized. This is interesting. So normally it either works for everything or for nothing. So we change the data type here. And this was right, right, right. This was wrong. This is wrong. This is right, right. Okay, the others seem to be fine. So for whatever reason, there were three datasets that were not recognized correctly. So now I can go back to MaxQuant and see if we have it now. Yeah. And now here it is, the collection number 45. We also leave all these parameters as they are. If you want to learn more on the MaxQuant parameter and their meaning, please check the beginners MaxQuant training. There's a whole section explaining the most important parameters of MaxQuant. So we had that chested with trypsin. So we have selected this. And here we could choose label free quantification. Um, but actually this adds only a normalization step on protein, um, level. And because we anyway work with the peptide level data and MS stats, it does not matter. So we can also leave quantitation method as known. Included into the MaxQuant wrapper is a PDXQC functionality. And this generates an automated QC report with many different and very helpful plots that describe the dataset. And also the, the, um, indirectly the machine, um, the instrument performance. And now in the last step, we need to select an output. There are several options. So from max, uh, from MS stats, we need protein groups and evidence file. And we could hit start now, but the analysis takes several hours. So it's not really worth to click the button because it will only finish, um, many hours later. And if we would like to continue with the training, we already have the PDXQC report downloaded from Synodo, the protein groups and the evidence file. So we can look into them and just assume that this will also be the output of this MaxQuant run. So I open the PDF with this scratch book. So this is an overview about, um, different parameters and how they perform in different files. Um, don't worry too much about red lines here. Um, it's often the cutoffs, um, do not always fit to all types of, um, mass specs and all types of experiments. So there's many different parameters written down that MaxQuant has used. We get several overview figures and this is an important figure. So MaxQuant depends automatically, um, potential contaminant sequences onto the FASTA file. And in this cohort, quite many of these potential contaminant proteins were found and they have, uh, intensity up to 60% of the samples can be contributed to potential contaminants. But it's important to think about what this means and which types of contaminants are in there. And many of the contaminants, um, or many of the potential contaminants, of course, human proteins, um, that the experimenter could, um, yeah, contaminate the sample with such as, um, a skin piece or a hair. And so all the skin proteins are inside the contaminant list, but here we have analyzed human skin and therefore we expect that we have the typical skin proteins in our sample. So in the later analysis, we will assume that we have, um, that the human skin proteins are actually from our sample and not contaminants. So there's a lot of more plots, which you can explore. I will go back. So I close this, um, let's also have a look at the protein output. If we click on the file, we can see that we have 2,635 lines. One line is the header line. So there's, um, 2,634 protein entries in the file. Here's the beginning of the file. We can see that some IDs are already labeled as contaminant. That's why we have the con in front of the Uniprot ID. And now there's a lot of information about peptides and unique peptides, their mass and so on. What we would need to know is, um, yeah, so here are the intensities of the proteins, but we will not really care about them. So we need to have this column here. Every entry that has a plus is considered a potential contaminant. And what MS starts will later do is to remove all the proteins that have here a plus that are potential contaminants. So what we will do next is we will remove the contaminants that are not human, because they actually are contaminants. They are not expected in our sample. And then we will replace the pluses that are remaining here, um, with nothing. And this removes the plus signs from this column. And then MS starts will not recognize that we will still have a few, um, protein IDs that come from, um, contaminant potential contaminants. So this was column 118. And there's also one column in the evidence file. So here we have, um, the feature information. So we have peptide sequences and they have different charges and come from different of the samples. And what I said before was that in the annotation file for MS starts, we need to put the right name. And this here is actually the names that we need to put in. And so the raw file column of the evidence, um, is how the run, the files should be named in the annotation file of MS starts. And then we also look for the contaminant column. So here it is, it's 54, at least here at the top, there are no contaminants. Um, so we need to remember that column 54, we need to, um, replace the pluses with nothing in order to remove them. So we will start, um, by keeping only, um, the human proteins because all non-human proteins are probably contaminants. We will do this with the select tool. So from the select tool, there's many different varieties. It's quite tricky to find the right one. Um, sometimes it's also easier to just look into the different, um, categories because the search is not always working so well. Um, so it would probably be filter and sort or text manipulation. No, it's not here. So the easiest option is, um, if you're on a Galaxy server that supports that feature, you'll click again on the head and go to the training material. And then you can directly click on the tool and it will be selected for you. So I can already copy also the pattern that we use. I click the select tool and here it is. So it's called select lines that match an expression. And first we, um, so we keep only lines of the data set that either contain the word human because these are the human proteins or majority and majority is just there in order that we can keep the header line because in the header line, there's no human word written. And here in the second, in the second column, the header is called majority. So if we filter the protein groups, which is number 24 with the majority option, we keep lines that are either matched to human or to majority. And we do a similar thing with the, um, evidence file. So we also want to remove non-human proteins from the evidence file that are probably contaminants. We do it again with the select tool. Um, I can click again on the training. I can copy the pattern that we're looking for and press the select tool. And this time we use the evidence file and the evidence file contains the word sequence and one of the headlines. Here it is. So here in the first column, it's called sequence. So we can keep the header by also selecting the sequence. And now we need to get, so now our data set has lost several lines of proteins. And these were non-human proteins, um, that came from the contaminant database and they are removed now, but still we have contaminants here in our protein list, but they are only potential contaminants. And here we can already see this is probably a keratin and we would like to keep it. But because this has a plus in the potential contaminant column, we need to remove the plus and this we can do with the replace tool, the replace text in a specific column tool. And we start with the select on data 24, that was the protein group file. And if you remember, we have seen the plus sign potential contaminants in column 118. We are looking for the pattern plus and we replace it with nothing. And this is how we remove it. And we do the same again. We can use the rerun button. So this is the rerun button. With this button, you can first of all see the parameters that you have chosen in the run before, but it also allows you to use the same parameters and the same tool for another run. So I only need to change here the input. We want to have 47 to select on the evidence file and the column already appeared. So 118 does not exist in the evidence file and this was here the contaminants were in column 54. And again, we find the plus and we just remove it by replacing it with nothing. And now we are ready for the MS-STATS analysis. So if you're not sure, if you can remember with just the protein group and with just the evidence, you can also rename it and say this is the manipulated protein group. And this will be the manipulated evidence group, no group, just the manipulated evidence file. And with these files, we are now ready to go already to the statistical analysis. So we look for the MS-STATS tour. There are two different tools. So the MS-STATS tool is for label-free data and MS-STATS TMT is for TMT or iTrack labeled experiments. So here we choose the MS-STATS tool for label-free data. And now we need to choose where our data comes from. And this time the data comes from MaxQuant. We could load directly the evidence file from MaxQuant here. But then we would still have the non-human contaminants inside and would automatically remove our human potential contaminants, which we don't consider as contaminants. And therefore we use the filtered or manipulated evidence file and we use the manipulated protein groups file so that we make sure that we keep the human skin proteins. And now here we select the annotation file and we can here choose that we use the leading razor protein column. So that might be a bit less IDs. Then there are some transformation options. How we can transform the data which is now formatted into an MS-STATS compatible format and here we just select remove the proteins which have only one peptide in charge. The data process options are all good as they are. And here we just add the sample quantification matrix table and we can also add the run level data. Here are some processing options hidden. And for the condition plot we do a zero degree label angle and we only choose one. Yeah, we only generate a QC plot for all proteins. Otherwise we would generate a plot for each protein and we have more than 1000 proteins. So that's too much information which we don't need now. And last we would like to do a group comparison. Yes. And if we select this we need a comparison matrix and here it is. We load the comparison matrix. We leave the outputs. We don't even need a comparison plot which is only helpful if one has more than one comparison. And again we can adjust the plotting options for this volcano plot here. We set a full change of 1.5 as a cutoff and we don't want to display the protein names in the volcano plot. Let's check that we have everything. Yes, this looks good. So we can start Maxquan, MSDOT, sorry. This is MSDOT now. This takes a few minutes so it might be a good point to make a short break. Okay, so let's have a look again at the MSDOT's parameter. So that was all to do the statistical analysis. So the first part is only a conversion of the Maxquan output files into an MSDOT's compatible format. And here it's really important that the annotation file contains the right annotations and the right file names because that's how the metadata in the annotation file is actually mapped to the run names in the Maxquan output. The second part is the data processing. You might have realized that we didn't change any parameter here. So the processing consists of a log 2 transformation, media normalization and feature selection, missing value imputation. And then the feature intensities are summarized with the TMP method into protein intensities. So these outputs all come from the feature and processing level. And only afterwards we decided to compare the two conditions and they were defined in our comparison matrix. So only the comparison result table and the volcano plot are actually showing the actual statistical modeling results. So let's have a look at the MSDOT's output. So first we have a log file. This is just a text file that captures information about data analysis steps, warnings and so on. If you click on one of these, we also see the number of proteins and the number of peptides per proteins that were in this data set. So the process data summarizes or is a list of features. So we have the peptide and the charge state. And for the features we have a lot of data, including the meta data from the annotation file. So the group they belong to and the subject they belong to the run they belong to. And then for each feature we have an intensity and we have an abundance value. And the abundance is the log transformed and normalized intensity value. And in the next step, these features were then summarized with the abundances to generate an overall abundance for each protein in each run. And that's actually the data that we can find in the run level information. Here we have now a summary of protein intensities per run. So here are the intensities and we will now calculate the distribution of the numbers of features per protein and run and see how many features were on average used to quantify a protein. And we will also look into this column with the run identifiers and count how often each unique run entry appears here in order to find out how many features were present in which run. So to do this we use the data mesh tool. So here it is. So this is a really helpful tool to perform summarizations on text files. We want to do it on the run level data and we need to know the column we are interested in. We can still check this here in the small window. So the run name is in column eight. We actually have a header line here. We want to print the header line and we would like to sort the inputs. And now we count with this we just count the number of lines that are now getting summarized. And the other thing we would like to know is the amount of features per protein. So we perform a simple summary statistics again on the run level data. And this time it's column four on the number of features per protein and run. So we change this to C four and run this step. In the meantime, we can explore more of the MS stats output files. So the next file is a QC plot. Only wants to open like this. So this plot shows the protein abundances of all proteins for each sample. And we see that the medians here are pretty on one line. What is a good sign? And we also have a relatively normally distributed data. And this plot you could also generate for each protein. But then the PDF file would have many, many plots. And that's why we have chosen that we would only like to get the summary for all proteins. Then there's the sample quantification matrix. This gives us the quantities for each condition. So for each, for each sample and for each protein. So we have 19 columns here, one column per sample. And we get an intensity or abundance value for each protein. And then the same is here per condition. For each protein, we get a summary of the intensities for each condition. And then the comparison result is the actual statistical table. It contains the protein name, the label. In our case, we only have the comparison metastasized versus RGB. The log two fault change. And here the most important column is the adjusted p-value column. There's also an issue column. And here is an example of a protein that was not at all measured in one condition. And therefore the issue says that the protein is missing in one condition. And later we will filter this table for the adjusted p-value and the log two fault change in order to find the significantly differentially abundant proteins. And the last output file we have here is the log two fault change. And the file we have here is the volcano plot. So here we have plotted a p-value in a way that smaller p-values actually give higher values in this plot. So it's easier to interpret. And on the x-axis, we have the log two fault change. And in the MS-STATS tool, we said that we would like to plot the volcano plot with a fault change cutoff of 1.5. And if we take the log two of it, it's 0.58 and minus 0.58, that's the dashed lines here. And everything that is now in the upper right part and in red are proteins that are up-regulated in metastasized condition. And here in this part are the proteins that are down-regulated. What means that they are up-regulated in the RDB condition. So in the meantime, we have a result for our summary statistics. So this is the summary of the numbers of features per protein and run. So we can see that on Median, there are three features summarized into a protein or on average around five. The data mesh tool gives us an overview now about how many lines the run level data had for each file. And because one line was one feature, this corresponds to the number of features. So we have here the numbers for each file. But we could also visualize it when we click here on this visualize button. And then choose the bar diagram. We can give it a name. So it's a number of features per sample. And here we can choose the data that we want. And we only need to change this here. And now we can interactively browse to see which file is which bar. We can save this and you will find a plot if you go on to use visualizations. That's where your plots are stored. Okay. So now let's continue with this actually statistical result here. So we would like to filter only for proteins that have an adjusted P value below 0.05 and that have a log two fault change above 0.58 or below minus 0.58. And furthermore, we would like to make the protein idea a bit easier to read. So we only keep the actual uniprot accession number and remove this part before and this part afterwards. So we start with this. We do this with a replace tool replace text in a specific column. We would like to use the comparison result and column one. So we want to get rid of this part. But because this pipe sign has also another meaning in this regular expression here, we need to use a backslash to actually find our pipe. And because you want to remove it, we do not replace it with anything. And now we do the same with the right part. So again, we want to have a pipe and then everything that comes afterwards. And again, we want to remove it so we don't need to put anything in here. So next we can already start the filter tool. So when we click here, it's a safe way to find it. We want to run the filter tool on our replaced file here. And we start by filtering for the adjusted p-value below 0.05. So it's column eight below 0.05 and we skip one header line. And now we can directly click the rerun button. We want to use the same tool again, but this time on number 61, which contains only the low p-values. And now we filter for the log 24 change, which is column three. And we only keep the proteins that have a log 24 change above 0.58. And we repeat the procedure. So we do it still on this p-value filter data set, but this time we switch it and look for the down-regulated proteins. And we can now count the number of lines. So we started with 1,297. And we have 123 that have a p-value below 0.05. And we have now eight lines that have a negative four change. So these are the proteins that are higher in the RDB condition. And once this is finished, we know about the proteins that are higher in the metastasized condition. And we will continue with both data sets, and therefore we give them a tag. So when we put a hashtag in front of the tag, we can actually keep the tag. So every file that uses this file is then having this tag here. And here I think we can only do it once it's finished. All right, here it is. Maybe it survives. And you can also use a tag without the hashtag, but in case you use a hashtag, you will find the same tag propagated during your history. And it makes it easier to track the many files that we will generate in the next step. That job is finished. And now we can visualize the results. And we see that the adjusted p-value is below 0.05 for all proteins. But many of the proteins have adjusted p-value of zero because they are missing in one condition. So in one condition, there was no feature identified and quantified for this protein. And also these proteins might be of interest. And so the first follow-up that we do is on these proteins that are missing in one condition. And to obtain them, we will now filter this adjusted p-value and keep only the proteins that have a zero here. So this is still column 8. We can use the filter tool again. We would like to use the metastasized up-regulated proteins first. And we say that column 8 should be exactly zero. We still have a header line. And now we would only like to keep the protein IDs. So we use the cut tool on the last file that we filtered. And we keep only column 1. So the other values are now not important anymore. We just look at the proteins that all have in common that they were not detected in the rdb condition, but in the metastasized condition. And we repeat the same for the rdb data set. So from file 63, which has the rdb tag, we only keep the zero adjusted p-value proteins. And then the next step, I use again the rerun button because it's way faster. And we filter this, we cut the filter data set and keep only the id column. And here we can already see the id column. So it still has a header line. And because we would like to combine now both IDs into one file of one file, we need to remove this header line. And this can be done with the remove beginning tool. So from this rdb id list, we remove the first line. And then we can combine them. And this can be done with the concatenate tool. We concatenate file 65, the metastasized IDs with file 68, which is the rdb IDs that have no header line. So we can directly attach them here. So that was the rdb. There was only the rdb file. There was only one protein that was detected in the rdb condition, but not in the metastasized condition. And now we have removed the beginning. So it's only one line left. And this line is now attached to file 65. So that we should obtain 81 lines now. And these 81 lines contain the protein IDs that we're missing in one condition. And what is interesting now is to see in how many samples each protein was actually found. And to do this, we go back to the sample quantification table from MS stats. There was the sample quantification matrix. And here we have for each protein and each sample, we have an intensity value. And we would like to get these values now for our proteins that turned out to be missing in one condition. And because we have only kept everything between these two pipes, we now also need to extract only these IDs from this matrix in order to make it possible to automatically join the IDs of both files to obtain the quantifications. So we will do the replace step. Where is it? We have done it before. So we will rerun this step just this time on the run level data. And everything else, we can even keep it how it is. So we want to remove this part before the ID and everything afterwards. And we just do it not on the run level on the sample quantification matrix. And after doing this, there's a tool called join, which automatically joins two files according to mutual information in one column. So we can already select a tool. And in this tool, we could even decide if you would like to keep the IDs that are in both tools, if you would like to keep the IDs that are only the first, but not in the second and so on. So there are many conditions which make this tool very powerful. So here we would like to use the sample quantification table. That's the one where we just replaced the protein IDs in order to have the same format as we currently have it in our IDs. And the IDs are in column one. And we would like to join this with our IDs that we have from the metastasized and the RDB file where the IDs are also in column one, which we can verify by looking at the data set once it's finished. And in both files, we should have still a header line. So we say header line, yes. And next or last, what we do is we use a heatmap tool. And we will put in our data into the heatmap. The data will have a header and a row name. So the IDs are in the, after joining the IDs are still in the first column. And that's what we also have here automatically. And we would like to change the output file a bit, so make it a bit bigger. So 15 width and 10 height should be fine. So we need to wait for the concatenate data to finish and then the join tool and the heatmap will automatically start. Okay, so the jobs are finished. We can look at the visualization heatmap. There was only one protein that was present in RDB samples in four out of six and a none of the metastasized samples. And there were many proteins that were not at all detected in RDB, but in several metastasized samples. But there's only one protein that is actually present in each metastasized sample and in none of the RDB samples. And this unibroad ID corresponds to collagen seven, which is expected to be absent in the RDB patients that have the genetic disease. And this shows that looking into proteins that are missing in one condition is actually also an important step. One might want to apply further filter criteria because there's for example protein that is here a few proteins that are only present in four out of the metastasized samples. So they might not be of interest, but others that are more frequently found could actually be interesting proteins. And that was the part on the missing proteins. Now we will switch to the proteins that were not missing. We will continue with the significant proteins that were not missing in one condition. And to do so, we go back to the first filtering step here. So we already had a filtering step to get only the proteins with p-values. Below 0.05. And then we filtered for the full change cutoff to find up and down regulated proteins. And we have seen that these proteins have an adjusted p-value of zero when the protein was missing in one condition. But now we are interested in the proteins that have a p-value above zero and below 0.05. So we will continue with the file 62 and then afterwards with 63 and filter it for a p-value. So we will filter 62 the metastasized upregulated file in column eight. And we would like to have a p-value above zero. And because we have already filtered for below 0.05, this should now give us the proteins that are significant and not missing in one condition. And we repeat exactly the same for the rdb upregulated proteins. And then we only keep the id column. So we use the cut tool. We are now here. We follow up here. We find differentially abundant proteins. We use the cut tool and we keep column one. And because it's exactly the same procedure, we can actually use the multiple data set here and select both files that we have just filtered for the p-value between zero and 0.05. And from this we cut column one to obtain the protein IDs. And here we can already see how many proteins we have. So seven lines means we have six proteins that are upregulated in rdb. And we have 36 that are upregulated in metastasized csc. And now we would also like to visualize their quantities. So we will again match the IDs with the sample quantification matrix. But this time we do it separately for rdb metastasized. So we don't concatenate the IDs and do only one heat map. But we repeat the procedure twice. So here. So we would like to have the sample quantification matrix. But we had to replace it before. So it was file 70 where we had the correct IDs. And then we had the quantities for each sample. And that's column one. And we should be able to do the same trick again because we do it for rdb and metastasized files. We use both files here and we will obtain once the joint data for metastasized and quantification sample and then once for rdb and the quantification sample. And the files all have a header line. And we need to select a column. It's column one for both files. And this time we use a different heat map tool just because it looks nicer. So it's a heat map tool. And we can do it again in parallel for both files. Now not if we want to give it a name. Then we do it separately. So if you want to use a plot title this would be the upregulated proteins in metastasized cscc. You can have a look here. Column one and then the quantities. We don't want to use clustering. And we would like to scale the data. But now we can use the rerun button. And we just need to change the input file to the rdb file. And we need to change the plot title to rdb. Okay. But now we are still at the level of the protein IDs and in order to have more meaningful protein IDs we would like to know the protein names. And this there are many different steps how we could do it. So there's a uniprot tool in Galaxy. The uniprot ID mapping and retrieval. And here we can put in our metastasized joint tool use the first column which is the ID column. And then retrieve as a faster file the uniprot entries. And we repeat the same for the rdb file column one. And because we now have a faster file it is quite tricky to read it because there's all the amino acid sequences. So we use as a last step the faster to tabular converter. We can convert again both in parallel. And we split up the header into two parts. And that gives us now a table format with the with the protein names of the up and down regulated proteins. In the meantime we can inspect the heat map. One heat map is at least finished. So this is the heat map for the proteins up regulated. And metastasized csdc and we clearly see here that they have higher intensities than in the rdb file. And we will see the opposite here once it is finished. And here's already one result. So this is now the proteins up regulated in metastasized csdc. And we have now split the faster file into this ID again. But then here in the second column we can read the actual protein name. And in the third column comes the sequence with the amino acids. And here's for example vitronectin that was also found in the original study. But in the original study the max quant parameters were different and a different statistical approach was used. And we see several histones here and also RNA related proteins. Okay, so we will just need to wait to see the results. So if you want to analyze your own data with msdats, the really crucial part is to set up the annotation file correctly. And because this is the data that msdats uses to decide on how to fit the linear models to the data. And then as you have seen now there's many text manipulation tools in Galaxy. So everything that you can do in Excel is also doable in Galaxy. The most complicated part is probably to know the exact name of the tool. But we have already used a lot of tools here. So you know already some of them. Now the tools are finished. We can look at the heat map here. And we see these are the proteins upregulated in rdb condition and they have higher intensities than in the metastasized condition. And now we also have the tabular file with the protein names. And we find two collagen here in the list of proteins that are higher abundant in the rdb condition than in the metastasized condition. And that could be explained by the fact that when collagen 7 is missing, there's a compensation effect and other collagen become upregulated. And actually for the collagen 14 protein there were also immunofluorescence staining stun and the original publication which confirmed that the abundance of collagen 14 is way higher in rdb cscc than in metastasized cscc. So that was the training. I hope you have learned something and you will be able to repeat the training on the data sets and maybe in the future also on your own data. Thank you very much for joining this video. Hands on demonstration.