 Welcome to MOOC course on Introduction to Proteogenomics. In the last lecture Dr. Mani provided you an overview of using ProteG software. Some features of ProteG includes data normalization and filtering, data quality control checks, market selection, interactive visualization of results, integration of protein-protein interaction databases, as well as how to save your analysis sessions and then you can share with your colleague researchers and collaborators. And finally, how to export your results into some output files like Excel or PDF formats. In today's lecture continuing from the previous session Dr. Mani will talk about how to implement log to transformation, normalization, data filtering and selection of test using ProteG. He will also discuss about the output summary and data visualization in ProteG. He will finally show you how ProteG can represent your results in multiple ways like PCA plots, scatter plot, volcano plots etc. So, let us welcome Dr. Mani for today's session. I want to show you what it can do so that you can explore things either with the given data or if you are adventurous enough you can try with your own data. So, let us just go through. So, pick PAM 50 and then you will get the list of groups and how many samples are in each group and then click ok. Then you get to the main page of ProteG where you can decide what you want to do to your data. So, the first column is log transformation. So, suppose you had only ratios and you did not log transform then you can say I want to transform the data using log base 2 or log base 10. So, if you clicked on the appropriate dot it would transform your data. So, right now the data we have is already log 2 transformed. So, we need to leave it at none. You do not want to log transform log transform data. So, what is the log of a negative number? It is not defined. So, half of your data will get thrown away if you try to log transform again because the once you have log transformed the up regulated ones are positive the down regulated ones are negative. If you try to log transform again all the negative data will become missing because negative of the log of a negative number is not defined. So, you do not want to do that. So, then we discussed many different ways of normalizing. So, some of those are implemented here. So, if you pick median it is just median centering it will not scale. If you pick median mad it will median center and mad scale. If you pick two component it will do the two component normalization that we went through. And if you want quantile normalization really you can also do it. But if you have data that has already been normalized then you would pick none. So, in this case the data set I provided is intentionally not normalized. So, you would want to normalize it. You could do two component if you want to see, but two component normalization takes a while each sample takes like a minute or so and we have like a hundred samples. So, it would take it would take like half an hour or so to do it. So, let us not try that, but median mad is a reasonable alternative. So, for this exercise we will use median mad you can try two component later maybe in your in the evening or something like that. So, then this one has the data set has about 15000 proteins that came out of spectrum l with basically very minimal filtering. So, if a protein was missing in 90 percent of the samples it is still here. If a protein did not change in any of the samples it is still here. So, you might want to consider implementing some kind of a filter. Ideally what we do in our analysis is to have like a missing value filter where things that are missing in too many samples are excluded. Here we do not have that filter, but a reasonable filter that I am going to use is called the standard deviation filter. What it does is it takes each protein looks at the protein measurement across all the samples and calculates the standard deviation and says what is the value of the standard deviation. So, it ranks all the samples all the proteins by standard deviation and keeps the ones that are which do you want to keep the ones that are most varying or least varying. I hear least any one votes for most least everybody thinks we should keep the least varying proteins. So, you have a protein that never changes in any of your subtypes would you want to keep that? No. No. You want to keep the proteins that are most varying because that is hopefully your marker. Your ideal marker is not present in one and is like high level in another one another group. So, that has a high standard deviation. So, you want to keep things that are high that vary more high standard deviation. And here you can pick what percentile of things you want to keep. So, what fraction of things you want to keep just to make things fast I will pick 50. So, usually 50 is a more aggressive number you would pick like throw away like the top the bottom 10 percentile, but I am throwing away the bottom 50 percentile. So, this just to make have fewer proteins in the analysis, but you will still have about 700 7000 proteins in your analysis. Then the question is what kind of a test do you want to do? Do you want to do a one sample test, a two sample test or a moderated or a F test? So, here I am going to pick two sample and then I will tell you why. So, if you pick two sample we pick PAM 50 as the annotation and you want to compare some one or two of your PAM 50 classes let us say. Then you would pick two sample and then to decide which two you want to compare you click on select groups. So, when you click on select groups it will tell you all the comparisons that are possible given the annotation that you picked. So, it says you can do basal versus her to basal versus luminal A, basal versus luminal B and so on. So, I am just going to say I want to do basal versus luminal A and I will deselect all the others. You can do any number of these or all of these if you want, but the results will be more confusing and harder to interpret, but you can pick which one you are looking at at any point I will just look at one. And when you do a display of the data as a heat map it can show you annotation tracks. So, say you want to see which one was ER positive versus ER negative and so forth. You can keep that information and show it in plots if you want. So, that is the annotation track selection. I am going to remove things that are basically like very across every sample like sample ID, experiment, TMT channel, again sample name and QC status I will remove and just keep the ER PR her to status and mutation status for the three genes that are there. So, we can look at that. Then click update to make sure that the selection is registered and then close this. So, now you are ready to run your analysis. We pick median map now. So, we can pick median map again and that was not supposed to happen, but not sure. Let us just make sure the select groups is still there, yeah that is there. So, all the rest are fine. So, just double check and then click run analysis it will take like half a minute. It will tell you on the bottom what it is doing. So, it is applying standard deviation filter now after normalization and then it will do the two sample T test and then you will get a page that shows results. So, while we are waiting I will just keep talking on what you will get at the end of this. So, at the end of this you will get a screen with multiple. So, actually it is going through it is running the two sample test now. I think it is done. So, here the results now. So, you will get a page like this. The top of the page gives you a summary of the data. So, you had 15,369 proteins and you had 5 groups. So, number of expression columns. So, 50. So, remember we had about 105 samples I think, but we did a basal versus luminal A comparison. So, the total number of basal and luminal samples together is 55. Then the workflow shows what long scaling, what normalization, what filtering you used and what test you ran. And so, we filter the results to look at only things that are statistically significant after adjusting the p-value and the p-value adjustment is Benjamin Hockberg FDR correction that is the only one that we have and that is always applied. So, the results are that there are 476 markers of basal versus basal versus luminal A in the filtered data set that are statistically significant with an adjusted p-value of 0.05. So, maybe you have changed your p-values or you used a different normalization. Yeah you could have used a different test or different groups. Yeah obviously. So, the bar plots here show how many proteins were observed in each of your samples. So, this is like the number of proteins in this sample was about 12000. The second sample had a little more and so on. And the red bars are for the basal samples and the green bars are for luminal samples. So, suppose you looked at this and all the red bars were there and all the green bars were half the size. Then you would be very worried because there is some batch effect or some effect somewhere that consistently is observing fewer number of proteins in the luminal samples only. So, this is basically like a QC check to kind of make sure that nothing is grossly wrong with your data. So, this one shows how many missing values are there in each of the samples. So, there are about 8000 proteins that are seen in every sample. All samples have these 8000 proteins observed about 50 percent missing. So, when you take into account proteins that have about 50 percent missing. So, they are not observed in 50 percent of the samples and there are totally about 12000 proteins that fall into that category less than or equal to 50 percent missing. So, this basically is to show the rate of missing values how many do you have on the in the average sample and how many are observed in all the samples. So, it is just to make sure that there are not too many missing values that you have. The other things you look at are there is a clustering tab which can generate a heat map. So, this is a heat map. So, you can see the annotation bars on top. So, red is basal green is luminal A and then these are other annotations that you have on this side. I think my screen is too small to see those maybe you can see it, but I think these tracks are ERPR HER2 status. And you can see that ERPR and HER2 are basically all negative in basal samples because generally triple negative samples fall into the basal category. The other thing you notice here is that the black marks are missing values. Remember we included missing values and with the test that we do can actually handle missing values. So, we did not fill in missing values and so, this display is showing which values are missing for each of the proteins. And these are all statistically significant proteins that are different between basal and luminal. You can see at the top that basically this line is almost a complete black line with one or two things here and there. So, that is saying that that protein was present in only a couple of samples. And because they were reasonably different between the basal and luminal the statistics is saying that was significant. Would you really believe it? I would not. So, this is a strong indication that you really need to filter to remove proteins that are missing in too many samples. So, if you did that this would kind of get chopped off somewhere here. And after that it is fine you have missing values here and there it is ok. But if your conclusion is made on two samples in one group and one sample in another group with all the rest missing that is not what you want. So, this you can basically by visually exploring your data you can get a feel for what is happening whether your analysis is reasonable or not and you can constantly keep making sanity and quality checks to make sure that your analysis is reasonable and the results you are getting is reliable. So, the other thing you can. So, you can play around with lot of settings and things to make it look like you want I would not do all those, but the other thing you want to look at are volcano plots. So, are people you are familiar with volcano plots? No ok it is couple of them are. A volcano plot is a plot of full change on the x axis versus statistical significance or p value on the y axis. So, on the x axis we have log full change. So, if it is negative it is non-regulated if it is positive it is up regulated, but this is comparing basal versus luminal. So, basically if it is on the left side it is up in basal, if it is on the right side it is up in luminal and the farther away it is from the x axis the more statistically significant it is. So, p values decrease when they become more significant, but we have multiplied by negative sign. So, it goes up. So, it is visually kind of impactful right. So, if it is far out on the top then it is a statistically significant protein. So, anything that is the beyond the threshold of 0.05 adjusted p value is marked in red. So, remember it said there were 400 something markers that were significant you are seeing all those markers and you can see which ones are up in basal and which ones are up in luminal. And in this one if you go and click on it it will tell you what marker it is. So, if you click there it will hopefully tell you oh that is true sorry there. So, that protein is the gene is A g R 2 and the protein ref seek ID is shown over there. So, that if you look at the breast cancer paper. So, we said p 53 there are some there is some up regulation in basals and down regulation in luminal A. If you look at Erby 2 they are kind of similar in both basal and luminal A. So, we are comparing basal like and luminal A in this comparison. So, if you looked at p 53 then it should be up in basal and kind of down in luminal similarly with pic 3 C A. And I think another protein that is known to be up in basals is E G F R. So, we can take a look at all those things in here by doing the following. So, you go to protein protein interactions click on the plus sign deselect everything under source data and then type in the name of your protein. So, you want E G F R. So, it gives you the protein with the gene name you click on it. If you go here you can see E G F R is way over there it is statistically significant and it is up in basal like we expected. So, you see you can also try can you do multiple no. So, the other thing you can do is TP 53. So, pick that you can see this is also up in basal, but is not as statistically significant as E G F R. So, you can explore your proteins you can kind of export the list of statistically significant markers and look at it you can do all kinds of things there is export. So, you can use that and if you look at the table you can see the actual values. So, these are the adjusted p values the average expression the log 4 change all the information that you might want to include in your paper or you want to like include in import into some other software you want to look at are all here and you can export this as a table to look work on it. If you want to see how one sample plot if you want to see a given protein how it measures in one sample versus another you can do like a scatter plot. So, this is this sample versus that sample and it is showing all the proteins in that measured in those two samples. So, you can see most of them are similar some of them are more extreme. So, these are all like basically things to look at the data and kind of get a feel for what is happening. The other thing I want to show is PCA. So, when you do PCA you can see that is the plot you get and it is colored by basal versus luminal. So, you can see if you draw a vertical line it essentially separates basals from luminals that is saying that this is the most dominant signature in the data, but you can also see that there is one red dot in the middle of all the greens and one green dot in the middle of all the reds. So, there are two samples that kind of behave like the other group. So, what are these you might want to go and explore. So, in in clustering also if you do a look at the heat map you may be able to see this, but here it is much more striking and now you can say ok what happened with those samples are they mislabeled what is the reason they behaving like the other group. So, you might want to go and explore those. So, these are all tools to kind of generate hypotheses. So, you can explore it more biologically and kind of build a biological story. It is not like this is going to build the story for you and write the paper, but this will give you tools to look at it from a biological perspective. And the tools are set up in such a way that people who do not do programming and who are primarily biologists or experimentalists can look at it. So, that is kind of the whole point of prodigy. So, people who do not do our programming can actually use the results of other people's are programming. I think I will stop there there are QC section has many plot actually let us just look at box plots. So, this will show you what happens after normalization. So, on the left side is the box plot before normalization, on the right side are box plots after normalization. I think the screen is so small it is all squished, but you can see this sample was actually adjusted quite a bit to get to normalization to agree with all the others. But the other samples were adjusted to a lesser degree. So, you can get a feel for are there samples that were that you had to use extreme normalization factors to get them to agree with all the others. In that case you might want to see whether those samples had any issues or if there was less material for that sample for whatever reason or if those samples didn't just work out there they failed for some reason. If you can show that they failed for whatever reason you can throw that out and do your analysis it will be more robust if there are less offending samples. But you should not throw away samples simply to get a better result, but if there is a experimental reason why some sample failed you can remove that sample and redo the analysis. The thing that is about it I will stop there if there are any quick questions I will answer. So, here the red and green were basal versus luminal A. Yes. So, if you have a if you have a excel file or a CSV file you can go to Morpheus. So, like it is actually come up here. So, if you go to Google and search for Morpheus Broad you will get the website and you can go to the software which is called Morpheus and you see there is select file or drag and drop file you take your excel file or CSV file and drop it in there it will open it and show a heat map and then you can save it as a GCT 1.3 file. You can add annotations using different files if you want and then you can save it as a GCT file and then use it in Prodigy. You can also use CSV files directly in Prodigy, but when you load it it is going to ask you to annotate the samples. So, you have it will create a template and then you have to fill the template with your annotation and then load in the template. So, for a hands on it was little more complex. So, we did not do it, but it is also possible. So, you go to once you do that you will get a top bar which has like a menu and then there will be a way to export the table. Yeah, you have to export the table and it will ask you what format and you pick GCT 1.3 for Prodigy. So, go to GitHub and search for Prodigy you will get Broad Institute Prodigy jump to. So, there is a directory here called docs. So, there is an introduction there is data formats installation and some documentation there. And I think on the main page also there is some documentation on what all it supports and in Prodigy itself when you I am going to refresh on the first page it has the same information, but when you load a data set it also tells you for each operation what it does and things like that. So, there is some documentation, but it is not like thermo documentation because we are not paid like thermal. Yes, you need to sign in you need an account to login to GitHub it is free, but you have to sign in. I am sure you have a lot of lot more questions and things were a little unclear, but the more you explore on your own the more you will remember what you discovered I could give you step by step for everything, but it will stick less. I hope the last two lectures and especially today's last session was very helpful for you to get a glimpse about how to use Prodigy software for analysis as well as visualization of your data. We also learned that Morpheus software can be used to prepare file with annotation which could be used in the Prodigy analysis. Apart from the annotation you can also visualize your data using heat maps or even explore the interactive tools like Morpheus. Variety of tools are available in Prodigy to really give you the visual glimpse of what is lying in your big omega data set. I hope these sessions are giving you not only the basic understanding of various concepts involved in looking at your data, but also providing you the open access tools where you can start implementing them right away from any data set available from databases or your own data set you can start analyzing them now. In the next lecture we will have a guest speaker Dr. Debash is us who will talk about Pro-2Mess data analysis. Thank you.