 Welcome to MOOC course on Introduction to Proteogenomics. In the previous lectures you have seen how new technologies like genomics and proteomics could generate big data sets. However, to obtain the meaningful insights from these data you need to utilize various statistical and computational tools. In this slide today Dr. D. R. Mani from Broad Institute is going to give you the first lecture and he is going to talk to you about data normalization which is a very important aspect of omic data analysis. There is lot of variations during your manual sample preparation steps as well as the artifacts which might be coming because of the running issues or the instrument day to day variability or the samples batch to batch variability. How to rectify some of these information and correct for these variability is very crucial. In this slide normalization techniques allow simultaneous correction of the various issues which one could see because of the instrumentation such as in the mass spectrometry context ionization efficiency of the detected peak or even achieving more quantitative values can be better obtained after normalization. So, today Dr. Mani is going to talk to you about data normalization which is very important aspect of omic data analysis. He will explain to you different strategies which could be employed for doing normalization like Contile Normalization, Media Normalization, Media Mad Normalization and many other methods. So, let us welcome Dr. Mani for his lecture on data normalization. We will start with data normalization. So, what is normalization? Normalization is transforming data so that they are compatible. So, you have 20 TMT reactions you ran and so you have 20 tenplexes you got data from all those. Now, you want to compare them all together for whatever reason each one is slightly different. So, you want to put them on the same scale so you can compare them. So, that is the purpose of normalization. So, what do you hope to accomplish by doing it? So, you can adjust for sample handling deviation. So, in one sample when you pipetted the sample you got slightly more and the other one you got slightly less or the temperature on one day was different from another and so you got something that was slightly different. So, it is kind of slight sample to sample differences can be taken care of. You could also have slight differences in experimental conditions so those are batch effects I think I will get to those separately, but if the batch effect is very small and kind of just sample specific you could be able to get rid of it using normalization. The whole purpose of that is to you have peptides or proteins that you are interested in that are that is your signal. So, things that are different between your different types of cancer or between cancer and control that you are interested in. So, things that are really different is called signal because that is what you are looking for, but then because of the measurement process and because of technical variation and biological because of primarily technical variation there is a lot of noise that is introduced in your data. The purpose of normalization is to minimize noise and maximize signal. So, in I think I have slides about this later, but this is a point that is worth repeating. I said the purpose of normalization is to minimize noise and maximize signal. So, that kind of assumes that you know what noise is and what signal is. In most real world experiments you do not know where the line that divides noise and signal is. So, in trying to get rid of noise if you overdo your correction you will get rid of signal and if you under correct you will have a lot more noise than you want. So, this is a very kind of a process that you have to do very carefully and be thoughtful about what you are doing and how it affects the questions you are going to ask about the experiment. So, there is usually no one procedure that works all the time. So, you have to carefully look at what your experiment is, what your questions that you want to get answered from the experiment is and then make sure that the normalization process is appropriate for that. So, there are many ways of doing normalization the simple ones are called centering. So, basically you start out so, what I am showing here is so, again remember we are talking of log ratios. So, this is a histogram of log ratios. So, how many ratios have value 0? So, it is that many, how many have value 1, it is that many and so on and each color is a sample. So, just for a mock up I am showing three samples and I am showing the distribution of log ratios for each of those samples. So, you can see that the green one is centered around 0.5, the red one is centered around 0, the blue one is centered around minus 0.5 approximately. So, that means, the average ratio was slightly off which means the amount you put in was slightly off. So, one thing you can do is you can say I want all my peaks lined up that is centering. So, you subtract the mean from each one of these or the median they will all line up and that is called centering. If you do the mean you will be affected more by extreme values or outliers. So, your actual biomarkers will have an effect on how much you move which you probably do not want in which case you would use the median which is much more robust and so, it will kind of ignore outliers and use the central part of the distribution to decide how much to move. And then the second thing is scaling. So, you can see that the green one is much more fatter than the red one, the red one is a thin long peak whereas, the green one is a spread out peak. So, in order to make things more comparable it would be helpful in many situations to have the spread also be equal. So, in other words you want to scale it so that the standard deviation or the fatness of your distribution is essentially the same. And so, to do that you divide by the standard deviation or there is a more robust measure called median absolute deviation which is short for the mad is something that is short for that. So, when you divide by one of these values for that distribution then the spread also becomes the same. And so, if you do both centering and scaling it is called standardization. So, in standardization you first center by subtracting the mean or median and then scale by dividing by the standard deviation or median absolute deviation. So, when you do that if you take these three samples in the top and you do standardization you get what you see at the bottom. So, you can see now they are all lined up their centers are lined up their spread is the same and so, the samples are now normalized and you can look at them. Which scaling is not the same? The upper one I mean you have a three peaks. Yeah. And you will now delete the centering and scaling. So, the upper one is unnormalized data after you do centering and scaling you get what is in the bottom. And you are saying the bottom is not lining up? No. Well it may not be exact so, that is the thing right so, not all your samples are exactly the same and your standard deviation and so, the distributions also may not be exactly identical in terms of the shape of the distribution. So, if you had all theoretical perfectly distributed normal Gaussian distributions then they would all line up, but these are real data with kinks in the middle and not exactly normally distributed and we are doing a transformation that kind of assumes they are sort of normal. So, the only outliers that would kind of affect you are the ones at that extreme tails because those are your markers because those are the ones that are most different in your sample right. So, if the ratio is like log 2 ratio is 2 then that means, that protein was fourfold upregulated compared to your reference whereas, if you are in the negative side then that protein was significantly downregulated compared to your reference. So, the reference is kind of the average so, something that is very different from the average is what is going to affect your analysis. So, if there are little mismatches in the middle I think it would not be too much of a problem, but what you need to be careful about are the tails and actually we will talk a lot more about tails when we go to the other two types of normalization that is listed there. So, in terms of terms to use when you say when people say I z-scored my data z-scoring means you did standardization by mean centering and standard deviation scaling. So, z-scoring is a term for that and if you use the median version you median center and scale by mad it is called median mad normalization or median mad centering no median centering with mad scale there is no special name for this people just describe it. So, there are two other types of normalization I have listed here one is called quantile normalization and the other is called two component normalization. So, we just had a question and we discussed that what matters really in these distributions are the tails of the distribution. So, you want the all the proteins that are of interest to you are in the tails. So, these are the most upregulated proteins in that in your set of samples and on the other side are the most downregulated proteins in the set of samples and so, those are the things you are going to be interested in and you want to make sure your normalization process does not mess with those to the degree possible and so, let us look at these two other procedures. So, quantile normalization basically takes two distributions and makes them statistically identical by doing the following it sorts all the values for all your samples and then it picks the highest value that you have and then says the normalized value is basically the median of the highest value I have seen in all my samples. So, let us say you had two different types of breast cancer samples in your in your study. One of them had a protein that was highly upregulated and the other one that protein was not upregulated. Now, in in this case the highly upregulated protein is going to be a kind of squashed to conform to the median because it was not upregulated in the other one. Well that is not completely true because you are sorting just by value. So, if there were one sample did not have too many things happening. So, let us say it was a normal sample everything was around average and the other was a cancer sample where some proteins were very high and some proteins were very low. Now, when you sort all the values the highest value in the normal sample will be some normal protein whereas, the highest value in the cancer samples will be this biomarker you are looking for and then when you do quantile normalization it is going to take all those values and take the median which is going to be somewhere in the middle. So, you just lost your biomarker. So, if you do quantile normalization it it is kind of I call it destructive normalization because you lose your there is the possibility that you could lose signal. So, this was introduced to work with afymetrics microarray data and that kind of data where you have set of numbers and you know the range of numbers is going to be between 0 and 20000, but for some reason in one of the cases the numbers ended up being between 100 and 7000. So, then you can say ok I am going to make my 7000 thing closer to 20000 or vice versa, but in real world projects when you use quantile normalization you have to be really careful that you know exactly what you are doing and you are sure that it is not like destroying signal that that you might have in your study. So, in order to deal with this problems one way we came up with to address normalization and what happens in the extremes is called 2 component normalization. So, the concept here is if you look at the distribution of proteins. So, you have a large number of proteins in the middle with an average ratio log ratio of 0. So, those proteins are not changing between your samples and your reference. So, those are kind of the unchanging proteins and then there are proteins in the tails that are either up or down regulated compared to the reference. So, what you want to do is normalize based only on the unchanging proteins. So, that you leave the extreme proteins alone you do not mess with them. So, to do that what we do is there is a mechanism called a mixture model where you can say I know there are two different distributions mixed up in my the plot that I showed. So, in other words I have a plot like this which is the black. So, the black has things that are not changing in the middle and things that are changing in the tails. I want to just find out those set of proteins in the middle that are not changing. So, that means there are two distributions in my black line. One is a thing that represents proteins that are not changing and another is a distribution that represent things that are changing. And so, we fit two different Gaussian or normal distributions using a process called mixture modeling and what that does is it says ok I mean you told me there are two distributions. Let me look at the data and figure out where the two distributions are what their mean and standard deviation is. So, this procedure will come up with this red curve which says my mean for this is 0 and the standard deviation is like say 1 and then the blue curve which represents the proteins that are changing for which the mean is also 0, but the standard deviation is 100 because they are very spread out. Then you say ok the red curve represents the proteins that are not changing. So, I am going to use only the red curve to do my normalization. So, this is basically like Z scoring, but you Z score with a specific distribution that was calculated to include only the proteins that are not changing. So, this way you end up not messing with the proteins that are changing and kind of focus your normalization only on the unchanging proteins. There is no one normalization that is good for everything. So, usually what we do is we do two component normalization in most big studies, but if there is. So, the two component normalization assumes there is a set of proteins that are not changing. So, suppose you did a protein-protein interaction experiment where you use some immunoprecipitation for some protein of interest and you pull down that protein and then you did proteomics on what came down. There most likely a lot of things are changing. There is may not be a set of proteins that are not changing. So, in that case you cannot use two component normalization because it is assuming there is a set of things that are there that are not changing. So, what would you use in that case? You can do median mad normalization. So, usually what we do is we look at two component normalization. If it seems appropriate and if it is working then we will use that. If that is not the case then we will the default that we fall back to is median mad normalization, but in some cases even that may not be possible. So, we have had experiments where they had a control sample and a treated sample. The mean or the median for the control sample was very different from that for the treated sample. So, then if you do this kind of normalization then you would end up with pulling everything together and then you may that may not be the appropriate thing to do. So, then we split it into two groups and do like normalization for each group. So, it really depends on the data you have and the experiment you have done and what the questions are that you are addressing through the experiment. So, if there is one thing you always want to try maybe median mad would be a simple thing to do. So, median mad normalization you can probably do it in excel you do you do not need any like additional software, but for two component normalization you will need some kind of a statistical analysis package. So, we do it in R, but there are lot of others that can implement the same process or in your hands on we are going to look at prodigy which has two component normalization implemented in it. You can explore that on your own and see. I think we probably will not do that today because it takes a while to do the normalization. So, you might want to try it on your own, but in general this one and median mad are what we generally look at unless there is a reason not to. Yeah. Asma protein or something. Yeah. In that case if we are if you want to do not do component normalization then what should be the number like minimum number of protein we should look at to be like not changing. So, that we can take it into consideration because we are having three different clinical conditions. So, in that there will be very less protein which are not changing across all the three conditions. Yeah. So, usually my rule of thumb is you should expect there to be at least a few hundred proteins that are not changing. So, if you have like only five or ten or very few in the tens then it is not a good method to apply. If you have a few hundred or beyond that. So, in most like discovery experiments like the CPTAC experiments we have thousands of proteins that are expected not to change. So, in those situations these this will definitely be applicable, but if you have only three hundred proteins and you think the large fraction of them would be different from your say the various experiments then the I would not use this I would just use median mad normalization. Pardon. Can we use Pareto scaling as well for protein mix? Sure. So, Pareto scaling is if you expect fat or tails. So, here the two component normalization is kind of this here. So, it is trying to accommodate for fat tails right. So, what it is doing is usually you have a normally distributed you assume normal distribution, but in most real data the center part is normally distributed, but you have fatter and fatter tails. So, if you expect a lot of things that are changing then you can try Pareto distribution, but I think that might still need a central part that is not changing, but it will accommodate tails that are fatter. So, if there are a larger proportion of things that are changing I think you you can you can deal it deal with that using that distribution. So, the fault change is only relative to the reference right. So, and you are trying to make it the same the the measure the same for all your samples. So, the problem you are saying will come when you do this with like actual intensities, but you that that is kind of why we take ratios and log transform you should not. If you do not like the normalization that the software is doing then you should disable the normalization and do this or something else separately, but you should not normalize data that is already been normalized. So, there are some other. I just wanted to clarify that the media point that you talked about was for quantitative interaction right. Well, in theory it is for any collection of data, but sure. Smaller sets. For smaller sets? Yeah, if you have smaller sets that is a more reasonable thing to do compared to two component normalization because the number of things that are changing would be smaller. So, changing the normalization method will obviously affect the outcome of an analysis. So, things that were differential markers in one normalization may not be in another one or may have a lower or higher p value. In other words, changing the normalization would change the outcome of your analysis. So, that is kind of why you want to think about how you are normalizing well before you start looking at results. Otherwise, you will tend to tweak your normalization to get results that are more agreeable which is not the right statistical approach. So, like I am saying here the results will be different, but it is not good in one case or bad in another case it is just different. And so, you want to make sure that it agrees with your methodology and experiment and the questions you are addressing before you kind of go ahead. Yeah. No, the reference is basically a sample we have to kind of make sure that there is the same thing in every TMT 10plex. So, that if there are differences then we can make this thing the same to kind of normalize things across. And yes, if you had a control or like a normal sample that would also be compared to the reference. So, as you can see the advantage of this kind of an approach is that you are taking relative ratios to the reference for every sample you have in your channels. And so, because of that most batch effects that you might have are kind of taken care of right there. So, you do not need to do like batch correction and other kinds of manipulation of the data. The downside is that you are looking at only relative ratios to the reference. So, if you wanted to know is one protein higher in a sample compared to another protein you cannot answer that question because all you have are relative ratios for protein A and relative ratios for protein B. So, you cannot compare across proteins, but if you want to know whether a protein was high in sample 1 versus sample 2 you can because we are measuring relative ratios to the reference. So, the advantage here is that it minimizes the manipulation of data you need to do later and kind of takes care of batch effects and other kind of systematic technical artifacts in a more agreeable way. But the downside is that you can make like absolute level comparisons. Preferably. Preferably. Preferably because there could be there could be proteins that are specific to ovarian cancer that may not be occurring in other types of cancers or other normal tissue that you could include. And if you do not have it in the reference you are more likely to miss it. Sir, if we do not have a normal control. Yeah. And we cool the mixture of that tumors and use as an internal control. That is kind of what we are doing here. So, we are calling that the reference because we are using it to take relative ratios but this. So, in the breast cancer project that you are going to hear about today the CPTAC prospective in a retrospective breast cancer project the control the reference was basically a collection of tumors. So, we had like 100 tumors there I think 40 of those went into the reference. But we made sure that the proportion of the different types was the same. So, the. Yeah. So, first off you start off with a protocol like a standard operating protocol that everybody uses. I think the question was how do you ensure that things do not change from lab to lab and it is consistent. So, there have been there has been a lot of effort in the CPTAC to make sure that things are reproducible within a lab and across labs. So, actually a lot of the work you are seeing here is what we call CPTAC 2 which is like the breast ovarian and colorectal cancer. And then CPTAC 3 is we have mentioned some of it like the lung cancer and the newer samples. But CPTAC 1 was basically a 3 or 5 year project whose entire goal was to make sure that proteomics is reproducible across labs. So, you need to set up your proper SOPs how similar like have a common sample you run to see if you are getting the same results. So, you just have to go through the process and make sure that things are reproducible. I did not say I think. Sir, reference shared across labs. Right now I think the answer is no. I think the NCI is trying to come up with something like that. But right now there is not anything that you can get from some lab or some institution to share across labs. So, in your slides there is a reference here to a paper in nature protocols. So, that was like a standardized protocol for TMT 10 that was kind of derived from all the CPTAC labs and has been published. So, if you want like a standard protocol for TMT. So, that is a good place to look at. The question is if you start fresh on a new project and you do not have any references how do you do it. So, I think the way I have presented here you use the you create a reference for each big project. So, if you have a project where you have only like four samples you are running in duplicate that is a TMT 10 plex you use 8 channels and maybe couple of replicates and that is it. So, you do not need a reference. But if you have a project where you have 100 samples then you have to first get enough of those samples in create your reference by combining the equal amounts of multiple samples and then start. So, from all these projects we did not have any reference before we started. For the breast cancer project we started by creating the reference. For the lung cancer we started by creating a new reference. So, you have to start by creating a reference. So, when in breast cancer you have the several types of breast cancer. So, you have the reference for each of. No, but in the reference you have the same fraction of each of those as you have in your samples. If I say fraction means you combine it with different type of family breast cancer. Yes. So, if you had like let us say you had 25 percent each of four different types of cancer four different subtypes in breast cancer. Then when you create your reference you want one fourth of each of the different types in your. So, let us say you have 100 samples, but you have enough material to create the reference from only 60 of the samples. Then you want 15 from type 1, 15 from type 2 and so on. So, the proportion of the different subtypes that go into your reference should match the actual proportions in your sample set. So, that it you are not overly emphasizing one or other group. It is just a guideline if it is off by a little bit it should be fine. If it is completely off or you left out one or two subtypes then it is more likely that proteins that are specific to that subtype will not be observed which is a disadvantage. So, your perfect biomarker for that subtype will not be there. So, I hope you have learnt and appreciated how the big data obtained from genomics and proteomic technologies could be very meaningful, but still to obtain the relevant insights you need to normalize the data. And today's lecture by Dr. Mani has given you and eliminated you with more thoughts about how to normalize your data, how normalization strategies differ from one to other data set type. You cannot just apply the same normalization for all type of data sets possible even the data which was obtained from the mass spectrometry or proteomics experiments. You have also heard about the centering and scaling, when it should be done and how it could actually help you to obtain robust standardization strategies. I hope you also studied how the correct normalization strategy can lead to address the correct biological question or even to correct biomarker identification and these outcomes could be very wrong if your baseline was wrong, if your normalization was not correct to begin with. Therefore, you know knowing about these issues and planning to rectify these issues by using the right way of normalization becomes very crucial. You also heard that under which context you should do normalization and whether we should apply that you know at with the raw data only once for the whole data set. You also learned about the importance of two component normalization and its you know robustness. At the same time you have probably heard the two component normalization cannot be used for all kind of data sets. So, again the context in which the normalization strategy should be utilized depends on the distribution of data, the tail of distribution as well as the data size. We will continue the next lecture again by Dr. D. R. Money and in the next lecture you will be given concepts and exposure of the importance for the batch correction as well as the missing values importation in data analysis. Thank you.