 Hello students, we have talked about many technologies both label based and label free platforms and what you may appreciate that omic tools often generate huge data as you have seen in the case of microarrays, SPR and NGS platforms. So data could be obtained from different equipments sometime when we are doing measurement using microarrays, then you are measuring the intensities. When you are looking at a mass spectrometry type of platforms, then you are looking at area under curve. So different instrument give different ways of measurement, but finally what we get, we get huge data set and then what we have to try to do further from that how best we can make sense of this biological data set, how much it is relevant to address the primary question which we are interested to address. In the previous lectures, you were exposed to the basics of statistical analysis and the importance of having clear understanding of the statistics that are to be implemented to get the meaningful insight and interpretation of your data. The data obtained from these high throughput omic technology platforms mostly comprised of huge range of values that need to be scaled down to perform the analysis. Also the big data that comes with lot of variability, sometime it is day to day experimental variability, sometime instrumentation error, batch effects, of course manual errors will be there as well. So therefore, there is a need to look at how best to normalize that data and how to remove or minimize the technical artifacts to obtain the logically relevant and meaningful data set. So we need many preprocessing steps which includes data scaling, normalization before we actually proceed for the statistical test. In today's lecture at hands on session, I have two of my research scholars Deepthrub Biswas and Arjit Mukherjee who will walk you through different steps which are required for the analyzing big data sets. They will further elaborate on different aspects of statistical analysis and data visualization and plotting. I must would like to mention that there are many very nice open access platforms and softwares available now and good idea for all of you students and participants to explore many of these software tools to make best sense of your data, not only how to best analyze the data but also how to interpret data and finally how to effectively present the data. And this is where even data visualization becomes very crucial. The data generated from these techniques comes in the form of big tables but big tables does not convey the biological sense of what we want to address. Therefore, you have to convert all these findings into the graphs and the visual way of displays. So therefore, we thought you should be learning some of these softwares and tools which can help you to best utilize and visualize your data sets. So, in today's session, we will have the basics of secondary data analysis and we will walk you through different steps involved in preprocessing of the data, basic statistical analysis, data visualization and plotting. So, let us start today's lecture. So, let us take an example that we have we have designed an experiment where there are 13 controls and 15 disease samples. So, after running all these samples individually, we got 28 files of which 13 are control file and 15 are disease file. So now, we need to analyze these files and need to know how the controls are different from disease. So, the first thing we should do is we need to take these files and do a primary analysis. So, after primary data analysis, what you will get total number of files which we took an example of 28. When we are talking about all 28 files, there is number of things that come into play like technical variation, experimental variation, batch effect and all this kind of thing. That means, we need to apply some strategy or to normalize the data. So, next part of the data analysis is the secondary analysis where we first transform the data that can be locked to transformation and try to understand whether the data is following a Gaussian distribution or normal distribution or not. Followed by if data set is skewed, then we go for sample wise normalization and if needed, we do data filtering and followed by different statistical test. So, after the analysis of the 28 files, what we get this kind of distribution and with the help of the normalization transformation and scaling, we need to get the data like this so that we can proceed for defined statistical test and statistical plot. And finally, after the secondary analysis, we try to link our significant protein list with biological pathways, protein-protein interaction and different pathway enrichment networks to understand the biology behind the disease. So, now, let us welcome Arijit Mukherjee, a junior research fellow in proteomics lab who will be giving a detailed information of secondary analysis. You are not aware of the overall overview of the complete data analysis workflow. So, in data analysis, we go through three steps mainly primary analysis, secondary analysis and then to the tertiary analysis. So, what you get after the primary analysis is the list of masses with their respective abundances or intensity. Now, from these respective abundances or intensities of a list of proteins. So, what we have in thousands of proteins, we need to make valid inferences based on some statistical test. And this statistical test and other forms of transformation scaling normalization, these steps comes under the secondary analysis steps. So, let me take you through the steps in secondary analysis. What we get after the raw intensity values, we do missing value imputation. So, missing value is a common feature in a omics dataset. This may occur due to some technical limitations or due to some manual handling. So, we will be talking in detail about this point about missing value imputation in terms of KNN that is KN nearest neighbor main replacement median replacement. Next step is to do log transformation and scaling. So, we do generally log 2 or log 10 based transformation and scaling and centering in order to be able to compare different kinds of distribution because different samples may take up different kinds of distribution after log transformation and scaling. And the next step we can perform is the normalization. Normalization comes under median absolute deviation normalization or quantile normalization. There are many other kinds of normalization that depends on your data structure that we can perform, but here we will focus on these two kinds of normalization specifically. And next comes the statistical analysis steps. Statistical analysis is performed to test a hypothesis basically. So, depending on your experimental question that you want to know the significantly dysregulated genes or proteins or mRNAs that is present in disease sample as compared controls. So, we need to perform T test or ANOVA test or regression analysis. These things will be covered in this lecture and finally, comes the data visualization and plotting. In data visualization and plotting we can draw hit maps or cluster maps box plotting volcano plots etcetera. So, this is the overview of the lecture that is going through the secondary analysis that we are going to talk about. Now, coming to the first step after you get your raw spectra if you look at the mass spectra abundances it comes in the terms of e to the power e to the power 9 e to the power 10 and so on. So, from these numbers it is quite difficult to play around this and to make valid assumptions. And as you can see in the figure on the left side that with the 4108 elements that have been plotted on the in a frequency distribution curve this is quite skewed and this kind of skewed distribution we cannot make any statistical inference based on this distribution. So, what we do we perform some log transformations generally we take a log base 2 or log base 10 and transform the data on the right hand side of the figure you can see the log transform data distribution. It is not perfectly normal distribution, but it is assuming to be a normal distribution. So, this log transformation brings the data range to a workable range that is 15 to 40 in this example. So, these numbers are quite easy to make inferences with this kind of data. So, coming to the importance of log transformation that as you can see in this figure that the skewed distribution is transformed to normal distribution or becomes less skewed and this allows you to interpret patterns in the data and always we can perform some inferential statistical test that is t-test or ANOVA etcetera. In as you know in t-test the basic controlling assumption is that your data distribution should be normal distribution. So, this allows to help us to perform such inferential statistical tests and in omics studies we generally perform log 2 or log 10 transformation that depends on the objective of your experiment. In biology generally we do log 2 transformation because we are interested which genes or proteins are in the full change of twice as compared to the normal or dysregulated or downregulated at a full change of two values. In that case it depends the scaling whether you take base 2 or base 10 that depends completely on the objective of your experiments. Now, next coming to the topic of scaling and centering if we take the log transform distribution it looks like a normal distribution, but when let us suppose we are taking control and disease samples the control samples will have a different mean if we log transform that values and the disease samples will have a different mean if we log transform the values, but these two distribution however it is normal we cannot compare the two means when the two means are very different. We need to scale and center so that the distribution takes up the similar mean or median values and they are distributed around those mean or median values and this is the purpose of scaling and centering that we can compare different samples since all samples become normally distributed with mean of 0 and this is performed in terms of z score. The z score as you can see in the formula is the x minus mu upon sigma x is the individual values of the variable, mu is the sample mean and sigma stands for the standard deviation. So, this z score reflects distance from the mean in terms of standard deviation. Assuming a normal distribution the z score should range in minus 3 to plus 3 because in normal distribution 67 percent of the population lies within the mean plus minus 1 standard deviation and 95 percent population lies within the range of mean plus minus 2 standard deviation and 99.9 percent that is almost all your data should lie within 3 standard deviation plus or minus of your mean or median. So, this is a generally very convenient method of scaling and centering that you can perform comparison of two or more different distributions in terms of z score. So, what are the other methods available for scaling and centering? In z score we assume that your data distribution is completely normal distribution or Gaussian distribution, but it is not always the case in real example or in real life examples we can see that the distribution may not be perfectly normal and in that case assuming that to be a normal distribution is not the right way to go on. So, in that case we can take up other scaling methods such as we can scale our data from a 0 to 1 scales. This is particularly useful when variables come from possibly different distributions, but it preserves the shape of its distribution while making them easily comparable on the same scale. So, here if we call the value as a score you can see that x minus minimum upon maximum. So, each of your observation is subtracted with the minimum value of all observations and then divided by the maximum value of all observations. This brings your data to the range of 0 to 1 and no other data will lie above that range or below that range. So, in this way we can compare different distribution in terms of scaling and centering. Now, next our point is the missing values as we have already told that missing values are a common feature in omics data. Now, let us think about what could be the source of missing values. The first reason could be that peptide is present in some samples and absent in some samples that is it is always generally a case that if you take 10 control samples or 10 disease samples or cancer samples out of 10 only you could get in 6 or 7 samples that particular protein and the rest of the 4 samples if this protein is not present and this can happen this can happen due to various reasons. The reasons could include the peptide may be present in below detection limit in the sample or the peptide is detected, but abundance is reported as NA not applicable due to technical limitations. If in data analysis step missing values can be imputed and there are many other methods for imputation of missing values and missing value imputation should ideally rely on the reason behind getting a missing value. Now, next we talk about the categories of missing value the categories of missing value is based on two properties missing completely at random that is m car abundance dependent missing values missing completely at random values occur due to some technical glitches in instrumentation such as pore ionization other peptides competing for charges etcetera and abundance dependent missing values occur due to peptides below detection limit or even they may not be present in the sample or it may be the case the detector got saturated and fails to record the abundance. So, these are the main two types of categories of missing values. Now, coming to the imputation of missing values here we will be talking about imputation using mean or median or lowest values. So, the mean can be used for each protein across the samples in a particular control or disease group we can use the mean to replace the missing values there or we can use the median values to replace the missing values. However, this method is has some pitfalls because this method may not truly reflect the biological variations since we are substituting the multiple entities with the same values. So, it is not a true reflection of biological variations. Now, in this slide you can look at the example of missing value imputation using mean, median or lowest observed values. As you can see in the matrix here that we have genes on the column and column side and samples that is divided into controls and desist on the row side. So, here are four control and four disease samples you can see this kind of values are lot of there are a lot of missing values for gene A this is missing for disease 3 and disease 4 sample gene V for disease missing for control 4 sample and so on. So, this kind of missing values always interferes with further data analysis and based on the distribution and based on some logic we can impute the missing values. So, that we get a complete matrix for further analysis as you can see the on the right hand side there is replacement values by means that is the rate ones are replaced by the value of the mean. If you see at the disease category on the right hand side the 35 and 36 are the values for disease 1 and disease 2 sample, but the disease 3 and disease 4 samples are missing. These values can be imputed with the mean that is the first two disease values the mean value is 35.5 and we can impute with the mean values. So, this is a method of imputation for using mean values. Again if you look at the replacement by medium that is highlighted in blue color that you can see for gene E that the control samples are having a median of 23 in 23 25 and 22. These are the three ranges of the control 3 4 control samples. So, we can impute the missing value as by median using 23 for the gene E in control 3 sample. So, this is another method of this is the method of median based missing value imputation and the last one green highlights the replacement by lowest transit value. So, it assumes that the peptide is below the detection limit in the sample that is why we are encouraged to use this kind of missing value imputation by the lowest observed value and this is shown in green. In green color as you can see for the control samples the lowest observed value is 21 in all the for all the genes. So, we can ideally substitute the 21 values in all the missing places with this kind of missing value imputation procedure. So, this is you got a general summary of how the missing value what the missing value is and how can they be imputed. Now, coming to a another missing value imputation approach that is KNN that stands for k nearest neighbor and KNN uses an unsupervised algorithm to impute the missing values. Now if you look at the graph that the star shaped point now we need to classify we do not know what which category it is going to be. So, we need to classify this star star shaped point whether it goes to the circular type or goes to the rectangular type. Based on this the KNN algorithm what it does it computes the distance between all the unknown points of all the observations and from the all the observation it takes the nearest k that is any positive integer it should be any positive integer and here in this example we are showing that it is k equal to 3. That means, it is taking the three shortest distance of the circular type and the rectangular type and from these distances the lowest distance it have the sum of lowest distance it will be categorized as the particular category of the unknown variable. In this case you can see that the it this star point might go to the rectangular side. So, this algorithm works like this and based on this if it is a rectangular type or a circular type it may be assigned a value based on the total distribution value of that particular category and this is the underlying assumption of KNN based missing value imputation. Now, the next topic comes to the normalization. Normalization is a pre-processing step in omics dataset in order to compare different samples to make valid inferences. Normalization is performed to remove the technical artifacts in experiments. Technical artifacts may come from changes in column, may come from manual handling or may come from changes in day to day temperature also. So, these kind of technical artifacts are different from the true biological artifacts. So, the whole aim of the normalization process is to remove these technical artifacts and without disturbing the biological variations. So, from these technical or biological variations we need to classify whether this dataset could be used for competition and to remove these technical artifacts we use various normalization methods such as quantile normalization and median mad normalization. Median mad means median absolute deviation normalization. Now, here in this slide we have shown you an example of quantile normalization and quantile normalization is a technique for making two or more distributions identical in statistical properties. So, let us take a case of 4 genes with 3 cases. The values are filled arbitrarily and here I am going to take you through the steps in quantile normalization step by step. So, from the raw data the first task is to rank the values from lower to higher. As you can see in the ranking of the values of lower to higher in case 1 the value 5 is given a rank 4 that is it is taking a rank from lower to higher order. The next step is to sort the data according to the rank. So, we will rank the we will sort the values according to the rank that is 1 2 3 4 there are 4 genes here and after that we would replace the values by means of that particular rank. So, in terms of if you see at the raw data and the final replaced by mean data that the rank remains the same the order of genes A B C D remains the same, but the values are changed and they are having the similar identical statistical properties. So, this is the underlying assumption of quantile normalization. However, there are some pitfalls in quantile normalization that extreme values could be masked. Hence, we may lose potential biomarkers because biomarkers are the proteins of interest that is having dysregulated expression level that is in the normal distribution curve they might be in the tail region. So, quantile normalization may mask your potential biomarkers from the analysis and so, it should be utilized or implemented with proper care and quantile normalization is generally used for microarray data analysis. Next comes the median absolute deviation normalization what we call in abbreviation as mad normalization. So, the mad normalization is preferred when the distribution of the data is not normal and it is particularly useful in that case. This score is particularly applicable when we assume that the data distribution is quite Gaussian and normally distributed. So, mad normalization is similar to z-score scaling and the only difference is that it takes the median of absolute deviations so that they are not affected by extreme values and this is particularly useful for non-normal distributions. The median absolute deviation as you can see in the formula is the median of the absolute deviations. Now, let us talk about the data visualization and plotting techniques. If you look at any omics paper in proteomics or genomics these kinds of plots are very common that is heat maps, cluster maps, box plotting, volcano plots, PCA plots so on. So, let me take you through the through the glimpse of these old data visualization and plotting techniques. Let us talk first about the heat maps or cluster maps. Heat map is an useful tool to visualize the patterns in large scale biological data. So, what is basically is that there you can actually visualize thousands of genes and their pattern across numerous samples. As you can see in the figure that the rows are the list of proteins and on the column there are list of samples and you can see for a particular protein if you go through but go through the row you can see the pattern of this particular protein across different samples. And this this allows us to get a get a visualize an overview of all the patterns in the large scale biological data. And this curly scale that ranges from minus 4 to plus 4 this comes from the zis code scaling. So, after you lock transform your data you do the zis code scaling and you can plot this as a heat map. The blue ones are representing the lower expressing values and the red ones are expressing the highly expressed proteins. So, this is a very useful technique in omics studies that you can make patterns out of this whole large scale biological data. Next we talk about the Volcano plots. Volcano plots are a quick and easy way to visualize the genes or proteins that are statistically significant. In Volcano plots what we plot on the x axis there is locked to fold change values and on the y axis there is negative logarithm of p values. So, after you perform your t-taste or ANOVA you end up with some p values for a particular protein and you can also determine the fold changes in comparison to the control samples for the disease samples. So, this disease samples and control samples ratio is called the fold change in terms of expression values. So, if we plot the locked to of fold change and the negative logarithm of the p values we can know what are the points that are very much statistically significant in terms of p values and they are significant in terms of fold change. As you can see in the figure the red dots are above the fold change values. So, these are having significantly significant p values and the green dots are on the lower hand side are having a locked to fold change in the range of minus 1.5 to minus 2 that is these proteins are quite down regulated as compared to the reference samples and on the right hand side the green dots denotes these proteins are up regulated as compared to the reference samples. So, this is a quick and easy way to visualize your data and which proteins are of your interest in terms of biological questions. Now, coming to the box plots box plots are easy way to visualize the spread of your data points for a particular gene or protein. So, when you have narrowed down to your proteins or genes of interest this box plot can be an easy way to quickly compare the distribution of the abundance values across the samples. As you can see in the example on the right hand side that we have taken the thymocene beta 4 like protein 3. This protein we have plotted the fold change values across different types of samples and this is compared in grade 2 samples grade 3 samples and grade 4 samples. You can see the data distribution of this protein in terms of fold change values across those samples and this is generally plotted in terms of box plot. And in box plot you can see that there is a media line in the middle of the box and this represents the median value of expression of that particular protein and after that you have on the upper hand side you have third quantity and on the lower hand side you have first quantity of the expression values. So, these are the parameters of box plots that is used to actually visualize the spread of the whole data. Now, let us talk about PCA or principal component analysis. Principal component analysis is a dimensionality reduction technique to interpret the clustering of similar type or grades of samples. In PCA generally what you see is a two dimensional plot. The x axis represents the principal component 1 and the y axis represents the principal component 2. So, the principal component 1 represents the maximum variation of your sample. So, this is the line where you can group or cluster your data and see how they are clustering. As you can see there are different colors of grouping of individual samples into different clusters. That means, they are quite different from the other samples. This PCA plot is another easy way to visualize the grouping and clustering in your samples for making further biological inferences. I hope today's session was informative and you go to learn many new tools which you should start playing with it. There are a lot of data sets available in the public repositories. Of course, you should ask even our lab to provide you some more raw data set which you can play with you can try to analyze yourself. And with these sessions I hope you are getting good understanding of different steps involved in data processing and analysis. Of course, each instrument platform whether we talk about microarrays or surface plasmon resonance or next generation sequencer or mass spectrometer. The initial raw data processing is very unique for each type of instrument platform. But once you have obtained the data in the excel sheet format, then a lot of things are pretty much common because you want to make biological sense out of these raw numbers and these big data tables. And what you are looking at, what was your background? What is the noise? What is the actual signal? So, many things for example, the processing steps remains very similar irrespective of which technology platform we are using. Of course, there can be uniqueness looking at the constraints of each technology as well. But finally, what you want? You want to make a biological sense of the data and you want to plot the data in more meaningful way. The different ways of visualization of data and tools available which can effectively represent your data are very crucial. So, some of these tools I hope what you learn today would have given you much better understanding that how effectively you can start looking at your data. And in the upcoming lecture, we will continue further on this and show you more software and analysis which can now try to build from your protein and gene list, how to we can make the networks and pathways and make better biological sense looking at the complexity of the problem which you are working on, how to make sense of that information by targeting different pathways and interaction networks. We will continue more of the session in the next lecture. Till then, thank you.