 We are discussing about different microarray based platforms and how to perform some biological applications on these chips. In last lecture, MSAPURA Venkatesh showed you how to perform a microarray experiment using serum samples obtained from patients who are suffering from Falsipirum or Vyvex malaria. Today we are going to continue the demonstration and also show you the ways to do data normalization and how to do microarray data analysis, specifically if your goal was to look for a biological question of interest. In this case, we are going to talk about several ways of how to make meaningful data from the patients who are suffering from malaria using protein microarray based platforms. Let us have this lecture and demonstration session today. I am Apurva Venkatesh, your TA for this course and today we are going to talk about microarray data normalization and analysis. In the last lecture, we are trying to profile humeral responses of malaria positive patients using microarray technology. So, we are going to start from there. What we are going to do today is to see how to normalize microarray data using Excel. So, what we will do is we will start with the raw file you get from the microarray scanner, right. So, once you take your slide and you scan it in a scanner, you will extract the raw data and here is the Excel sheet you see. This is the type of data you get. I am showing you this one particular slide. So, I will just like to repeat that one slide can probe 8 patients serum. So, here in this one particular Excel which you see here, we actually have data for 8 patients. So, first of what I am going to show you is how to reorganize this data, ok. So, let us see first of all what kind of parameters are exported. Now, you will see that all important parameters are provided in this Excel. For example, start with pixel size is 10. The slide was scanned at a wavelength of 635 nanometer. Then you go down normalization method. This was not normalized yet. So, it says none. Then if you scroll down further, you can see the PMT gain which is 400 scan power 100 laser power 1.34. So, basically later on if you want to go back and check these slides again. If for example, for instance, if you forget the parameters you used, you can always go and open this Excel to see what you had done, right. So, now let us scroll down further. You will see block, column and row. So, this is very important. Again let us go back to the slide layout. One slide can probe 8 patient sera and one particular pad that is one pad which probes one patient sera has four blocks. So, which means that if I keep scrolling down, so every four blocks represents one patient data, right. So, now when I keep scrolling down and I go to block 5, a new patient begins. So, that is what I am going to talk to you about how to reorganize this. So, for example, if you see here four ends here, right with blank and you will see that there is an IGG mix 1 which starts again. So, this is basically your new patient. So, what we are going to do is we are going to first reorganize this, but before this let me tell you which are the columns which are important for us. So, now I am going to scroll back up and we are going to go through the columns which we have on this Excel. So, now apart from block, column, row, the name and the ID, basically this is your protein ID. We do not need any of the other columns except for column AH. So, you see column AH here. So, this is basically your F635 median minus B635 which is your basically your background signals, right. So, this is our the column which we actually extract and use for an analysis and we do not need any other column here. So, what I am going to do? I am going to first delete all unwanted columns to make this Excel less complicated. So, let us delete all of this and then we go and delete this as well. We also do not need these parameters for the analysis, I am also going to delete this. So, finally this is what you get. So, now you keep scrolling down and then you arrange all patients side by side. So, when you do that this is what you get. So, you also see that there are additional columns here. This is what you get from your GAL file and when you scroll right you will see that you all the patients are now placed next to each other, right. So, this is the kind of Excel you get first. Now what I am going to do is I am going to reorganize this Excel to make it easier. This is combined data for all your patients and now we will reorganize the proteins. So, as you will see the proteins are present in the same order as they are present in your slide. So, what we first do is now that this is common data for all patients we have already put them together. We are going to bring the IDG mix from all four blocks together. So, I will repeat this slide layout once more. Now, you have six IDG mix here this is present in your block one. Similarly, you have the same spots present in all four blocks of the same pad. So, what I am trying to do here I am trying to get all the IDG mix together. So, you will have 24 such spots one after the other. We will also do the same thing for your anti-human IDG mix and similarly we are also going to do the same thing for your next spots. So, when we rearrange our Excel this is how your Excel will look. What I have done here I have put all the 24 IDG mix of one pad together right we will go through all the columns once more. For example, these are all just your spot details now we will go to your ADI spot ID which is your GAL file. So, this is what you get from your GAL file I come to this in a minute before that let us talk about ORF. So, this column here is basically your ID this is also going to give you details about the fragment which has been printed on the chip. So, let us go to PlasmaDBID now if you look at PlasmaDBID these are all basically each and every protein has a unique PlasmaDBID. So, that is what is mentioned in this column here. If you go to the next column which is ORF fragment. So, this will explain your ORF your column H better. If you go here you will see that this specifies which Exxon segment is printed on the chip. So, basically as you know POTS which are printed on the chip were not purified proteins they were IVTT spots and basically not so, what is IVTT in vitro transcription translation. So, what was expressed the whole protein was not expressed here only a certain segment of a particular Exxon of a protein was expressed right. So, basically it is not really right to say that proteins were expressed on the chip instead it will be better to say that polypeptides were expressed on the chip. So, this particular column J gives us details about the polypeptide that was expressed and printed on the chip right. So, that is how you get this ADI spot ID which is a unique ID for each and every protein. What I mean here is that if you go to PlasmaDBID and then if you try to look for duplicates we will actually find duplicates here because it could be that for the same protein different Exxon fragments were printed on the chip. So, you might get duplicates here whereas if you go to your ADI spot ID you will not find single do any single duplicate because these are unique ID's for each and every protein which takes into account the Exxon fragment which was printed on the chip. So, that is what you see here. If you say for example let us look at this particular row if you say that this was the ID and this is Exxon 1 of 2 you will actually see the ID here and 1 O2. So, this becomes your unique ID for each and every protein. Why I am telling you all this is because this is very important for data analysis for all sometimes you might just start with an analyzing your I column and then you will figure out later that there are lot of duplicates and you do not know what you are actually doing. So, what we need to do is if you want to shortlist any antigens we need to consider the G column for analysis right. So, now let us move on to the next column which is your description. So, this we all know this basically described what was printed on the chip right these are just basically the names of them basically the names of the antigens. Now, the next column is your organism. So, as you know you have two types of spots here plasmodium falciparum and plasmodium bivax. So, basically this is going to tell you which organism does the antigen belong to. So, you have plasmodium falciparum 3D7 here for instance and probably and if you scroll down further you will see plasmodium bivax. Salvan right. So, this is going to give you details of the organism and then the next column which is M is going to talk to you about the preparation of the spot like for example you have the first few spots are basically your IgG mix right. So, then the preparation is basically your IgG mix it is not an IVTT spot. Now, if you scroll down further you will have similarly anti-human IgG again you scroll down further you have certain purified proteins which are nothing but are control proteins. So, our control spots here were printed as purified proteins and not as IVTT spots. Now, if you scroll down further you will find all your other spots basically your antigens which you are trying to study are all printed as IVTT spots. So, basically this entire column M gives you details about the spot preparation. Now the all other columns here are basically your patient samples. So, if I just move this a little bit what you see here for example let us consider the first sample this is basically a positive control which was part of batch 1 set 1, slide 1 and pad 1. So, let me again take you back to the experiment this experiment was performed in four sets of two batches or rather two batches of four sets. So, you have batch 1 set 1, batch 1 set 2 then you have batch 2 set 1 and batch 2 set 2. So, basically what is this telling me this is telling me that this particular positive control was probed on batch 1 set 1 on slide 1 and pad 1 right. Now similarly so, let us go to the next one which is a real sample that was just a positive control. So, this is basically probed on batch 1 set 1 slide 1 and pad 2. So, this is going to tell me my position of the sample. So, if ever I want to go back to the slides and check the real spots right the images of the spots then I know exactly where to go. So, for example if some sample is not behaving well and I want to go and cross check the intensity of the spots. For example, some sample is giving me very high intensity signals and I want to go back and check whether it is real then I will know exactly which file to open because I have all the details here. So, this is for the all other columns. So, this is I hope you have now understood how the excel sheet looks right. So, now what we are going to do the next thing is we are going to apply a color gradient to this excel right and now I will tell you why we are going to apply the color gradient let us first do that. So, for which what I am going to do is I am going to go to conditional formatting I am going to go to color scales more rules and I am going to choose a 3 color scale and then I am going to choose number type. So, I am going to say 0 and I am going to assume that my entire data falls you know in between say certain negative values and maybe around 80,000 is my maximum value. So, I am just going to assume that if my data falls in this range I am going to split my data based on 3 numbers 0 then my midpoint will be say 20,000 and my highest will be 40,000 and I am going to choose some colors here. So, I say this is maybe gray then I am going to keep this black and I am going to keep this red. So, what this is going to do is all my values above 40,000 are going to be in dark red and then around 20,000 will be black and the lowest or the least values will be gray and those which are in negatives will be almost white. So, that is how I am going to apply a color gradient here. So, you can see in this slide here basically what I have done is I have just minimized this excel a little bit. So, you can you will be able to see all 4 batches at once. So, I do not know if you can see a black line here. So, basically this is going to split your batches. So, in fact it is going to split your set, set 1 from set 2 of the first batch and set 1 from set 2 of the second batch. So, basically this is batch 1, set 1, batch 1, set 2, batch 2, set 1 and batch 2, set 2. So, when you minimize this excel a little bit and you apply this color gradient what you can see is that the signal intensities particularly for this batch. In fact, this whole batch, but batch 1, but batch 2, set 1 is really high compared to the rest of the batches. So, you will know that mainly from the IgG signals here. So, this particular line which you see here are all your IgG mix and this particular line which you see here is your anti-human IgG. So, basically this is your control which is going to tell you whether you need to re-scan your slide or not. So, if this is very high then all your signals by default for this particular batch will also be high right. So, that is going to screw up your results a little screw up your results later because all the patients in this batch are going to show high signals which will be which is which is not correct. So, this IgG mix printed on the strip is going to basically help you in deciding whether you need to re-scan your slides at different PMT settings and per settings right. So, what we will do here is we will re-scan the slides once more bring these settings down a little bit and bring these settings not as low as this, but a little lower because this is also a little high compared to this if you see right. So, later on we realize that this is because of the membrane thickness of the slides there could be other issues also which you might encounter later. So, to avoid this you need to first bring down the signals and then any changes after that will be corrected by normalization ok. So, now having re-scan all the slides as you can see in the slide the settings look pretty uniform though it is still not very uniform and you will still feel that batch to set 1 has higher signals, but overall it is ok because this will then be taken care of by normalization. So, now what we will do is we will proceed with normalization using Excel. Now, there are two strategies I am going to talk to you about today. The first strategy is basically a very simple normalization method which we will use only for visualization. For example, if you want to prepare heat maps then we will use this the first normalization method. However, if you want to perform statistical tests then we will use the second kind of normalization which I will talk to you about. So, let us first go through the first normalization method. So, what we are going to do in the first normalization is we are going to subtract the raw values for each of the IVTT spots from the sample specific median value of the no DNA controls. So, I am sure that this is a little confusing. So, what we will do is we will go step by step. First I am going to show you what raw values are and then I am going to show you what the no DNA controls are right. So, again we are going to come back to the same Excel. It is color coded and we have reached this stage. You also know that now in this data we have IgG mix, we have anti-human IgG, we have purified proteins. We do not need any of those right now for our analysis. We are going to directly go down to the IVTT spots. So, in fact what we will do is we will probably just delete those rows to avoid confusion. So, let us start from here. I am going to delete the first few, maybe what I will do is I will just zoom this a little bit. So, I have just zoomed this a little bit. What we are going to do is we are going to delete unwanted rows right now. So, we do not want IgG mix, we do not want anti-human IgG, we do not want purified proteins right now. Again, let me tell you the purified proteins basically we do not require in the analysis, but it is important when for example your slide is not worked at all. And or you have not got the signals you required. You can always go back to the positive control spots to see what their signals were right. So, this is basically used for such you know analysis just as controls. So, right now we are going to delete those rows and we are going to only keep rows which are IVTT mix right that is what this is. So, now we are going to have this way 500 Plasmodium Phalzipyrim IVTT spots and 515 Plasmodium Bivax IVTT spots. So, we are going to go down, give deleted unwanted rows there are few more rows below which we do not need. So, after these 1015 spots, there are few more like TTBS which is nothing but your buffer spot where only buffer is spotted and then you have some empty spots data then we have data for blank. So, this is also unwanted we are going to delete that as well. So, now what we have are 1015 IVTT spots and 24 No DNA control spots. So, now what are these No DNA control spots? So, basically these spots have the entire IVTT mix except the plasmid. So, basically what you expect here is no expression because you don't even have the plasmid here whereas the IVTT spots have the entire IVTT machinery just like no DNA but they also have the plasmid where you are going to express your gene of interest whereas you don't have that here. So, what is this going to provide is going to provide your background signal. So, what we are trying to do in the first type of normalization is we are subtracting our raw signals from background. So, now there are 24 such spots which you remember we have rearranged and that's why it's come together group together like this. The first thing you are going to do is take a median of this which I have already provided you here. So, this is the formula for it I have just done this the whole thing in Excel. So, now we have a median value here. So, the first thing what I am going to do is for this particular sample which is in column N this is one sample I am going to subtract the values for each and every IVTT spot from that particular median right. So, each of these spots which you see here I am going to subtract it and that is what is my median sample specific median normalization. So, let's scroll so probably we will do this in the same Excel ok I have kept place for that here. So, this is called IVTT spots minus median of IVTT control. So, that is exactly what we have to do we are going to say is equal to then we are going to go to that particular spot. So, say let's take the first patient and then we see minus and we go to the median value which is 7, 8, 4, 2. So, now because I want this row to remain constant throughout I am going to put a dollar sign in front of the row. So, this is what we get here and now I am just going to drag this across as well as down. So, once you drag and drop this is the kind of Excel you get I have just minimized this, but if you apply your color gradient this is how it looks overall. So, this is what you can use now to make your heat maps and what I have also done is I have sorted this based on the antigens as well as the patients who are falciparin positive and vivax positive I have split them completely and I have also made another Excel sheet based on age you can also split them based on age. So, this is how I have sorted them. So, I have put all your PV positive patients together and PF positive together and I have also sorted based on age. So, you have PV positive, PF positive as well as sorted by age. So, this way you can sort your Excel in different ways you can also use other softwares to make your heat map, but basically this kind of once you normalize it in this way you do not perform any statistical analysis with this data. For statistical analysis I am going to now show you the next normalization method which is your log 2 transform fold over control normalization. So, for this I am not going to show you the entire method again because now you know how to do it on Excel I am sure you all know. For this I am only going to show you the steps. The first thing what we do is we are going to set a floor of 100. So, what we are trying to say here is that all the samples which are below 100 is going to have a value of 100. So, this is going to remove all my negative values from my data. So, that is the first thing and I have done it here for you and we are going to keep scrolling right. The next step what we are going to do is to divide each and every raw value by the median of the IVTT control spots. So, just like how we did previously we subtracted raw values from the median of the IVTT control spots this time we are going to divide it. So, that is what is called fold over control. So, once you set a floor of 100 then you divide it and the next thing you are going to do is to convert this whole data into log values. So, you log to transform this entire data right and that is why it is called log to fold over control. So, once you do this, this data can be used for any statistical analysis. So, this because this normalization is known to be more stringent ok. So, now either you can use programming to do your statistical analysis or you can use different softwares which provide your statistical tests. But what you need to know is which type of test you need to use which is beyond the scope of this lecture. But you can always read about what you want to do and you can also decide on which software you can you want to you want to use. For example, graphpad prism is an excellent tool for preparing graphs. It also helps you in a lot of statistical analysis, but if your data is really huge like the one we have is not very huge, but still it is not very small for graphpad prism. So, for example, you can have graphpad prism can get stuck in the middle if you are using data of even this size. So, of course, if you have bigger data sets then it is very difficult to use softwares like graphpad prism. However, if you are going to have only 10 patients or 20 patients with 40 proteins or something, graphpad prism does offer you a lot of statistical tests. Apart from that, there are other softwares as well. You have metaboanalysts though it is for metabolomics data, you still have a module called significance analysis of microarrays in it, which you can explore for your microarray data analysis. But there are of course, R programming and Python and other things will definitely be much better for your analysis as you will save a lot of time as well. So, what I am going to do towards this we are coming to the end of the lecture. I am only going to show you very small analysis you do on. I have done on Excel. So, basically what my aim here is to identify most sero-reactive proteins in my from my chip, which means that the proteins which elicit the maximum antibody response in malaria patients, right. So, that is my aim. So, now just to get this whole list of best sero-reactive proteins, I you can also do this on Excel using a particular formula, which I will show you now. So, let us go back to the Excel. This is how our Excel was, right before we removed all of these rows which are IgG and anti-human IgG. So, retain them for now and probably zoom this a little bit. So, now if you see that I have retained all the rows. So, the first thing what we need to do of course we do not need this, but I have still retained the entire sheet from the beginning. You will see that there are these 4 patients which are deliberately kept out of the analysis. For example, there are, so if you see here there is PF plus PV and everywhere, right. So, basically these are my patients who are diagnosed with mixed infection. So, I do not want any such patients in my analysis. So, I am going to purely have groups which are plasmodium falciparum and plasmodium vivax and I am going to look at look for their response to plasmodium falciparum antigens and plasmodium vivax respectively. So, I am not going to have any of these mixed patients I have kept them out. So, if you want we can also delete them, right. So, we need to basically start from row number 82 that is what we are interested in because these are the IVTT spots. So, the first thing I am going to do is I am going to take an average for each and every spot. So, for example let us write here average and I am going to say is equal to. So, I get an average value and I am going to just drag this down. So, you will have an average value for each and every spot for all the patients. What I missed telling you before is that the previous sheet had many more columns here that is because they we had a lot more samples which are probed on the chips. For example, we had positive controls which are nothing but samples from taken from a place which is a highly malaria endemic region. So, you know that those spots have to give you a signal, right. So, those are my positive control samples. So, do not get confused between positive control samples and positive control spot they are totally different. So, these positive control samples I have excluded them from this particular analysis. We also had healthy controls which are basically malaria naive individuals means patients who are detected were not detected with malaria at the time of admission. So, they were malaria negative. Those patients were also taken and probed on the chip just to see there is a difference in response. Such patients have also removed from the analysis. There are also certain samples which I have probed repeatedly in probably in duplicates or four times in all you know once in all the sets just to check for reproducibility. So, here are some scatter plots you can see where I am showing patient to patient reproducibility. So, basically I am showing reproducibility between my batch runs all of these patients also have removed from the analysis. I have basically now in this Excel 200 patients 100 Plasmodium vivax and 96 Plasmodium falciparin patients and 4 which are mixed infection also have removed. So, in this way you can choose to remove rows and columns based on what you want to study and you can make a Excel less complicated right. So, that is what I had missed mentioning, but now that is done. So, I have taken an average right now. Now I am going to apply this particular formula which you see here. If the average value for a particular spot is more than twice the standard deviation of the mean of the no DNA controls then that particular spot then that particular antigen is seroreactive. So, what this means is that if this is the average if this particular number is greater than this particular number which I am going to show you right now. If you take an average of the mean plus 2 times standard deviation of the no DNA spots this is my number. So, if that raw value or if any raw value is greater than this value then that spot is basically seroreactive then that antigen is sorry then that antigen is basically seroreactive. So, I am going to say if this is equal to if function this spot is greater than this then 1 else 0. So, I get 1 here then what I finally get is an Excel sheet like this where I have random ones and zeros right. So, all of these ones I am going to now say are my seroreactive proteins because they are greater than twice the standard deviation of my control spots. Now a lot of people may also use healthy controls in their analysis right, but we do not have them. So, what they do is they compare the signals signal intensities in a malaria group versus a healthy population, but since we do not have all that I am going to simply say that this is my these are the list of my seroreactive proteins which I can now take forward for further analysis. So, this is not a great this is not a statistical test this is only short listing my proteins I am here I am only short listing my proteins from 1500 to a handful which I can then take forward and study. So, this is what that sheet is now what I have done here is I have taken this for a group of patients, but now what if I want to check this for a particular patient. So, that is what is my antibody breath which you see here I have done this individually for every single patient maybe I will zoom this a little bit. So, what do you see here is that I have zoomed this for every single patient. So, basically the previous one which I showed you was the average for a single spot for a group of patients as well as the average for the no DNA control here I have done it for each patient which means it is sample specific right. So, in this way if I scroll down what you will get here this you will know the number of seroreactive antigens per patient which means that if one patient for example here is seroreactive to only 12 antigens whereas, there are some other patients which is seroreactive to 77 antigens. So, this is basically my antibody breath. So, these are the two basic kind of analysis like which I can show you in excel for now. Power of microarray technology is basically the fact that you can perform this experiment very fast probably in a day or two and then using any kind of patient data all you need to do is map this whole data which of microarray data to each and every patient clinical information that you have and then you need to you can perform any kind of statistical analysis and you can generate several results from the same single experiment. So, that is the beauty of this I hope you have got a glimpse of how to perform data analysis and how and how basic statistics can be done and how this is not the only way to do statistics at all you can do use softwares and programming and I still recommend that people do programming because if you want to make even a single small change you do not have to repeat the entire analysis also tomorrow if somebody provides you some other clinical information of the same patient population you do not have to repeat the analysis in excel you can simply write a code for it and then in a few minutes you will get results for that as well. So, that is all for now. Thank you. After going through this demonstration session and the insights of doing microarray based data analysis you must have realized that there are many ways of analyzing and representing microarray data of course there is no single way no correct way of telling you what is the best way of doing it analysis there are many considerations you have to keep in mind when you are thinking about how to make meaningful information out of this high throughput data there are several questions that can be answered using microarray data provided your data passes the quality control checks and it is properly normalized in such experiments your control features becomes very crucial both the positive controls and negative controls they guide you about how accurate and real your data is they could distinguish between real signals and background noise after proper analysis methods in the next class you will see another application of protein microarrays in a different application we will shift the gates to the cancer research and also the platforms. So far we have talked about self free expression based protein microarray platform we will not talk about how to take terrified proteins printed on the chip using human proteome arrays and then apply those to investigate a deadly disease cancer and try to talk to you about both experimental demonstrations as well as the theoretical concepts involved in performing such biological experiments see you in next lecture thank you.