 Today, Professor Santosh Narona from IT Bombay will continue his lecture about considerations of data analysis especially for the omics data sets. Today's lecture is going to be about why basic understanding of data analysis is required. For example, 0.5 percent accepted error rate in significance used without basic understanding of data may result in false interpretations. Professor Narona will also talk about the importance of replicates and how one should choose controls which are usually one of the very important samples for the big data or the omics based experiments. So, again thinking about a good experimental design, what should be replicates, what should be your strategy for data analysis actually determine the meaningful sense of your experiments despite all the advancements in these technologies and the pace at which we could generate the data, but still getting meaningful data is not a straightforward, it is not easy. So, I hope today's lecture and based on the previous lectures, these two lectures will eliminate your knowledge and give you the concepts about good experimental designing and what should be the considerations to look for to get the meaningful insights from your data. So, let us welcome Professor Santosh Narona. He systematically tested so many possible candidates for significance and if there was a 5 percent error rate in your analysis, you would have randomly found a candidate and called it significant and we end up fixating all our energies on these one or two candidates that we get when it is sure randomness that has caused these two turnips. So, unless you have an independent way of carrying out an analysis with these candidates and validating that the important to you, it is kind of pointless proceeding further. Now, at this point if you are in the publishing game, it is very important that you notice that publications do not allow you to publish negative results. So, all these other things you cannot publish here. So, the only thing you can publish is this particular result. So, there is pressure on you to find that needle in a haystack as a positive result and publish it and that is the nature of this confirmation bias and p-hacking which pushes you into now focusing entirely your research on this particular candidate, the green candidate as if it were the only relevant candidate. So, what is the way around this? So, again something that we typically do not do is an adjustment to this. So, the only way to if there is a 5 percent error rate in analysis and 5 percent error rate is a dangerous thing if you are doing 10,000 studies, I mean I want you to appreciate the 5 percent at a different level. You take any 100 papers published which are scientific hypothesis being tested and I can tell you without even reading those papers that 5 of them have got to be false because all of them have used a 95 percent confidence level for executing this analysis and if you are saying there is a 5 percent sheer bad luck error rate, then several of those researchers have suffered unfortunately with randomness in the data they collected which means their results are going to be false. It is not that they have set out to cheat, it is unfortunate given that they are unable to reproduce their own data and they are trying they are in a rush to publish. So, the trick the way to control for this in some ways, so how do I reduce my error rate? So, if my error rate is this red portion the 5 percent, if I want to reduce my error rate I have to move my goalpost further out and that is the only solution of course now it gets hard the number of candidates you will get which pass this goalpost further out which are so extreme what you are saying is your results must be so extreme that they are well outside this wider goalpost range. So, what do you say if you are going to do 10,000 tests? Each test should not have been done at a 5 percent level instead each test should have been done at a 5 divided by 1000 or 10,000 if I am doing 10,000 tests. So, that 5 percent error rate should be spread across the 10,000 tests that you are doing and if you were to attempt that this area now is 5 over 10,000, so it has become a tiny area, but I have effectively pushed my goalposts out. So, the odds of now passing my test are much much lower and the odds of randomly passing my test have gone down and that is a core trick to the statistical analysis it is called a Bonferroni adjustment and good packages software packages for omics testing will have this as a setting where you can correct for the number of tests that you are doing and try to refine this and it is a critical thing. So, in other words one of the things you ought not to be doing in an omics framework where statistical tools are being provided to you by the manufacturer is use a default setting in a workflow you got to ask the question what are the settings which control for statistical significance and do these need to be tweaked to correct for the number of studies you propose to do on that software. There is this aspect of power of a test and what I want you to appreciate is while all the emphasis on asking is a genetic candidate is a gene candidate significant or not all of this involves in only looking at this particular curve. So, if you look at this particular curve forget the other curve for a moment if you are looking at this particular curve then your 95 percent confidence leaves out this blue area on either side that is at 5 percent the blue area would have been 5 percent, but for the sake of argument I am going to pretend as if the reality was some of the hypothesis under which this would have been a mean value and this would have been a range of outcomes I would have seen if some of the hypothesis is better and that is just for the sake of argument because now you will see something problematic happen if this hypothesis were true then this blue area corresponds to that percentage of time you are going to get your hypothesis outcome wrong. So, that 5 percent of the time we are getting a hypothesis outcome wrong under the null hypothesis what do these thresholds mean for you within these thresholds you say this hypothesis is ok I mean agreement with that hypothesis outside that those thresholds you end up saying I do not believe in this hypothesis I will go with the other hypothesis in this case h 1 if you now look at h 1 h 1 is allowed to be true only from this coordinate to the right beyond that region you believe in h 1 to the left you are you have already argued you prefer to go with h naught as a hypothesis, but do you now see under h 1 this area in purple corresponds to an error where h 1 could have been the true hypothesis you have gone with h naught therefore, as a technicality you are committing a mistake by saying h naught is true when h 1 should have been true. So, there is a mistake there is a mistake it is just like false positives and false negatives in fact it is related to the concept of false positives and false negatives you will make one mistake or the other if you were to create a diagnostic kit and you are going to change the threshold for detection of a particular measurement in trying to cut down the false positives just think about this if I take this threshold and I move it to the right if I take this threshold and I move it to the right under the h naught curve what will happen to the blue area the blue area goes down I commit less of a mistake with respect to my original hypothesis, but if I move this coordinate to the right what happens to the purple area the purple area grows if you are trying to minimize false positives in your analysis you run the risk of increasing false negatives and vice versa. So, that is a key issue. So, the headache comes about because if you look at what we have done in the previous thing we only paid attention to one curve we did not ask the question what might the other hypothesis behave like which is the case here. So, if you start paying attention to alternate hypothesis you suddenly realize that yes I might have a diagnostic kit for example, which is accurate 95 percent of the time that is what the confidence interval tells you, but what it and therefore, a 5 percent of the time I am making a mistake of a certain kind let us say I am falsely calling somebody positive. So, false positives, but what is not giving me information at all is what is my false negative rate and for the false negative rate you have to be looking at other curve and this beta. So, in other words you want beta to be small you want alpha to be small you want beta to be small 1 minus beta is called the power of a test and it is a good practice to ask whenever you claim that something was a significant candidate a significant target do not just tell me how significant was that result tell me how powerful that test was. In other words tell me what is this value beta that I might therefore actually have a lousy test. So, this power of a test is a concept which most again it is there it is buried somewhere in software typically as an option for you to report, but it is not something researchers are in the habit of reporting. So, when somebody tells me I have found a candidate and in fact, I found a short list of candidates which are all significant what they are not telling me is how powerful was the analysis what they are not telling me therefore is what was the probability that they got the analysis wrong the other way around while they are telling me that they are confident to within 95 percent what they are not telling me is whether this was so bad that this was 20 percent or 30 percent or 40 percent the purple area. If one of these values is greater than 10 percent your study your analysis is already in trouble. So, both alpha and beta both these shaded areas cannot be large because they are both errors in your interpretation and here are cartoons which quickly improve the point. So, if you are trying to distinguish two hypothesis and your two hypothesis are so similar to each other therefore the effect size is small you will have such an overlap between the two predictions coming out of the two hypothesis that you are unable to discriminate and say which hypothesis is true you will not be able to do that. If you have a bunch of replicates if you have a bunch of replicates you are I am not getting into the math of this, but your curves get thinner if your curves are thinner. So, the spread in here is thin if your curves are thinner the overlap is reduced compared to here. So, you in a nutshell you want more replicates of any analysis that you do otherwise your errors are going to be large. If you are going to make your confidence into a large if somebody says that really ought not to be talking about 95 percent confidence you need 99 percent confidence. Then the immediate outcome to you is the moment you move your error sorry your thresholds further out this purple area which was small here has become large here. So, that that worsens the case. So, there is no clean solution here the moment you try to improve one sense situation something else in your analysis is going to worsen somewhere else and my point is to prove that everything is interlinked and you therefore ought to be talking about significance of a result as well as whether this is a powerful test being done. This is a famous study and it is really worth looking this up on your own later on. Amgen is one of the two top biotech companies in the world. They make that there is a dominant manufacturer biopharmaceuticals protein drugs for the most part and while initially they worked on things which are already discovered in research labs increasingly they have been doing their own research trying to find out what is the next generation of pharmaceuticals that have to be manufactured. They obviously keep track of literature. So, one of the things they did was they took 53 landmark papers published in the top journals in oncology and hematology. These are publications coming out of MIT, Caltech, Stanford, Berkeley the top labs the top universities in the world 53 and their logic was these are all published in the top journals. Let us repeat these results in-house and if as is published these candidates are good candidates let us get into the business of manufacturing these candidates that is where they were going. So, out of 53 they could reproduce only six papers and this is MIT, Caltech, Stanford we are not talking of some small tiny college somewhere. So, what is going on? It is not that people at MIT and Berkeley and Caltech were cheating. It is not that they were deliberately cheating, but there is a situation where the results coming out of even these top labs cannot be reproduced. So, why do you think they cannot be reproduced? In the six cases where the results could be reproduced when you look at carefully what happened, attention had been paid to doing the right controls in the experiment. You need the right controls. You do not make claims about results based on only a test case. You do the controls the reagents were reproducible and this you will realize most of you are doing experiments by lab experiments reagents especially in the immunology space are had to obtain in a reproducible fashion antibodies in particular batch to batch variation exists and you are unable to reproduce results. So, in the six cases there was the ability to manufacture these reagents reproducibly and that made a huge difference. The investigators were not biased. They did not they were not trying to push for a particular insight or an outcome and importantly they were honest about reporting all their data. So, you remember that straight line plot where I deleted the mid-section of the data and then you claim a better result than it actually is. They were honest enough to claim all or at least report all their data which meant that when somebody tried to reproduce it they also saw some bad data equally back to what these guys had found. It is a surprising result. It tells you to what extent there is pressure on people to publish positive insights even at the top labs and the moment this study came out and when this pharma company published this insight many companies started paying suit. So, Bayer did a similar study. They looked at 67 targets published in the literature Bayer is another big pharma major and out of 67 they could reproduce 14 results which tells you this is a serious problem. Now, this problem goes beyond just statistics alone. So, you can argue that a lot of that is bad luck with data not being reproducible because of the one time you do this experiment with that one material and inherently it is not a reproducible experiment. But it also reflects other aspects of poor design. So, here is one experiment of poor design where you are screening for certain drugs to do with epigenetic control in a glabablastoma system and the screen which was done was an in vitro screen. So, there is two ways finally, this got done and in vitro screen you are using basically relying on an RNA interference kind of protocol to try to identify targets. And the in vivo screen where this is the in vivo screen where you are directly loading these cells onto the brain and then looking for changes in function. The in vivo screen and the in vitro screen have practically no overlap in terms of what's up regulated and what's down regulated. Which means if you had just done the in vitro experiment and you had generated a bunch of targets and you had then proposed to now design drug candidates against these targets, you have a huge amount of money. There have been several such things you start going through literature you will see these things you know it is not something conventionally that journals published. So, this is published elsewhere some of these ironically are published in blogs they are not published in a cell in journals. They have for example, been studies on looking at gene signatures predicting the response of breast cancer to chemotherapy. And in this case the problems are even more ridiculous. In this case and this is another strategic problem with handling large data sets. One of the things that happened here is when the researcher and they finally traced it to a student. This was at Duke University. When this data set omics data set was finally taken into a spreadsheet and subsequent analysis was done and this data set was sorted many columns got sorted one column did not get sorted. And now every gene is being assigned or all these numbers are being assigned to the wrong gene labels gene IDs. This was one mistake which happened very early on in a rush to carry out this analysis. Nobody followed that up and it went through an entire analytics pipeline by informatics. Drug candidates were created, probe sets were created. Three clinical trials were started on human patients on this basis and a huge amount of money was spent by NIH and running clinical trials you will appreciate a billion dollar experiments sometimes. And millions of dollars later in this case because this was an early clinical trial early stage clinical trial. Millions of dollars later when the Duke researchers went back to NIH and said our results are not reproducible. These candidates despite the bioinformatics don't seem to work in reality. And the hard question got asked why show your lab notebooks. You go all the way back and you look at the printouts of these spreadsheets and suddenly realize one column has not been shuffled and sorted and comes down to a simple IT mistake which has wasted a lot of time and money. These are clinical trials. It could have been worse if people had died as a consequence or been seriously hurt as a consequence of the trial because you're actually playing around with therapies or proposed therapies. It could have been much, much worse in terms of how this would have hurt the university and the researchers. So I'm giving you a bunch of links here really what I wanted to appreciate is leaving data analysis and statistics as an afterthought to a bioinformatics pipeline is a dangerous business. Remember that phrase I came up with. Hypothesis after results are known. There is this philosophy that getting the data is a hard thing. Therefore all the effort goes into getting the data and once you get the data you say you'll actually get down to doing the science. But really it should have been the other way around. They should have been a robust experimental method and a computational method identified before the experiment was even done. And then you report whatever results you get as per that methodology. In fact, the whole publishing paradigm in the omics space is going to change now. And for example, journals like Nature are already starting to follow an altered publication protocol where they're so concerned about the fact that data was generated and hypothesis are created that they're saying look, the entire review process must now happen in two stages. So in stage number one, before you even do your experiments, you actually try to publish a paper and this is a weird thing. You try to publish a paper and you submit to the editor a protocol that you wish to follow. So you say that look, I wish to work in such and such a system. These are the experimental methods I'm going to use. And this is a statistical analysis that I propose to do once I get this data. And you want a reviewer system, a bunch of reviewers to look at this protocol and tell you in advance whether this is correct or not. Okay? Why? Why would you do this? Because remember, we are under pressure to publish positive results and not negative results. So how do you take away that pressure? So one way to take away this pressure is to say let your methodology be accepted by the peer group, the editors and reviewers. And at this point, regardless of whether your results are good or not, they publish the paper. So you get guaranteed publication of this paper after a protocol is approved. Okay? So therefore before you even publish, you talk of how you're going to assess your data sets. Okay? What is the hypothesis in other words? What's the protocol that you're going to follow, both experimental and computational? And what's your detailed analysis plan of how you're going to interpret which gene sets are important to you and which are not? Okay? And at this point, if they are of acceptable standards, you're guaranteed publication. And afterwards, you publish the data that you actually generated, both good data and bad data, because now there's no penalty if you publish bad data. One argument against this has been that if you're going to allow people to just publish your protocol, in fact have to announce my analysis protocol in advance, does that allow you to do more creative analysis later on? Because you're forced, you're locked into some kind of analysis already, because that's what got approved. But the reality is, okay, as long as you label this extra analysis that you do, in fact it's called a post hoc analysis, as long as you flag it in your publication that this was done afterwards, it's still acceptable to the peer review community. So this is a game changer in the way the omics, okay, industry is going to potentially function down the road. So at the moment it's small, it's probably like 30 journals which are signed on to this kind of a paradigm for publication, but it's a community which is so concerned of that what they are doing is not a participle, that they're willing to collectively go by this protocol of publication. So I want to spend a few last minutes on what might happen. So if you're not, if it's not a good idea to analyze one gene at a time and ask what's important or what's not important as a candidate, what else can you do? One of the things that's lost to you when you analyze one gene at a time, basically in any system when you analyze one component at a time, what you lose sight of is what are the linkages between the components. It's kind of like saying in a car, I've got each component and I'll try to separately study each component and if you were to study each component, yes you know precisely how a brake works, how an accelerator works, but what you don't know is how the car works given a brake and accelerator. You don't understand how the system works. And clearly there's some interaction between the brake and the accelerator which finally governs how the system works and these interactions are lost to you if you study things in isolation. So the only solution therefore in a computational manner is if you take things into a multivariate mode, don't study things one at a time, study the whole data set at one shot, not one variable at a time. It turns out there are many ways in which you can do this. I'm just throwing a few buzzwords out there. Some of you will be familiar if you've done courses in bioinformatics, you'll be familiar with things like clustering. Hierarchical clustering is something that for example phylogenetics or multiple sequence alignment methods would require. For example if you're building some kind of a tree or species or how genes have evolved over time. So these are all approaches where whole data sets get interpreted at one shot. And if you start looking through the pattern recognition literature and again I'm throwing more buzzwords at you, you'll realize there's a whole bunch of methods available out there. Some of these may get built into some omics tools but they are more likely to be present in some statistics toolbox in which case you have to make the effort to go to the toolbox and try to figure out what's going on. Now trying to find patterns in multivariate data sets can be problematic for many reasons. For example go back to this omics question of you have carried out an experiment control versus test case and you're looking for fold change. You will ask the question what's been, you would normally have asked the question to what extent does something mean upregulated or downregulated. If something's upregulated on a log scale beyond let's say two fold that's got to be significant for you. So you're going to make that kind of an argument. Now one of the problems is why is two fold an important cutoff and not some other lower cutoff. And I can give you for example I can give you a simple kinetic argument for why a value of two is arbitrary. Think of two branched pathways. A is going to be going to C, A going to B going to C and A is going to P going to Q. So two branched pathways. A, B, C, A, P, Q. These let's say are metabolic pathways. There's some metabolism going on. There's branched metabolism at A. Something's going down one pathway to C. Something's going down to Q. If you were to look at fold changes so if I upregulate something at A that upregulation of an activity at A cascades into some change for B and some change into C. It starts impacting changes for B and C and similarly for P and Q. Which guys would you expect to be the most upregulated as a function of fold change at A? If I have a fold change of A as two fold what can I expect at B and P? B and P can go up five fold. Because I have typically a transcriptional regulator being toggled a little bit. That effect starts impacting some effect genes a bit more and that goes further down so very quickly. This is what I was saying. So if I've gone up two fold here and you're trying to say this is significant then it's typically the case in a metabolic pathway that this goes up something like five fold and this goes up something like 50 fold. Because then products start accumulating even more. So and why is nature being this way? Because it doesn't make sense to directly push this up 50 fold. Because then you lose control over. You lose fine-tune control over how things propagate down different pathways and you want to control the expression levels of each intermediate to various pathways. If I asked you to find out those those species which are most upregulated you'd have told me C and Q are most upregulated because they have the highest fold changes. Therefore if you were to cluster them if you didn't know better if I didn't draw this pathway structure and you simply told me this went up 50 fold from a spreadsheet and this went up 50 fold from a spreadsheet one in one one temptation at this point is to assume this is a relationship between C and Q. That C and Q are both path part of let's say some operand and that's why the whole operand has gone up 50 50 fold. When there is no connection between C and Q but the connections are via this. So if you wanted a cluster if the question was being asked what's a cluster of effected genes which are going up as a response to whatever intervention you did. The cluster was not C and Q as one cluster and B and P as another cluster because they would be clustered on the basis of fold changes remember that's not a cluster what should have been a cluster this should have been a cluster and this should have been a cluster because there's a more obvious biological explanation as to how there is a cascade effect in terms of upper down regulation as you go down pathways and you can immediately see therefore that any clustering approach which now clusters on under an assumption of fold change alone is problematic. If you're going to start grouping together candidates or targets on the basis of expression levels alone that's a problem. So you've got to be looking for relationships. So what's a relationship? What you want to start looking for is if I move this up is something else going up is something else going up is something coming down is something else going up and what you want to see is in every patient across every patient across every disease condition if these things are going up and down in coordinated fashion then there's something going on between this bunch and that bunch deserves to be clustered. This is some genotypic relationship that you're now seeing across these species because they're ultimately related by one physical process. Now that's a subtlety because I'm now saying I'm not so interested in the raw magnitudes of these up and down fold changes that's not important to me. What is more important to me is whether the level of this goes up and this goes up whether this goes up to fold or whether it goes up one point five fold does this go up across all patients and when this goes up does this go up and those kinds of pairwise relationships is what I start looking for but what is what what do we call those space wise if I were plotting a line between x and y you'd call that pairwise relationship a correlation coefficient that r square value that I showed you a while back. So here's suddenly an insight instead of simply saying let me look at fold changes and ask is a fold change important and then trying to identify targets on that basis. Sometimes it's more intriguing to ask the question are correlations between pairs of candidates important and is that telling you something and now the reason I bring that up is if I were to somehow plot this data okay if you look at the previous slide these are things where clusters are based on magnitude. So the fifty fold change guys are all together the five fold change guys are all together and so on but if I wanted to look at correlations that's a different model okay. So correlation structures are usually way more important in biology than simple fold changes because that fold change could have occurred with sheer bad luck that fifty fold for example could remember all a discussion of randomness fifty fold could have been because of bad luck. So instead you need correlation based analysis if you're talking about hierarchy is a other call a species correlating amongst themselves in a hierarchical analysis. So choose something based on correlation analysis okay. There are methods out there for example on gene set enrichment we say that we build clusters based on which genes are together. One statistical tool which is what I'll end up with is something called correspondence analysis where you've looked at how things cluster together. It's something you've done okay for a series of data sets in this case for a medulloblastoma analysis across different types so what you're seeing are different patient types and you're looking different patients and samples and you're looking for what's the relationship between them. These are all exploratory methods where somebody's saying we have got so many tissue samples across so many patients. Can you find out how many subsets of medulloblastomas you might find and where this is going is nobody knows the cause okay what how many subgroups might exist with this particular disease condition and then later on you ask the question what what could be causing or what what is the signature for that subgroup with which genes are signatures for each subgroup but the question as to how many subgroups exist in the first place is itself an open question okay so if I were to do hierarchical clustering I'll find this kind of analysis if I were to use something called correspondence analysis that same data is plotted it's literally the same spreadsheet but it's being plotted different ways and I wanted to appreciate that there is no one perfect way to do this which is why a better analysis try different ways now in this case an interpretation is slightly different so here most of you are familiar with how to interpret this two nodes in here are very closely related related to something else over here but as here okay you're essentially asking whether things are far away from the center at the center you've got an average condition for a tissue sample as you move further away these are all patients these are all patients as you move further away you're deviating from the normal okay so distance from the origin matters and if you're moving on a diagonal away from the origin all these guys on a diagonal are related so my notion of a cluster is no longer one nice cloud so spherical cloud so this group of patients is a cluster out here is some other cluster and there's another group of patients behaving differently which is not which kind of shows up yet there's one cluster here is one cluster here there's another cluster here of patients okay so different ways of interpreting this throw different insights out what was very useful about this method was a fact that it allowed allows one to not just plot patients but you can also on the same coordinate system plot genes so remember your data set you've got different patients or different sample conditions and for each sample condition based on your omics okay for throughput you have so many gene expression levels or protein expression levels same logic and now I'm plotting just the genes and asking these are all the normal genes housekeeping genes probably okay marginally changing the expression levels and ask a question which genes are sitting out of the extremes with genes radially a furthest away from the origin those genes are probably doing something interesting in terms of having their expressions always go up or down based on okay a correlation with other patients what's being plotted is not raw magnitude but correlation coefficients okay so these genetic candidates are all related to each other somehow and one insight by the way is that when one goes looking in these gene candidates are all related to one particular signaling pathway and no surprise that they are all nicely correlated with each other one guy went up so many other genes responded to that signal and went up and down so they all show up as a cluster on this axis another bunch of genes are clustered around here and so on and what's very powerful about this analytical procedure is you can then superpose this on top of this and you then ask remember the clusters of patients we had there's a cluster of patients here and other cluster of patients now what are genetic signatures so these genes over here are signatures okay specific to this cluster of patients okay now what has happened here is rather than test one gene at a time and we know the problems now of testing one gene at a time by sheer bad luck five percent of the time we get things wrong that can mount as an error rate if I'm doing ten thousand analysis instead the entire cloud of data the entire metrics of data is being analyzed when you think about is these are columns my patients are columns my genes are rows in a data set so I'm looking at columns of patients I'm looking at rows of genes and I'm looking at the two things super imposed and I'm looking at all my data somehow projected at one shot and what I learn from this methodology is that a subset of genes here is associated with these patients a different subset of genes is associated with a different set of patients and so on and I already found my clusters and my markers for the subtypes okay so it turns out that in the statistics world at least in the multivariate statistics world the appropriate methodology for statistical analysis of this data set existed just was a case of being a little adventurous and going out there and trying to find out could was there a method which would more accurately ask answer this question of what were relevant targets and not simply trust the least complicated statistical procedure and the least complicated statistical procedure was just one gene at a time and that procedure is prone to a large number of mistakes okay whereas a more robust approach which looks at all the data at one shot in a multivariate mode captured relationships very fast we go looking it's turned out there are nice insights about why these genetic why these genes were part of a signaling pathway and how defects in one particular gene could escalate into this condition that has led to better science the same thing has been done later on okay again for proteomics data for different for classifying different types of infections from blood so if you're looking so I'm not sure you can make this out other than the colors here but these are healthy patients blood from healthy patients and they're a nice cluster on your own okay you're looking at falsiparam malaria you're looking at vivax malaria and you can clearly see there's a differentiation between vivax and falsiparam that shows out when I just cluster this data at one shot so we're able to therefore fundamentally differentiate falsiparam from vivax as a other malaria type and in fact we've gone further beyond this to ask is there a differentiation for example from leptospirosis which are all conditions that you would normally see as blood infections causing a high fever and so if somebody wants a rapid diagnosis here's an approach which does this and not showing all the data here but you're seeing a subset of your gene candidates and clearly these gene candidates are capable of differentiating multiple clusters so multivariate analysis it's not a question of one of these genes being analyzed at a time in fact you go the other way around you analyze all the data at one shot on one plot ask which gene subsets are important and then go and ask for each individual gene why did it turn out to be important you don't flip it the other way around and ask each gene are you important or not and then try to make a story out of it instead the whole data set gets analyzed at one shot a subset is chosen and each one is reconfirmed as being important one at a time okay so I'm not expecting you guys to turn statisticians overnight but this is more in terms of being aware that there are methods out there and the several other methods out there which improve the quality of your analysis so in a nutshell there are several approaches and it's a democratic philosophy which is don't trust one method don't trust one voter you trust many people to vote for a given candidate and if there are independent statistical methods which are all seemingly voting for the same target then you've probably found a target if one method alone is talking about a target then it's probably bad luck and surely not a significant target okay so that's another insight to take from this so okay I stopped there so today's lecture I hope you have learnt about the errors created due to the lack of knowledge and understanding about the p-values you also studied how the bone pharaonic corrections can help in reduction of false positive and false negative candidates from the data sets you also heard the role of false positive and false negatives in search of potential biomarkers with increased sensitivity and its specificity I hope it also reminded you Dr. Josh Lebed one of the previous lectures about a good biomarker and considerations for biomarker discovery programs so again you can see that you know different experts have same opinion about the experimental design how to really find the right candidates the right targets could be potential biomarker or the discovery targets especially sorting out based on the false positives and false negatives so I hope these two lectures have made you much more aware about the need for the experimental design and various crucial considerations in data analysis but before I close let me give you the overall summary of all the lectures which we have covered in this course so we started this course from the basic microarray technologies especially the nucleic acid programmable protein array and the leading experts in the area Professor Joshua Laber gave you some very interesting lectures about the basics of these technology as well as different applications with more focus on biomarker discovery program in various diseases we then learnt about how to use NAPA technology for a screening of various auto-antibodies in different disease conditions or use the same technology platform for drug discovery screening we also learnt about how to use these technology platforms for protein interactions and looking at various type of protein modifications so various these examples these applications have brought in your horizon that these technologies could be used for identification of biomarkers the therapeutic targets and for the functional proteomics based screening we also got a chance to look into applications of other type of array based platforms especially the reverse phase protein arrays and also the considerations of making good arrays and making good slides by doing good type of printing then different type of applications of purified protein arrays using few projects were shown to you directly with the demonstration sessions from my research scholars in the laboratory where you learnt about some examples of malaria and the cancer research how it could be beneficial by employing the protein microarray based technologies next we learnt about very briefly immunoprecipitation and the use of the advanced mass spectrometry based technologies of course we did not talk too much about mass spectrometry in this course because that was not the scope of this course but this is one of the very promising technology which is helping now entire field of interactomics or entire field of proteomics to say for various applications so of course you should try to get more advanced training in this area but at this one of the application we try to give you emphasis that IP followed by MS is a strong platform to identify the potential interactors during these lectures we also try to give you the idea that different type of label free biosensors are very important while label based technologies may have some bias for what the signal looks like is that a real signal is that an artifact you have to negate many of the false positives many of these false fluorescence signal those possibilities in these experiments but the label free sensors label free technologies have tried to overcome that and look for just the biomolecular interactions in its original state so trying to avoid many of the confounding factors which one may observe in routine microarray based technologies so I hope technologies like biolayer interferometry PLI surface plasma resonance based technology like SPR and micro scale thermophoresis technologies have really given you the broad idea that many of the label free biosensors could also be used for biomolecular interaction studies along with these technologies of microarrays and label free biosensors one of the latest advancement in the entire biomedical field is about next-generation sequencing technologies and these sequencing technologies have immense applications for the entire genome sequencing to RNA sequencing to variety of applications and we try to give you at least some idea for what can be done using NGS platforms with two of the leading industry key players and their application scientist from Illumina and Thermo Fisher to talk to you about the latest advancement in this area as well as the possible applications which could be used on these technology platforms then we also had interaction with one of the leading scientists and a clinician Dr. Sanjay Navani who talked to you about another mega project of human protein atlas and the very important role of India in doing the pathology atlas project and the associated challenges of this journey and the major outcomes of this project so all of these rapidly evolving technology platforms have immense applications in life sciences and transitional biology they also provide a much comprehensive picture for better understanding of the crucial physiological processes in systems approach so I hope these lectures various discussion points have really made you aware the pros and cons of designing these experiments and using the technologies choosing the right technology for your given experiment I hope these weekly assignments and live interactive sessions were helpful and you enjoyed attending this course as much as we made efforts to teach you this course and these advanced technologies thank you very much